{smcl}
{* 19June2013}{...}
{hline}
help for {hi:pca2}
{hline}

{title:Application of Principal Component Analisys (PCA) to standard and GMM-style instrumental variables}

{p 4}Full syntax

{p 6 10}{cmd:pca2} {it:varlist} 
[{cmd:if} {it:exp}] [{cmd:in} {it:range}]
{bind:[{cmd:,} {cmd:nt(}{it:timevar} or {it:panelvar timevar}{cmd:)}}
{cmdab:var:iance(}{it:#}{cmd:)}
{cmd:avg}
{cmdab:cov:ariance}
{cmdab:pre:fix(}{it:string}{cmd:)}
{cmd:see} 
{cmdab:gmml:iv(}{it:#} or {it:# #}{cmd:)}
{cmdab:gmmd:iv(}{it:#} or {it:# #}{cmd:)}
{cmdab:lagsl(}{it:varlistl, ll(}{it:#} or {it:# #)} {cmd:)}
{cmdab:lagsd(}{it:varlistd, ll(}{it:#} or {it:# #)} {cmd:)}
{cmdab:togvar}
{cmd:togld}
{cmdab:ret:ain} {bind: ]}


{p}{cmd:pca2} may be used with cross-section, time-series or panel data; the nature of data may be specified by using the option {cmd:nt()}.
This option is based on Stata command {cmd:tsset}; see help {help tsset}.

{p} This command does not allow time-series operators in the {it:varlist}; 
if you want to use standard lags of {it:varlist} you need to generate them 
by using Stata time-series operators (see help {it: tsvarlist}) before the {cmd:pca2} command.

{title:Contents}
{p 2}{help pca2##s_description:Description}{p_end}
{p 2}{help pca2##s_options:Options summary}{p_end}
{p 2}{help pca2##s_macros:Remarks and saved results}{p_end}
{p 2}{help pca2##s_examples:Examples}{p_end}
{p 2}{help pca2##s_refs:References}{p_end}
{p 2}{help pca2##s_acknow:Acknowledgements}{p_end}
{p 2}{help pca2##s_citation:Authors}{p_end}
{p 2}{help pca2##s_citation:Citation of pca2}{p_end}

{marker s_description}{title:Description}

{p}{cmd:pca2} applies the Principal Component Analysis (PCA) (see help {help pca}) to a set of different variables, 
or to a set of GMM-style lags of the same variable, or to a set of lags of different variables.
PCA is aimed at data reduction and consists of an eigenvalue-eigenvector decomposition of the correlation
or covariance matrix of the variables in order to obtain a set of orthogonal linear combinations 
of the original variables that account for most of the variability in the original data. 
In order to choose the number of principal components to be retained, several criteria may be adopted, 
among which the "average criterion" and the "variance criterion" are the most commonly used.
The "average criterion" implies that only the principal components whose eigenvalue is above 
the average of the eigenvalues are retained for the score computation; 
according to the "variance criterion", the eigenvectors that account at least for a desired fraction 
of the variability in the original data are considered.
The PCA scores, relative to the retained components, become new variables to be used in the analysis; 
in our framework, the scores can be used as instruments in any instrumental variables Stata command, 
such as {help ivreg} or {help ivreg2} or {help xtivreg}.
This procedure implements a statistical method for the selection of non redundant IVs, 
based on the characteristics of the empirical problem at hand. 

The main distinguishing feature of this procedure is that it allows to obtain scores from the GMM-style instruments and 
hence to use them as IVs in GMM estimation Stata commands, like {cmd: xtdpd}, {cmd: xtabond} and the user-written {cmd: xtabond2}.

The {cmd: pca2} command implements the procedure discussed in detail by Bontempi and Mammi (2012, 2014).

{marker s_options}{title:Options summary}

{p 0 4}{cmd:nt(}{it:timevar} or {it:panelvar timevar}{cmd:)} is required in time-series and panel data 
in order to create GMM-style instruments and to apply PCA on them. If this option is omitted, 
the dataset is treated as a cross-section and all the observations would be pooled. 

{p 0 4}{cmd:variance(}{it:#}{cmd:)} allows you to apply the "variance criterion" (default criterion) i.e. only those principal components that account for at least 
the chosen percentage of the variability in the original data are retained 
for the computation of the scores. The number defining the percentage must be an integer greater than 0 and lower or equal to 100. 
The default is {it:#} = 90.
  
{p 0 4}{cmd:avg} selects the principal components to be kept for score computation 
according to the "average criterion", i.e. only those eigenvectors whose corresponding 
eigenvalues are above the average of the eigenvalues are retained. 
Note that when the options {cmd:avg} is chosen, {cmd:pca2} also computes the scores 
according to the default 90% "variance criterion" and saves both of them in the dataset: 
the scores obtained according to the two criteria can thus be compared.

{p 0 4}{cmd:covariance} performs PCA of the covariance matrix; default
is to perform PCA on the correlation matrix (see {help pca}).

{p 0 4}{cmd:prefix(}{it:string}{cmd:)} specifies the prefix for the name of the scores
generated by the {cmd:pca2} command corresponding to the retained principal components.
If you write e.g. {cmd:prefix(}{it:sys}{cmd:)} you will obtain "_sys_varscore#" and "_sys_avgscore#". 
This option is particularly useful when the {cmd:pca2} command is repeated many times 
on the same dataset in order to create different scores from different instrument sets, 
eventually also according to different criteria. The default prefix is "_BM" which retains
the scores with labels such as "_BM_varscore#" and "_BM_avgscore#". 

{p 0 4}{cmd:see} asks Stata to display the outcome of the PCA. 

{p 0 4}{cmd:gmmliv(}{it:#} or {it:# #}{cmd:)} generates the GMM-style instruments
in levels (for the equations in first-differences) for all the variables included in 
the {cmd:pca2} {it:varlist}. 
If only one argument is specified, e.g. {cmd:gmmliv(}{it:k}{cmd:)}, all the available lags
from {it:t-k} back to the initial observation for each variable in the {it:varlist} of the 
{cmd:pca2} command are used. If two arguments are specified, e.g. {cmd:gmmliv(}{it:k1 k2}{cmd:)}
with k1<=k2, the lags from {it:t-k1} to {it:t-k2} are considered. The PCA is done on all 
the specified GMM-style lags in levels of each variable taken separately. If the option 
{cmd:togvar} (full description below) is also added, the PCA is performed on all the 
generated GMM-style lags in levels of all the variables in the {it:varlist} considered together.
With this option the lag structure is the same for each variable.

{p 0 4}{cmd:gmmdiv(}{it:#} or {it:# #}{cmd:)} generates the GMM-style instruments
in first-differences (for the equations in levels) for all the variables included in 
the {cmd:pca2} {it:varlist}. 
If only one argument is specified, e.g. {cmd:gmmdiv(}{it:k}{cmd:)}, all the available lags
from {it:t-k} back to the initial observation for each variable in the {it:varlist} of the 
{cmd:pca2} command are used. If two arguments are specified, e.g. {cmd:gmmdiv(}{it:k1 k2}{cmd:)}
with k1<=k2, the lags from {it:t-k1} to {it:t-k2} are considered. The PCA is done on all 
the specified GMM-style lags in first-differences of each variable taken separately. If the option 
{cmd:togvar} (full description below) is also added, the PCA is performed on all the 
generated GMM-style lags in first-differences of all the variables in the {it:varlist} considered together.
With this option the lag structure is the same for each variable.

{p 0 4}{cmd:lagsl(}{it:varlistl, ll(}{it:#} or {it:# #))} generates the GMM-style instruments
in levels for a specific {it:varlistl}. It is a more flexible alternative to the 
{it:gmmliv()} option, as it allows for a different lag structure for each variable. 
The option {cmd:lagsl()} may be used more than once: different lag structures may thus be
defined for the variables in each {it:varlistl}. The suboption {it:ll} specifies the
lag structure of the variables in each {it:varlistl}: if only one argument is specified, 
e.g. {cmd:ll(}{it:k}{cmd:)}, all the available lags from {it:t-k} back to the initial 
observation for each variable in the {it:varlistl} are used. If two arguments are specified, 
e.g. {cmd:ll(}{it:k1 k2}{cmd:)} with k1<=k2, the lags from {it:t-k1} to {it:t-k2} 
are considered. The PCA is done on all the specified GMM-style lags in levels
of each variable taken separately. If the option {cmd:togvar} (full description below) is 
also added, the PCA is performed on all the generated GMM-style lags in levels of all the 
variables in the {it:varlistl} considered together. Such option can not be used with the option {it:gmml()}, while it is allowed together 
with either the option {it:lagsd()} or with the option {it:gmmd()}. When used alone or with 
the option {it:lagsd()} the number of variables in both {it:varlistl} and {it:varlistd} 
must be at least equal to the number of the variables in the {it:varlist} of the 
{cmd:pca2} command. Only when associated with the option {it:gmmd()}, the {it:lagsl()} option 
can have fewer variables than those included in the {it:varlist} of the {cmd:pca2} command. 

{p 0 4}{cmd:lagsd(}{it:varlistl, ll(}{it:#} or {it:# #))} generates the GMM-style instruments
in first-differences for a specific {it:varlistd}. It is a more flexible alternative to the 
{it:gmmdiv()} option, as it allows for a different lag structure for each variable. 
The option {cmd:lagsl()} may be used more than once: different lag structures may thus be
defined for the variables in each {it:varlistd}. The suboption {it:ll} specifies the
lag structure of the variables in each {it:varlistd}: if only one argument is specified, 
e.g. {cmd:ll(}{it:k}{cmd:)}, all the available lags from {it:t-k} back to the initial 
observation for each variable in the {it:varlistl} are used. If two arguments are specified, 
e.g. {cmd:ll(}{it:k1 k2}{cmd:)} with k1<=k2, the lags from {it:t-k1} to {it:t-k2} 
are considered. The PCA is done on all the specified GMM-style lags in first-differences
of each variable taken separately. If the option {cmd:togvar} (full description below) is 
also added, the PCA is performed on all the generated GMM-style lags in first-differences
 of all the variables in the {it:varlistd} considered together. Such option can not be used 
 with the option {it:gmmd()}, while it is allowed together with either the option {it:lagsl()}
 or with the option {it:gmml()}. When used alone or with the option {it:lagsl()} 
 the number of variables in both {it:varlistl} and {it:varlistd} must be at least equal 
 to the number of the variables in the {it:varlist} of the {cmd:pca2} command. 
 Only when associated with the option {it:gmml()}, the {it:lagsd()} option can have 
 fewer variables than those included in the {it:varlist} of the {cmd:pca2} command. 
 
{p 0 4}{cmd:togvar} specifies that the PCA is performed on the matrix that includes 
all the variables in the {it:varlist} and not on each variable separately.
E.g. the syntax {cmd:pca2} {it:x z}{cmd:, togvar} implies that the PCA is performed 
jointly on the variables {it:x} and {it:z}. This option needs to be specified in order 
to apply the PCA to GMM-style lags of more than one variable taken together. 
For example, {cmd:pca2} {it:x z}{cmd:, gmml(2) togvar} implies that the principal 
components are extracted from the matrix that includes all the available lags in levels 
from {it:t-2} backward of the variables {it:x} and {it:z}.

{p 0 4}{cmd:togld} specifies that, once that instruments in levels and first-differences 
are generated, the PCA is applied to the matrix that includes all these instruments 
together for each variable in the {it:varlist} of the {cmd:pca2}. If the option 
{cmd:togld} is used together with the option {cmd:togvar}, the principal components
are extracted from the matrix that includes all the lags in first-differences and 
in levels of all the variables in the {it:varlist}.  

{p 0 4}{cmd:retain} adds the generated GMM-style instrumental variables as new 
variables in the dataset. These IVs are named "_GMML""varname""period""lag"; 
e.g. "_GMMLn1978L2" stands for the {it:t-2} observation in levels for the variable "n" 
in the year "t=1978". 

{marker s_macros}{title:Remarks and saved results}

{p}{cmd:pca2} allows to visualize the output of the PCA; use the option {cmd:see}.

{p}{cmd:pca2} saves the following results in {cmd:r()}:

Scalars
{col 4}{cmd:r(number_scores_by_mean)}{col 18} Number of scores retained according to the "average criterion".
{col 4}{cmd:e(var_exp_by_mean)}{col 18} Percentage of total variability of the instruments 
explained by the principal components extracted according to the "average criterion".
{col 4}{cmd:e(number_scores_by_var)}{col 18} Number of scores retained according to the "variance criterion".
{col 4}{cmd:e(var_exp_by_var)}{col 18} Percentage of total variability of the instruments 
explained by the principal components extracted according to the "variance criterion".

{p}{cmd:pca2} also stores in {cmd:e()} the same results saved after the {cmd:pca} command is run.

{marker s_examples}{title:Examples}

{p 8 12}{stata " http://fmwww.bc.edu/ec-p/data/macro/abdata.dta" : . use http://fmwww.bc.edu/ec-p/data/macro/abdata.dta}{p_end}
{p 8 12}(Arellano and Bond, R.Ec.Studies 1991; Blundell and Bond, J.Econometrics 1998) {p_end}
{p 8 12}{stata "pca2 n w k, nt(id year) gmml(2) ret avg var(90)" : . pca2 n w k, nt(id year) gmml(2) ret avg var(90)}{p_end}
{p 8 12}(Generation of the GMM-style IVs in levels for GMM DIF estimation and extraction of the components according to both variance and average criteria) {p_end}
{p 8 12}. qui tab year, g(tauyear) {p_end}

{p 8 12}. xtabond2 n l.n l(0/1).(w k) tauyear3-tauyear9, iv(tauyear3-tauyear9, eq(diff)) gmm(n, lag(2 .) eq(diff)) gmm(w, lag(2 .) eq(diff)) gmm(k, lag(2 .) eq(diff)) h(2) nolev robust nod{p_end}
{p 8 12}(Use of the user written command {cmd:xtabond2}, D. Roodman, Stata Journal, 2009)

{p 8 10}. xtabond2 n l.n l(0/1).(w k) tauyear3-tauyear9, iv(tauyear3-tauyear9, eq(diff)) iv(_GMML_n_*, eq(diff) pass) iv(_GMML_w_*, eq(diff) pass) iv(_GMML_k_*, eq(diff) pass) h(2) nolev robust nod two{p_end}
{p 8 12}(Replication of the same results by using the GMM-style instruments generated by the command {cmd:pca2} as standard instruments)

{p 8 10}. xtabond2 n l.n l(0/1).(w k) tauyear3-tauyear9, iv(tauyear3-tauyear9, eq(diff)) iv(_BM_var*n*, eq(diff) pass) iv(_BM_var*w*, eq(diff) pass) iv(_BM_var*k*, eq(diff) pass) h(2) nolev robust nod {p_end}
{p 8 12}(DIFF-GMM estimation with the scores from PCA as new instruments){p_end}


{marker s_refs}{title:References}

{p 0 4}Bontempi, M.E. and I. Mammi. 2012. A strategy to reduce the count of moment conditions in panel data GMM. Working Papers wp843, Dipartimento Scienze Economiche, Universita' di Bologna.{p_end}
{p 0 4}Bontempi, M.E. and I. Mammi. 2014. {cmd:pca2}: implementing a strategy to reduce the instrument count in panel GMM, mimeo. {p_end}

{marker s_citation}{title:Citation of pca2}

{p}{cmd:pca2} is not an official Stata command. It is a free contribution
to the research community, like a paper. Please cite it as such: {p_end}

{phang}Bontempi, M.E. and I. Mammi, 2014.
pca2: a Stata module for applying PCA on GMM-style instrumental variables.


{title:Authors}

	Maria Elena Bontempi, Department of Economics, University of Bologna, Italy
	mariaelena.bontempi@unibo.it

	Irene Mammi, Department of Economics, University of Bologna, Italy
	irene.mammi@unibo.it

	
{title:Also see}

{p 1 14}Manual:  {hi:[U] 23 Estimation and post-estimation commands}{p_end}
{p 10 14}{hi:[U] 29 Overview of model estimation in Stata}{p_end}
{p 10 14}{hi:[R] ivreg}{p_end}
{p 10 14}{hi:[R] pca}{p_end}

{p 1 10}On-line: help for {help ivregress}, {help ivreg}, {help newey};
{help overid}, {help ivendog}, {help ivhettest}, {help ivreset},
{help xtivreg2}, {help xtoverid}, {help ranktest},
{help condivreg} (if installed);
{help est}, {help postest};
{help regress}{p_end}