------------------------------------------------------------------------------- help for propcnsreg -------------------------------------------------------------------------------

Fitting a measurement model with causal indicators

propcnsreg depvar [indepvars] [if] [in] [weight] , constrained(varlist_c) lambda(varlist_l) [ standardized lcons unit(varname) mimic logit poisson vce(vcetype) robust cluster(clustervar) level(#) {or | irr} wald maximize_options ]

by ... : may be used with propcnsreg; see help by.

fweights, pweights, aweights, and iweights are allowed; see help weights.

Description

propcnsreg combines information from several observed variables into a single latent variable and estimates the effect of this latent variable on the dependent variable. propcnsreg assumes that the observed variables influence the latent variable. A common alternative assumption is that the latent variable influences the observed variables. For example, factor analysis is based in this alternative assumption. To distinguish between these two situations some authors, following Bollen (1984) and Bollen and Lennox (1991), call the observed variables "effect indicators" when they are influenced by the latent variable, while they call the observed variables "causal indicators" when they influence the latent variable. Distinguishing between these two is important as they require very different strategies for recovering the latent variable. In a basic (exploratory) factor analysis, which is a model for effect indicators, one assumes that the only thing that the observed variables have in common is the latent variable, so any correlation between the observed variables must be due to the latent variable, and it is this correlation that is used to recover the latent variable. In propcnsreg, which estimates models for causal indicators, we assume that the latent variable is a weighted sum of the observed variables (and optionally an error term), and the weights are estimated such that they are optimal for predicting the dependent variable.

Models for dealing with causal indicators come in roughly three flavors: A model with "sheaf coefficients" (Heise 1972), a model with "parametricaly weighted covariates" (Yamaguchi 2002), and a Multiple Indicators and Multiple Causes (MIMIC) model (Hauser Goldberger 1971). The latter two can be estimated using propcnsreg, while the former can be estimated using sheafcoef, which is also available from SSC.

+-------------------+ ----+ Sheaf coefficient +------------------------------------------------

The sheaf coefficient is the simplest model of the three. Say we want to explain a variable y using three observed variables x1, x2, and x3, and we think that x1 and x2 actually influence y through a latent variable eta. Because eta is a latent variable we need to fix the origin and the unit. The origin can be fixed by setting eta to 0 when both x1 and x2 are 0, the unit can be fixed by setting the standard deviation of eta equal to 1. The model starts with simple regression model, where the b-s are the regression coefficients and e a normally distributed error term, with a mean of 0 and a standard deviation that is to be estimated:

(1) y = b0 + b1 x1 + b2 x2 + b3 x3 + e

and we want to turn this into, where l is the effect of the latent variable and the c-s are the effects of the observed variables on the latent variable:

(2) y = b0 + l eta + b3 x3 + e

(3) eta = c0 + c1 x1 + c2 x2

We can fix the origin of eta by constraining c0 to be 0, this way eta will be 0 when both x1 and x2 equal 0. This leaves c1 and c2. We want to choose values for these parameters such that eta optimally predicts y, and the standard deviation of eta equals 1. This means that c1 and c2 are going to be a transformation of b1 and b2. We can start with an initial guess that c1 equals b1 and c2 equals b2, and call the resulting latent variable eta'. This will get us closer to where we want to be, as we now have values for all parameters: c0=0, c1'=b1, c2'=b2, and l'=1. The value for l' is derived from the fact that that is the only value where equations (2) and (3) lead to equation (1). However, the standard deviation of eta' will generally not be equal to 1, actually we can calculate the standard deviation of eta' as follows:

sd(eta') = sqrt{b1^2 var(x1) + b2^2 var(x2) + 2 b1 b2 cov(x1, x2)}

We can recover eta by dividing eta' by its standard deviation, which means that the true values of c1 and c2 are actually b1/sd(eta') and b2/sd(eta'). If we divide eta' by its standard deviation, then we must multiply l' by that same number to ensure that equations (2) and (3) continue to lead to equation (1). As a consequence l will equal sd(eta').

Notice that the effect of the latent variable will thus always be positive. This is necesary because we have only specified the origin and unit of the latent variable but not its direction. Say, x1 is the proportion of vegetables in a person's diet and x2 the number minutes spent a day excercizing. If we did not fix the effect of the latent variable to be positive, then there would always be two sets of estimates that would represent exactly the same information. If the c's are positive then the latent variable represent the healtyness of someone's lifestyle, and if the c's are negative then the latent variable represent the unhealtyness of that person's lifestyle. Saying that the healthyness of someone's lifestyle has a positive effect is exactly the same as saying that the unhealthyness of someone's lifestyle has a negative effect. Stata can't choose between these two, since both statements are the same, so we need to choose for it. We can do so by either fixing the direction of the latent variable or fixing the direction of the effect. The default is to fix the direction of the effect, but we can also specify one key variable and fix the direction of the latent variable relative to this key variable either by stating that the latent variable is high when the key variable is high and low when the key variable is low, or exactly the opposite.

This illustrates how the following set assumptions can be used to recover the latent variable and its effect of the dependent variable:

- the latent variable is a weighted sum of the observed variables such that the latent variable optimally predicts the dependent variable.

- a constraint that fixes the origin of the latent variable.

- a constraint that fixes the unit of the latent variable.

- a constraint that either fixes the direction of the latent variable or the direction of the effect of the latent variable.

However, a sheaf coefficient just reorders the information you obtained from a regular regression. It is just a different way of looking at the regression results, which can be useful but it does not impose a testable constraint.

One possible application of the sheaf coefficient is the comparison of effect sizes of different blocks of variables. For example, we may have a block of variables representing the family situation of the respondent and another block of variables representing characteristics of the work situation and we wonder whether the work situation or the family situation is more important for determining a certain outcome variable. In that case we would estimate a model with two latent variables, one for the family situation and one for the work situation, and since both latent variables are standardized their effects will be comparable.

+------------------------------------+ ----+ Parametricaly weighted covariates +-------------------------------

The model with parametricaly weighted covariates builds on the model with sheaf coefficients, but adds a testable constraint by assuming that the effect of the latent variable changes over another observed variable. This means that instead of equation (2) we will be estimating equation (4) where the effect of eta changes over x3:

(4) y = b0 + (l0 + l1 x3) eta + b3 x3 + e

If we replace eta with equation (3), and fix the unit of eta by constraining c0 to be zero, we get:

y = b0 + (l0 + l1 x3) (c1 x1 + c2 x2) + b3 x3 + e

= b0 + (l0 + l1 x3) c1 x1 + (l0 + l1 x3) c2 x2 + b3 x3 + e

This means the effect of x1 (through eta) on y equals (l0 + l1 x3) c1, and that the effect of x2 (through eta) on y equals (l0 + l1 x3) c2. This implies the following constraint: for every value of x3, the effect of x1 relative to x2 will always be {(l0 + l1 x3) c1} / {(l0 + l1 x3) c2} = c1/c2, which is a constant. In other words, the model with parametricaly weighted covariates imposes a proportionality constraint. A test of this constraint is reported at the bottom of the output from propcnsreg (when the mimic option is not specified).

This proportionality constraint can also be of substantive interest without referring to a latent variable. Consider a model where one wants to explain the respondent's education (ed) with the eduction of the father (fed) and the mother (med), and that one is interested in testing whether the relative contribution of the mother's education has increased over time. propcnsreg will estimate this model under the null hypothesis that the relative contributions of fed and med have remained constant overtime. Notice that the effects of fed and med are allowed to change over time, but the effects of fed and med are constrained to change by the same proportion over time. So if the effect of fed drops by 10% over a decade, than so does the effect of med.

propcnsreg will allow you to identify the unit of the latent variable in one of the following three ways:

- By setting its standard deviation of the latent variable to 1, effectively standardizing the latent variable. This is the default parametrization , but can also be explicitly requesting by specifying the standardized option. One can specify one key variable by prefixing that variable in the constrained option with either a + or a -. The + means that the latent variable is high when the key variable is high and the latent variable is low when the key variable is low. The - means exactly the opposite. If no key variable is specified then l0 is constrained to be postive.

- By setting the coefficient l0 to 1, which means that c1 and c2 represent the indirect effects of x1 and x2 through the latent variable on y when x3 equals 0.

- By setting either the coefficient c1 or c2 to 1, which means that the unit of the latent variable will equal the unit of either x1 or x2 respectively. This can be done by specifying the unit(varname) option.

+-------+ ----+ MIMIC +------------------------------------------------------------

The MIMIC model builds on the model with parametricaly weighted covariates by assuming that the latent variable is measured with error. This means that the following model is estimated:

(5) y = b0 + (l0 + l1 x3) eta + b3 x3 + e_y

(6) eta = c0 + c1 x1 + c2 x2 + e_eta

Where e_y and e_eta are independent normally distributed error terms with means zero and standard deviations that need to be estimated. By replacing eta in equation (5) with equation (6) one can see that the error term of this model is:

e_y + (l0 + l1 x3) e_eta

This combined error term will also be normally distributed, as the sum of two independent normally distributed variables is itself also normally distributed, with a mean zero and the following standard deviation:

sqrt{var(e_y) + (l0 + l1 x3)^2 var(e_eta)}

So the empirical information that is used to separate the standard deviation of e_y from the standard deviation of e_eta is the changes in the residual variance over x3. So the data will only contain rather indirect information that can be used for estimating this model, and the model may thus not always converge. However, if the model is correct it will enable one to control for measurement error in the latent variable.

There is an important downside to this model, and that is that heteroscedasticity, and in particular changes in the variance of e_y over x3, could have a distorting influence on the parameter estimates of l0 and l1. Consider again the example where one wants to explain the respondent's education with the education of the father and the mother, but now assume that we are interested in how the effect of the latent variable changes over time. In this case we have good reason to suspect that the variance of e_y will also change over time: Education consists of a discrete number of categories, and in early cohorts most of the respondents tend to cluster in the lowest categories. Over time the average level of education tends to increase, which means that the respondents tend to cluster less in the lowest category, and have more room to differ from one another. As a consequence the residual variance is likely to have increased over time. Normally this heteroscedasticity would not be an issue of great concern, but in a MIMIC model this heteroscedasticity is incorrectly interpreted as indicating that there is measurement error in the latent variable representing parental education. Moreover, this "information" on the measurement error is used to "refine" the estimates of l0 and l1. So, this would be an example where the MIMIC model would not be appropriate.

Options

+-------+ ----+ Model +------------------------------------------------------------

constrained(varlist_c) specifies the variables can be thought of as being measurements of the same latent variable. The effects of these variables are to be constrained to change by the same proportion as the variables specified in lambda() change.

If the standardized option is specified one can identify one variable as a key variable that identifies the direction of the latent variable, either in the same direction as the key variable (+) or in the opposite direction (-). If the standardized option is specified but no key variable is specified, then the constant of the lambda equation will be constrained to be positive.

lambda(varlist_l) specifies the variables along which the effects of the latent variable changes.

mimic specifies that a MIMIC model is to be estimated.

logit specifies that the dependent variable is binary and that the influence of the latent and control variables on the probability is modeled through a logistic regression model.

poisson specifies that the dependent variable is a count and that the influence of the latent and control variables on the rate is modeled through a poisson regression model.

+----------------+ ----+ Identification +---------------------------------------------------

standardized specified that the unit of the latent variable is identified by constraining the standardard deviation of the latent variable to be equal to 1. This is the default parametrization.

lcons specifies that the parameters of the variables specified in the option constrained() measure the indirect effect of these variables through the latent variable on the dependent variable when all variables specified in the option lamda() are zero.

unit(varname) specifies that the scale of the latent variable is indentified by constraining the unit of the latent variable to be equal to the unit of varname. The variable varname must be specified in varlist_c.

+---------------------+ ----+ SE/robust/reporting +----------------------------------------------

vce(vcetype) specifies the type of standard error reported, which includes types that are derived from asymptotic theory, that are robust to some kinds of misspecification, that allow for intragroup correlation, and that use bootstrap or jackknife methods; see vce_option.

robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.14 Obtaining robust variance estimates. robust combined with cluster() allows observations which are not independent within cluster (although they must be independent between clusters).

cluster(clustervar) specifies that the observations are independent across groups (clusters) but not necessarily within groups. clustervar specifies to which group each observation belongs; e.g., cluster(personid) in data with repeated observations on individuals. See [U] 23.14 Obtaining robust variance estimates. Specifying cluster() implies robust.

level(#) specifies the confidence level, in percent, for the confidence intervals of the coefficients; see help level.

or specifies that odds ratios are to be displayed. If the lcons option is specified than the parameters in all three equations (unconstrained, lambda, and unconstrained) will be exponentiated. In all other cases only the parameters in the first two equations (unconstrained, and lambda) will be exponentiated. This option is only allowed in combination with the logit option.

irr specifies that incidence rate ratios are to be displayed. If the lcons option is specified than the parameters in all three equations (unconstrained, lambda, and unconstrained) will be exponentiated. In all other cases only the parameters in the first two equations (unconstrained, and lambda) will be exponentiated. This option is only allowed in combination with the poisson option.

wald specifies that the test of the proportionality constrained is to be a Wald test instead of a likelihood ratio test. This is the default when robust standard errors have been used. This option is not allowed in combination with the mimic option.

+------------------+ ----+ maximize_options +-------------------------------------------------

difficult, technique(algorithm_spec), iterate(#), trace, gradient, showstep, hessian, shownrtolerance, tolerance(#), ltolerance(#), gtolerance(#), nrtolerance(#), nonrtolerance(#); see maximize. These options are seldom used.

Example

Example illustrating the use of the poisson option to model a non-negative but not necessarily a count dependent variable. For its advantages see: (Cox et al. 2007; Nichols, 2010; Gould 2011). However, in a simulation the point estimates seem to be unbiased but the robust standard errors don't seem to perform as well. So I use bootstrap standard errors instead of robust standard errors. This example also illustrates the use of predict to help with interpreting the model:

sysuse nlsw88, clear

gen hs = grade == 12 if grade < . gen sc = grade > 12 & grade < 16 if grade < . gen c = grade >= 16 if grade < .

gen tenure2 = tenure^2 gen tenureXunion = tenure*union gen tenure2Xunion = tenure2*union

gen hours2 = ( hours - 40 ) / 5 gen white = race == 1 if race < .

propcnsreg wage tenure* union white hours2, /* */ lambda(tenure tenureXunion union) /* */ constrained(hs sc c) unit(c) /* */ poisson vce(bootstrap) irr

predict double effect, effect predict double se_effect, stdp eq(lambda) gen double lb = effect - invnormal(.975)*se_effect gen double ub = effect + invnormal(.975)*se_effect replace effect = exp(effect) replace lb = exp(lb) replace ub = exp(ub) sort tenure

twoway rarea lb ub tenure if union == 1 || /* */ rarea lb ub tenure if union== 0, /* */ astyle(ci ci) || /* */ line effect tenure if union == 1 || /* */ line effect tenure if union == 0, /* */ yline(1) clpattern(longdash shortdash) /* */ legend(label(1 "95% conf. int.") /* */ label(2 "95% conf. int.") /* */ label(3 "union") /* */ label(4 "non-union") /* */ order(3 4 1 2)) /* */ ytitle("effect of education on wage") (click to run)

An example for a binary dependent variable. Note that in this case both the parameters in the unconstrained and the lambda equation are both odds ratios.

sysuse nlsw88, clear gen byte high = occupation < 3 if !missing(occupation) gen byte white = race == 1 if !missing(race)

gen byte hs = grade == 12 if !missing(grade) gen byte sc = grade > 12 & grade < 16 if !missing(grade) gen byte c = grade >= 16 if !missing(grade)

propcnsreg high white ttl_exp married never_married age, /// lambda(ttl_exp white) /// constrained(hs sc c) unit(c) logit or (click to run)

Author

Maarten L. Buis, Wissenschaftszentrum Berlin für Sozialforschung (WZB) maarten.buis@wzb.eu

References

Bollen, Kenneth A. 1984. "Multiple Indicators: Internal Consistency or No Necessary Relationship" Quality and Quantity 18(4): 377-385.

Bollen, Kenneth A. and Richard Lennox. 1991. "Conventional Wisdom on Measurement: A Structural Equation Perspective" Psychological Bulletin 110(2): 305-314.

Cox, Nicholas J., Jeff Warburton, Alona Armstrong and Victoria J. Holliday (2007) "Fitting concentration and load rating curves with generalized linear models" Earth Surface Processes and Landforms, 33(1):25--39.

Gould, William. (2011) "Use poisson rather than regress; tell a friend" Not Elsewhere Classified, the official Stata blog. http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-f > riend/

Hauser, Robert M. and Arthur S. Goldberger. 1971. "The Treatment of Unobservable Variables in Path Analysis." Sociological Methodology 3: 81-117.

Heise, David R. 1972. "Employing nominal variables, induced variables, and block variables in path analysis." Sociological Methods & Research 1(2): 147-173.

Nichols, Austin (2010) "Regression for nonnegative skewed dependent variables" Stata Conference, 2010. http://www.stata.com/meeting/boston10/boston10_nichols.pdf

Yamaguchi, Kazuo. 2002. "Regression models with parametrically weighted explanatory variables." Sociological Methodology 32: 219-245.

Suggested citation if using propcnsreg in published work

propcnsreg is not an official Stata command. It is a free contribution to the research community, like a paper. Please cite it as such.

Buis, Maarten L. 2007. "PROPCNSREG: Stata program fitting a linear regression with a proportionality constraint by maximum likelihood" http://ideas.repec.org/c/boc/bocode/s456858.html

Also see: