Title
calibrate -- Calibrates survey datasets to population totals
Syntax
calibrate, marginals(varlist) poptot(matrix) entrywt(varname) exitwt(varname) [options]
calibrate takes a sampling weight and converts it to a calibration weight. The variables being calibrated to are listed in marginals, and the population totals used in the calibration are in a row matrix poptot.
options Description ------------------------------------------------------------------------- Required entrywt(varname) the entry weight (selection weight) exitwt(varname) the exit (calibrated) weight marginals(varlist) variables used for calibration poptot(matrix) row matrix of population totals
Options method(method) specifies the calibration method qvar(varname) scaling variable (default equals 1) print(print) controls the amount of printing graphs(graphs) controls the graphs produced outc(varname) binary variable used for method nonresp to indicate response sampvars(varlist) additional variables used for the non-response methods tolerance(real) tolerance used in the iterative methods (logistic and blinear) maxit(real) maximum number of iterations used in the iterative methods (logistic and blinear) lbound(real) minmum value of the ratio exitwt to entrywt for method blinear ubound(real) maximum value of the ratio exitwt to entrywt for method blinear -------------------------------------------------------------------------
Description
calibrate calibrates survey datasets to external totals. Seven possible methods are available. The linear and logistic methods are the equivalent to Methods 1 and 2 of Deville and Särndal (1992). The bounded linear method (blinear) is an iterative method that uses the linear method while also constraining the ratio of the exit weight to the entry weight to be between specified limits (c.f. Singh and Mohl, 1996). The non-response methods (nrSS, nr2A, nr2B and nr2C) assume the dataset contains both responders and non-responders. They calibrate the responders to population-level information on the variables in marginals, while using information about the selected sample on the variables in sampvars. Method nrSS is the single-step procedure in Chapter 8 of Särndal and Lundström (2005). Methods nr2A and nr2B are their two-step procedures with one difference: the intermediate weights obtained after the first step have any negative weights set to zero. Method nr2C is related to the other two-step methods. Details of method nr2C are available on request.
In the special case where the calibration variables are all categorical and the scaling variable is a constant, the logistic method is equivalent to raking (Demming and Stephan, 1940). This case can also be dealt with using the maxentropy program.
entrywt is the selection weight of the individual case. It will usually be the reciprocal of the selection probability. If it has been scaled (for example to sum to the sample size) it will usually be advisable to rescale it to sum to the population size. The weight exitwt will be generated (or replaced if it already exits). The population totals are held in the row matrix poptot. The calibration variables (marginals) should be numeric. Categorical variables will usually need to be converted to indicator variables.
Options
method(method) specifies the calibration method. linear is the default. Other methods are logistic, blinear, or the non-response methods: nrSS, nr2A, nr2B and nr2C.
qvar(varname) is related to the importance of the observation. See (Deville and Särndal, 1992) for further details. When using one of the non-response methods, it is usually advisable to use the default value of qvar.
print(print) controls the amount of printing. Options are none (the default), final (which summarises the final weights) and all (which summarises the weights after each iteration). When the method is linear or nonresp the options final and all are equivalent.
graphs(graphs) controls the number of graphs produced. Options are none (the default), final (which produces a histogram of the exit weight) and all. The option all produces two additional graphs: a scatterplot of the exit weight against the entry weight, and a histogram either of the ratio of the exit weight to the entry weight (for methods linear, blinear or nonresp) or of the logarithm of the ratio of the exit weight to the entry weight (for method logistic).
outc(varname) is a binary variable equal to 1 if the case corresponds to a responder and 0 otherwise. This is required when a non-response method is used and is ignored otherwise.
sampvars(varlist) is a list of variables that are available on the complete sample, both responders and non-reponders. This is required when a non-response method is used and is ignored otherwise. Variables in marginals should not be included in sampvars.
tolerance(real) specifies the tolerance for the iterative methods.
maxit(real) specifies the maximum number of iterations to be used by the iterative methods. The default is 15.
lbound(real) Puts a lower bound on the ratio exitwt to entrywt for method blinear. The default is 0.2.
ubound(real) Puts an upper bound on the ratio exitwt to entrywt for method blinear. The default is 5.
Warnings and problems
Calibration can result in negative weights. If this happens calibrate will give a warning. (Note that the method logistic ensures that calibration weights are positive). It will also give a warning if the calibration matrix is found to be singular. This is usually a consequence of collinearity among the marginal variables and the solution is usually to re-calibrate after omitting variables.
Note also that there is no guarantee that a solution to the calibration equations exits.
It is also worth noting that the method calibrate uses to solve the calibration equations involves calculating the inverse of a matrix using the command invsym. This limits the number of calibration constraints that can be used to the maximum size of the matrix. There could also be some problems if the problem is almost singular.
A further problem might occur when using the logistic method. This method uses Newton-Raphson to solve the calibration equations, and might fail to converge, especially if the initial estimate is not close to the solution. The initial estimate calibrate uses is calculated from the selection weights. Newton-Raphson might fail if the selection weights have been scaled (for example to sum to the sample size). Rescaling them to sum to the population size will sometimes be a solution.
Examples
To calibrate the multistage dataset. The population consists of 8,000,000 high school seniors. Assume it is known that it is 50% male and 50% female, and contains 7,000,000 white seniors.
. use http://www.stata-press.com/data/r9/multistage
Convert the categorical variables sex and race into binary indicator variables.
. tab sex, gen(isex) . tab race, gen(irace)
Make a row matrix of popultaion totals (male, female, white).
. matrix M=[4000000, 4000000, 7000000]
An example of linear calibration creating an exit weight called wt1:
. calibrate , marginals(isex1 isex2 irace1) poptot(M) entrywt(sampwgt) exitwt(wt1)
An example of linear calibration with additional printing:
. calibrate , marginals(isex1 isex2 irace1) poptot(M) entrywt(sampwgt) exitwt(wt1) print(all) graphs(all)
To check that the weighted sex and race distributions are correct:
. tab sex [iweight=wt1] . tab race [iweight=wt1]
It is possible to calibrate to continuous variables. Suppose it is also known that the average weight is 160lbs (so the total weight is 1,280,000,000lbs).
. matrix M=[4000000, 4000000, 7000000, 1280000000]
Linear, logistic or bounded linear calibration can be used. An example of logistic (with printing turned on) is:
. calibrate , marginals(isex1 isex2 irace1 weight) poptot(M) entrywt(sampwgt) exitwt(wt2) method(logistic) print(all)
Checks:
. tab sex [iweight=wt2] . tab race [iweight=wt2] . summ weight [iweight=wt2]
Saved results
calibrate saves the following in r():
Scalars r(N) number of observations r(mean) mean exit weight r(min) minimum exit weight r(max) maximum exit weight r(entdeff) approximate design effect (one plus the coefficient of variation) of the entry weights r(exitdeff) approximate design effect (one plus the coefficient of variation) of the exit weights r(sclmin) minimum exit weight after re-scaling to have a mean of one r(sclmax) maximum exit weight after re-scaling to have a mean of one r(sclsd) standard deviation of exit weights after re-scaling to have a mean of one
Matrices r(Bhat) coefficients of the variables in marginals used in the equation calculating the exit weight
Also see
Calibration can be thought of as a generalisation of post-stratification. The p > rogram calibest generalises Stata's post-stratification estimation commands.
References
Deming, W. E., and F. F. Stephan. 1940. On a least squares adjustment of a sample frequency table when the expected marginal totals are known. Annals of Mathematical Statistics 11: 427-444.
Deville, J.-C., and C.-E. Särndal. 1992. Calibration estimators in survey sampling. Journal of the American Statistical Association 87: 376-382.
Särndal, C.-E., and S. Lundström. 2005. {it Estimation in Surveys with Nonresponse} New York, Wiley.
Singh, A., C. and C. A. Mohl. 1996. Understanding calibration estimators in survey sampling Survey Methodology 22: 107-115. Chichester, UK: Wiley.
Author
John D'Souza National Centre for Social Research London, England, UK John.D'Souza@natcen.ac.uk