Survey sampling weights: adjustment and replicate weight creation
survwgt create weight_type , strata(varname) psu(varname) weight(varname) stem(stem) [ fay(#) dof(#) hadmat(matname) hadfile(matrix_file_name) nodots ]
weight_type is one of
brr - balanced repeated replication weights jk1 - unstratified delete-one jackknife weights jk2 - two per stratum delete-one jackknife weights jkn - delete-n jackknife weights
survwgt poststratify varspec , by(varlist) totvar(varname) { generate(varlist) | stem(stem) | prefix(prefix) | replace }
survwgt rake varspec , by(varlist) totvars(varlist) { generate(varlist) | stem(stem) | prefix(prefix) | replace }
survwgt nonresponse varspec , by(varlist) respvar(varname) { generate(varlist) | stem(stem) | prefix(prefix) | replace }
varspec is one of
varlist - a Stata variable list [pw] - indicates the full sample weight [rw] - indicates the set of replicate weights [all] - indicates both [pw] and [rw]
Description
survwgt creates sets of weights for replication-based variance estimation techniques for survey data. These include balanced repeated replication (BRR) and several version of the survey jackknife (JK*). These replication methods are alternates to the Taylor series linearization methods used by Stata's svy-based commands.
In addition, survwgt performs poststratification, raking, and non-response adjustments to survey weights.
survwgt create creates a set of replicate weights for a dataset. survwgt can create four types of replicate weights, depending on the nature of the complex sample design and user preferences. In each method, multiple weight variables are created. Each set of replicate weights is calcuated by setting the sampling weights for observations in one or more PSUs to zero, and adjusting the sampling weights for the remaining observations to reproduce the full-sample totals. The svr set of commands use these weights to calculate (co)variances estimates of parameters by repeatedly estimating statistics of interest with each set of replicate weights. See Wolter (1985) for details.
brr (balanced repeated replication) is appropriate for designs with exactly two PSUs per stratum. Technically, (CHECK) PSUs must have been selected without replacement; any subsequent subsampling within PSUs is acceptable. In BRR, n weights are created, in which n is the smallest multiple of four greater than or equal to the number of strata. In each set of weights, one PSU from each stratum is included and the other excluded, in a pattern defined by a Hadamard matrix. Specifications for Hadamard matrices up to dimension 512 are included with survwgt, which allows for designs with up to 512 strata. Larger Hadamard matrices may be provided by the user, up to the limits of matsize.
jk1 (unstratified delete-one jackknife) is appropriate for non-stratified, clustered sampling designs. In JK1, one PSU is deleted from each set of replicate weights, so the number of replicate weights equals the number of PSUs.
jk2 (two per stratum delete-one jackknife) is appropriate for the same designs as the balanced repeated replication method: two PSUs per stratum, selected with replacement, with any subsampling scheme within PSU. In the jk2 method, one PSU is deleted from each replicate (as opposed to one PSU per stratum in the brr method).
jkn (delete-n jackknife) is appropriate for sampling designs with two or more PSUs per stratum. [Insert description here!]
survwgt poststratify computes post-stratification adjustments to survey sampling weights. The sampling weights in each stratum are adjusted by a multiplicative factor such that the sum of the weights equals the control total for each stratum, as specified in the totvar() option. When more than one sampling weight variable is specified, the command post-stratifies each in turn. This allows for the adjustment of the main sampling weight and a full set of replicate weights in one easy step.
survwgt rake computes raking adjustments to survey sampling weights. Raking is used when there are multiple stratification dimension, when control totals known for the marginal distribution of each dimension but not for the individual cell totals. (When population cell totals are known, post-stratification should be used.) In raking, the sampling weights for each stratum are iteratively adjusted by a multiplicative factor such that the sum of the weights equals the control total for marginal dimension, in turn, until convergence is achieved. As with post-stratification, multiple sets of analysis and replicate weights can be raked with one call to survwgt.
survwgt nonresponse computes non-response adjustments to survey sampling weights. Nonresponse adjustment requires a dataset that includes the full sample-- responders and non-responders. Separately within each response stratum, the base survey sampling weight for each responder in the sample is adjusted such that the total weight for responders alone equals the total weight for the sample. The weight for non-responders is set to zero. As with the other weight adjustment routines, multiple variables can be subjected to non-response adjustment in one easy call to survwgt.
Options for survwgt create
strata(varname) specifies the variable that identifies stratum membership. This must be a single variable; if the strata are defined by multiple variables, a single variable can be created with egen's group() option. The strata() option is required for all weight types except JK1, for which it is not allowed.
psu(varname) specifies the variable that identifies the primary sampling units within strata. It is required for all types of replicate weight creation.
weight(varname) specifies the base sampling weights. It is required.
stem)stem) specifies a stem to be used as the basis for the replicate weight variable names. The repliciate weight variables are named stem1, stem2, ... stemn. If stem() is not specified, a stem based on the type of weights is used (brr_, jk1_, jk2_, or jkn_)
fay(#) specifies the value of the constant to be used in generating weights according to Fay's variant of balanced repeated replication. In this method, observations in the selected PSUs are assigned weight of (2-fay), and those in the non-selected PSUs are assigned a weight of (fay), rather than 2 and 0, respectively. By default, the Fay constant is 0, which implies "regular" BRR. This option is valid only for the brr method.
dof(#) specifies the appropriate degrees of freedom for variance estimates. By default, degrees of freedom is set to the number of strata for the BRR and JK2 methods, to one less than the the number of PSUs for JK1, and to the total number of PSUs minus the total number of strata, for the JKn method.
hadmat(matname) specifies a Stata system matrix that contains the Hadamard matrix to create the replicates. The program comes with a binary file with Hadamard matrices up to dimension 512, so this option should be little-used. No checking is done that the matrix is in fact a Hadamard matrix -- be careful!
hadfile(matrix_file_name) specifies the system file that contains Hadamard matrices. This should only be necessary when the system file is located off the Stata search path, or named something other than the default.
nodots specifies that a dot should not be displayed for each set of weights that is created. With large datasets, the dots can reassure you that the program has not died.
Options for survwgt poststratify, rake, and nonresponse
varspec specifies the base weight(s) to adjusted (i.e., post-stratified, raked, or adjusted for non-response). This can be specified as a Stata varlist. More usefully, this can be specified as [pw], which indicates the currently specified main analysis weight, and/or [rw] which indicates the set of replicate weights. [all] is a synonym for [pw] [rw]. If this automated variable specification is used, then the svr settings for the dataset are updated to specify the new weights, unless the noupdate option is specified. See svrset.
noupdate specifies that the svr settings for the dataset should not be updated to reflect the adjusted weights. This only has an effect when the variables to be adjusted are specified with the "automatic" syntax discussed above.
by(varlist) specifies variable(s) identifying the strata. For post-stratification, the base weights are adjusted to sum to the control totals for the cells defined by the strata; for raking, the base weights are adjusted to sum to the control totals for the marginals of the strata. For non-response adjustment, the base weights for respondents are adjusted to sum to the total of the full sample weights for the cells defined by the strata.
totvar[s](varname[s]) specifies the variable[s] containing control totals for post-stratification or raking. For post-stratification, totvar() must be a single variable, constant within the cells defined by the by(varlist), of control totals. For raking, totvars() must contain variables (one per variable specified by()), which specify the marginal control total for each value of the corresponding stratum variable. This option is not valid for non-response adjustment.
respvar(varname) specifies the variable that contains response information for members of the sample. This variable must take on values of 0 (indicating non-response), 1 (indicating response), or missing (out of sample). The base weights are adjusted such that, within each response stratum, the adjusted weight for respondents sums to the total of the base weight for all sample members. Non-respondent cases are assigned an adjusted weight of zero, and out of sample cases are excluded from the calculations and assigned missing for the adjusted weight.
If there are no respondents in a stratum, all weights for that stratum are set to zero and a warning is displayed. If there are no sample members in a stratum, all weights are set to missing and a warning is displayed. This option is only valid for non-response adjustment.
generate(varlist) specifies the explicitly the names for the adjusted weight variable(s) to be created. There must be one name per base sampling weight specified in varlist.
stemstem) specifies a stem to be used to create names for the adjusted weight variable(s). New variables are numbered from 1, unless the "pw" or "all" are indicated for the varspec, in which case they are numbered from 0.
prefixprefix) specifies a prefix to be prepended to the existing variable names used to create the adjusted weight variable(s).
replace specifies that the adjusted variables should replace the existing variables. This option should be used with caution.
Examples
Methods and formulae for weight calculation
survwgt create only works for survey designs that exactly match the specificatons for the type of weights requested (two PSUs per stratum for BRR, etc.) Any collapsing of strata or PSUs, splitting of certainty PSUs, or other adjustments to approximate the appropriate design must be done outside of the program.
The program creates k sets of replicate weights, where k is defined as discussed above for the replication method.
For the BRR method, the the program selects one of the PSUs from each stratum, according to a Hadamard matrix of the relevant dimension. For replicate j, the weights for each observation i are calculated as follows:
W = W * (2-k) for observations in the PSU ij iF selected into the replicate
= W * (k) for observations in the other PSU iF
where
W is the full-sample weight for observation i, and iF
k is the constant for Fay's method. For standard BRR, k=0.
The program comes with an auxiliary binary file, survwgt_hadamardmatrixfile.ado, which contains Hadamard matrices up to dimension 512. These matrices are stored in a compressed binary format to save disk space; the available matrix sizes can be obtained by typing survwgt create brr sizes. (Note: the binary file is named with an "ado" file extension in order that it be installed in the correct directory by Stata's net install and ssc install commands. The file is not, in fact, a Stata program, and will issue an appropriate error message if it is attempted to be run.)
For the JK1 method, the program creates one replicate per PSU. In replicate j, the weights for each observation i are calculated as follows:
W = 0 for observations in PSU ij j
= W * ( N/(N-1) ) for observations in other PSUs iF
where N is the number of PSUs, and W is defined as above.
For the JK2 method, the program creates one replicate per stratum. In each replicate, the weights in the selected stratum are doubled in the first PSU, and set to zero in the second PSU. Weights in other strata are not changed.
For the JKn method, the program creates one replicate for PSU. In each replicate, the weights for each observation i are calculated as follows:
W = 0 for observations in PSU (from stratum k) ij j
= W * (N /(N -1)) for observations from other PSUs iF k k in stratum k
= W for observations in other strata iF
where
N is the number of PSUs in stratum k k
Methods and formulae for variance estimation
The svr set of commands make use of the replication weights produced by survwgt create to estimate (co)variances for parameters and other estimated quantities. See svr for a list of these commands. In general, the parameter estimates are calculated by standard statistical commands using aweights, which yields the same point estimates as Stata's linearization-based svy commands for survey data. The (co)variances are calculated by repeatedly re-estimating the same parameters with each of the set of replicate weights. From these replicated estimates, the (co)variances are calcuated as follows:
R V(b) = F * Sum( f *(b -b)(b -b)' ) r=1 r r r
where
R is the number of replicates,
b is the vector of full sample point estimate(s),
b is the vector of estimates derived from replicate r, r
F is a constant factor depending on the replication method, and
f is a replicate-specific factor. r
The values for F and f are:
method F f
BRR 1/R 1
Fay's variant 1/(R*(1-k)^2) 1 of BRR
JK1 (R-1)/R 1
JK2 1 1
JKn 1 (N -1)/N k k
Saved Results
The command saves no results, but it does set data characteristics to identify the full sample and replicate weight variables, degrees of freedom, fay constant for BRR method, and replication method (BRR, JK1, JK2, or JKn). The svrset command can be used to set, clear, or display these characteristics.
References
Judkins, D. 1990. Fay's Method for Variance Estimation. Journal of Official Statistics 16:25-45.
Wolter, K. M. 1985. Introduction to Variance Estimation. New York: Springer-Verlag.
Acknowledgements
I would like to thank Bobby Gutierrez at StataCorp for advice on implementation of BRR, and the technical group at StataCorp for feedback on an early version of the BRR programs. The code for raking is partly based upon Nick Cox's program mstdize.
Author
Nick Winter Cornell University nw53@cornell.edu