Generate direct standardization weights for input to estimation commands
dsweight stanvarlist [if] [in] [weight] using filename , generate(newvarname) [ groupvars(varlist) by(varlist) nocomplete missing tfreqvar(varname) sorted float fast ]
scdsweight stanvarlist [if] [in] [weight] using filename , generate(newvarname) scenvar(varname) [ by(varlist) nocomplete missing tfreqvar(varname) sorted float fast ]
where stanvarlist is a varlist specifying a list of standardization variables.
Description
dsweight generates direct standardization weights for input as pweights to estimation commands, standardizing the joint distribution of a list of standardization variables to a standard target population, possibly within groups defined by value combinations of a list of group variables. A direct standardization weight is defined as a ratio between the frequency of a combination of values of standardization variables in a target standard population and the frequency of the same combination of values of standardization variables in the sample or group. The standard target population may be the full sample, or a by-group defined by a combination of values of by-variables, or it may be defined using a dataset with 1 observation per combination of the group variables, and data on the frequencies of these combinations in the standard target population. scdsweight is a version of dsweight for generating scenario direct standardization weights, which can be input as scenario weights to the SSC package scsomersd.
Options for dsweight and scdsweight
generate(newvarname) must be present. It specifies the name of a new variable to be generated, containing the direct standardization weights.
groupvars(varlist) (dsweight only) specifies a list of variables, whose value combinations will be groups, within which the joint distribution of the standardization variables in the stanvarlist will be standardized, using the sampling probability weights, to the joint distribution of the standardization variables in the target population. If groupvars() is absent, then the standardization weights will standardize the joint distribution of the standardization variables in the full input sample to the standard target population. The full input sample is the set of all observations in the dataset (or in the by-group if by() is specified) for which the values of all standardization variables and all group variables are non-missing, and which are not excluded by the if and/or in qualifiers.
scenvar(varname) (scdsweight only) specifies a binary scenario-indicator variable, with values 0 and 1, indicating that an observation is present in a scenario, for which the scenario direct standardization weights will be calculated. These scenario direct standardization weights are equal to zero for observations not in the scenario, and equal to direct standardization weights for observations in the scenario, standardizing the distribution of the standardization variables for observations in the scenario to the standard population. These scenario direct standardization weights may be input, as scenario-specific weights, to the scsomersd package, downloadable from SSC. The scsomersd package uses rank methods to compare the distributions of outcomes between scenarios. An example of a scenario-comparison rank statistic is the population attributwble risk, which may be either crude or age-standardized.
by(varlist) specifies a list of by-variables, whose combinations (missing or non-missing) specify the by-groups. The standardization weights are calculated independently within each by-group. If a using dataset is specified, then the by-variables must be present in this using dataset, and, together with the standardization variables, they must uniquely identify the observations in the using dataset. If a using dataset is not specified, then the generated standardization weights will standardize the joint distribution of the standardization variables to the subset of the total sample within each by-group.
nocomplete specifies that each group specified by the groupvars() option (or the scenario specified by the scenvars() option) does not have to contain the full list of value combinations of the standardization variables. If nocomplete is absent, then dsweight and scdsweight checks that each combination of values of the standardization variables (within each by-group if by() is specified) is present in each combination of values of the groupvars() variables, or in the scenario specified by the scenvar() variable, within each by-group if by() is specified. If this condition is not met, then dsweight or scdsweight will fail.
missing specifies that the generated standardization weights, in the variable named by generate(), may have missing values in the input sample, even if the group (or scenario) variables and standardization variables are non-missing. This may be because the sum of weights in the sample, group or scenario is zero, or because a using dataset is specified and does not contain an observation with the current combination of the standardization variables. If missing is not specified, and some standardization weights in the input sample are missing, then dsweight or scdsweight will fail.
tfreqvar(varname) specifies the name of a variable, in the using dataset, containing the frequencies (or sums of weights) of the corresponding combination of standardization variables in the standard target population. If tfreqvar() is not specified, and a using dataset is specified, then dsweight or scdsweight looks for a variable named _freq. Such a variable will usually be present if the using dataset has been created by the Stata command contract, or by the SSC package xcontract.
sorted functions as the option of the same name for merge. It specifies that the observations in the using dataset are already sorted by the standardization variables (or by the by-variables and the standardization variables if by() is specified), so there is no need for Stata to sort them before use. This may save some computational time.
float specifies that the output variable specified by generate() will be of storage type float or lower. If float is not specified, then the output variable will be generated as type double. Note that the output variable will be compressed after being generated (using compress) to the lowest type possible without loss of precision, whether or not the user specifies float.
fast is an option for programmers. It specifies that dsweight or scdsweight will take no action to restore the existing dataset in memory in the event of failure, or if the user presses Break. If fast is not specified, then dsweight and scdsweight will take this action, which uses an amount of time depending on the size of the dataset in memory.
Remarks
dsweight works on the same principle as dstdize. However, dsweight creates weights that can be input to estimation commands as pweights, in order to estimate a wide range of directly-standardized parameters (not only rates and proportions). scdsweight is intended for use with the scsomersd package, which the user can download from SSC, and which calculates rank statistics for comparing scenarios. The user must also download the SSC packages somersd and expgen, if scsomersd is to work.
Examples
The following examples make use of the xcontract command, which can be downloaded from SSC, and is an extended version of contract.
Set-up:
. use http://www.stata-press.com/data/r11/lbw.dta, clear . gene agegp=age . recode agegp (0/19=1) (20/29=2) (30/max=3) . lab def agegp 1 "<20" 2 "20-29" 3 "30+" . lab val agegp agegp . lab var agegp "Age group" . describe . tab agegp, m
The following example creates and lists standardization weights, standardizing the children of smoking and non-smoking mothers to the age group distribution in the total sample, and then uses regress, with the standardization weights as sampling probability weights, to estimate an effect of maternal smoking on birth weight, standardized by age group. We then use censlope, part of the SSC package somersd, to estimate an age-standardized median difference in birth weight between the babies of smoking and non-smoking mothers.
. dsweight agegp, groupvars(smoke) gene(swei1) . xcontract smoke agegp swei1, list(, abbr(32) sepby(smoke)) . regress bwt smoke [pweight=swei1] . censlope bwt smoke [pweight=swei1], transf(z) tdist
The following example creates a dataset agpfreq1, with 1 observation per maternal age group and data on the frequencies of that maternal age group in the children of non-smoking mothers. We then use dsweight to create sampling probability weights, standardizing the children of smokers and non-smokers to the maternal age group distribution of non-smokers, and display these weights using xcontract. We then use regress to estimate the effect of smoking, in a hypothetical population, where smoking and non-smoking mothers have the age distribution of non-smokers in the sample. Finally, we use censlope to estimate an age-standardized median difference in birth weight between the babies of smoking and non-smoking mothers.
. xcontract agegp if smoke==0, list(, abbr(32)) saving(agpfreq1, replace) . dsweight agegp using agpfreq1, groupvars(smoke) gene(swei2) . xcontract smoke agegp swei2, list(, abbr(32) sepby(smoke)) . regress bwt smoke [pweight=swei2] . censlope bwt smoke [pweight=swei2], transf(z) tdist
The following example demonstrates the use of the scdsweight module to compute scenario direct standardization weights for use with the scsomersd package, downloadable from SSC. We define a scenario indicator variable nonsmoke, indicating that a subject is a non-smoker. We then use scsomersd to define scenario direct-standardization weights, stored in a new variable swei3, and equal to age-standardization weights for children of non-smokers and to zero for children of smokers. We then use scsomersd to compare two scenarios, the real-world scenario and a fantasy scenario where all mothers are non-smoking and the age-group distribution stays the same, and estimate a population attributable risk, equal to the difference between the proportions of babies with low birth weight in the real-world scenario and in the fantasy scenario.
. gene nonsmoke=1-smoke . scdsweight agegp, scenvar(nonsmoke) gene(swei3) . xcontract smoke nonsmoke agegp swei3, list(, abbr(32) sepby(smoke)) . scsomersd low [pwei=1], sweight(swei3) transf(z) tdist
Author
Roger Newson, National Heart and Lung Institute, Imperial College London, UK. Email: r.newson@imperial.ac.uk
Also see
Manual: [D] merge, [D] contract, [R] dstdize On-line: help for merge, contract, dstdize help for xcontract, somersd, censlope, scsomersd, expgen if installed