Title
gsample -- Sampling
Syntax
gsample [#|varname] [if] [in] [weight] [, options]
options Description ------------------------------------------------------------------------- percent sample size is in percent wor sample without replacement strata(varlist) variables identifying strata cluster(varlist) variables identifying resampling clusters idcluster(newvar) create new cluster ID variable keep keep observations that do not meet if and in generate(newvar) store sampling frequencies in newvar replace overwrite existing variables ------------------------------------------------------------------------- aweights are allowed; see weight.
Description
gsample draws a random sample from the data in memory. Simple random sampling (SRS) is supported, as well as unequal probability sampling (UPS), of which sampling with probabilities proportional to size (PPS) is a special case. Both methods, SRS and UPS/PPS, provide sampling with replacement and sampling without replacement. Furthermore, stratified sampling and cluster sampling is supported.
# specifies the size of the sample. The default for gsample is to replace the data in memory with the sampled observations in random order. Alternatively, gsample may store a new variable containing the sampling frequencies of the observations (see the generate(newvar) option). In the case of sampling without replacement (see the wor option), the sample size must be less than or equal to the number of sampling units in the data. Sampling units are either single observations or clusters identified by the cluster() option. If # is not specified or if #==., the sample size is equal to the observed number of units in the data. For stratified sampling, # units will be selected from each stratum identified by the strata() option. Alternatively, specify varname instead of #, where varname is a variable containing for each stratum a specific sample size. varname is assumed to be constant within strata.
Specifying aweights causes unequal probability sampling (UPS/PPS) to be performed. The sampling probabilities of the observations will be proportional to the specified weights in this case.
gsample is implemented as a wrapper for the mm_sample() function from the moremata package. See help for mm_sample() for methodical details and references. Note that for unequal probability sampling without replacement many different algorithms have been proposed in the literature and there may be better solutions than the method implemented here. In addition, UPS without replacement may fail if the distribution of weights is very uneven (see help for mm_sample() for an explanation of this problem).
If you are serious about sampling, you should first set the random number seed; see help generate.
Dependencies
gsample requires moremata. Type
. ssc describe moremata
Options
percent indicates that # (or varname) specifies the percentage of observations to be sampled. For example,
. gsample 50, percent
draws a 50% sample. For stratified sampling, a #-percent sample is drawn from each stratum, thus maintaining the proportion of each stratum.
wor causes the observations to be sampled without replacement (each observation may only be drawn once). The default is to sample with replacement (observations may be drawn multiple times).
strata(varlist) specifies the variables identifying strata. If strata() is specified, samples are selected within each stratum. The strata variables may be numeric or string.
cluster(varlist) specifies the variables identifying sampling clusters. If cluster() is specified, the sample drawn is a sample of clusters. The cluster variables may be numeric or string.
idcluster(newvar) creates a new variable containing a unique identifier for each sampled cluster. This is particularly useful when sampling with replacement.
keep causes observations that do not meet the optional if and in criteria (and observations that are missing on any of the input variables or have zero weight) to be kept (sampled at 100%). The default is to drop these observations. Alternatively, if generate() is specified, keep changes the stored sampling frequencies of these observations from zero to one.
generate(newvar) causes a variable containing sampling frequencies to be added to the data instead of replacing the data in memory with the sampled observations. Note that the original sort order of the data will be preserved if generate() is specified. newvar will either be zero or one for observations that do not meet the optional if and in criteria (or for observations that are missing on any of the input variables or have zero weight) depending on whether keep is specified or not.
replace permits gsample to overwrite existing variables.
Examples
For example,
. gsample
draws a bootstrap sample of size _N (simple random sample with replacement, SRSWR). The data in memory will be replaced by the sampled observations. Alternatively
. gsample, wor
draws a simple random sample without replacement (SRSWOR).
Furthermore,
. gsample [aw=size]
draws a unequal probability sample with sampling probabilities proportional to size.
Author
Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch
You may cite this software as follows:
Jann, B. (2006). gsample: Stata module to draw a random sample. Available from http://ideas.repec.org/c/boc/bocode/s456716.html.
Also see
Online: mm_sample(), sample, bsample, generate, moremata