help gsample-------------------------------------------------------------------------------

Title

gsample-- Sampling

Syntax

gsample[#|varname] [if] [in] [weight] [,options]

optionsDescription -------------------------------------------------------------------------percentsample size is in percentworsample without replacementstrata(varlist)variables identifying stratacluster(varlist)variables identifying resampling clustersidcluster(newvar)create new cluster ID variablekeepkeep observations that do not meetifandingenerate(newvar)store sampling frequencies innewvarreplaceoverwrite existing variables -------------------------------------------------------------------------aweights are allowed; see weight.

Description

gsampledraws a random sample from the data in memory. Simple random sampling (SRS) is supported, as well as unequal probability sampling (UPS), of which sampling with probabilities proportional to size (PPS) is a special case. Both methods, SRS and UPS/PPS, provide samplingwithreplacement and samplingwithoutreplacement. Furthermore, stratified sampling and cluster sampling is supported.

#specifies the size of the sample. The default forgsampleis to replace the data in memory with the sampled observations in random order. Alternatively,gsamplemay store a new variable containing the sampling frequencies of the observations (see thegenerate(newvar)option). In the case of sampling without replacement (see theworoption), the sample size must be less than or equal to the number of sampling units in the data. Sampling units are either single observations or clusters identified by thecluster()option. If#is not specified or if#==., the sample size is equal to the observed number of units in the data. For stratified sampling,#units will be selected from each stratum identified by thestrata()option. Alternatively, specifyvarnameinstead of#, wherevarnameis a variable containing for each stratum a specific sample size.varnameis assumed to be constant within strata.Specifying

aweights causes unequal probability sampling (UPS/PPS) to be performed. The sampling probabilities of the observations will be proportional to the specified weights in this case.

gsampleis implemented as a wrapper for themm_sample()function from themorematapackage. See help formm_sample()for methodical details and references. Note that for unequal probability sampling without replacement many different algorithms have been proposed in the literature and there may be better solutions than the method implemented here. In addition, UPS without replacement may fail if the distribution of weights is very uneven (see help formm_sample()for an explanation of this problem).If you are serious about sampling, you should first set the random number seed; see help

generate.

Dependencies

gsamplerequiresmoremata. Type. ssc describe moremata

Options

percentindicates that#(orvarname) specifies the percentage of observations to be sampled. For example,. gsample 50, percent

draws a 50% sample. For stratified sampling, a

#-percent sample is drawn from each stratum, thus maintaining the proportion of each stratum.

worcauses the observations to be sampled without replacement (each observation may only be drawn once). The default is to sample with replacement (observations may be drawn multiple times).

strata(varlist)specifies the variables identifying strata. Ifstrata()is specified, samples are selected within each stratum. The strata variables may be numeric or string.

cluster(varlist)specifies the variables identifying sampling clusters. Ifcluster()is specified, the sample drawn is a sample of clusters. The cluster variables may be numeric or string.

idcluster(newvar)creates a new variable containing a unique identifier for each sampled cluster. This is particularly useful when sampling with replacement.

keepcauses observations that do not meet the optionalifandincriteria (and observations that are missing on any of the input variables or have zero weight) to be kept (sampled at 100%). The default is to drop these observations. Alternatively, ifgenerate()is specified,keepchanges the stored sampling frequencies of these observations from zero to one.

generate(newvar)causes a variable containing sampling frequencies to be added to the data instead of replacing the data in memory with the sampled observations. Note that the original sort order of the data will be preserved ifgenerate()is specified.newvarwill either be zero or one for observations that do not meet the optionalifandincriteria (or for observations that are missing on any of the input variables or have zero weight) depending on whetherkeepis specified or not.

replacepermitsgsampleto overwrite existing variables.

ExamplesFor example,

. gsample

draws a bootstrap sample of size _N (simple random sample with replacement, SRSWR). The data in memory will be replaced by the sampled observations. Alternatively

. gsample, wor

draws a simple random sample without replacement (SRSWOR).

Furthermore,

. gsample [aw=size]

draws a unequal probability sample with sampling probabilities proportional to

size.

AuthorBen Jann, ETH Zurich, jann@soz.gess.ethz.ch

You may cite this software as follows:

Jann, B. (2006). gsample: Stata module to draw a random sample. Available from http://ideas.repec.org/c/boc/bocode/s456716.html.

Also seeOnline:

mm_sample(),sample,bsample,generate,moremata