Title
mm_sample() -- Draw a random sample
Syntax
real colvector mm_sample(n, strata [, cluster, w, wor, count, fast])
where
n: real colvector containing sample size(s)
strata: real matrix containing strata sizes (or the population size) and, in the case of stratified cluster sampling, the number of clusters per stratum
cluster: real colvector containing cluster sizes; cluster==. indicates that there are no clusters
w: real colvector containing weights for unequal probability sampling; w being scalar causes equal probability sampling to be performed
wor: real scalar indicating that sampling be performed without replacement; default is to sample with replacement
count: real scalar indicating that a count vector be returned; default is to return a permutation vector
fast: real scalar indicating that some internal checks be skipped; do not use this option
real colvector mm_srswr(n, N [, count])
real colvector mm_srswor(n, N [, count])
real colvector mm_upswr(n, w [, count])
real colvector mm_upswor(n, w [, count, nowarn])
where
n: real scalar containing sample size
N: real scalar containing population size
w: real colvector containing weights/sizes of elements
count: real scalar indicating that a count vector be returned; default is to return a permutation vector
nowarn: real scalar indicating that repetitions are allowed in UPSWOR
Description
mm_sample() may be used for sampling. Simple random sampling (SRS) is supported, as well as unequal probability sampling (UPS), of which sampling with probabilities proportional to size (PPS) is a special case. Both methods support sampling with replacement and sampling without replacement. Furthermore, stratified sampling and cluster sampling may be performed.
n specifies the desired sample size. n==. indicates that n be equal to the size of the population or, if cluster!=., the number of clusters. If n is scalar and there are several strata, n cases will be sampled from each stratum. Alternatively, specify an individual sample size for each stratum in colvector n.
strata specifies the sizes of the strata to be sampled from. The sizes must be equal to one or larger. In the case of unstratified sampling, strata is a real scalar specifying the population size (i.e. there is only one stratum). Note that strata may be set missing in unstratified sampling if cluster or w is provided. The population size will then be inferred from cluster or w, respectively.
cluster provides the sizes of the clusters within strata. The sizes must be equal to one or larger. If cluster is specified, the drawn sample is a sample of clusters. Note that, for cluster sampling, strata must have a second column containing the number of clusters in each stratum (unless there is only one stratum). cluster==. indicates that there are no clusters (i.e. each population member is its own cluster). Use mm_panels() to generate the required input for mm_sample() from strata and cluster ID variables (see the examples below).
Sampling with probabilities proportional to size or, more generally, unequal probability sampling can be achieved by providing colvector w, where w contains the sizes/weights of the elements in the population or, if cluster is provided, the sizes/weights of the clusters. w being scalar (e.g. w==1 or w==.) indicates that equal probability sampling be applied.
wor!=0 indicates that the sample be drawn without replacement (similar tp sample). The default is to sample with replacement (similar to bsample). Note that, when sampling without replacement, n may not be larger than the size of the population/stratum (or the number of clusters within the population/stratum).
The default for mm_sample() is to return a permutation vector representing the sample (see [M-1] permutation). Alternatively, if count!=0 is specified, mm_sample() returns a count vector indicating for each population member the number of times it is in the sample. If sampling is performed without replacement, the counts are restricted to {0, 1}.
mm_srswr(), mm_srswor(), mm_upswr(), and mm_upswor() are the basic sampling functions used by sample(). mm_srswr() and mm_srswor() draw simple random samples (SRS) with and without replacement, respectively. mm_upswr() and mm_upswor() perform unequal probability sampling (UPS) or sampling with probabilities proportional to size (PPS).
If you are serious about sampling, you should first set the random number seed; see help generate or help for [M-5] uniform().
Remarks
Remarks are presented under the headings
Introduction: Simple Random Sample with Replacement
Stratified Sampling
Cluster Sampling
Stratified Cluster Sampling
Sampling from Strata and Cluster ID Variables using mm_panels()
Returning a Count Vector
Sampling without Replacement
Unequal Probability Sampling/PPS Sampling
Methods and Formulas
Introduction: Simple Random Sample with Replacement
The simplest (and fastest) application of mm_sample() is to create a permutation vector representing a simple random sample with replacement (SRSWR). For example, the following command samples 10 out of a population of 1000:
: mm_sample(10, 1000) 1 +-------+ 1 | 578 | 2 | 807 | 3 | 47 | 4 | 8 | 5 | 900 | 6 | 237 | 7 | 545 | 8 | 76 | 9 | 398 | 10 | 770 | +-------+
The numbers in the returned vector represent the positions of the sampled elements in the (hypothetical) list of population members.
Suppose X is a data matrix containing rows(X) observations and cols(X) variables. To create a matrix Xs, which represents a SRSWR containing 100 randomly drawn observations from X, type
: Xs = X[mm_sample(100,rows(X)),.]
Note that in most applications you would want to save the sample permutation vector for further use. For example:
: p = mm_sample(100,rows(X)) : Xs = X[p,.] : Ys = Y[p,.]
To generate a stratified SRSWR, provide to mm_sample() a column vector containing the sizes of the strata. Example:
: mm_sample(5, (300\700)) 1 +-------+ 1 | 112 | 2 | 130 | 3 | 168 | 4 | 62 | 5 | 241 | 6 | 474 | 7 | 603 | 8 | 669 | 9 | 310 | 10 | 994 | +-------+
From each stratum, five elements were drawn. The first five cases in the returned sample come from the first stratum (1-300), the remaining five cases come from the second stratum (301-1000).
To use different sample sizes in the strata, type, for example,
: mm_sample((3\7), (300\700)) 1 +-------+ 1 | 298 | 2 | 226 | 3 | 192 | 4 | 998 | 5 | 956 | 6 | 338 | 7 | 900 | 8 | 378 | 9 | 980 | 10 | 992 | +-------+
Now the first three cases come from the first stratum and the remaining seven come from the second stratum. Note that mm_sample() has no internal mechanism to determine the sample sizes for proportional stratification from a given total sample size. However, it is easy to compute the appropriate sample sizes in advance and then provide them to mm_sample().
To generate a sample of clusters, provide to mm_sample() a column vector containing the sizes of the clusters within the population. The sum of cluster sizes must equal the population size (unless the population size is missing, in which case the sum of cluster sizes defines the population size). The sample size n is interpreted as the number of clusters to be sampled in this case.
For example, the following command randomly picks one of three clusters, where the first cluster has 3 members, the second cluster has 2 members, and the third cluster has 5 members (making a population total of 10). Note that, regardless of its size, each cluster has the same sampling probability (see below for sampling with probabilities proportional to size).
: mm_sample(1, ., (3\2\5)) 1 +-----+ 1 | 4 | 2 | 5 | +-----+
The result indicates that the second cluster was drawn (containing the 4th and 5th member of the population).
Generating a stratified sample of clusters requires:
o A matrix containing the sizes of the strata and the number of clusters within each stratum. For example,
: strata = (5, 2) \ (10, 3) : strata 1 2 +-----------+ 1 | 5 2 | 2 | 10 3 | +-----------+
defines two strata, where the first stratum contains 2 clusters with a total of 5 members and the second stratum contains 3 clusters with a total of 10 members.
o A column vector containing the sizes of the clusters.
In the following example, one cluster is sampled from each stratum:
: strata = (5, 2) \ (10, 3) : cluster = 3 \ 2 \ 2 \ 5 \ 3 : mm_sample(1, strata, cluster) 1 +------+ 1 | 4 | 2 | 5 | 3 | 8 | 4 | 9 | 5 | 10 | 6 | 11 | 7 | 12 | +------+
In both strata the second cluster was drawn.
Sampling from Strata and Cluster ID Variables using mm_panels()
When resampling real data, information on strata and clusters is usually present in the form of ID variables. The mm_panels() function, which is also part of the moremata package, can be used in this case to generate the appropriate strata and cluster input for mm_sample().
Suppose you want to resample stratified and clustered data. First, sort the data by stratum and cluster ID. For example, in Stata type
. sort strata cluster
where strata is the strata ID variable and cluster is the cluster ID variable. After that, in Mata type something like
: st_view(strata=., ., "strata") : st_view(cluster=., ., "cluster") : mm_panels(strata, Sinfo=., clusters, Cinfo=.) : p = mm_sample(n, Sinfo, Cinfo) : ...
Alternatively, if the data are stratified only, type
. sort strata
and then
: st_view(strata=., ., "strata") : mm_panels(strata, Sinfo=.) : p = mm_sample(n, Sinfo) : ...
or, if the data are clustered only,
. sort cluster
and then
: st_view(cluster=., ., "cluster") : mm_panels(cluster, Cinfo=.) : p = mm_sample(n, ., Cinfo) : ...
The following example further illustrates the usage of mm_panels():
: strata,clusters 1 2 +---------+ 1 | 1 1 | 2 | 1 1 | 3 | 1 2 | 4 | 1 3 | 5 | 1 3 | 6 | 1 3 | 7 | 1 3 | 8 | 1 4 | 9 | 2 1 | 10 | 2 2 | 11 | 2 2 | 12 | 2 2 | 13 | 2 3 | 14 | 2 3 | +---------+
: mm_panels(strata, Sinfo=., clusters, Cinfo=.) : Sinfo 1 2 +---------+ 1 | 8 4 | 2 | 6 3 | +---------+
: Cinfo 1 +-----+ 1 | 2 | 2 | 1 | 3 | 4 | 4 | 1 | 5 | 1 | 6 | 3 | 7 | 2 | +-----+
: mm_sample(1,Sinfo,Cinfo) 1 +------+ 1 | 1 | 2 | 2 | 3 | 10 | 4 | 11 | 5 | 12 | +------+
mm_sample() can return its results in two different formats. The default is to return a permutation vector containing the positions of the drawn elements in the population list. See the examples above. Alternatively, if count!=0 is specified, a count vector is returned. A count vector contains for each member of the population the number of times it has been drawn into the sample. The following example shows the count vector of a sample of 5 out of a population of 10 (with replacement):
: mm_sample(5,10,.,.,0,1) 1 +-----+ 1 | 0 | 2 | 0 | 3 | 0 | 4 | 0 | 5 | 0 | 6 | 0 | 7 | 1 | 8 | 0 | 9 | 2 | 10 | 2 | +-----+
The following examples illustrate the difference between sampling with replacement and sampling without replacement. When sampling with replacement, an individual element may be sampled multiple times:
: mm_sample(5,5,.,.,0,1) 1 +-----+ 1 | 3 | 2 | 1 | 3 | 1 | 4 | 0 | 5 | 0 | +-----+
However, when sampling without replacement, each element may appear at most once in the sample:
: mm_sample(5,5,.,.,1,1) 1 +-----+ 1 | 1 | 2 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | +-----+
Note that, naturally, the sample size n may not exceed the population size when sampling without replacement. (In the case of cluster sampling, n may not exceed the number of clusters.)
Unequal Probability Sampling/PPS Sampling
For sampling with probabilities proportional to size (PPS) or, more generally, unequal probability sampling (UPS), you have to specify a column vector containing the sizes or weights. In the following example a n = 15000 "sample" is drawn out of a population containing 5 members. The population members are sampled with probabilities proportional to size, where the first member has weight 1, the second has weight 2, etc.
: mm_sample(15000, 5, ., (1::5),0,1) 1 +--------+ 1 | 1068 | 2 | 2076 | 3 | 2909 | 4 | 3969 | 5 | 4978 | +--------+
We see that, according to the given weights, the first member has been sampled roughly 1000 times, the second has been sample around 2000 times, etc.
Unequal probability sampling is also possible without replacement. However, note that in the without replacement case a problem exists if there are population members for which w(i) * n / sum(w) > 1. Consider the following example:
: mm_sample(4, 5, ., (1::5),1,1) mm_upswor(): 3300 2 cases have w_i*n/sum(w)>1 mm_sample(): - function returned error <istmt>: - function returned error
What happened? Population member no. 5 has size 5 and the sum of sizes over all members is 15. That is, the population share of member no. 5 is 5/15 = 33.3%. However, even if member no. 5 is selected with certainty into the sample, i.e. if member no. 5 is sampled with probability 1, it can only reach a maximum sample share of 1/4 = 25%. (A similar problem exists with member no. 4 whose population share is 4/15 = 26.7%.) Apparently, unbiased PPS sampling without replacement is not possible in this situation.
Simple random sampling with replacement (SRSWR) is implemented as ceil(uniform(n,1) * N) where n is the sample size and N is the population size.
Simple random sampling without replacement (SRSWOR) is implemented as unorder(N)[|1 \ n|].
Unequal probability sampling with replacement (UPSWR) is implemented using the standard "cumulative" approach (see, e.g., Levy and Lemeshow 1999:354 or Cochran 1977:250; important theoretical results have been provided by Hansen and Hurwitz 1943).
Unequal probability sampling without replacement (UPSWOR) is implemented using the random systematic sampling technique discussed in, e.g., Hartley and Rao (1962). Note that many other UPSWOR algorithms can be found in the literature (see the review in Brewer and Hanif 1983; the algorithm implemented here conforms to their "Procedure 2"). An interesting recent approach has been developed by Tillé (1996; also see Ernst 2003).
Conformability
mm_sample(n, strata, cluster, w, wor, count, fast) n: 1 x 1 or k x 1, where k>0 is the number of strata strata: k x 1 (if cluster!=.: k x 2) cluster: l x 1, where l>0 is the number of clusters; alternatively, cluster==. w: 1 x 1 or N x 1 (if cluster!=.: l x 1) wor: 1 x 1 count: 1 x 1 fast: 1 x 1 result: ntot x 1, where ntot is the final sample size, or, if count!=0, N x 1, where N is the population size
mm_srswr(n, N, count) n: 1 x 1 N: 1 x 1 count: 1 x 1 result: n x 1 or, if count!=0, N x 1
mm_srswor(n, N, count) n: 1 x 1 N: 1 x 1 count: 1 x 1 result: n x 1 or, if count!=0, N x 1
mm_upswr(n, w, count) n: 1 x 1 w: N x 1, where N is the population size count: 1 x 1 result: n x 1 or, if count!=0, N x 1
mm_upswor(n, w, count) n: 1 x 1 w: N x 1, where N is the population size count: 1 x 1 result: n x 1 or, if count!=0, N x 1
Diagnostics
mm_upswr() and mm_upswor() produce erroneous results if w contains negative or missing values or if sum(w)==0.
Source code
mm_sample.mata, mm_srswr.mata, mm_srswor.mata, mm_upswr.mata, mm_upswor.mata
References
Brewer, K. R. W., Muhammad Hanif (1983). Sampling with Unequal Probabilities. New York: Springer.
Cochran, William G. (1967). Sampling Techniques, 3rd ed. New York: Wiley.
Ernst, Lawrence (2003). Sample Expansion for Probability Proportional to Size without Replacement Sampling. Proceedings of the Section on Survey Research Methods, 2003, American Statistical Association: http://www.bls.gov/ore/pdf/st030100.pdf.
Hansen, Morris H., William N. Hurwitz (1943). On the Theory of Sampling from Finite Populations. The Annals of Mathematical Statistics 33: 350-374.
Hartley, H. O., J. N. K. Rao (1962). Sampling with Unequal Probabilities and without Replacement. The Annals of Mathematical Statistics 14: 333-362.
Levy, Paul S., Stanley Lemeshow (1999). Sampling of Populations. Methods and Applications, 3rd ed. New York: Wiley.
Tillé, Yves (1996). An Elimination Procedure for Unequal Probability Sampling without Replacement. Biometrika 83: 238-241.
Author
Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch
Also see
Online: help for mm_panels(), sample, bsample, [M-5] uniform(), [M-4] utility, moremata