help mata mm_sample()-------------------------------------------------------------------------------

Title

mm_sample() -- Draw a random sample

Syntax

real colvectormm_sample(n,strata[,cluster,w,wor,count,fast])where

n:real colvectorcontaining sample size(s)

strata:real matrixcontaining strata sizes (or the population size) and, in the case of stratified cluster sampling, the number of clusters per stratum

cluster:real colvectorcontaining cluster sizes;cluster==. indicates that there are no clusters

w:real colvectorcontaining weights for unequal probability sampling;wbeing scalar causes equal probability sampling to be performed

wor:real scalarindicating that sampling be performed without replacement; default is to sample with replacement

count:real scalarindicating that a count vector be returned; default is to return a permutation vector

fast:real scalarindicating that some internal checks be skipped; do not use this option

real colvectormm_srswr(n,N[,count])

real colvectormm_srswor(n,N[,count])

real colvectormm_upswr(n,w[,count])

real colvectormm_upswor(n,w[,count,nowarn])where

n:real scalarcontaining sample size

N:real scalarcontaining population size

w:real colvectorcontaining weights/sizes of elements

count:real scalarindicating that a count vector be returned; default is to return a permutation vector

nowarn:real scalarindicating that repetitions are allowed in UPSWOR

Description

mm_sample()may be used for sampling. Simple random sampling (SRS) is supported, as well as unequal probability sampling (UPS), of which sampling with probabilities proportional to size (PPS) is a special case. Both methods support sampling with replacement and sampling without replacement. Furthermore, stratified sampling and cluster sampling may be performed.

nspecifies the desired sample size.n==. indicates thatnbe equal to the size of the population or, ifcluster!=., the number of clusters. Ifnis scalar and there are several strata,ncases will be sampled from each stratum. Alternatively, specify an individual sample size for each stratum incolvector n.

strataspecifies the sizes of the strata to be sampled from. The sizes must be equal to one or larger. In the case of unstratified sampling,stratais areal scalarspecifying the population size (i.e. there is only one stratum). Note thatstratamay be set missing in unstratified sampling ifclusterorwis provided. The population size will then be inferred fromclusterorw, respectively.

clusterprovides the sizes of the clusters within strata. The sizes must be equal to one or larger. Ifclusteris specified, the drawn sample is a sample of clusters. Note that, for cluster sampling,stratamust have a second column containing the number of clusters in each stratum (unless there is only one stratum).cluster==. indicates that there are no clusters (i.e. each population member is its own cluster). Usemm_panels()to generate the required input formm_sample()from strata and cluster ID variables (see the examples below).Sampling with probabilities proportional to size or, more generally, unequal probability sampling can be achieved by providing

colvector w, wherewcontains the sizes/weights of the elements in the population or, ifclusteris provided, the sizes/weights of the clusters.wbeing scalar (e.g.w==1 orw==.) indicates that equal probability sampling be applied.

wor!=0 indicates that the sample be drawn without replacement (similar tpsample). The default is to sample with replacement (similar tobsample). Note that, when sampling without replacement,nmay not be larger than the size of the population/stratum (or the number of clusters within the population/stratum).The default for

mm_sample()is to return a permutation vector representing the sample (see[M-1] permutation). Alternatively, ifcount!=0 is specified,mm_sample()returns a count vector indicating for each population member the number of times it is in the sample. If sampling is performed without replacement, the counts are restricted to {0, 1}.

mm_srswr(),mm_srswor(),mm_upswr(), andmm_upswor()are the basic sampling functions used bysample().mm_srswr()andmm_srswor()draw simple random samples (SRS) with and without replacement, respectively.mm_upswr()andmm_upswor()perform unequal probability sampling (UPS) or sampling with probabilities proportional to size (PPS).If you are serious about sampling, you should first set the random number seed; see help

generateor help for[M-5] uniform().

RemarksRemarks are presented under the headings

Introduction: Simple Random Sample with Replacement

Stratified Sampling

Cluster Sampling

Stratified Cluster Sampling

Sampling from Strata and Cluster ID Variables usingmm_panels()

Returning a Count Vector

Sampling without Replacement

Unequal Probability Sampling/PPS Sampling

Methods and Formulas

Introduction: Simple Random Sample with ReplacementThe simplest (and fastest) application of

mm_sample()is to create a permutation vector representing a simple random sample with replacement (SRSWR). For example, the following command samples 10 out of a population of 1000:: mm_sample(10, 1000) 1 +-------+ 1 | 578 | 2 | 807 | 3 | 47 | 4 | 8 | 5 | 900 | 6 | 237 | 7 | 545 | 8 | 76 | 9 | 398 | 10 | 770 | +-------+

The numbers in the returned vector represent the positions of the sampled elements in the (hypothetical) list of population members.

Suppose

Xis a data matrix containingrows(X)observations andcols(X)variables. To create a matrixXs, which represents a SRSWR containing 100 randomly drawn observations fromX, type: Xs = X[mm_sample(100,rows(X)),.]

Note that in most applications you would want to save the sample permutation vector for further use. For example:

: p = mm_sample(100,rows(X)) : Xs = X[p,.] : Ys = Y[p,.]

To generate a stratified SRSWR, provide to

mm_sample()a column vector containing the sizes of the strata. Example:: mm_sample(5, (300\700)) 1 +-------+ 1 | 112 | 2 | 130 | 3 | 168 | 4 | 62 | 5 | 241 | 6 | 474 | 7 | 603 | 8 | 669 | 9 | 310 | 10 | 994 | +-------+

From each stratum, five elements were drawn. The first five cases in the returned sample come from the first stratum (1-300), the remaining five cases come from the second stratum (301-1000).

To use different sample sizes in the strata, type, for example,

: mm_sample((3\7), (300\700)) 1 +-------+ 1 | 298 | 2 | 226 | 3 | 192 | 4 | 998 | 5 | 956 | 6 | 338 | 7 | 900 | 8 | 378 | 9 | 980 | 10 | 992 | +-------+

Now the first three cases come from the first stratum and the remaining seven come from the second stratum. Note that

mm_sample()has no internal mechanism to determine the sample sizes for proportional stratification from a given total sample size. However, it is easy to compute the appropriate sample sizes in advance and then provide them tomm_sample().

To generate a sample of clusters, provide to

mm_sample()a column vector containing the sizes of the clusters within the population. The sum of cluster sizes must equal the population size (unless the population size is missing, in which case the sum of cluster sizes defines the population size). The sample sizenis interpreted as the number of clusters to be sampled in this case.For example, the following command randomly picks one of three clusters, where the first cluster has 3 members, the second cluster has 2 members, and the third cluster has 5 members (making a population total of 10). Note that, regardless of its size, each cluster has the same sampling probability (see below for sampling with probabilities proportional to size).

: mm_sample(1, ., (3\2\5)) 1 +-----+ 1 | 4 | 2 | 5 | +-----+

The result indicates that the second cluster was drawn (containing the 4th and 5th member of the population).

Generating a stratified sample of clusters requires:

o A matrix containing the sizes of the strata and the number of clusters within each stratum. For example,

: strata = (5, 2) \ (10, 3) : strata 1 2 +-----------+ 1 | 5 2 | 2 | 10 3 | +-----------+

defines two strata, where the first stratum contains 2 clusters with a total of 5 members and the second stratum contains 3 clusters with a total of 10 members.

o A column vector containing the sizes of the clusters.

In the following example, one cluster is sampled from each stratum:

: strata = (5, 2) \ (10, 3) : cluster = 3 \ 2 \ 2 \ 5 \ 3 : mm_sample(1, strata, cluster) 1 +------+ 1 | 4 | 2 | 5 | 3 | 8 | 4 | 9 | 5 | 10 | 6 | 11 | 7 | 12 | +------+

In both strata the second cluster was drawn.

Sampling from Strata and Cluster ID Variables usingmm_panels()When resampling real data, information on strata and clusters is usually present in the form of ID variables. The

mm_panels()function, which is also part of themorematapackage, can be used in this case to generate the appropriate strata and cluster input formm_sample().Suppose you want to resample stratified and clustered data. First, sort the data by stratum and cluster ID. For example, in Stata type

. sort strata cluster

where

stratais the strata ID variable andclusteris the cluster ID variable. After that, in Mata type something like: st_view(strata=., ., "strata") : st_view(cluster=., ., "cluster") : mm_panels(strata, Sinfo=., clusters, Cinfo=.) : p = mm_sample(

n, Sinfo, Cinfo) :...Alternatively, if the data are stratified only, type

. sort strata

and then

: st_view(strata=., ., "strata") : mm_panels(strata, Sinfo=.) : p = mm_sample(

n, Sinfo) :...or, if the data are clustered only,

. sort cluster

and then

: st_view(cluster=., ., "cluster") : mm_panels(cluster, Cinfo=.) : p = mm_sample(

n, ., Cinfo) :...The following example further illustrates the usage of

mm_panels():: strata,clusters 1 2 +---------+ 1 | 1 1 | 2 | 1 1 | 3 | 1 2 | 4 | 1 3 | 5 | 1 3 | 6 | 1 3 | 7 | 1 3 | 8 | 1 4 | 9 | 2 1 | 10 | 2 2 | 11 | 2 2 | 12 | 2 2 | 13 | 2 3 | 14 | 2 3 | +---------+

: mm_panels(strata, Sinfo=., clusters, Cinfo=.) : Sinfo 1 2 +---------+ 1 | 8 4 | 2 | 6 3 | +---------+

: Cinfo 1 +-----+ 1 | 2 | 2 | 1 | 3 | 4 | 4 | 1 | 5 | 1 | 6 | 3 | 7 | 2 | +-----+

: mm_sample(1,Sinfo,Cinfo) 1 +------+ 1 | 1 | 2 | 2 | 3 | 10 | 4 | 11 | 5 | 12 | +------+

mm_sample()can return its results in two different formats. The default is to return a permutation vector containing the positions of the drawn elements in the population list. See the examples above. Alternatively, ifcount!=0 is specified, a count vector is returned. A count vector contains for each member of the population the number of times it has been drawn into the sample. The following example shows the count vector of a sample of 5 out of a population of 10 (with replacement):: mm_sample(5,10,.,.,0,1) 1 +-----+ 1 | 0 | 2 | 0 | 3 | 0 | 4 | 0 | 5 | 0 | 6 | 0 | 7 | 1 | 8 | 0 | 9 | 2 | 10 | 2 | +-----+

The following examples illustrate the difference between sampling with replacement and sampling without replacement. When sampling

withreplacement, an individual element may be sampled multiple times:: mm_sample(5,5,.,.,0,1) 1 +-----+ 1 | 3 | 2 | 1 | 3 | 1 | 4 | 0 | 5 | 0 | +-----+

However, when sampling

withoutreplacement, each element may appear at most once in the sample:: mm_sample(5,5,.,.,1,1) 1 +-----+ 1 | 1 | 2 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | +-----+

Note that, naturally, the sample size

nmay not exceed the population size when sampling without replacement. (In the case of cluster sampling,nmay not exceed the number of clusters.)

Unequal Probability Sampling/PPS SamplingFor sampling with probabilities proportional to size (PPS) or, more generally, unequal probability sampling (UPS), you have to specify a column vector containing the sizes or weights. In the following example a

n= 15000 "sample" is drawn out of a population containing 5 members. The population members are sampled with probabilities proportional to size, where the first member has weight 1, the second has weight 2, etc.: mm_sample(15000, 5, ., (1::5),0,1) 1 +--------+ 1 | 1068 | 2 | 2076 | 3 | 2909 | 4 | 3969 | 5 | 4978 | +--------+

We see that, according to the given weights, the first member has been sampled roughly 1000 times, the second has been sample around 2000 times, etc.

Unequal probability sampling is also possible

withoutreplacement. However, note that in the without replacement case a problem exists if there are population members for whichw(i) *n/ sum(w) > 1. Consider the following example:: mm_sample(4, 5, ., (1::5),1,1) mm_upswor(): 3300 2 cases have w_i*n/sum(w)>1 mm_sample(): - function returned error <istmt>: - function returned error

What happened? Population member no. 5 has size 5 and the sum of sizes over all members is 15. That is, the population share of member no. 5 is 5/15 = 33.3%. However, even if member no. 5 is selected with certainty into the sample, i.e. if member no. 5 is sampled with probability 1, it can only reach a maximum sample share of 1/4 = 25%. (A similar problem exists with member no. 4 whose population share is 4/15 = 26.7%.) Apparently, unbiased PPS sampling without replacement is not possible in this situation.

Simple random sampling with replacement (SRSWR) is implemented as ceil(uniform(

n,1) *N) wherenis the sample size andNis the population size.Simple random sampling without replacement (SRSWOR) is implemented as unorder(

N)[|1 \n|].Unequal probability sampling with replacement (UPSWR) is implemented using the standard "cumulative" approach (see, e.g., Levy and Lemeshow 1999:354 or Cochran 1977:250; important theoretical results have been provided by Hansen and Hurwitz 1943).

Unequal probability sampling without replacement (UPSWOR) is implemented using the random systematic sampling technique discussed in, e.g., Hartley and Rao (1962). Note that many other UPSWOR algorithms can be found in the literature (see the review in Brewer and Hanif 1983; the algorithm implemented here conforms to their "Procedure 2"). An interesting recent approach has been developed by Tillé (1996; also see Ernst 2003).

Conformability

mm_sample(n,strata,cluster,w,wor,count,fast)n: 1x1 ork x1, wherek>0 is the number of stratastrata:k x1 (ifcluster!=.:k x2)cluster:l x1, wherel>0 is the number of clusters; alternatively,cluster==.w: 1x1 orNx1 (ifcluster!=.:l x1)wor: 1x1count: 1x1fast: 1x1result:ntotx1, wherentotis the final sample size, or, ifcount!=0,N x1, whereNis the population size

mm_srswr(n,N,count)n: 1x1N: 1x1count: 1x1result:n x1 or, ifcount!=0,N x1

mm_srswor(n,N,count)n: 1x1N: 1x1count: 1x1result:n x1 or, ifcount!=0,N x1

mm_upswr(n,w,count)n: 1x1w:N x1, whereNis the population sizecount: 1x1result:n x1 or, ifcount!=0,N x1

mm_upswor(n,w,count)n: 1x1w:N x1, whereNis the population sizecount: 1x1result:n x1 or, ifcount!=0,N x1

Diagnostics

mm_upswr()andmm_upswor()produce erroneous results ifwcontains negative or missing values or if sum(w)==0.

Source codemm_sample.mata, mm_srswr.mata, mm_srswor.mata, mm_upswr.mata, mm_upswor.mata

ReferencesBrewer, K. R. W., Muhammad Hanif (1983). Sampling with Unequal Probabilities. New York: Springer.

Cochran, William G. (1967). Sampling Techniques, 3rd ed. New York: Wiley.

Ernst, Lawrence (2003). Sample Expansion for Probability Proportional to Size without Replacement Sampling. Proceedings of the Section on Survey Research Methods, 2003, American Statistical Association: http://www.bls.gov/ore/pdf/st030100.pdf.

Hansen, Morris H., William N. Hurwitz (1943). On the Theory of Sampling from Finite Populations. The Annals of Mathematical Statistics 33: 350-374.

Hartley, H. O., J. N. K. Rao (1962). Sampling with Unequal Probabilities and without Replacement. The Annals of Mathematical Statistics 14: 333-362.

Levy, Paul S., Stanley Lemeshow (1999). Sampling of Populations. Methods and Applications, 3rd ed. New York: Wiley.

Tillé, Yves (1996). An Elimination Procedure for Unequal Probability Sampling without Replacement. Biometrika 83: 238-241.

AuthorBen Jann, ETH Zurich, jann@soz.gess.ethz.ch

Also seeOnline: help for

mm_panels(),sample,bsample,[M-5] uniform(),[M-4]utility,moremata