help mata mm_sample()
-------------------------------------------------------------------------------

Title

    mm_sample() -- Draw a random sample


Syntax

        real colvector mm_sample(n, strata [, cluster, w, wor, count, fast])

    where

              n:  real colvector containing sample size(s)

         strata:  real matrix containing strata sizes (or the population
                  size) and, in the case of stratified cluster sampling, the
                  number of clusters per stratum

        cluster:  real colvector containing cluster sizes; cluster==.
                  indicates that there are no clusters

              w:  real colvector containing weights for unequal probability
                  sampling; w being scalar causes equal probability sampling
                  to be performed

            wor:  real scalar indicating that sampling be performed without
                  replacement; default is to sample with replacement

          count:  real scalar indicating that a count vector be returned;
                  default is to return a permutation vector

           fast:  real scalar indicating that some internal checks be
                  skipped; do not use this option


        real colvector mm_srswr(n, N [, count])

        real colvector mm_srswor(n, N [, count])

        real colvector mm_upswr(n, w [, count])

        real colvector mm_upswor(n, w [, count, nowarn])

    where

              n:  real scalar containing sample size

              N:  real scalar containing population size

              w:  real colvector containing weights/sizes of elements

          count:  real scalar indicating that a count vector be returned;
                  default is to return a permutation vector

         nowarn:  real scalar indicating that repetitions are allowed in
                  UPSWOR


Description

    mm_sample() may be used for sampling. Simple random sampling (SRS) is
    supported, as well as unequal probability sampling (UPS), of which
    sampling with probabilities proportional to size (PPS) is a special case.
    Both methods support sampling with replacement and sampling without
    replacement. Furthermore, stratified sampling and cluster sampling may be
    performed.

    n specifies the desired sample size. n==. indicates that n be equal to
    the size of the population or, if cluster!=., the number of clusters. If
    n is scalar and there are several strata, n cases will be sampled from
    each stratum. Alternatively, specify an individual sample size for each
    stratum in colvector n.

    strata specifies the sizes of the strata to be sampled from. The sizes
    must be equal to one or larger. In the case of unstratified sampling,
    strata is a real scalar specifying the population size (i.e. there is
    only one stratum). Note that strata may be set missing in unstratified
    sampling if cluster or w is provided. The population size will then be
    inferred from cluster or w, respectively.

    cluster provides the sizes of the clusters within strata. The sizes must
    be equal to one or larger. If cluster is specified, the drawn sample is a
    sample of clusters. Note that, for cluster sampling, strata must have a
    second column containing the number of clusters in each stratum (unless
    there is only one stratum).  cluster==. indicates that there are no
    clusters (i.e. each population member is its own cluster). Use
    mm_panels() to generate the required input for mm_sample() from strata
    and cluster ID variables (see the examples below).

    Sampling with probabilities proportional to size or, more generally,
    unequal probability sampling can be achieved by providing colvector w,
    where w contains the sizes/weights of the elements in the population or,
    if cluster is provided, the sizes/weights of the clusters. w being scalar
    (e.g. w==1 or w==.) indicates that equal probability sampling be applied.

    wor!=0 indicates that the sample be drawn without replacement (similar tp
    sample). The default is to sample with replacement (similar to bsample).
    Note that, when sampling without replacement, n may not be larger than
    the size of the population/stratum (or the number of clusters within the
    population/stratum).

    The default for mm_sample() is to return a permutation vector
    representing the sample (see [M-1] permutation). Alternatively, if
    count!=0 is specified, mm_sample() returns a count vector indicating for
    each population member the number of times it is in the sample. If
    sampling is performed without replacement, the counts are restricted to
    {0, 1}.

    mm_srswr(), mm_srswor(), mm_upswr(), and mm_upswor() are the basic
    sampling functions used by sample(). mm_srswr() and mm_srswor() draw
    simple random samples (SRS) with and without replacement, respectively.
    mm_upswr() and mm_upswor() perform unequal probability sampling (UPS) or
    sampling with probabilities proportional to size (PPS).

    If you are serious about sampling, you should first set the random number
    seed; see help generate or help for [M-5] uniform().


Remarks

    Remarks are presented under the headings

        Introduction: Simple Random Sample with Replacement

        Stratified Sampling

        Cluster Sampling

        Stratified Cluster Sampling

        Sampling from Strata and Cluster ID Variables using mm_panels()

        Returning a Count Vector

        Sampling without Replacement

        Unequal Probability Sampling/PPS Sampling

        Methods and Formulas


    Introduction: Simple Random Sample with Replacement

    The simplest (and fastest) application of mm_sample() is to create a
    permutation vector representing a simple random sample with replacement
    (SRSWR). For example, the following command samples 10 out of a
    population of 1000:

        : mm_sample(10, 1000)
                  1
             +-------+
           1 |  578  |
           2 |  807  |
           3 |   47  |
           4 |    8  |
           5 |  900  |
           6 |  237  |
           7 |  545  |
           8 |   76  |
           9 |  398  |
          10 |  770  |
             +-------+

    The numbers in the returned vector represent the positions of the sampled
    elements in the (hypothetical) list of population members.

    Suppose X is a data matrix containing rows(X) observations and cols(X)
    variables. To create a matrix Xs, which represents a SRSWR containing 100
    randomly drawn observations from X, type

        : Xs = X[mm_sample(100,rows(X)),.]

    Note that in most applications you would want to save the sample
    permutation vector for further use. For example:

        : p = mm_sample(100,rows(X))
        
        : Xs = X[p,.]
        
        : Ys = Y[p,.]


    Stratified Sampling

    To generate a stratified SRSWR, provide to mm_sample() a column vector
    containing the sizes of the strata. Example:

        : mm_sample(5, (300\700))
                  1
             +-------+
           1 |  112  |
           2 |  130  |
           3 |  168  |
           4 |   62  |
           5 |  241  |
           6 |  474  |
           7 |  603  |
           8 |  669  |
           9 |  310  |
          10 |  994  |
             +-------+

    From each stratum, five elements were drawn. The first five cases in the
    returned sample come from the first stratum (1-300), the remaining five
    cases come from the second stratum (301-1000).

    To use different sample sizes in the strata, type, for example,

        : mm_sample((3\7), (300\700))
                  1
             +-------+
           1 |  298  |
           2 |  226  |
           3 |  192  |
           4 |  998  |
           5 |  956  |
           6 |  338  |
           7 |  900  |
           8 |  378  |
           9 |  980  |
          10 |  992  |
             +-------+

    Now the first three cases come from the first stratum and the remaining
    seven come from the second stratum. Note that mm_sample() has no internal
    mechanism to determine the sample sizes for proportional stratification
    from a given total sample size. However, it is easy to compute the
    appropriate sample sizes in advance and then provide them to mm_sample().


    Cluster Sampling

    To generate a sample of clusters, provide to mm_sample() a column vector
    containing the sizes of the clusters within the population. The sum of
    cluster sizes must equal the population size (unless the population size
    is missing, in which case the sum of cluster sizes defines the population
    size). The sample size n is interpreted as the number of clusters to be
    sampled in this case.

    For example, the following command randomly picks one of three clusters,
    where the first cluster has 3 members, the second cluster has 2 members,
    and the third cluster has 5 members (making a population total of 10).
    Note that, regardless of its size, each cluster has the same sampling
    probability (see below for sampling with probabilities proportional to
    size).

        : mm_sample(1, ., (3\2\5))
               1
            +-----+
          1 |  4  |
          2 |  5  |
            +-----+

    The result indicates that the second cluster was drawn (containing the
    4th and 5th member of the population).


    Stratified Cluster Sampling

    Generating a stratified sample of clusters requires:

     o  A matrix containing the sizes of the strata and the number of
        clusters within each stratum. For example,

        : strata  = (5, 2) \ (10, 3)
        
        : strata
                1    2
            +-----------+
          1 |   5    2  |
          2 |  10    3  |
            +-----------+

        defines two strata, where the first stratum contains 2 clusters with
        a total of 5 members and the second stratum contains 3 clusters with
        a total of 10 members.

     o  A column vector containing the sizes of the clusters.

    In the following example, one cluster is sampled from each stratum:

        : strata  = (5, 2) \ (10, 3)
        
        : cluster = 3 \ 2 \ 2 \ 5 \ 3
        
        : mm_sample(1, strata, cluster)
                1
            +------+
          1 |   4  |
          2 |   5  |
          3 |   8  |
          4 |   9  |
          5 |  10  |
          6 |  11  |
          7 |  12  |
            +------+

    In both strata the second cluster was drawn.


    Sampling from Strata and Cluster ID Variables using mm_panels()

    When resampling real data, information on strata and clusters is usually
    present in the form of ID variables. The mm_panels() function, which is
    also part of the moremata package, can be used in this case to generate
    the appropriate strata and cluster input for mm_sample().

    Suppose you want to resample stratified and clustered data.  First, sort
    the data by stratum and cluster ID. For example, in Stata type

        . sort strata cluster

    where strata is the strata ID variable and cluster is the cluster ID
    variable. After that, in Mata type something like

        : st_view(strata=., ., "strata")
        
        : st_view(cluster=., ., "cluster")
        
        : mm_panels(strata, Sinfo=., clusters, Cinfo=.)
        
        : p = mm_sample(n, Sinfo, Cinfo)
        
        : ...

    Alternatively, if the data are stratified only, type

        . sort strata

    and then

        : st_view(strata=., ., "strata")
        
        : mm_panels(strata, Sinfo=.)
        
        : p = mm_sample(n, Sinfo)
        
        : ...

    or, if the data are clustered only,

        . sort cluster

    and then

        : st_view(cluster=., ., "cluster")
        
        : mm_panels(cluster, Cinfo=.)
        
        : p = mm_sample(n, ., Cinfo)
        
        : ...

    The following example further illustrates the usage of mm_panels():

        : strata,clusters
                1   2
             +---------+
           1 |  1   1  |
           2 |  1   1  |
           3 |  1   2  |
           4 |  1   3  |
           5 |  1   3  |
           6 |  1   3  |
           7 |  1   3  |
           8 |  1   4  |
           9 |  2   1  |
          10 |  2   2  |
          11 |  2   2  |
          12 |  2   2  |
          13 |  2   3  |
          14 |  2   3  |
             +---------+

        : mm_panels(strata, Sinfo=., clusters, Cinfo=.)
        
        : Sinfo
               1   2
            +---------+
          1 |  8   4  |
          2 |  6   3  |
            +---------+

        : Cinfo
               1
            +-----+
          1 |  2  |
          2 |  1  |
          3 |  4  |
          4 |  1  |
          5 |  1  |
          6 |  3  |
          7 |  2  |
            +-----+

        : mm_sample(1,Sinfo,Cinfo)
                1
            +------+
          1 |   1  |
          2 |   2  |
          3 |  10  |
          4 |  11  |
          5 |  12  |
            +------+


    Returning a Count Vector

    mm_sample() can return its results in two different formats. The default
    is to return a permutation vector containing the positions of the drawn
    elements in the population list. See the examples above. Alternatively,
    if count!=0 is specified, a count vector is returned. A count vector
    contains for each member of the population the number of times it has
    been drawn into the sample. The following example shows the count vector
    of a sample of 5 out of a population of 10 (with replacement):

        : mm_sample(5,10,.,.,0,1)
                1
             +-----+
           1 |  0  |
           2 |  0  |
           3 |  0  |
           4 |  0  |
           5 |  0  |
           6 |  0  |
           7 |  1  |
           8 |  0  |
           9 |  2  |
          10 |  2  |
             +-----+


    Sampling without Replacement

    The following examples illustrate the difference between sampling with
    replacement and sampling without replacement. When sampling with
    replacement, an individual element may be sampled multiple times:

        : mm_sample(5,5,.,.,0,1)
               1
            +-----+
          1 |  3  |
          2 |  1  |
          3 |  1  |
          4 |  0  |
          5 |  0  |
            +-----+

    However, when sampling without replacement, each element may appear at
    most once in the sample:

        : mm_sample(5,5,.,.,1,1)
               1
            +-----+
          1 |  1  |
          2 |  1  |
          3 |  1  |
          4 |  1  |
          5 |  1  |
            +-----+

    Note that, naturally, the sample size n may not exceed the population
    size when sampling without replacement. (In the case of cluster sampling,
    n may not exceed the number of clusters.)


    Unequal Probability Sampling/PPS Sampling

    For sampling with probabilities proportional to size (PPS) or, more
    generally, unequal probability sampling (UPS), you have to specify a
    column vector containing the sizes or weights. In the following example a
    n = 15000 "sample" is drawn out of a population containing 5 members. The
    population members are sampled with probabilities proportional to size,
    where the first member has weight 1, the second has weight 2, etc.

        : mm_sample(15000, 5, ., (1::5),0,1)
                  1
            +--------+
          1 |  1068  |
          2 |  2076  |
          3 |  2909  |
          4 |  3969  |
          5 |  4978  |
            +--------+

    We see that, according to the given weights, the first member has been
    sampled roughly 1000 times, the second has been sample around 2000 times,
    etc.

    Unequal probability sampling is also possible without replacement.
    However, note that in the without replacement case a problem exists if
    there are population members for which w(i) * n / sum(w) > 1. Consider
    the following example:

        : mm_sample(4, 5, ., (1::5),1,1)
                     mm_upswor():  3300  2 cases have w_i*n/sum(w)>1
                     mm_sample():     -  function returned error
                         <istmt>:     -  function returned error

    What happened? Population member no. 5 has size 5 and the sum of sizes
    over all members is 15. That is, the population share of member no. 5 is
    5/15 = 33.3%. However, even if member no. 5 is selected with certainty
    into the sample, i.e. if member no. 5 is sampled with probability 1, it
    can only reach a maximum sample share of 1/4 = 25%. (A similar problem
    exists with member no. 4 whose population share is 4/15 = 26.7%.)
    Apparently, unbiased PPS sampling without replacement is not possible in
    this situation.


    Methods and Formulas

    Simple random sampling with replacement (SRSWR) is implemented as
    ceil(uniform(n,1) * N) where n is the sample size and N is the population
    size.

    Simple random sampling without replacement (SRSWOR) is implemented as
    unorder(N)[|1 \ n|].

    Unequal probability sampling with replacement (UPSWR) is implemented
    using the standard "cumulative" approach (see, e.g., Levy and Lemeshow
    1999:354 or Cochran 1977:250; important theoretical results have been
    provided by Hansen and Hurwitz 1943).

    Unequal probability sampling without replacement (UPSWOR) is implemented
    using the random systematic sampling technique discussed in, e.g.,
    Hartley and Rao (1962). Note that many other UPSWOR algorithms can be
    found in the literature (see the review in Brewer and Hanif 1983; the
    algorithm implemented here conforms to their "Procedure 2"). An
    interesting recent approach has been developed by Tillé (1996; also see
    Ernst 2003).


Conformability

    mm_sample(n, strata, cluster, w, wor, count, fast)
           n:  1 x 1 or k x 1, where k>0 is the number of strata
      strata:  k x 1 (if cluster!=.: k x 2)
     cluster:  l x 1, where l>0 is the number of clusters; alternatively,
               cluster==.
           w:  1 x 1 or N x 1 (if cluster!=.: l x 1)
         wor:  1 x 1
       count:  1 x 1
        fast:  1 x 1
      result:  ntot x 1, where ntot is the final sample size, or, if
               count!=0, N x 1, where N is the population size

    mm_srswr(n, N, count)
           n:  1 x 1
           N:  1 x 1
       count:  1 x 1
      result:  n x 1 or, if count!=0, N x 1

    mm_srswor(n, N, count)
           n:  1 x 1
           N:  1 x 1
       count:  1 x 1
      result:  n x 1 or, if count!=0, N x 1

    mm_upswr(n, w, count)
           n:  1 x 1
           w:  N x 1, where N is the population size
       count:  1 x 1
      result:  n x 1 or, if count!=0, N x 1

    mm_upswor(n, w, count)
           n:  1 x 1
           w:  N x 1, where N is the population size
       count:  1 x 1
      result:  n x 1 or, if count!=0, N x 1


Diagnostics

    mm_upswr() and mm_upswor() produce erroneous results if w contains
    negative or missing values or if sum(w)==0.


Source code

    mm_sample.mata, mm_srswr.mata, mm_srswor.mata, mm_upswr.mata,
    mm_upswor.mata


References

    Brewer, K. R. W., Muhammad Hanif (1983). Sampling with Unequal
        Probabilities. New York: Springer.

    Cochran, William G. (1967). Sampling Techniques, 3rd ed. New York: Wiley.

    Ernst, Lawrence (2003). Sample Expansion for Probability Proportional to
        Size without Replacement Sampling. Proceedings of the Section on
        Survey Research Methods, 2003, American Statistical Association: 
        http://www.bls.gov/ore/pdf/st030100.pdf.

    Hansen, Morris H., William N. Hurwitz (1943). On the Theory of Sampling
        from Finite Populations. The Annals of Mathematical Statistics 33:
        350-374.

    Hartley, H. O., J. N. K. Rao (1962). Sampling with Unequal Probabilities
        and without Replacement. The Annals of Mathematical Statistics 14:
        333-362.

    Levy, Paul S., Stanley Lemeshow (1999). Sampling of Populations. Methods
        and Applications, 3rd ed. New York: Wiley.

    Tillé, Yves (1996). An Elimination Procedure for Unequal Probability
        Sampling without Replacement. Biometrika 83: 238-241.


Author

    Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch


Also see

    Online:  help for mm_panels(), sample, bsample, [M-5] uniform(), [M-4]
             utility, moremata