help for samplepps       Stephen P. Jenkins (June 2005, help revised May 2008))

Draw random sample, proportional to size, of n cases

samplepps newvar [if exp] [in range] , ncases(integer) size(sizevar) [withrepl ]


samplepps draws a random sample with ncases observations from the current data set, with probabilities proportional to size (`pps'). The default is to select cases without replacement; optionally cases may be selected with replacement.

If sampling is without replacement, the variable newvar is equal to 1 for selected cases, and 0 for non-selected cases. The program returns an error if either the number of cases to be selected is greater than the number of valid observations, or if any observation has newvar/(SUM_i newvar) >= 1/ncases.

If sampling is with replacement, the variable newvar is equal to a positive integer for selected cases (the integer is the number of times the case has been selected), and 0 for non-selected cases. For both types of sampling, newvar is missing if sizevar is missing.

If you are serious about drawing random samples, you must first set the random number seed; see generate.

Methods for sampling with probabilities proportional to size are discussed by Lohr (1999). See also Levy and Lemeshow (1991, chapter 11) and Som (1973, chapter 5), who focus on the with-replacement case. The algorithm used by samplepps for the with-replacement case is the standard `cumulative method'. For the without-replacement case, I used an algorithm described by Jean-Yves Pip Courbois (formerly at the University of Washington), orginally due to Madow (1949). For more details, see Brewer and Hanif (1983) and Cochran (1977, p. 265) who cites Hartley and Rao (1962) and Madow (1949).


ncases(integer) specifies the number of observations to be selected.

size(sizevar) specifies the name of the existing variable summarizing `size'.

withrepl specifies selection with replacement. (If the option is specified, a given obs may be selected more than once.)

Saved results

r(ncases) is the integer ncases.

r(nobs) is the number of valid observations at risk of being sampled.

r(sizevar) contains the name sizevar.

r(withrepl) = 1 if the with-replacement option was specified.

r(sample) contains the name newvar.


. // select a sample of schools with selection probabilities depending on # pupils per school.

. use schools.dta, clear

. set seed 123517

. samplepps pick1, size(n_pupils) n(100)

. samplepps pick2, size(n_pupils) n(50) withrepl


Program written with support of ESRC grant number RES-000-22-0995 ("Social segregation in UK schools: benchmarking with international comparisons"). For helpful discussions, I thank project colleagues John Micklewright and Sylke Schnepf, and also Philippe Van Kerm. Steven Samuels due my attention to the references by Cochran, Hartley and Rao, and Madow. Ben Jann drew my attention to the Brewer and Hanif reference.


Stephen P. Jenkins, ISER, University of Essex, U.K. <stephenj@essex.ac.uk>


Brewer, K. R. W. and Muhammad Hanif. 1983. Sampling with Unequal Probabilities. New York: Springer.

Cochran, William G. 1977. Sampling Techniques, 3rd Edition. New York: Wiley.

Madow, William G. 1949. On the theory of systematic sampling. II. Annals of Mathematical Statistics, 19: 535-545.

Hartley, H.O. and J.N.K. Rao. 1962. Sampling with unequal probabilities and without replacement. Annals of Mathematical Statistics, 33: 350-374.

Levy, Paul S. and Stanley Lemeshow. 1991. Sampling of Populations: Methods and Applications, 2nd edition. New York: John Wiley and Sons.

Lohr, Sharon L. 1999. Sampling: Design and Analysis. Pacific Grove CA: Duxbury Press.

Som, Ranjan K. 1973. Practical Sampling Techniques, second edition, revised and expanded. New York: Marcel Dekker.

Also see

Manual: [S-Z] sample

On-line: help for sample, and gsample if installed.