{smcl} {* revised 30aug2014}{...} {cmd:help randomtag} {hline} {title:Title} {phang} {bf:randomtag} {hline 2} Tag a random number of observations {title:Syntax} {p 4 16 2} {cmd:randomtag} {ifin} {cmd:,} {opt c:ount(#)} [{opth g:enerate(newvar)}] {title:Description} {pstd} Like Stata's {help sample} command, {cmd:randomtag} draws observations without replacement. Unlike {help sample}, {cmd:randomtag} does not discard observations; it creates instead an indicator variable that tags observations that are part of the pseudorandom sample. {pstd} If the {opth g:enerate(newvar)} option is omitted, the default name for the tag variable is {cmd:_randomtag}. {pstd} The desired number of observations in the sample is specified using {opt c:ount(#)}. If {it: #} is larger than the number of observations, all observations are tagged. When {ifin} qualifiers are used and {it: #} is larger than the total number of observations that meet the criteria, all observations that meet the criteria are tagged. {pstd} {cmd:randomtag} is much faster than Stata's {help sample} command because it does not need to reorder the observations in memory to complete its task. Given the same seed, {bf:randomtag} and {help sample} will draw exactly the same observations but {bf:randomtag} will be significantly faster. See one caveat in {it:{help randomtag##tech:Technical Details}} below. {pstd} {cmd:randomtag} is a stand-alone version of the code that was developed to quickly draw a random random sample for {stata "ssc des listsome":listsome} (from {help SSC}). {pstd} {cmd:randomtag} requires Stata version 9 or newer. {marker tech}{...} {title:Technical Details} {pstd} If you have a dataset in memory with {help _N} observations and you want to draw a pseudorandom sample without replacement of {it:n} observations, you could use {cmd:.} keep if runiform() <= n/{help _N} {pstd} The problem is that while the resulting dataset may contain {it:n} observations, more likely it will be a few less or a few more. As explained in satisfying detail by William Gould (StataCorp) in {browse "http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/":The Stata Blog}, to draw {it:n} observations without replacement in a reproducible way, you can use {cmd:.} set seed # {cmd:.} sort {it:variables_that_put_data_in_unique_order} {cmd:.} generate double u1 = runiform() {cmd:.} generate double u2 = runiform() {cmd:.} sort u1 u2 {cmd:.} keep in 1/n {pstd} Stata's {help sample} command implements the above strategy. However, since this approach requires sorting the data, {help sample} run times increase more than linearly with {help _N}. The time penalty is quite severe when dealing with millions of observations. {pstd} {bf:randomtag} follows exactly the same strategy but capitalizes on the fact that {it:u1} < {it:n}/{help _N} is true for almost all observations that are retained and defines an initial {it: cutoff = n / {help _N}}. If it turns out that count({it:u1} < {it:cutoff}) == {it:n}, {bf:randomtag} simply tags observations using {it:u1} < {it:cutoff} and is done. {pstd} If the yield is lower, then {it:lowcut = cutoff} and {it:highcut = cutoff} are set and {it:highcut} is increased in small increments until count({it:u1} < {it:highcut}) > n. The inverse is done if the initial {it:cutoff} yielded too many observations. {pstd} Once {it:lowcut} and {it:highcut} are determined, then all observations with {it:u1} < {it:lowcut} are tagged for inclusion. A matrix is then put together to hold {it:u1}, {it:u2}, and observation indices for the very small subset where ({it:u1} >= {it:lowcut} & {it:u1} <= {it:highcut}). The rows are sorted in the order of ({it:u1}, {it:u2}, indices) and the number of observations needed to bring up the total to {it:n} are tagged. {pstd} Because the main data in memory is never sorted, {bf:randomtag} is much faster than {help sample}. Since {bf:randomtag} follows exactly the same strategy that {help sample} uses to draw random observations, both should generate exactly the same random sample provided the same {help seed} is used. {pstd} There is however a small chance that {bf:randomtag} will not match {help sample} completely. {bf:randomtag} is coded in Mata so {it:u1} and {it:u2} are doubles while {help sample} uses floats. Squeezing Stata's 32-bit random numbers generated by {cmd:runiform()} into 23-bit floats increases the chances that {it:u1} will contain duplicate random numbers. Since {it:u2} is there to break the ties, this is better than using a single double and less resource intensive than using doubles for both {it:u1} and {it:u2}. Since {bf:randomtag}'s version of {it:u1} and {it:u2} have more precision, it is possible that the last observation drawn will be different if there is a duplicate in {help sample}'s {it:u1} at observation {it:n}. A version of {bf:randomtag} that is completely compatible with {help sample} can be provided upon request (it's a bit slower and requires Stata 10 or higher). {title:Examples} {pstd} Load some data in memory {cmd:.} {stata sysuse nlsw88.dta} {pstd} Draw a random sample using Stata's {help sample} command {cmd:.} {stata set seed 9583945} {cmd:.} {stata sample 10, count} {cmd:.} {stata sort idcode} {cmd:.} {stata list idcode} {pstd} Redo using {bf:randomtag} {cmd:.} {stata sysuse nlsw88.dta, clear} {cmd:.} {stata set seed 9583945} {cmd:.} {stata randomtag , count(10) gen(t)} {cmd:.} {stata keep if t} {cmd:.} {stata list idcode} {pstd} A big job for {help sample} {cmd:.} {stata clear} {cmd:.} {stata set seed 651651} {cmd:.} {stata set obs 10000000} {cmd:.} {stata gen n = _n} {cmd:.} {stata randomtag , count(1000000) gen(t)} {cmd:.} {stata sum n if t} {cmd:.} {stata set seed 651651} {cmd:.} {stata sample 1000000, count} {cmd:.} {stata sum n} {title:References} {pstd} Gould, W. W. 2012a. Using Stata's random-number generators, part 2: Drawing without replacement. The Stata Blog: Not Elsewhere Classified. {browse "http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/":http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/} {title:Also see} {psee} SSC: {stata "ssc des fastsample":fastsample} is a similar program written by Andrew Maurer. It also uses Mata to draw a random sample using a different approach that completely avoids sorting. Observations not in the random sample are dropped. You can expect similar performance between {stata "ssc des fastsample":fastsample} and {cmd:randomtag}. {stata "ssc des fastsample":fastsample} requires Stata version 13 or higher. {p_end} {psee} SSC: {stata "ssc des listsome":listsome} uses an embedded version of {bf:randomtag} to draw random observations to list. {p_end} {psee} Help: {manhelp sample D} {p_end} {psee} FAQs: {browse "http://www.stata.com/support/faqs/statistics/random-samples/":How can I take random samples from an existing dataset?} {p_end} {title:Author} {pstd}Robert Picard{p_end} {pstd}picard@netbox.com{p_end}