{smcl}
{* revised 30aug2014}{...}
{cmd:help randomtag}
{hline}
{title:Title}
{phang}
{bf:randomtag} {hline 2} Tag a random number of observations
{title:Syntax}
{p 4 16 2}
{cmd:randomtag}
{ifin}
{cmd:,}
{opt c:ount(#)}
[{opth g:enerate(newvar)}]
{title:Description}
{pstd}
Like Stata's {help sample} command, {cmd:randomtag} draws observations without replacement.
Unlike {help sample}, {cmd:randomtag} does not discard
observations;
it creates instead an indicator variable that tags observations that are part
of the pseudorandom sample.
{pstd}
If the {opth g:enerate(newvar)}
option is omitted, the default name for the tag variable
is {cmd:_randomtag}.
{pstd}
The desired number of observations in the sample is specified using {opt c:ount(#)}.
If {it: #} is larger than the number of observations, all observations
are tagged.
When {ifin} qualifiers are used and {it: #} is larger
than the total number of observations that meet the criteria, all observations
that meet the criteria are tagged.
{pstd}
{cmd:randomtag} is much faster than Stata's {help sample} command because it
does not need to reorder the observations in memory to complete its task.
Given the same seed, {bf:randomtag} and
{help sample} will draw exactly the same observations but {bf:randomtag}
will be significantly faster.
See one caveat in {it:{help randomtag##tech:Technical Details}} below.
{pstd}
{cmd:randomtag} is a stand-alone version of the code that was developed to
quickly draw a random random sample for
{stata "ssc des listsome":listsome} (from {help SSC}).
{pstd}
{cmd:randomtag} requires Stata version 9 or newer.
{marker tech}{...}
{title:Technical Details}
{pstd}
If you have a dataset in memory with {help _N} observations and you want
to draw a pseudorandom sample without replacement of {it:n} observations,
you could use
{cmd:.} keep if runiform() <= n/{help _N}
{pstd}
The problem is that while the resulting dataset may contain {it:n} observations,
more likely it will be a few less or a few more.
As explained in satisfying detail by William Gould (StataCorp) in
{browse "http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/":The Stata Blog},
to draw {it:n} observations without replacement in a reproducible way, you can use
{cmd:.} set seed #
{cmd:.} sort {it:variables_that_put_data_in_unique_order}
{cmd:.} generate double u1 = runiform()
{cmd:.} generate double u2 = runiform()
{cmd:.} sort u1 u2
{cmd:.} keep in 1/n
{pstd}
Stata's {help sample} command implements the above strategy. However,
since this approach requires sorting the data, {help sample} run times
increase more than linearly with {help _N}. The time penalty
is quite severe when dealing with millions of observations.
{pstd}
{bf:randomtag} follows exactly the same strategy but
capitalizes on the fact that {it:u1} < {it:n}/{help _N} is true for almost all observations
that are retained and defines an initial {it: cutoff = n / {help _N}}. If it
turns out that count({it:u1} < {it:cutoff}) == {it:n},
{bf:randomtag} simply tags observations using {it:u1} < {it:cutoff} and is done.
{pstd}
If the yield is lower, then {it:lowcut = cutoff} and {it:highcut = cutoff}
are set and
{it:highcut} is increased in small increments until count({it:u1} < {it:highcut}) > n.
The inverse is done if the initial {it:cutoff} yielded too many observations.
{pstd}
Once {it:lowcut} and {it:highcut} are determined, then all observations with
{it:u1} < {it:lowcut} are tagged for inclusion. A matrix is then put together
to hold {it:u1}, {it:u2}, and observation indices
for the very small subset where ({it:u1} >= {it:lowcut} & {it:u1} <= {it:highcut}).
The rows are sorted in the order of ({it:u1}, {it:u2}, indices) and the
number of observations needed to bring up the total to {it:n} are tagged.
{pstd}
Because the main data in memory is never
sorted, {bf:randomtag} is much faster than {help sample}. Since {bf:randomtag}
follows exactly the same strategy that {help sample} uses to draw random observations,
both should generate exactly the same random sample provided the same {help seed} is used.
{pstd}
There is however a small chance that {bf:randomtag} will not match {help sample}
completely. {bf:randomtag} is coded in Mata so {it:u1} and {it:u2} are
doubles while {help sample} uses floats. Squeezing Stata's 32-bit random numbers
generated by {cmd:runiform()} into 23-bit floats increases the chances that
{it:u1} will contain duplicate random numbers. Since {it:u2} is there to break
the ties, this is better than using a single double and less resource intensive
than using doubles for both {it:u1} and {it:u2}.
Since {bf:randomtag}'s
version of {it:u1} and {it:u2} have more precision, it is possible that the last
observation drawn will be different
if there
is a duplicate in {help sample}'s {it:u1} at observation {it:n}.
A version of {bf:randomtag} that is completely compatible with {help sample}
can be provided upon request (it's a bit slower and requires Stata 10 or higher).
{title:Examples}
{pstd}
Load some data in memory
{cmd:.} {stata sysuse nlsw88.dta}
{pstd}
Draw a random sample using Stata's {help sample} command
{cmd:.} {stata set seed 9583945}
{cmd:.} {stata sample 10, count}
{cmd:.} {stata sort idcode}
{cmd:.} {stata list idcode}
{pstd}
Redo using {bf:randomtag}
{cmd:.} {stata sysuse nlsw88.dta, clear}
{cmd:.} {stata set seed 9583945}
{cmd:.} {stata randomtag , count(10) gen(t)}
{cmd:.} {stata keep if t}
{cmd:.} {stata list idcode}
{pstd}
A big job for {help sample}
{cmd:.} {stata clear}
{cmd:.} {stata set seed 651651}
{cmd:.} {stata set obs 10000000}
{cmd:.} {stata gen n = _n}
{cmd:.} {stata randomtag , count(1000000) gen(t)}
{cmd:.} {stata sum n if t}
{cmd:.} {stata set seed 651651}
{cmd:.} {stata sample 1000000, count}
{cmd:.} {stata sum n}
{title:References}
{pstd}
Gould, W. W. 2012a. Using Stata's random-number generators, part 2:
Drawing without replacement. The Stata Blog: Not Elsewhere Classified.
{browse "http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/":http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/}
{title:Also see}
{psee}
SSC: {stata "ssc des fastsample":fastsample} is a similar program written by
Andrew Maurer. It also uses Mata to draw a random sample using a different approach
that completely avoids sorting. Observations not in the random sample are dropped.
You can expect similar performance between {stata "ssc des fastsample":fastsample}
and {cmd:randomtag}. {stata "ssc des fastsample":fastsample} requires Stata version 13
or higher.
{p_end}
{psee}
SSC: {stata "ssc des listsome":listsome} uses an embedded version of {bf:randomtag}
to draw random observations to list.
{p_end}
{psee}
Help: {manhelp sample D}
{p_end}
{psee}
FAQs: {browse "http://www.stata.com/support/faqs/statistics/random-samples/":How can I take random samples from an existing dataset?}
{p_end}
{title:Author}
{pstd}Robert Picard{p_end}
{pstd}picard@netbox.com{p_end}