{smcl}
{* revised 30aug2014}{...}
{cmd:help randomtag}
{hline}

{title:Title}

{phang}
{bf:randomtag} {hline 2} Tag a random number of observations


{title:Syntax}

{p 4 16 2}
{cmd:randomtag}  
	{ifin}
	{cmd:,}
	{opt c:ount(#)}
	[{opth g:enerate(newvar)}]
	

{title:Description}

{pstd}
Like Stata's {help sample} command, {cmd:randomtag} draws observations without replacement.
Unlike {help sample}, {cmd:randomtag} does not discard
observations; 
it creates instead an indicator variable that tags observations that are part
of the pseudorandom sample. 

{pstd}
If the {opth g:enerate(newvar)}
option is omitted, the default name for the tag variable
is {cmd:_randomtag}.

{pstd}
The desired number of observations in the sample is specified using {opt c:ount(#)}.
If {it: #} is larger than the number of observations,  all observations
are tagged. 
When {ifin} qualifiers are used and {it: #} is larger
than the total number of observations that meet the criteria, all observations
that meet the criteria are tagged.

{pstd}
{cmd:randomtag} is much faster than Stata's {help sample} command because it
does not need to reorder the observations in memory to complete its task.
Given the same seed, {bf:randomtag} and
{help sample} will draw exactly the same observations but {bf:randomtag}
will be significantly faster.
See one caveat in {it:{help randomtag##tech:Technical Details}} below.

{pstd}
{cmd:randomtag} is a stand-alone version of the code that was developed to
quickly draw a random random sample for 
{stata "ssc des listsome":listsome} (from {help SSC}).

{pstd}
{cmd:randomtag} requires Stata version 9 or newer.


{marker tech}{...}
{title:Technical Details}

{pstd}
If you have a dataset in memory with {help _N} observations and you want
to draw a pseudorandom sample without replacement of {it:n} observations,
you could use

      {cmd:.} keep if runiform() <= n/{help _N}
      
{pstd}
The problem is that while the resulting dataset may contain {it:n} observations,
more likely it will be a few less or a few more.
As explained in satisfying detail by William Gould (StataCorp) in
{browse "http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/":The Stata Blog},
to draw {it:n} observations without replacement in a reproducible way, you can use

        {cmd:.} set seed #
        {cmd:.} sort {it:variables_that_put_data_in_unique_order}
        {cmd:.} generate double u1 = runiform()
        {cmd:.} generate double u2 = runiform()
        {cmd:.} sort u1 u2
        {cmd:.} keep in 1/n

{pstd}
Stata's {help sample} command implements the above strategy. However,
since this approach requires sorting the data, {help sample} run times
increase more than linearly with {help _N}. The time penalty
is quite severe when dealing with millions of observations.

{pstd}
{bf:randomtag} follows exactly the same strategy but
capitalizes on the fact that {it:u1} < {it:n}/{help _N} is true for almost all observations
that are retained and defines an initial {it: cutoff = n / {help _N}}. If it
turns out that count({it:u1} < {it:cutoff}) == {it:n},
{bf:randomtag} simply tags observations using {it:u1} < {it:cutoff} and is done.

{pstd}
If the yield is lower, then {it:lowcut = cutoff} and {it:highcut = cutoff} 
are set and
{it:highcut} is increased in small increments until count({it:u1} < {it:highcut}) > n.
The inverse is done if the initial {it:cutoff} yielded too many observations.

{pstd}
Once {it:lowcut} and {it:highcut} are determined, then all observations with
{it:u1} < {it:lowcut} are tagged for inclusion. A matrix is then put together
to hold {it:u1}, {it:u2}, and observation indices
for the very small subset where ({it:u1} >= {it:lowcut} & {it:u1} <= {it:highcut}).
The rows are sorted in the order of ({it:u1}, {it:u2}, indices) and the
number of observations needed to bring up the total to {it:n} are tagged.

{pstd}
Because the main data in memory is never
sorted, {bf:randomtag} is much faster than {help sample}. Since {bf:randomtag} 
follows exactly the same strategy that {help sample} uses to draw random observations, 
both should generate exactly the same random sample provided the same {help seed} is used.

{pstd}
There is however a small chance that {bf:randomtag} will not match {help sample}
completely. {bf:randomtag} is coded in Mata so {it:u1} and {it:u2} are
doubles while {help sample} uses floats. Squeezing Stata's 32-bit random numbers 
generated by {cmd:runiform()} into 23-bit floats increases the chances that
{it:u1} will contain duplicate random numbers. Since {it:u2} is there to break
the ties, this is better than using a single double and less resource intensive
than using doubles for both {it:u1} and {it:u2}.
Since {bf:randomtag}'s
version of {it:u1} and {it:u2} have more precision, it is possible that the last
observation drawn will be different
if there
is a duplicate in {help sample}'s {it:u1} at observation {it:n}.
A version of {bf:randomtag} that is completely compatible with {help sample}
can be provided upon request (it's a bit slower and requires Stata 10 or higher).


{title:Examples}

{pstd}
Load some data in memory

        {cmd:.} {stata sysuse nlsw88.dta}

{pstd}
Draw a random sample using Stata's {help sample} command

        {cmd:.} {stata set seed 9583945}
        {cmd:.} {stata sample 10, count}
        {cmd:.} {stata sort idcode}
        {cmd:.} {stata list idcode}
        
{pstd}
Redo using {bf:randomtag}

        {cmd:.} {stata sysuse nlsw88.dta, clear}
        {cmd:.} {stata set seed 9583945}
        {cmd:.} {stata randomtag , count(10) gen(t)}
        {cmd:.} {stata keep if t}
        {cmd:.} {stata list idcode}
        

{pstd}
A big job for {help sample}

        {cmd:.} {stata clear}
        {cmd:.} {stata set seed 651651}
        {cmd:.} {stata set obs 10000000}
        {cmd:.} {stata gen n = _n}
        {cmd:.} {stata randomtag , count(1000000) gen(t)}
        {cmd:.} {stata sum n if t}
        {cmd:.} {stata set seed 651651}
        {cmd:.} {stata sample 1000000, count}
        {cmd:.} {stata sum n}
        

{title:References}

{pstd}
Gould, W. W. 2012a. Using Stata's random-number generators, part 2: 
	Drawing without replacement. The Stata Blog: Not Elsewhere Classified.
{browse "http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/":http://blog.stata.com/2012/08/03/using-statas-random-number-generators-part-2-drawing-without-replacement/}


{title:Also see}

{psee}
SSC:  {stata "ssc des fastsample":fastsample} is a similar program written by
Andrew Maurer. It also uses Mata to draw a random sample using a different approach 
that completely avoids sorting. Observations not in the random sample are dropped.
You can expect similar performance between {stata "ssc des fastsample":fastsample}
and {cmd:randomtag}. {stata "ssc des fastsample":fastsample} requires Stata version 13
or higher.
{p_end}

{psee}
SSC:  {stata "ssc des listsome":listsome} uses an embedded version of {bf:randomtag} 
to draw random observations to list.
{p_end}

{psee}
Help: {manhelp sample D}
{p_end}

{psee}
FAQs:  {browse "http://www.stata.com/support/faqs/statistics/random-samples/":How can I take random samples from an existing dataset?}
{p_end}


{title:Author}

{pstd}Robert Picard{p_end}
{pstd}picard@netbox.com{p_end}