------------------------------------------------------------------------------- help distinct (SJ12-2: dm0042) -------------------------------------------------------------------------------

Title

distinct -- Report number(s) of distinct observations or values

Syntax

distinct [varlist] [if] [in] [, missing abbrev(#) joint minimum(#) maximum(#) ]

by is allowed; see [D] by.

Description

The distinct command displays the number of distinct observations with respect to the variables in varlist. By default, each variable is considered separately so that the number of distinct observations for each variable is reported; the number of distinct observations is the same as the number of distinct values. Optionally, variables can be considered jointly so that the number of distinct groups defined by the values of variables in varlist is reported.

By default, missing values are not counted. varlist may contain both numeric and string variables.

Options

missing specifies that missing values are to be included in counting distinct observations.

abbrev(#) specifies that variable names are to be displayed abbreviated to at most # characters. This option has no effect with joint.

joint specifies that distinctness is to be determined jointly for the variables in varlist.

minimum(#) specifies that numbers of distinct values are to be displayed only if they are equal to or greater than a specified minimum.

maximum(#) specifies that numbers of distinct values are to be displayed only if they are less than or equal to a specified maximum.

Remarks

Distinctness, duplication, and uniqueness are different aspects of the similarity and difference of observations. Suppose the values of some variable are 1, 2, 2, 3, 3, 3, 4, 4, 4, 4. Then there are four distinct values: 1, 2, 3, and 4. Alternatively, there are, so far as this variable is concerned, four distinct observations because, for example, the second and third observations both containing the value 2 are identical in respect to this variable. Of these values, 2, 3, and 4 are duplicated in the data, meaning that each occurs twice or more. Some people refer to the distinct values as unique values, even though in general distinct values could all be repeated in the data. One logic behind that terminology is that if you remove all duplicates from these data then you are left with four distinct values, each of which occurs once.

Now consider distinctness determined jointly for two variables. Suppose observations are 1 and "a", 2 and "b", 2 and "b", 3 and "c", 3 and "c", 3 and "d", 4 and "c", 4 and "c", 4 and "d", 4 and "d". Then, as far as these two variables are concerned, there are six distinct observations, 1 and "a", 2 and "b", 3 and "c", 3 and "d", 4 and "c", 4 and "d". Considering the variables individually, there are four distinct values for the first variable and four for the second. Clearly, the same principles of considering variables individually and jointly extend to three or more variables.

Saved results

distinct saves the following in r():

Scalars r(ndistinct) distinct count (for last variable, or jointly considered group of variables, and, if specified, last by group) r(N) number of observations (for last variable, or jointly considered group of variables, and, if specified, last by group)

Examples

. sysuse auto . distinct . distinct, max(10) . distinct make-headroom . distinct make-headroom, missing abbrev(6) . distinct foreign rep78, joint . distinct foreign rep78, joint missing

Authors

Gary Longton, Fred Hutchinson Cancer Research Center, USA glongton@fhcrc.org

Nicholas J. Cox, Durham University, UK n.j.cox@durham.ac.uk

Acknowledgment

This program grew out of one originally posted to Statalist by Patrick Royston for Stata 4.

Also see

Article: Stata Journal, volume 8, number 4: dm0042

Online: [D] codebook, [D] contract, [D] duplicates, [D] egen, [D] inspect, [D] isid, [P] levelsof, tabulate, groups (if installed)