Title

distinct -- Report number(s) of distinct observations or values

Syntax

distinct [varlist] [if] [in] [, missing abbrev(#) joint minimum(#)
maximum(#) ]

by is allowed; see [D] by.

Description

The distinct command displays the number of distinct observations with
respect to the variables in varlist.  By default, each variable is
considered separately so that the number of distinct observations for
each variable is reported; the number of distinct observations is the
same as the number of distinct values.  Optionally, variables can be
considered jointly so that the number of distinct groups defined by the
values of variables in varlist is reported.

By default, missing values are not counted.  varlist may contain both
numeric and string variables.

Options

missing specifies that missing values are to be included in counting
distinct observations.

abbrev(#) specifies that variable names are to be displayed abbreviated
to at most # characters.  This option has no effect with joint.

joint specifies that distinctness is to be determined jointly for the
variables in varlist.

minimum(#) specifies that numbers of distinct values are to be displayed
only if they are equal to or greater than a specified minimum.

maximum(#) specifies that numbers of distinct values are to be displayed
only if they are less than or equal to a specified maximum.

Remarks

Distinctness, duplication, and uniqueness are different aspects of the
similarity and difference of observations.  Suppose the values of some
variable are 1, 2, 2, 3, 3, 3, 4, 4, 4, 4.  Then there are four distinct
values: 1, 2, 3, and 4.  Alternatively, there are, so far as this
variable is concerned, four distinct observations because, for example,
the second and third observations both containing the value 2 are
identical in respect to this variable.  Of these values, 2, 3, and 4 are
duplicated in the data, meaning that each occurs twice or more.  Some
people refer to the distinct values as unique values, even though in
general distinct values could all be repeated in the data.  One logic
behind that terminology is that if you remove all duplicates from these
data then you are left with four distinct values, each of which occurs
once.

Now consider distinctness determined jointly for two variables.  Suppose
observations are 1 and "a", 2 and "b", 2 and "b", 3 and "c", 3 and "c", 3
and "d", 4 and "c", 4 and "c", 4 and "d", 4 and "d".  Then, as far as
these two variables are concerned, there are six distinct observations, 1
and "a", 2 and "b", 3 and "c", 3 and "d", 4 and "c", 4 and "d".
Considering the variables individually, there are four distinct values
for the first variable and four for the second.  Clearly, the same
principles of considering variables individually and jointly extend to
three or more variables.

Saved results

distinct saves the following in r():

Scalars
r(ndistinct)  distinct count (for last variable, or jointly considered
group of variables, and, if specified, last by group)
r(N)          number of observations (for last variable, or jointly
considered group of variables, and, if specified, last
by group)

Examples

. sysuse auto
. distinct
. distinct, max(10)
. distinct foreign rep78, joint
. distinct foreign rep78, joint missing

Authors

Gary Longton, Fred Hutchinson Cancer Research Center, USA
glongton@fhcrc.org

Nicholas J. Cox, Durham University, UK
n.j.cox@durham.ac.uk

Acknowledgment

This program grew out of one originally posted to Statalist by Patrick
Royston for Stata 4.

Also see

Article: Stata Journal, volume 8, number 4: dm0042

Online:  [D] codebook, [D] contract, [D] duplicates, [D] egen, [D]
inspect, [D] isid, [P] levelsof, tabulate, groups (if installed)
```