```help digdis
-------------------------------------------------------------------------------

Title

digdis -- Analysis of digit distributions

Syntax

digdis varlist [if] [in] [weight] [, options ]

digdis varname [if] [in] [weight] [, by(groupvar) options ]

options                Description
-------------------------------------------------------------------------
Main
position(#)          digit position (1st is default); # in [1,6]
base(#)              base of number system (10 is default); # in [2,10]
decimalplaces(#)     precision of input values (number of decimal
places)
benford              reference is Benford's law (the default)
uniform              reference distribution is uniform
matrix(name)         user defined reference distribution
test(mgof_opts)      options for goodness-of-fit test
notest               suppress goodness-of-fit test
nofreq               suppress frequency table
generate(newvarlist) save variable(s) containing digits
replace              overwrite existing variables

Graph
graph                display graph
percent              scale is in percent (default)
fraction             scale is in proportions
count                scale is in counts
bar_options          affect rendition of observed distribution
ci[(type)]           include confidence intervals (capped spikes)
level(#)             set confidence level; default is level(95)
ciopts(rcap_opts)    affect rendition of confidence spikes
refopts(options)     affect rendition of reference distribution
noref                suppress reference distribution
twoway_options       any options other than by() documented in help
twoway_options

By
by(groupvar)         repeat results for subgroups
byopts(by_subopts)   graph suboptions for by()
-------------------------------------------------------------------------

by is allowed; see help by.
fweights are allowed; see help weight.

Description

digdis tabulates the distribution of digits of the variables in varlist,
performs goodness-of-fit tests against a reference distribution and,
optionally, graphs the distributions. The default is to tabulate the
first (nonzero) digit; specify, e.g., position(2) to tabulate the second
digit. The default reference distribution is given by Benford's law.

The variables in varlist may be numeric or string (see help data types).
It is sensible in some situations to use string variables to store the
numbers to be analyzed, since this ensures that the numbers remain
exactly as is. Note that, if the storage type is float or double the
numbers will be right-padded with zeros (i.e. 1.3 is interpreted as
1.3000...) unless the decimalplaces() option is specified. Using the
float storage type is strongly discouraged because of it's limited
precision. For example, the number 1.30 is 1.29999995... in float
accuracy and digdis will, e.g., read a 2 for the second digit (unless the
decimalplaces() option is used to round the number).

digdis sometimes displays notes such as "x: 7 invalid observations". An
observation is considered invalid if (1) it's value 0, (2) position(#)>1
is specified and the value does not have a #-th digit, or (3) the input
variable is string and contains a nonnumeric value.

Dependencies

digdis requires moremata and mgof. Type

. ssc describe moremata

. ssc describe mgof

Options

+------+
----+ Main +-------------------------------------------------------------

position(#), where # in [1,6], specifies the position of the digits to be
tabulated. position(1) is the default. Examples: The fist digits of
236, 4.015, and 0.00789 are 1, 4, and 7; the second digits are 3, 0,
and 8; the third digits are 6, 1, and 9.

base(#), where # in [2,10], specifies the base of the number system.
base(10) is the default. This option is rarely used.

decimalplaces(#) specifies the number of decimal places of the input
values. This option has an effect only if the storage type of the
variable is float or double (see help data types). decimalplaces(#)
rounds the values to # decimal places.

benford specifies that the expected distribution be computed according to
Benford's law (see help for mm_benford()). This is the default.

uniform specifies that the expected distribution is uniform.

matrix(name) provides the name of a matrix containing the expected
distribution. The matrix should be a column vector containing the
proportions or the expected counts of the digits in ascending order
(i.e. frequency of 1's in the first column, frequency of 2's in the
second column, etc., or, if position()>1 is specified, frequency of
0's in the first column, frequency of 1's in the second column, etc.)

test(mgof_opts) specifies options to be passed through to the mgof
command, which is used to perform goodness-of-fit tests (see help
mgof). For example, type test(mc) to perform exact tests using the
Monte Carlo method instead of asymptotic tests.

notest suppresses the goodness-of-fit tests.

nofreq suppresses the frequency table(s).

generate(newvarlist) causes variables to be generated containing the
extracted digits. Specify one newvar for each input variable.

replace allows the generate() option to overwrite existing variables.

+-------+
----+ Graph +------------------------------------------------------------

graph displays the observed digit distribution in a graph as a bar plot
(see help twoway bar). The reference distribution is overlayed as
connected-line plot (see help twoway connected). Type noref to omit
the reference distribution.

percent displays percentages. This is the default

fraction displays proportions.

count displays counts.

bar_options affect the rendition of the plotted distribution. See help
twoway bar.

ci[(type)] specifies that pointwise confidence intervals of the observed
distribution be plotted as capped spikes.  type sets the calculation
method and may be exact (the default), wald, wilson, agresti, or
jeffreys (see help ci). Alternatively, type may be reference in which
case point-wise probability intervals are plotted around the
reference distribution as a connected-line range plot (the intervals
are the shortest intervals with a probability mass of at least the
value of the confidence level).

level(#) specifies the confidence level, as a percentage, for the plotted
confidence intervals. The default is level(95) or as set by set
level.

ciopts(options) affects the rendition of the confidence spikes. See help
twoway rcap or, if ci(reference) is used, help twoway rconnected.

noref suppresses plotting the reference distribution.

refopts(connected_options) affects the rendition of the plotted reference
distribution. See help twoway connected.

addplot(plot) provides a way to add other plots to the generated graph.

twoway_options are any of the options documented in twoway_options,
excluding by().

+----+
----+ By +---------------------------------------------------------------

by(groupvar) repeats the analysis for the groups defined by groupvar. The
individual goodness-of-fit tests are included in a single table and
the individual plots are drawn within a single graph. Note that
digdis also allows the by prefix command, which arranges output
differently. A difference between the by() option and the by prefix
is also that by() returns in r() the results for all groups whereas
the by prefix only returns the results for the last group.

byopts(by_subopts) affects the arrangement of the individual plots in the
graph. See the suboptions in help by_option. Do not use the total
suboption.

Examples

. sysuse auto
(1978 Automobile Data)

. digdis price

Digit distribution (1st digit)

Value |     Count    Percent    Percent      Diff.    P-value
-------------+------------------------------------------------------
1 |        10     13.514     30.103    -16.589     0.0014
2 |         0      0.000     17.609    -17.609     0.0000
3 |        11     14.865     12.494      2.371     0.4844
4 |        26     35.135      9.691     25.444     0.0000
5 |        14     18.919      7.918     11.001     0.0018
6 |         7      9.459      6.695      2.765     0.3456
7 |         2      2.703      5.799     -3.096     0.4482
8 |         2      2.703      5.115     -2.413     0.5918
9 |         2      2.703      4.576     -1.873     0.7767
-------------+------------------------------------------------------
Total |        74    100.000    100.000      9.240

Goodness-of-fit tests        method =   approx
observations =       74
categories =        9
df =        8

----------------------------------------------
Test statistic |       Coef.    P-value
----------------------+-----------------------
Pearson's X2 |    84.35215     0.0000
Log likelihood ratio |    76.29649     0.0000
----------------------------------------------

. digdis price, position(2)

Digit distribution (2nd digit)

Value |     Count    Percent    Percent      Diff.    P-value
-------------+------------------------------------------------------
0 |         7      9.459     11.968     -2.508     0.5948
1 |        13     17.568     11.389      6.179     0.0991
2 |         7      9.459     10.882     -1.423     0.8522
3 |         7      9.459     10.433     -0.973     1.0000
4 |         7      9.459     10.031     -0.571     1.0000
5 |         4      5.405      9.668     -4.262     0.3212
6 |         4      5.405      9.337     -3.932     0.3182
7 |        12     16.216      9.035      7.181     0.0406
8 |         9     12.162      8.757      3.405     0.3000
9 |         4      5.405      8.500     -3.094     0.5279
-------------+------------------------------------------------------
Total |        74    100.000    100.000      3.353

Goodness-of-fit tests        method =   approx
observations =       74
categories =       10
df =        9

----------------------------------------------
Test statistic |       Coef.    P-value
----------------------+-----------------------
Pearson's X2 |    11.75117     0.2277
Log likelihood ratio |    11.12604     0.2672
----------------------------------------------

. set seed 3217367

. digdis price, position(2) test(mc) nofreq

Goodness-of-fit tests                                method =       mc
observations =       74
categories =       10
replications =    10000

----------------------------------------------------------------------
Test statistic |       Coef.    P-value    [99% Conf. Interval]
----------------------+-----------------------------------------------
Pearson's X2 |    11.75117     0.2234      0.2128      0.2343
Log likelihood ratio |    11.12604     0.2894      0.2778      0.3012
----------------------------------------------------------------------

. digdis price, position(2) graph nofreq
notest ti("Second digit distribution")

. digdis price, position(2) by(foreign) graph
byopts(ti("Second digit distribution"))

-----------------------------------------------------------------------
-> foreign = 0

Digit distribution (2nd digit)

Value |     Count    Percent    Percent      Diff.    P-value
-------------+------------------------------------------------------
0 |         6     11.538     11.968     -0.429     1.0000
1 |        10     19.231     11.389      7.842     0.0807
2 |         3      5.769     10.882     -5.113     0.3687
3 |         6     11.538     10.433      1.106     0.8189
4 |         6     11.538     10.031      1.508     0.6448
5 |         3      5.769      9.668     -3.898     0.4812
6 |         2      3.846      9.337     -5.491     0.2331
7 |         7     13.462      9.035      4.426     0.2313
8 |         6     11.538      8.757      2.781     0.4576
9 |         3      5.769      8.500     -2.731     0.6239
-------------+------------------------------------------------------
Total |        52    100.000    100.000      3.533

-----------------------------------------------------------------------
-> foreign = 1

Digit distribution (2nd digit)

Value |     Count    Percent    Percent      Diff.    P-value
-------------+------------------------------------------------------
0 |         1      4.545     11.968     -7.422     0.5072
1 |         3     13.636     11.389      2.247     0.7331
2 |         4     18.182     10.882      7.300     0.2916
3 |         1      4.545     10.433     -5.888     0.7224
4 |         1      4.545     10.031     -5.485     0.7194
5 |         1      4.545      9.668     -5.122     0.7174
6 |         2      9.091      9.337     -0.247     1.0000
7 |         5     22.727      9.035     13.692     0.0431
8 |         3     13.636      8.757      4.879     0.4355
9 |         1      4.545      8.500     -3.954     1.0000
-------------+------------------------------------------------------
Total |        22    100.000    100.000      5.624

-----------------------------------------------------------------------

Goodness-of-fit tests

foreign |      Obs.         X2    P-value         LR    P-value
-------------+------------------------------------------------------
0 |        52   8.783497     0.4575   9.041643     0.4334
1 |        22   9.744576     0.3716    9.01948     0.4355

. digdis displ, graph ci nofreq notest

. digdis displ, graph ci(ref) nofreq notest

Returned results

digdis saves the following in r():

Scalars
r(N)         number of observations
r(position)  digit position
r(base)      base of number system
r(mad)       mean average percentage deviation between observed and
expected distribution
r(level)     confidence level as a percentage (if ci is specified)
r(stat)      value of test statistic
r(p_stat)    p-value of r(stat)

where stat may be x2, lr, cr, mlnp, or ksmirnov, depending
on test()

Macros
r(cmd)       "digdis"
r(refdist)   type of reference distribution ("Benford", "uniform", or
"user")
r(citype)    confidence interval type (if ci is specified)
r(byvar)     name of variable specified in by()

Matrices
r(count)     observed and expected counts
r(pvals)     p-values of individual differences
r(ci)        pointwise confidence intervals (if ci is specified)

r(N), r(mad), r(stat), and r(p_stat) are matrices if digdis is used with
more than one variable or if by() is specified.

Author

Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch

Also see

Online:  help for tabulate, graph, ci, mgof, mm_mgof(), mm_benford(),
```