help digdis
-------------------------------------------------------------------------------

Title

digdis -- Analysis of digit distributions

Syntax

digdis varlist [if] [in] [weight] [, options ]

digdis varname [if] [in] [weight] [, by(groupvar) options ]

options Description ------------------------------------------------------------------------- Main position(#) digit position (1st is default); # in [1,6] base(#) base of number system (10 is default); # in [2,10] decimalplaces(#) precision of input values (number of decimal places) benford reference is Benford's law (the default) uniform reference distribution is uniform matrix(name) user defined reference distribution test(mgof_opts) options for goodness-of-fit test notest suppress goodness-of-fit test nofreq suppress frequency table generate(newvarlist) save variable(s) containing digits replace overwrite existing variables

Graph graph display graph percent scale is in percent (default) fraction scale is in proportions count scale is in counts bar_options affect rendition of observed distribution ci[(type)] include confidence intervals (capped spikes) level(#) set confidence level; default is level(95) ciopts(rcap_opts) affect rendition of confidence spikes refopts(options) affect rendition of reference distribution noref suppress reference distribution addplot(plot) add other plots to the generated graph twoway_options any options other than by() documented in help twoway_options

By by(groupvar) repeat results for subgroups byopts(by_subopts) graph suboptions for by() -------------------------------------------------------------------------

by is allowed; see help by. fweights are allowed; see help weight.

Description

digdis tabulates the distribution of digits of the variables in varlist, performs goodness-of-fit tests against a reference distribution and, optionally, graphs the distributions. The default is to tabulate the first (nonzero) digit; specify, e.g., position(2) to tabulate the second digit. The default reference distribution is given by Benford's law.

The variables in varlist may be numeric or string (see help data types). It is sensible in some situations to use string variables to store the numbers to be analyzed, since this ensures that the numbers remain exactly as is. Note that, if the storage type is float or double the numbers will be right-padded with zeros (i.e. 1.3 is interpreted as 1.3000...) unless the decimalplaces() option is specified. Using the float storage type is strongly discouraged because of it's limited precision. For example, the number 1.30 is 1.29999995... in float accuracy and digdis will, e.g., read a 2 for the second digit (unless the decimalplaces() option is used to round the number).

digdis sometimes displays notes such as "x: 7 invalid observations". An observation is considered invalid if (1) it's value 0, (2) position(#)>1 is specified and the value does not have a #-th digit, or (3) the input variable is string and contains a nonnumeric value.

Dependencies

digdis requires moremata and mgof. Type

. ssc describe moremata

. ssc describe mgof

Options

+------+ ----+ Main +-------------------------------------------------------------

position(#), where # in [1,6], specifies the position of the digits to be tabulated. position(1) is the default. Examples: The fist digits of 236, 4.015, and 0.00789 are 1, 4, and 7; the second digits are 3, 0, and 8; the third digits are 6, 1, and 9.

base(#), where # in [2,10], specifies the base of the number system. base(10) is the default. This option is rarely used.

decimalplaces(#) specifies the number of decimal places of the input values. This option has an effect only if the storage type of the variable is float or double (see help data types). decimalplaces(#) rounds the values to # decimal places.

benford specifies that the expected distribution be computed according to Benford's law (see help for mm_benford()). This is the default.

uniform specifies that the expected distribution is uniform.

matrix(name) provides the name of a matrix containing the expected distribution. The matrix should be a column vector containing the proportions or the expected counts of the digits in ascending order (i.e. frequency of 1's in the first column, frequency of 2's in the second column, etc., or, if position()>1 is specified, frequency of 0's in the first column, frequency of 1's in the second column, etc.)

test(mgof_opts) specifies options to be passed through to the mgof command, which is used to perform goodness-of-fit tests (see help mgof). For example, type test(mc) to perform exact tests using the Monte Carlo method instead of asymptotic tests.

notest suppresses the goodness-of-fit tests.

nofreq suppresses the frequency table(s).

generate(newvarlist) causes variables to be generated containing the extracted digits. Specify one newvar for each input variable.

replace allows the generate() option to overwrite existing variables.

+-------+ ----+ Graph +------------------------------------------------------------

graph displays the observed digit distribution in a graph as a bar plot (see help twoway bar). The reference distribution is overlayed as connected-line plot (see help twoway connected). Type noref to omit the reference distribution.

percent displays percentages. This is the default

fraction displays proportions.

count displays counts.

bar_options affect the rendition of the plotted distribution. See help twoway bar.

ci[(type)] specifies that pointwise confidence intervals of the observed distribution be plotted as capped spikes. type sets the calculation method and may be exact (the default), wald, wilson, agresti, or jeffreys (see help ci). Alternatively, type may be reference in which case point-wise probability intervals are plotted around the reference distribution as a connected-line range plot (the intervals are the shortest intervals with a probability mass of at least the value of the confidence level).

level(#) specifies the confidence level, as a percentage, for the plotted confidence intervals. The default is level(95) or as set by set level.

ciopts(options) affects the rendition of the confidence spikes. See help twoway rcap or, if ci(reference) is used, help twoway rconnected.

noref suppresses plotting the reference distribution.

refopts(connected_options) affects the rendition of the plotted reference distribution. See help twoway connected.

addplot(plot) provides a way to add other plots to the generated graph. See help addplot_option.

twoway_options are any of the options documented in twoway_options, excluding by().

+----+ ----+ By +---------------------------------------------------------------

by(groupvar) repeats the analysis for the groups defined by groupvar. The individual goodness-of-fit tests are included in a single table and the individual plots are drawn within a single graph. Note that digdis also allows the by prefix command, which arranges output differently. A difference between the by() option and the by prefix is also that by() returns in r() the results for all groups whereas the by prefix only returns the results for the last group.

byopts(by_subopts) affects the arrangement of the individual plots in the graph. See the suboptions in help by_option. Do not use the total suboption.

Examples

. sysuse auto (1978 Automobile Data)

. digdis price Digit distribution (1st digit)

Value | Count Percent Percent Diff. P-value | Observed Expected (MAD) -------------+------------------------------------------------------ 1 | 10 13.514 30.103 -16.589 0.0014 2 | 0 0.000 17.609 -17.609 0.0000 3 | 11 14.865 12.494 2.371 0.4844 4 | 26 35.135 9.691 25.444 0.0000 5 | 14 18.919 7.918 11.001 0.0018 6 | 7 9.459 6.695 2.765 0.3456 7 | 2 2.703 5.799 -3.096 0.4482 8 | 2 2.703 5.115 -2.413 0.5918 9 | 2 2.703 4.576 -1.873 0.7767 -------------+------------------------------------------------------ Total | 74 100.000 100.000 9.240 Goodness-of-fit tests method = approx observations = 74 categories = 9 df = 8

---------------------------------------------- Test statistic | Coef. P-value ----------------------+----------------------- Pearson's X2 | 84.35215 0.0000 Log likelihood ratio | 76.29649 0.0000 ----------------------------------------------

. digdis price, position(2) Digit distribution (2nd digit)

Value | Count Percent Percent Diff. P-value | Observed Expected (MAD) -------------+------------------------------------------------------ 0 | 7 9.459 11.968 -2.508 0.5948 1 | 13 17.568 11.389 6.179 0.0991 2 | 7 9.459 10.882 -1.423 0.8522 3 | 7 9.459 10.433 -0.973 1.0000 4 | 7 9.459 10.031 -0.571 1.0000 5 | 4 5.405 9.668 -4.262 0.3212 6 | 4 5.405 9.337 -3.932 0.3182 7 | 12 16.216 9.035 7.181 0.0406 8 | 9 12.162 8.757 3.405 0.3000 9 | 4 5.405 8.500 -3.094 0.5279 -------------+------------------------------------------------------ Total | 74 100.000 100.000 3.353 Goodness-of-fit tests method = approx observations = 74 categories = 10 df = 9

---------------------------------------------- Test statistic | Coef. P-value ----------------------+----------------------- Pearson's X2 | 11.75117 0.2277 Log likelihood ratio | 11.12604 0.2672 ----------------------------------------------

. set seed 3217367 . digdis price, position(2) test(mc) nofreq Goodness-of-fit tests method = mc observations = 74 categories = 10 replications = 10000

---------------------------------------------------------------------- Test statistic | Coef. P-value [99% Conf. Interval] ----------------------+----------------------------------------------- Pearson's X2 | 11.75117 0.2234 0.2128 0.2343 Log likelihood ratio | 11.12604 0.2894 0.2778 0.3012 ----------------------------------------------------------------------

. digdis price, position(2) graph nofreq notest ti("Second digit distribution") . digdis price, position(2) by(foreign) graph byopts(ti("Second digit distribution"))

----------------------------------------------------------------------- -> foreign = 0 Digit distribution (2nd digit)

Value | Count Percent Percent Diff. P-value | Observed Expected (MAD) -------------+------------------------------------------------------ 0 | 6 11.538 11.968 -0.429 1.0000 1 | 10 19.231 11.389 7.842 0.0807 2 | 3 5.769 10.882 -5.113 0.3687 3 | 6 11.538 10.433 1.106 0.8189 4 | 6 11.538 10.031 1.508 0.6448 5 | 3 5.769 9.668 -3.898 0.4812 6 | 2 3.846 9.337 -5.491 0.2331 7 | 7 13.462 9.035 4.426 0.2313 8 | 6 11.538 8.757 2.781 0.4576 9 | 3 5.769 8.500 -2.731 0.6239 -------------+------------------------------------------------------ Total | 52 100.000 100.000 3.533

----------------------------------------------------------------------- -> foreign = 1 Digit distribution (2nd digit)

Value | Count Percent Percent Diff. P-value | Observed Expected (MAD) -------------+------------------------------------------------------ 0 | 1 4.545 11.968 -7.422 0.5072 1 | 3 13.636 11.389 2.247 0.7331 2 | 4 18.182 10.882 7.300 0.2916 3 | 1 4.545 10.433 -5.888 0.7224 4 | 1 4.545 10.031 -5.485 0.7194 5 | 1 4.545 9.668 -5.122 0.7174 6 | 2 9.091 9.337 -0.247 1.0000 7 | 5 22.727 9.035 13.692 0.0431 8 | 3 13.636 8.757 4.879 0.4355 9 | 1 4.545 8.500 -3.954 1.0000 -------------+------------------------------------------------------ Total | 22 100.000 100.000 5.624 -----------------------------------------------------------------------

Goodness-of-fit tests

foreign | Obs. X2 P-value LR P-value -------------+------------------------------------------------------ 0 | 52 8.783497 0.4575 9.041643 0.4334 1 | 22 9.744576 0.3716 9.01948 0.4355 . digdis displ, graph ci nofreq notest . digdis displ, graph ci(ref) nofreq notest

Returned results

digdis saves the following in r():

Scalars r(N) number of observations r(position) digit position r(base) base of number system r(mad) mean average percentage deviation between observed and expected distribution r(level) confidence level as a percentage (if ci is specified) r(stat) value of test statistic r(p_stat) p-value of r(stat)

where stat may be x2, lr, cr, mlnp, or ksmirnov, depending on test()

Macros r(cmd) "digdis" r(refdist) type of reference distribution ("Benford", "uniform", or "user") r(citype) confidence interval type (if ci is specified) r(byvar) name of variable specified in by()

Matrices r(count) observed and expected counts r(pvals) p-values of individual differences r(ci) pointwise confidence intervals (if ci is specified)

r(N), r(mad), r(stat), and r(p_stat) are matrices if digdis is used with more than one variable or if by() is specified.

Author

Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch

Also see

Online: help for tabulate, graph, ci, mgof, mm_mgof(), mm_benford(),