{smcl} {* 18jan2021}{...} {hi:help digdis}{...} {right:{browse "http://github.com/benjann/digdis/"}} {hline} {title:Title} {pstd}{hi:digdis} {hline 2} Analysis of digit distributions {title:Syntax} {p 8 15 2} {cmd:digdis} {varlist} {ifin} {weight} [{cmd:,} {help digdis##opt:{it:options}} ] {p 8 15 2} {cmd:digdis} {varname} {ifin} {weight} [{cmd:,} {cmd:by(}{it:groupvar}{cmd:)} {help digdis##opt:{it:options}} ] {synoptset 21 tabbed}{...} {marker opt}{synopthdr:options} {synoptline} {syntab :Main} {synopt :{opt p:osition(#)}}digit position (1st is default); {it:#} in [1,6] {p_end} {synopt :{opt b:ase(#)}}base of number system (10 is default); {it:#} in [2,10] {p_end} {synopt :{opt d:ecimalplaces(#)}}precision of input values (number of decimal places) {p_end} {synopt :{opt ben:ford}}reference is Benford's law (the default) {p_end} {synopt :{opt uni:form}}reference distribution is uniform {p_end} {synopt :{opt mat:rix(name)}}user defined reference distribution {p_end} {synopt :{cmd:test(}{it:{help mgof:mgof_opts}}{cmd:)}}options for goodness-of-fit test {p_end} {synopt :{opt notest}}suppress goodness-of-fit test {p_end} {synopt :{opt nofreq}}suppress frequency table {p_end} {synopt :{opth g:enerate(newvarlist)}}save variable(s) containing digits {p_end} {synopt :{opt r:eplace}}overwrite existing variables {p_end} {syntab :Graph} {synopt :{opt gr:aph}}display graph {p_end} {synopt :{opt per:cent}}scale is in percent (default) {p_end} {synopt :{opt frac:tion}}scale is in proportions {p_end} {synopt :{opt count}}scale is in counts {p_end} {synopt :{it:{help twoway_bar:bar_options}}}affect rendition of observed distribution {p_end} {synopt :{cmd:ci}[{cmd:(}{it:{help digdis##ci:type}}{cmd:)}]}include confidence intervals (capped spikes) {p_end} {synopt :{opt l:evel(#)}}set confidence level; default is {cmd:level(95)} {p_end} {synopt :{cmdab:ciopt:s(}{it:{help twoway_rcap:rcap_opts}}{cmd:)}}affect rendition of confidence spikes {p_end} {synopt :{cmdab:refopt:s(}{it:{help scatter:options}}{cmd:)}}affect rendition of reference distribution {p_end} {synopt :{opt noref}}suppress reference distribution {p_end} {synopt :{opt addplot(plot)}}add other plots to the generated graph {p_end} {synopt :{it:{help twoway_options}}}any options other than {cmd:by()} documented in help {it:{help twoway_options}} {p_end} {syntab :By} {synopt :{opt by(groupvar)}}repeat results for subgroups {p_end} {synopt :{cmdab:byopt:s(}{it:{help by_option:by_subopts}}{cmd:)}}graph {it:suboptions} for {cmd:by()} {p_end} {synoptline} {pstd} {cmd:by} is allowed; see help {helpb by}. {p_end} {pstd} {cmd:fweight}s are allowed; see help {help weight}. {title:Description} {pstd} {cmd:digdis} tabulates the distribution of digits of the variables in {varlist}, performs goodness-of-fit tests against a reference distribution and, optionally, graphs the distributions. The default is to tabulate the first (nonzero) digit; specify, e.g., {cmd:position(2)} to tabulate the second digit. The default reference distribution is given by Benford's law. {pstd} The variables in {varlist} may be numeric or string (see help {help data types}). It is sensible in some situations to use string variables to store the numbers to be analyzed, since this ensures that the numbers remain exactly as is. Note that, if the storage type is {cmd:float} or {cmd:double} the numbers will be right-padded with zeros (i.e. 1.3 is interpreted as 1.3000...) unless the {cmd:decimalplaces()} option is specified. Using the {cmd:float} storage type is strongly discouraged because of it's limited precision. For example, the number 1.30 is 1.29999995... in {cmd:float} accuracy and {cmd:digdis} will, e.g., read a 2 for the second digit (unless the {cmd:decimalplaces()} option is used to round the number). {pstd} {cmd:digdis} sometimes displays notes such as "{err:x: 7 invalid observations}". An observation is considered invalid if (1) its value is 0, (2) {cmd:position(#)}>1 is specified and the value does not have a {it:#}-th digit, or (3) the input variable is string and contains a nonnumeric value. {title:Dependencies} {pstd} {cmd:digdis} requires {cmd:moremata} and {cmd:mgof}. Type {com}. {net "describe moremata, from(http://fmwww.bc.edu/repec/bocode/m/)":ssc describe moremata}{txt} {com}. {net "describe mgof, from(http://fmwww.bc.edu/repec/bocode/m/)":ssc describe mgof}{txt} {title:Options} {dlgtab:Main} {phang} {opt position(#)}, where {it:#} in [1,6], specifies the position of the digits to be tabulated. {cmd:position(1)} is the default. Examples: The fist digits of 236, 4.015, and 0.00789 are 1, 4, and 7; the second digits are 3, 0, and 8; the third digits are 6, 1, and 9. {phang} {opt base(#)}, where {it:#} in [2,10], specifies the base of the number system. {cmd:base(10)} is the default. This option is rarely used. {phang} {opt decimalplaces(#)} specifies the number of decimal places of the input values. This option has an effect only if the storage type of the variable is {cmd:float} or {cmd:double} (see help {help data types}). {cmd:decimalplaces(}{it:#}{cmd:)} rounds the values to {it:#} decimal places. {phang} {opt benford} specifies that the expected distribution be computed according to Benford's law (see help for {helpb mf_mm_benford:mm_benford()}). This is the default. {phang} {opt uniform} specifies that the expected distribution is uniform. {phang} {opt matrix(name)} provides the name of a matrix containing the expected distribution. The matrix should be a column vector containing the proportions or the expected counts of the digits in ascending order (i.e. frequency of 1's in the first column, frequency of 2's in the second column, etc., or, if {cmd:position()}>1 is specified, frequency of 0's in the first column, frequency of 1's in the second column, etc.) {phang} {cmd:test(}{it:{help mgof:mgof_opts}}{cmd:)} specifies options to be passed through to the {cmd:mgof} command, which is used to perform goodness-of-fit tests (see help {helpb mgof}). For example, type {cmd:test(mc)} to perform exact tests using the Monte Carlo method instead of asymptotic tests. {phang} {opt notest} suppresses the goodness-of-fit tests. {phang} {opt nofreq} suppresses the frequency table(s). {phang} {opth generate(newvarlist)} causes variables to be generated containing the extracted digits. Specify one {newvar} for each input variable. {phang} {opt replace} allows the {cmd:generate()} option to overwrite existing variables. {dlgtab:Graph} {phang} {opt graph} displays the observed digit distribution in a graph as a bar plot (see help {helpb twoway bar}). The reference distribution is overlayed as connected-line plot (see help {helpb twoway connected}). Type {cmd:noref} to omit the reference distribution. {phang} {opt percent} displays percentages. This is the default {phang} {opt fraction} displays proportions. {phang} {opt count} displays counts. {phang} {it:{help twoway_bar:bar_options}} affect the rendition of the plotted distribution. See help {helpb twoway bar}. {marker ci}{phang} {cmd:ci}[{cmd:(}{it:type}{cmd:)}] specifies that pointwise confidence intervals of the observed distribution be plotted as capped spikes. {it:type} sets the calculation method and may be {opt exa:ct} (the default), {opt wa:ld}, {opt w:ilson}, {opt a:gresti}, or {opt j:effreys} (see help {helpb ci}). Alternatively, {it:type} may be {cmd:reference} in which case point-wise probability intervals are plotted around the reference distribution as a connected-line range plot (the intervals are the shortest intervals with a probability mass of at least the value of the confidence level). {phang} {opt level(#)} specifies the confidence level, as a percentage, for the plotted confidence intervals. The default is {cmd:level(95)} or as set by {helpb set level}. {phang} {opt ciopts(options)} affects the rendition of the confidence spikes. See help {helpb twoway rcap} or, if {cmd:ci(reference)} is used, help {helpb twoway rconnected}. {phang} {opt noref} suppresses plotting the reference distribution. {phang} {cmd:refopts(}{it:{help twoway_connected:connected_options}}{cmd:)} affects the rendition of the plotted reference distribution. See help {helpb twoway connected}. {phang} {opt addplot(plot)} provides a way to add other plots to the generated graph. See help {help addplot_option}. {phang} {it:{help twoway_options}} are any of the options documented in {it:{help twoway_options}}, excluding {cmd:by()}. {dlgtab:By} {phang} {opt by(groupvar)} repeats the analysis for the groups defined by {it:groupvar}. The individual goodness-of-fit tests are included in a single table and the individual plots are drawn within a single graph. Note that {cmd:digdis} also allows the {helpb by} prefix command, which arranges output differently. A difference between the {cmd:by()} option and the {helpb by} prefix is also that {cmd:by()} returns in {cmd:r()} the results for all groups whereas the {helpb by} prefix only returns the results for the last group. {phang} {cmd:byopts(}{it:{help by_option:by_subopts}}{cmd:)} affects the arrangement of the individual plots in the graph. See the {it:suboptions} in help {it:{help by_option}}. Do not use the {cmd:total} suboption. {title:Examples} {com}. {stata "sysuse auto"} {txt}(1978 Automobile Data) {com}. {stata "digdis price"} {res}{txt} Digit distribution ({res}1st{txt} digit) Value {c |} Count Percent Percent Diff. P-value {c |} Observed Expected (MAD) {hline 12}{hline 1}{c +}{hline 10}{hline 33}{hline 11} 1 {c |} {res} 10 13.514 30.103 -16.589 0.0014 {txt}2 {c |} {res} 0 0.000 17.609 -17.609 0.0000 {txt}3 {c |} {res} 11 14.865 12.494 2.371 0.4844 {txt}4 {c |} {res} 26 35.135 9.691 25.444 0.0000 {txt}5 {c |} {res} 14 18.919 7.918 11.001 0.0018 {txt}6 {c |} {res} 7 9.459 6.695 2.765 0.3456 {txt}7 {c |} {res} 2 2.703 5.799 -3.096 0.4482 {txt}8 {c |} {res} 2 2.703 5.115 -2.413 0.5918 {txt}9 {c |} {res} 2 2.703 4.576 -1.873 0.7767 {txt}{hline 12}{hline 1}{c +}{hline 10}{hline 33}{hline 11} Total {c |} {res} 74 100.000 100.000 9.240 {txt} {res} {txt}Goodness-of-fit tests method = {res}approx {txt}observations ={res} 74 {txt}categories ={res} 9 {txt}df ={res} 8 {txt}{hline 22}{c TT}{hline 23} Test statistic {c |} Coef. P-value {hline 22}{c +}{hline 23} Pearson's X2 {c |} {res} 84.35215 0.0000 {txt}Log likelihood ratio {c |} {res} 76.29649 0.0000 {txt}{hline 22}{c BT}{hline 23} {com}. {stata "digdis price, position(2)"} {res}{txt} Digit distribution ({res}2nd{txt} digit) Value {c |} Count Percent Percent Diff. P-value {c |} Observed Expected (MAD) {hline 12}{hline 1}{c +}{hline 10}{hline 33}{hline 11} 0 {c |} {res} 7 9.459 11.968 -2.508 0.5948 {txt}1 {c |} {res} 13 17.568 11.389 6.179 0.0991 {txt}2 {c |} {res} 7 9.459 10.882 -1.423 0.8522 {txt}3 {c |} {res} 7 9.459 10.433 -0.973 1.0000 {txt}4 {c |} {res} 7 9.459 10.031 -0.571 1.0000 {txt}5 {c |} {res} 4 5.405 9.668 -4.262 0.3212 {txt}6 {c |} {res} 4 5.405 9.337 -3.932 0.3182 {txt}7 {c |} {res} 12 16.216 9.035 7.181 0.0406 {txt}8 {c |} {res} 9 12.162 8.757 3.405 0.3000 {txt}9 {c |} {res} 4 5.405 8.500 -3.094 0.5279 {txt}{hline 12}{hline 1}{c +}{hline 10}{hline 33}{hline 11} Total {c |} {res} 74 100.000 100.000 3.353 {txt} {res} {txt}Goodness-of-fit tests method = {res}approx {txt}observations ={res} 74 {txt}categories ={res} 10 {txt}df ={res} 9 {txt}{hline 22}{c TT}{hline 23} Test statistic {c |} Coef. P-value {hline 22}{c +}{hline 23} Pearson's X2 {c |} {res} 11.75117 0.2277 {txt}Log likelihood ratio {c |} {res} 11.12604 0.2672 {txt}{hline 22}{c BT}{hline 23} {com}. {stata "set seed 3217367"} {txt} {com}. {stata "digdis price, position(2) test(mc) nofreq"} {res} {txt}Goodness-of-fit tests method = {res}mc {txt}observations ={res} 74 {txt}categories ={res} 10 {txt}replications ={res} 10000 {txt}{hline 22}{c TT}{hline 47} Test statistic {c |} Coef. P-value [99% Conf. Interval] {hline 22}{c +}{hline 47} Pearson's X2 {c |} {res} 11.75117 0.2234 0.2128 0.2343 {txt}Log likelihood ratio {c |} {res} 11.12604 0.2894 0.2778 0.3012 {txt}{hline 22}{c BT}{hline 47} {com}. {stata `"digdis price, position(2) graph nofreq notest ti("Second digit distribution")"':digdis price, position(2) graph nofreq} {stata `"digdis price, position(2) graph nofreq notest ti("Second digit distribution")"':notest ti("Second digit distribution")} {res}{txt} {com}. {stata `"digdis price, position(2) by(foreign) graph byopts(ti("Second digit distribution"))"':digdis price, position(2) by(foreign) graph} {stata `"digdis price, position(2) by(foreign) graph byopts(ti("Second digit distribution"))"':byopts(ti("Second digit distribution"))} {txt}{hline} -> foreign = 0 {res}{txt} Digit distribution ({res}2nd{txt} digit) Value {c |} Count Percent Percent Diff. P-value {c |} Observed Expected (MAD) {hline 12}{hline 1}{c +}{hline 10}{hline 33}{hline 11} 0 {c |} {res} 6 11.538 11.968 -0.429 1.0000 {txt}1 {c |} {res} 10 19.231 11.389 7.842 0.0807 {txt}2 {c |} {res} 3 5.769 10.882 -5.113 0.3687 {txt}3 {c |} {res} 6 11.538 10.433 1.106 0.8189 {txt}4 {c |} {res} 6 11.538 10.031 1.508 0.6448 {txt}5 {c |} {res} 3 5.769 9.668 -3.898 0.4812 {txt}6 {c |} {res} 2 3.846 9.337 -5.491 0.2331 {txt}7 {c |} {res} 7 13.462 9.035 4.426 0.2313 {txt}8 {c |} {res} 6 11.538 8.757 2.781 0.4576 {txt}9 {c |} {res} 3 5.769 8.500 -2.731 0.6239 {txt}{hline 12}{hline 1}{c +}{hline 10}{hline 33}{hline 11} Total {c |} {res} 52 100.000 100.000 3.533 {txt}{hline} -> foreign = 1 {res}{txt} Digit distribution ({res}2nd{txt} digit) Value {c |} Count Percent Percent Diff. P-value {c |} Observed Expected (MAD) {hline 12}{hline 1}{c +}{hline 10}{hline 33}{hline 11} 0 {c |} {res} 1 4.545 11.968 -7.422 0.5072 {txt}1 {c |} {res} 3 13.636 11.389 2.247 0.7331 {txt}2 {c |} {res} 4 18.182 10.882 7.300 0.2916 {txt}3 {c |} {res} 1 4.545 10.433 -5.888 0.7224 {txt}4 {c |} {res} 1 4.545 10.031 -5.485 0.7194 {txt}5 {c |} {res} 1 4.545 9.668 -5.122 0.7174 {txt}6 {c |} {res} 2 9.091 9.337 -0.247 1.0000 {txt}7 {c |} {res} 5 22.727 9.035 13.692 0.0431 {txt}8 {c |} {res} 3 13.636 8.757 4.879 0.4355 {txt}9 {c |} {res} 1 4.545 8.500 -3.954 1.0000 {txt}{hline 12}{hline 1}{c +}{hline 10}{hline 33}{hline 11} Total {c |} {res} 22 100.000 100.000 5.624 {txt} {hline} Goodness-of-fit tests foreign {c |} Obs. X2 P-value LR P-value {hline 13}{c +}{hline 10}{hline 22}{hline 22} 0 {c |} {res} 52 8.783497 0.4575 9.041643 0.4334 {txt}1 {c |} {res} 22 9.744576 0.3716 9.01948 0.4355 {txt} {com}. {stata "digdis displ, graph ci nofreq notest":digdis displ, graph ci nofreq notest} {res}{txt} {com}. {stata "digdis displ, graph ci(ref) nofreq notest":digdis displ, graph ci(ref) nofreq notest} {res}{txt} {title:Returned results} {pstd}{cmd:digdis} saves the following in {cmd:r()}: {pstd}Scalars{p_end} {p2colset 7 20 20 2}{...} {p2col : {cmd:r(N)}}number of observations {p_end} {p2col : {cmd:r(position)}}digit position {p_end} {p2col : {cmd:r(base)}}base of number system {p_end} {p2col : {cmd:r(mad)}}mean average percentage deviation between observed and expected distribution {p_end} {p2col : {cmd:r(level)}}confidence level as a percentage (if {cmd:ci} is specified) {p_end} {p2col : {cmd:r(}{it:stat}{cmd:)}}value of test statistic {p_end} {p2col : {cmd:r(p_}{it:stat}{cmd:)}}p-value of {cmd:r(}{it:stat}{cmd:)} {p_end} {p 19 19 2}where {it:stat} may be {cmd:x2}, {cmd:lr}, {cmd:cr}, {cmd:mlnp}, or {cmd:ksmirnov}, depending on {cmd:test()} {pstd}Macros{p_end} {p2col : {cmd:r(cmd)}}"digdis" {p_end} {p2col : {cmd:r(refdist)}}type of reference distribution ("Benford", "uniform", or "user") {p_end} {p2col : {cmd:r(citype)}}confidence interval type (if {cmd:ci} is specified) {p_end} {p2col : {cmd:r(byvar)}}name of variable specified in {cmd:by()} {p_end} {pstd}Matrices{p_end} {p2col : {cmd:r(count)}}observed and expected counts {p_end} {p2col : {cmd:r(pvals)}}p-values of individual differences {p_end} {p2col : {cmd:r(ci)}}pointwise confidence intervals (if {cmd:ci} is specified) {p_end} {pstd}{cmd:r(N)}, {cmd:r(mad)}, {cmd:r(}{it:stat}{cmd:)}, and {cmd:r(p_}{it:stat}{cmd:)} are matrices if {cmd:digdis} is used with more than one variable or if {cmd:by()} is specified. {title:Author} {pstd} Ben Jann, University of Bern, ben.jann@soz.unibe.ch {pstd} Thanks for citing this software as follows: {pmore} Jann, B. (2007). digdis: Stata module to analyze the distribution of digits. Available from {browse "http://ideas.repec.org/c/boc/bocode/s456853.html"}. {title:Also see} {psee} Online: help for {helpb tabulate}, {helpb graph}, {helpb ci}, {helpb mgof}, {helpb mf_mm_mgof:mm_mgof()}, {helpb mf_mm_benford:mm_benford()}, {helpb moremata}