help mata mm_mgof()-------------------------------------------------------------------------------

Title

mm_mgof() -- Goodness-of-fit tests for multinomial data

Syntax

real matrixmm_mgof(f[,h,method,stats,lambda,nfit|reps,dots])

where

f:real colvectorcontaining observed counts

h:real colvectorcontaining expected counts (or probabilities; the scale does not matter)

method:string scalarcontaining"approx"(large sample chi-squared approximation test; the default),"mc"(Monte Carlo exact test), or"ee"(exhaustive enumeration exact test)

stats:string vectorspecifying the test statistics to be used; available statistics are"x2"(Pearson's X2 statistic; the default),"lr"(log likelihood ratio statistic),"cr"(statistic from the Cressie-Read family),"mlnp"(outcome probability statistic),"ksmirnov"(two-sided Kolmogorov-Smirnov statistic);"mlnp"and"ksmirnov"are not allowed with the"approx"method

lambda:real scalarspecifying the lambda parameter for the Cressie-Read statistic; default is2/3

nfit:real scalarspecifying the number of fitted parameters (imposed restrictions) for the chi-squared approximation test (default is0)

reps:real scalarspecifying the number of replications for the"mc"method (default is10000)

dots:real scalarcausing progress dots to be displayed with the"mc"or"ee"method

Description

mm_mgof()performs goodness-of-fit tests for discrete distributions. It returns a matrix containing for each requested statistic a row with the statistic's value in the first column and the associated p-value in the second column.fis acolvectorcontaining the observed frequency distribution (i.e. the observed counts for each category;sum(f)is the sample size).his acolvectorspecifying the null distribution (in counts or proportions) against which the observed distribution be tested. Ifhis omitted or ifrows(h)==1, the uniform distribution is used as the null distribution.

methodspecifies the method used to evaluate the p-values. Available methods are:

"approx"to perform a classic large sample chi-squared approximation test."approx"is the default method.

"mc"to approximate the exact p-value by sampling from the null distribution (Monte Carlo simulation). The proportion of samples in which the test statistic exceeds the observed statistic gives the p-value. The number of replications (i.e. the number of drawn samples) is 10000 or as specified byreps.

"ee"to compute the exact p-value by iterating through all possible data compositions given the number of observations and the number of categories. The number of possible compositions grows very fast with the number of observations and categories (it is equal tocomb(n+k-1,k-1)=comb(n+k-1,n)= (n+k-1)!/((k-1)!n!). The"ee"method is therefore only useful for very small samples with few categories. An important exception is when the null distribution is the uniform distribution (andstatsdoes not containksmirnov). In this case the p-values are computed based on the partitions ofn. The number of partitions is typically much smaller than the number of compositions.

statsspecifies the test statistics for which the p-values be computed. Available statistics are:

"x2": Pearson's X2 statisticX2 = sum( (

f-h)^2 /h)where

fare the observed counts andhare the expected counts. X2 is asymptotically chi-square distributed withk-nfit-1 degrees of freedom."x2"is the default for the"approx"method ("approx"with"x2"istheclassical chi-squared goodness-of-fit test for multinomial data).

"lr": The log likelihood ratio statistic (or deviance)LR = G2 = 2 * sum(

f* ln(f/h) )LR is an alternative to Pearson's X2 and is also asymptotically chi-square distributed with

k-nfit-1 degrees of freedom.

"cr": The Cressie-Read statisticCR = 2/(

l*(l+1)) * sum(f* ((f/h)^l- 1) )where

lstands forlambda, which defaults to 2/3. The Cressie-Read family includes Pearson's X2 and the LR statistic as special cases (lambda=1 andlambda=0, respectively; other special cases arelambda=-1/2 for the Freeman-Tukey statistic,lambda=-1 for the Kullback-Leibler information, andlambda=-2 for Neyman's modified X2 statistic; see Cressie and Read 1984, Weesie 1997). All members are asymptotically chi-square distributed withk-nfit-1 degrees of freedom.

"mlnp": The statistic -ln(p), where

p=n!/(f1!*...*fk!) *p1^f1*...*pk^fkis the probability of the observed outcome given the null distribution (

ndenotes the sample size,p1,...,pkare the theoretical probabilities of the categories, andf1,...,fkare the observed counts)."mlnp"is not allowed with the"approx"method."mlnp"corresponds to the "exact multinomial test", i.e. the computed p-value reflects the exact probability to observe an outcome that is less probable, given the null distribution, than the actually observed outcome (see, e.g., Horn 1977, Cressie and Read 1989).

"ksmirnov": The two-sided Kolmogorov-Smirnov statisticD = max( abs(

H-F) )where

His the theoretical andFis the empirical cumulative distribution function. D is sensitive to the order of the categories and should therefore only be used with data that has a natural order (i.e. ordinal or discrete metric data). While the distribution of D is well known for continuous data, the standard Kolmogorov-Smirnov test (see helpksmirnov) is conservative in the case of discrete data (see, e.g., Conover 1972).mm_mgof()performs the Kolmogorov-Smirnov test without making assumptions about the distribution of D."ksmirnov"is not available with the"approx"method.

statsmay include several statistics in which casemm_mgof()returns results for each of the specified statistics. In the case of the"mc"method, the same set of samples is used for all specified statistics.

dots!=0 causes progress dots to be displayed with the"mc"or"ee"method (one dot = 2 percent of computations).See Jann (2008) for a working paper discussing multinomial goodness-of-fit tests (available from http://ideas.repec.org/p/ets/wpaper/2.html).

RemarksExamples:

: uniformseed(46) : x = ceil(uniform(7,1)*5) \ 5 \ 5 \ 5 \ 5 : x 1 +-----+ 1 | 5 | 2 | 3 | 3 | 1 | 4 | 4 | 5 | 3 | 6 | 4 | 7 | 5 | 8 | 5 | 9 | 5 | 10 | 5 | 11 | 5 | +-----+

: f = mm_freq(x, 1, (1::5)) : (1::5), f 1 2 +---------+ 1 | 1 1 | 2 | 2 0 | 3 | 3 2 | 4 | 4 2 | 5 | 5 6 | +---------+

: mm_mgof(f,1,"approx",("x2","lr","cr")) 1 2 +-----------------------------+ 1 | 9.454545455 .0506896631 | 2 | 9.700229147 .0457916588 | 3 | 9.102749235 .0585819271 | +-----------------------------+

: mm_mgof(f,J(5,1,1/5),"approx",("x2","lr","cr")) 1 2 +-----------------------------+ 1 | 9.454545455 .0506896631 | 2 | 9.700229147 .0457916588 | 3 | 9.102749235 .0585819271 | +-----------------------------+

: mm_mgof(f,1,"ee",("x2","lr","cr","mlnp","ksmirnov")) 1 2 +-----------------------------+ 1 | 9.454545455 .057135616 | 2 | 9.700229147 .08694016 | 3 | 9.102749235 .057135616 | 4 | 8.167054764 .08694016 | 5 | .3454545455 .0286034534 | +-----------------------------+

: mm_mgof(f,1,"mc",("x2","lr","cr","mlnp","ks")) 1 2 +-----------------------------+ 1 | 9.454545455 .0546 | 2 | 9.700229147 .0842 | 3 | 9.102749235 .0546 | 4 | 8.167054764 .0842 | 5 | .3454545455 .0281 | +-----------------------------+

Conformability

mm_mgof(f,h,method,stats,lambda,nfit|reps):f:k x1h:k x1 or 1x1method: 1x1stats: 1x sors x1lambda: 1x1nfit: 1x1reps: 1x1result: max(s,1)x2

Diagnostics

mm_mgofaborts with error ifhcontains zero or ifforhcontain negative or missing values.

mm_mgofaborts with error iffcontains non-integer values and the method is"ee"or the method is"mc"and the"mlnp"statistic is requested. The number of observations, i.e.sum(f), is rounded to the nearest integer with the"mc"method.

mm_mgofaborts with error iffcontains zero and the"cr"statistic is used withlambda<0. Furthermore,lambda<0 is only allowed with the"approx"method.

mm_mgofreturns(0,1)for each statistic ifrows(f)<2 orsum(f)==0.

Source codemm_mgof.mata

ReferencesConover, W. J. (1972). A Kolmogorov Goodness-of-Fit Test for Discontinuous Distributions. Journal of the American Statistical Association 67: 591-596.

Cressie, N., T. R. C. Read (1984). Multinomial Goodness-of-Fit Tests. Journal of the Royal Statistical Society (Series B) 46: 440-464.

Cressie, N., T. R. C. Read (1989). Pearson's X^2 and the Loglikelihood Ratio Statistic G^2: A Comparative Review. International Statistical Review 57: 19-43.

Horn, S. D. (1977). Goodness-of-Fit Tests for Discrete Data: A Review and an Application to a Health Impairment Scale. Biometrics 33: 237-247.

Jann, B. (2008). Multinomial goodness-of-fit: large sample tests with survey design correction and exact tests for small samples. ETH Zurich Sociology Working Paper No. 2. Available from: http://ideas.repec.org/p/ets/wpaper/2.html.

Weesie, J. (1997). sg68: Goodness-of-fit statistics for multinomial distributions. Stata Technical Bulletin Reprints 6: 183-186.

AuthorBen Jann, ETH Zurich, jann@soz.gess.ethz.ch

Also seeOnline: help for

ksmirnov,mm_freq(),mm_subset(),moremata;mgof(if