help mgof-------------------------------------------------------------------------------

Title

mgof-- Goodness-of-fit tests for multinomial data

Syntax

mgofvarname[=exp] [if] [in] [weight] [,options]

mgofif1f2...[/p1p2...] [,options]

optionsDescription ------------------------------------------------------------------------- Method 1approx[(nfit)] compute large sample chi-squared tests (the default) *svy[(spec)] adjust tests for survey design *vce(vcetype)adjust tests usingproportionvariance estimate *cluster(varname)adjust tests for intragroup correlationnoisilyshow output fromproportionMethod 2

mccompute Monte Carlo exact testsreps(#)number of replications formc; default isreps(10000)level(#)set confidence level formc; default islevel(99)citype(type)set confidence interval type formc; default iscitype(exact)Method 3

eecompute exhaustive enumeration exact testsTest statistics

nox2suppress Pearson's X2 statisticnolrsuppress log likelihood ratio statisticcr[(lambda)] include Cressie-Read (lambda) statistic;lambdadefaults to2/3mlnpinclude log outcome probability statistic (mcandeeonly)ksmirnovinclude Kolmogorov-Smirnov statistic (mcandeeonly)Other options

freqdisplay frequency distributionpercentdisplay frequency distribution in percent *matrix(name)provide matrix containing observed and expected countsexpected(name)provide matrix (column vector) containing expected countsnodotssuppress progress dots (mcandeeonly) ------------------------------------------------------------------------- *svy,vce(),cluster(), andmatrix()not allowed withmgofi.

byis allowed (unlesssvyis specified); see helpby.

fweights,pweights, andiweights are allowed (see help weight), butpweights are not allowed witheeormc, andiweights are not allowed witheeand not allowed withmcif themlnpoption is specified.

Description

mgofcomputes goodness-of-fit tests for the distribution ofvarname, wherevarnameis a discrete (categorical, multinomial) variable. The default is to perform classical large sample chi-squared approximation tests based on Pearson's X2 statistic and the log likelihood ratio (G2) statistic. Alternatively,mgofcomputes exact tests using Monte Carlo methods or exhaustive enumeration (see themcandeeoptions).The (theoretical) null-distribution (the distribution against which

varnameis tested) is specified byexp.expis assumed to evaluate to the hypothesized probabilities of the categories ofvarnameor to quantities proportional to these probabilities (e.g. expected counts; the scale does not matter). Ifexpis omitted, the uniform (geometric, equiprobable) distribution is used.

mgofiis the immediate form ofmgof(see immed) wheref1,f2, etc. specify the observed counts and, optionally,p1,p2, etc. specify the theoretical probabilities or expected counts.The heavy lifting is done in Mata. See help

mata mm_mgof()for details. See Jann (2008) for an article discussing multinomial goodness-of-fit tests and themgofcommand.

Dependencies

mgofrequiresmoremata. Type. ssc describe moremata

Options+----------+ ----+ Method 1 +---------------------------------------------------------

approx[(nfit)], the default method, computes classical large sample chi-squared approximation tests based on Pearson's X2 and the log likelihood ratio (G2) statistic (see, e.g., Horn 1977; Cressie and Read 1989; Sokal and Rohlf 1995, Ch. 17). The degrees of freedom for chi-squared tests are determined ask-nfit-1 wherekis the number of categories andnfit, provided by the user, indicates the number of fitted parameters (imposed restrictions) (nfit's default is 0). Ifpweights are specified, the tests are corrected as outlined in the Methods and Formulas section.

svy[([vcetype] [,svy_options])] specifies that the test results be adjusted for survey design effects according to thesvysetspecifications.vcetypeandsvy_optionsare as described in helpsvy. The correction procedure is described in the Methods and Formulas section below.svyis not allowed withmgofi.

vce(vcetype)specifies that the variance-covariance matrix of the proportions be estimated usingproportionand that the tests be adjusted based on this estimate (see Methods and Formulas below).vcetypemay beanalytic,clusterclustvar,bootstrap, orjackknife(plus possible suboptions as described in helpvce_option).analyticandclusterclustvarare not allowed with Stata 9.vce()is not allowed withmgofi.

cluster(clustvar)is Stata 9 syntax forvce(clusterclustvar).cluster()is not allowed withmgofi.

noisilydisplays the output from theproportioncommand, which is used to estimate the variances of the proportions ifsvy,vce(), orcluster()is specified or ifpweightsare applied.+----------+ ----+ Method 2 +---------------------------------------------------------

mccauses the exact p-values to be approximated by sampling from the null distribution (Monte Carlo simulation). The default number of replications for the simulation is 10000; see thereps()option (the same set of samples is used for all test statistics). 99% confidence intervals are displayed for the estimated p-values.

reps(#)sets the number of replications for themcmethod. The default is 10'000.

level(#)sets the level for the confidence intervals of the p-values computed by themcmethod. The default islevel(99). Note that, unlike many other Stata commands,mgofdoesnotdepend onset level.

citype(type)specifies how the binomial confidence intervals for the p-values from themcmethod are to be calculated. Available types areexact,wald,wilson,agresti, andjeffreys. See helpci.citype(exact)is the default.+----------+ ----+ Method 3 +---------------------------------------------------------

eecauses the exact p-values to be computed by cycling through all possible data compositions given the sample size and the number of categories. Since the number of compositions grows very fast -- it is equal to (n+k-1)!/((k-1)!n!) wherenis the sample size andkis the number of categories -- theeemethod is only feasible for very small samples and few categories. An important exception is when the null distribution is uniform (andksmirnovis not specified). In this case the tests are based on enumerating partitions, which are much fewer in number than compositions. For example, withn=40 andk=10, the number of compositions is 2'054'455'634, but the number of partitions is only 16'928 (Hirji 1997).+-----------------+ ----+ Test statistics +--------------------------------------------------

nox2suppresses Person's X2 statistic.

nolrsuppresses the log likelihood ratio (G2) statistic.

cr[(lambda)] specifies that the Cressie-Read statistic with parameterlambdabe included (Cressie and Read 1984; also see Weesie 1997). The default forlambdais2/3.

mlnprequests that a test based on the (minus log) multinomial probability of the observed outcome be included (see Horn 1977).mlnpis not allowed with theapproxmethod.

ksmirnovrequests that the two-sided Kolmogorov-Smirnov statistic be included. The Kolmogorov-Smirnov statistic is sensitive to the order of the categories and should only be used with variables that have a natural order (i.e. ordinal or discrete metric data). Note that the Kolmogorov-Smirnov test implemented in official Stata'sksmirnovis conservative in the case of discrete data (see, e.g., Conover 1972). The methods implemented here are exact.ksmirnovis not allowed withapprox.+---------------+ ----+ Other options +----------------------------------------------------

freqdisplays a table containing observed and expected frequencies.

percentdisplays a table containing observed and expected percent.

matrix(name)specifies that the observed and expected counts are to be taken from matrixname(see helpmatrix). The first column of the matrix provides the observed counts and the second column, if present, provides the the expected counts or theoretical probabilities. The uniform distribution is used if the matrix does not contain a second column. Do not provide non-integer observed counts with theeeormcmethod.matrix()is not allowed withmgofi.

expected(name)specifies that the expected counts or theoretical probabilities are to be taken from column vectorname(see helpmatrix).mgofaborts if the number of elements in the vector does not match the number outcomes.

nodotscauses the progress dots for themcandeemethods to be suppressed. The default is to display a dot for each 2 percent of completed computations.

Methods and formulasThe survey design correction procedure for the large-sample tests is based on Rao and Scott (1981) and parallels the default independence test correction used in

svy: tabulate twoway(see[SVY] svy: tabulate twowayand the references therein).Let V/(n-1) be a consistent estimate of the variance-covariance matrix of the proportions p_i, i=1...k, where n is the number of observations. Furthermore, let v_ij denote an element of V, m be the number of PSUs/clusters, and L be the number of strata. The correction then takes

F = X2 / (delta * (a2 + 1)) / d = X2 / delta / (k-1) as asymptotically F(d, d*r) distributed where X2 stands for the uncorrected test statistic and where

delta = 1/(k-1) * sum_i v_ii/p_i d = (k-1) / (1 + a2) a2 = [ 1/(k-1) * sum_ij v_ij^2/(p_i*p_j) ] / delta^2 - 1 r = m - L delta can be seen as the mean of the "generalized design effects" for the proportions, a2 as the square of their variation coefficient. V/(n-1) is estimated by

proportiontaking into accountpweights, cluster structure, or other complex survey design properties.

ExamplesApproximate chi-squared test Monte Carlo exact test Exhaustive enumeration exact test Kolmogorov-Smirnov test Complex survey design correction Test against non-uniform distribution Using the immediate command or the the matrix() option Empty cells

+------------------------------+ ----+ Approximate chi-squared test +-------------------------------------

A classical chi-squared tests against the uniform distribution can be performed as follows:

. drop _all . set seed 38 . set obs 20 obs was 0, now 20 . gen byte x = ceil(uniform()*5) . mgof x, freq cr Number of obs = 20 N of outcomes = 5 Chi2 df = 4 ---------------------------------------------- Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 10 0.0404 Log likelihood ratio | 9.55691 0.0486 Cressie-Read (2/3) | 9.699921 0.0458 ---------------------------------------------- x | observed expected -------------+---------------------- 1 | 3 4 2 | 1 4 3 | 9 4 4 | 2 4 5 | 5 4 -------------+---------------------- Total | 20 20

+------------------------+ ----+ Monte Carlo exact test +-------------------------------------------

If the number of observations is low, exact p-values should be computed. One approach is to approximate the exact p-values by sampling from the null distribution (

mcmethod):. mgof x, mc cr Percent completed (10000 replications) 0 ----- 20 ------ 40 ------ 60 ------ 80 ----- 100 .................................................. Number of obs = 20 N of outcomes = 5 Replications = 10000 ---------------------------------------------------------------------- | Exact Goodness-of-fit | Coef. P-value [99% Conf. Interval] ----------------------+----------------------------------------------- Pearson's X2 | 10 0.0369 0.0322 0.0420 Log likelihood ratio | 9.55691 0.0753 0.0687 0.0824 Cressie-Read (2/3) | 9.699921 0.0402 0.0353 0.0455 ----------------------------------------------------------------------

+-----------------------------------+ ----+ Exhaustive enumeration exact test +--------------------------------

There are (20+5-1)!/((5-1)!20!) = 10626 different ways in which 20 observations can be divided into 5 (or less) categories. Performing an exact test based on generating all these compositions would not be much of a problem. However, because the null distribution is uniform, only 192 partitions, a subset of the 10626 composition, have to be generated:

. mgof x, ee cr Percent completed (192 partitions) 0 ----- 20 ------ 40 ------ 60 ------ 80 ----- 100 .................................................. Number of obs = 20 N of outcomes = 5 Partitions = 192 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 10 0.0392 Log likelihood ratio | 9.55691 0.0773 Cressie-Read (2/3) | 9.699921 0.0432 ----------------------------------------------

+-------------------------+ ----+ Kolmogorov-Smirnov test +------------------------------------------

The

ksmirnovoption performs an exact Kolmogorov-Smirnov test for discrete data (not allowed withapprox). Example:. mgof x, ee nox2 nolr ksmirnov Percent completed (10626 compositions) 0 ----- 20 ------ 40 ------ 60 ------ 80 ----- 100 .................................................. Number of obs = 20 N of outcomes = 5 Compositions = 10626 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Kolmogorov-Smirnov D | .2 0.1656 ---------------------------------------------- +----------------------------------+ ----+ Complex survey design correction +---------------------------------

mgofmay be used withpweights, in which case the correction outlined in the Methods and Formulas section is applied. Example:. gen w = exp(invnorm(uniform())) . mgof x [pw = w] Number of obs = 20 N of outcomes = 5 F df1 = 3.46153 F df2 = 65.769 --------------------------------------------------------- Goodness-of-fit | Coef. F-value P-value ----------------------+---------------------------------- Pearson's X2 | 6.867921 1.1741 0.3287 Log likelihood ratio | 7.946182 1.3584 0.2611 ---------------------------------------------------------

More generally, use the

svyoption to take account of the complex survey design set bysvyset: . svyset [pw = w]pweight:w VCE:linearized Single unit:missing Strata 1:<one> SU 1:<observations> FPC 1:<zero> . mgof x, svy Number of strata = 1 Number of obs = 20 Number of PSUs = 20 Pop size = 35.039 Design df = 19 N of outcomes = 5 F df1 = 3.46153 F df2 = 65.769 --------------------------------------------------------- Goodness-of-fit | Coef. F-value P-value ----------------------+---------------------------------- Pearson's X2 | 6.867921 1.1741 0.3287 Log likelihood ratio | 7.946182 1.3584 0.2611 ---------------------------------------------------------

+---------------------------------------+ ----+ Test against non-uniform distribution +----------------------------

. recode x (1=0.1) (2=0.2) (3=0.4) (4=0.2) (5=0.1), generate(p) (20 differences between x and p) . mgof x = p, freq ee nodots Number of obs = 20 N of outcomes = 5 Compositions = 10626 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 8.375 0.0762 Log likelihood ratio | 8.170615 0.1115 ---------------------------------------------- x | observed expected -------------+---------------------- 1 | 3 2 2 | 1 4 3 | 9 8 4 | 2 4 5 | 5 2 -------------+---------------------- Total | 20 20

+--------------------------------------------------------+ ----+ Using the immediate command or the the matrix() option +-----------

. mgofi 3 1 9 2 5 / 2 4 8 4 2, ee nodots Number of obs = 20 N of outcomes = 5 Compositions = 10626 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 8.375 0.0762 Log likelihood ratio | 8.170615 0.1115 ---------------------------------------------- . matrix A = (3\1\9\2\5),(2\4\8\4\2) . mgof, matrix(A) ee nodots Number of obs = 20 N of outcomes = 5 Compositions = 10626 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 8.375 0.0762 Log likelihood ratio | 8.170615 0.1115 ----------------------------------------------

+-------------+ ----+ Empty cells +------------------------------------------------------

Empty cells can be introduced by adding observations with 0 weights (or by using

mgofior thematrix()option):. gen w = 1 . set obs 21 obs was 20, now 21 . replace x = 6 in 21 (1 real change made) . replace w = 0 in 21 (1 real change made) . mgof x [fw = w], freq ee nodots Number of obs = 20 N of outcomes = 6 Partitions = 282 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 16 0.0071 Log likelihood ratio | 16.84977 0.0083 ---------------------------------------------- x | observed expected -------------+---------------------- 1 | 3 3.333333 2 | 1 3.333333 3 | 9 3.333333 4 | 2 3.333333 5 | 5 3.333333 6 | 0 3.333333 -------------+---------------------- Total | 20 20

Returned resultsDepending on options,

mgofsaves the following inr():Scalars

r(N)number of observationsr(N_pop)population sizer(N_strata)number of stratar(N_psu)number of PSUsr(N_clust)number of clustersr(df_r)design degrees of freedomr(df)degrees of freedom for the chi2 statisticsr(df1)numerator d.f. for the F statisticsr(df2)denominator d.f. for the F statisticsr(delta)mean generalized design effectr(a2)squared variation coefficient of generalized design effectsr(reps)number of replicationsr(partitions)number of partitionsr(compositions)number of compositionsr(stat)value of test statisticr(F_stat)value of the corrected F statisticr(p_stat)p-value ofr(stat)orr(F_stat)r(p_stat_srs)uncorrected p-value ofr(stat)r(p_stat_lb)lower C.I. bound forr(p_stat)r(p_stat_ub)upper C.I. bound forr(p_stat)where

statis:x2(Pearson's X2)lr(log likelihood ratio)cr(Cressie-Read statistic)mlnp(minus log outcome probability)ksmirnov(Kolmogorov-Smirnov D)Macros

r(depvar)name of tabulated variabler(h0)definition of the theoretical distribution; either"=exp"or"= 1/k", depending on whetherexpwas specified, wherekis the number of categoriesr(method)method used to compute the p-valuesr(stats)list of tested statisticsr(lambda)lambda parameter of the Cressie-Read statisticr(citype)Monte Carlo confidence interval typer(cilevel)Monte Carlo confidence levelMatrices

r(count)observed counts and expected counts

ReferencesConover, W. J. (1972). A Kolmogorov Goodness-of-Fit Test for Discontinuous Distributions. Journal of the American Statistical Association 67: 591-596.

Cressie, N., T. R. C. Read (1984). Multinomial Goodness-of-Fit Tests. Journal of the Royal Statistical Society (Series B) 46: 440-464.

Cressie, N., T. R. C. Read (1989). Pearson's X^2 and the Loglikelihood Ratio Statistic G^2: A Comparative Review. International Statistical Review 57: 19-43.

Hirji, K. F. (1997). A Comparison of Algorithms for Exact Goodness-of-Fit Tests for Multinomial Data. Communications in Statistics–Simulation and Computation 26: 1197-1227.

Horn, S. D. (1977). Goodness-of-Fit Tests for Discrete Data: A Review and an Application to a Health Impairment Scale. Biometrics 33: 237-247.

Jann, B. (2008). Multinomial goodness-of-fit: large sample tests with survey design correction and exact tests for small samples. The Stata Journal 8(2): 147-169. [Working paper version available from http://ideas.repec.org/p/ets/wpaper/2.html.]

Rao, J. N. K., A. J. Scott (1981). The Analysis of Categorical Data From Complex Sample Surveys: Chi-Squared Tests for Goodness of Fit and Independence in Two-Way Tables. Journal of the American Statistical Association 76: 221-230.

Sokal, R. R., F. J. Rohlf (1995). Biometry. The Principles and Practice of Statistics in Biological Research. Third Edition. New York: W. H. Freeman and Company.

Weesie, J. (1997). sg68: Goodness-of-fit statistics for multinomial distributions. Stata Technical Bulletin Reprints 6: 183-186.

AuthorBen Jann, ETH Zurich, jann@soz.gess.ethz.ch

Thanks for citing

mgofas follows:Jann, B. (2008). Multinomial goodness-of-fit: large sample tests with survey design correction and exact tests for small samples. The Stata Journal 8(2): 147-169.

or

Jann, B. (2007). mgof: Stata module to perform goodness-of-fit tests for multinomial data. Available from http://ideas.repec.org/c/boc/bocode/s456854.html.

Also seeOnline:

ksmirnov,matrix,svy: tabulate twoway,proportion,mm_mgof(),moremataUser packages:

tabchiby Nicholas J. Coxmultgofby Jeroen Weesiechi2fitby Stas Kolenikovcsgofby Michael N. Mitchell