Title
mgof -- Goodness-of-fit tests for multinomial data
Syntax
mgof varname [ = exp ] [if] [in] [weight] [, options ]
mgofi f1 f2 ... [ / p1 p2 ... ] [, options ]
options Description ------------------------------------------------------------------------- Method 1 approx[(nfit)] compute large sample chi-squared tests (the default) * svy[(spec)] adjust tests for survey design * vce(vcetype) adjust tests using proportion variance estimate * cluster(varname) adjust tests for intragroup correlation noisily show output from proportion
Method 2 mc compute Monte Carlo exact tests reps(#) number of replications for mc; default is reps(10000) level(#) set confidence level for mc; default is level(99) citype(type) set confidence interval type for mc; default is citype(exact)
Method 3 ee compute exhaustive enumeration exact tests
Test statistics nox2 suppress Pearson's X2 statistic nolr suppress log likelihood ratio statistic cr[(lambda)] include Cressie-Read (lambda) statistic; lambda defaults to 2/3 mlnp include log outcome probability statistic (mc and ee only) ksmirnov include Kolmogorov-Smirnov statistic (mc and ee only)
Other options freq display frequency distribution percent display frequency distribution in percent * matrix(name) provide matrix containing observed and expected counts expected(name) provide matrix (column vector) containing expected counts nodots suppress progress dots (mc and ee only) ------------------------------------------------------------------------- * svy, vce(), cluster(), and matrix() not allowed with mgofi.
by is allowed (unless svy is specified); see help by.
fweights, pweights, and iweights are allowed (see help weight), but pweights are not allowed with ee or mc, and iweights are not allowed with ee and not allowed with mc if the mlnp option is specified.
Description
mgof computes goodness-of-fit tests for the distribution of varname, where varname is a discrete (categorical, multinomial) variable. The default is to perform classical large sample chi-squared approximation tests based on Pearson's X2 statistic and the log likelihood ratio (G2) statistic. Alternatively, mgof computes exact tests using Monte Carlo methods or exhaustive enumeration (see the mc and ee options).
The (theoretical) null-distribution (the distribution against which varname is tested) is specified by exp. exp is assumed to evaluate to the hypothesized probabilities of the categories of varname or to quantities proportional to these probabilities (e.g. expected counts; the scale does not matter). If exp is omitted, the uniform (geometric, equiprobable) distribution is used.
mgofi is the immediate form of mgof (see immed) where f1, f2, etc. specify the observed counts and, optionally, p1, p2, etc. specify the theoretical probabilities or expected counts.
The heavy lifting is done in Mata. See help mata mm_mgof() for details. See Jann (2008) for an article discussing multinomial goodness-of-fit tests and the mgof command.
Dependencies
mgof requires moremata. Type
. ssc describe moremata
Options
+----------+ ----+ Method 1 +---------------------------------------------------------
approx[(nfit)], the default method, computes classical large sample chi-squared approximation tests based on Pearson's X2 and the log likelihood ratio (G2) statistic (see, e.g., Horn 1977; Cressie and Read 1989; Sokal and Rohlf 1995, Ch. 17). The degrees of freedom for chi-squared tests are determined as k-nfit-1 where k is the number of categories and nfit, provided by the user, indicates the number of fitted parameters (imposed restrictions) (nfit's default is 0). If pweights are specified, the tests are corrected as outlined in the Methods and Formulas section.
svy[([vcetype] [, svy_options])] specifies that the test results be adjusted for survey design effects according to the svyset specifications. vcetype and svy_options are as described in help svy. The correction procedure is described in the Methods and Formulas section below. svy is not allowed with mgofi.
vce(vcetype) specifies that the variance-covariance matrix of the proportions be estimated using proportion and that the tests be adjusted based on this estimate (see Methods and Formulas below). vcetype may be analytic, cluster clustvar, bootstrap, or jackknife (plus possible suboptions as described in help vce_option). analytic and cluster clustvar are not allowed with Stata 9. vce() is not allowed with mgofi.
cluster(clustvar) is Stata 9 syntax for vce(cluster clustvar). cluster() is not allowed with mgofi.
noisily displays the output from the proportion command, which is used to estimate the variances of the proportions if svy, vce(), or cluster() is specified or if pweights are applied.
+----------+ ----+ Method 2 +---------------------------------------------------------
mc causes the exact p-values to be approximated by sampling from the null distribution (Monte Carlo simulation). The default number of replications for the simulation is 10000; see the reps() option (the same set of samples is used for all test statistics). 99% confidence intervals are displayed for the estimated p-values.
reps(#) sets the number of replications for the mc method. The default is 10'000.
level(#) sets the level for the confidence intervals of the p-values computed by the mc method. The default is level(99). Note that, unlike many other Stata commands, mgof does not depend on set level.
citype(type) specifies how the binomial confidence intervals for the p-values from the mc method are to be calculated. Available types are exact, wald, wilson, agresti, and jeffreys. See help ci. citype(exact) is the default.
+----------+ ----+ Method 3 +---------------------------------------------------------
ee causes the exact p-values to be computed by cycling through all possible data compositions given the sample size and the number of categories. Since the number of compositions grows very fast -- it is equal to (n+k-1)!/((k-1)!n!) where n is the sample size and k is the number of categories -- the ee method is only feasible for very small samples and few categories. An important exception is when the null distribution is uniform (and ksmirnov is not specified). In this case the tests are based on enumerating partitions, which are much fewer in number than compositions. For example, with n=40 and k=10, the number of compositions is 2'054'455'634, but the number of partitions is only 16'928 (Hirji 1997).
+-----------------+ ----+ Test statistics +--------------------------------------------------
nox2 suppresses Person's X2 statistic.
nolr suppresses the log likelihood ratio (G2) statistic.
cr[(lambda)] specifies that the Cressie-Read statistic with parameter lambda be included (Cressie and Read 1984; also see Weesie 1997). The default for lambda is 2/3.
mlnp requests that a test based on the (minus log) multinomial probability of the observed outcome be included (see Horn 1977). mlnp is not allowed with the approx method.
ksmirnov requests that the two-sided Kolmogorov-Smirnov statistic be included. The Kolmogorov-Smirnov statistic is sensitive to the order of the categories and should only be used with variables that have a natural order (i.e. ordinal or discrete metric data). Note that the Kolmogorov-Smirnov test implemented in official Stata's ksmirnov is conservative in the case of discrete data (see, e.g., Conover 1972). The methods implemented here are exact. ksmirnov is not allowed with approx.
+---------------+ ----+ Other options +----------------------------------------------------
freq displays a table containing observed and expected frequencies.
percent displays a table containing observed and expected percent.
matrix(name) specifies that the observed and expected counts are to be taken from matrix name (see help matrix). The first column of the matrix provides the observed counts and the second column, if present, provides the the expected counts or theoretical probabilities. The uniform distribution is used if the matrix does not contain a second column. Do not provide non-integer observed counts with the ee or mc method. matrix() is not allowed with mgofi.
expected(name) specifies that the expected counts or theoretical probabilities are to be taken from column vector name (see help matrix). mgof aborts if the number of elements in the vector does not match the number outcomes.
nodots causes the progress dots for the mc and ee methods to be suppressed. The default is to display a dot for each 2 percent of completed computations.
Methods and formulas
The survey design correction procedure for the large-sample tests is based on Rao and Scott (1981) and parallels the default independence test correction used in svy: tabulate twoway (see [SVY] svy: tabulate twoway and the references therein).
Let V/(n-1) be a consistent estimate of the variance-covariance matrix of the proportions p_i, i=1...k, where n is the number of observations. Furthermore, let v_ij denote an element of V, m be the number of PSUs/clusters, and L be the number of strata. The correction then takes
F = X2 / (delta * (a2 + 1)) / d = X2 / delta / (k-1) as asymptotically F(d, d*r) distributed where X2 stands for the uncorrected test statistic and where
delta = 1/(k-1) * sum_i v_ii/p_i d = (k-1) / (1 + a2) a2 = [ 1/(k-1) * sum_ij v_ij^2/(p_i*p_j) ] / delta^2 - 1 r = m - L delta can be seen as the mean of the "generalized design effects" for the proportions, a2 as the square of their variation coefficient. V/(n-1) is estimated by proportion taking into account pweights, cluster structure, or other complex survey design properties.
Examples
Approximate chi-squared test Monte Carlo exact test Exhaustive enumeration exact test Kolmogorov-Smirnov test Complex survey design correction Test against non-uniform distribution Using the immediate command or the the matrix() option Empty cells
+------------------------------+ ----+ Approximate chi-squared test +-------------------------------------
A classical chi-squared tests against the uniform distribution can be performed as follows:
. drop _all . set seed 38 . set obs 20 obs was 0, now 20 . gen byte x = ceil(uniform()*5) . mgof x, freq cr Number of obs = 20 N of outcomes = 5 Chi2 df = 4 ---------------------------------------------- Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 10 0.0404 Log likelihood ratio | 9.55691 0.0486 Cressie-Read (2/3) | 9.699921 0.0458 ---------------------------------------------- x | observed expected -------------+---------------------- 1 | 3 4 2 | 1 4 3 | 9 4 4 | 2 4 5 | 5 4 -------------+---------------------- Total | 20 20
+------------------------+ ----+ Monte Carlo exact test +-------------------------------------------
If the number of observations is low, exact p-values should be computed. One approach is to approximate the exact p-values by sampling from the null distribution (mc method):
. mgof x, mc cr Percent completed (10000 replications) 0 ----- 20 ------ 40 ------ 60 ------ 80 ----- 100 .................................................. Number of obs = 20 N of outcomes = 5 Replications = 10000 ---------------------------------------------------------------------- | Exact Goodness-of-fit | Coef. P-value [99% Conf. Interval] ----------------------+----------------------------------------------- Pearson's X2 | 10 0.0369 0.0322 0.0420 Log likelihood ratio | 9.55691 0.0753 0.0687 0.0824 Cressie-Read (2/3) | 9.699921 0.0402 0.0353 0.0455 ----------------------------------------------------------------------
+-----------------------------------+ ----+ Exhaustive enumeration exact test +--------------------------------
There are (20+5-1)!/((5-1)!20!) = 10626 different ways in which 20 observations can be divided into 5 (or less) categories. Performing an exact test based on generating all these compositions would not be much of a problem. However, because the null distribution is uniform, only 192 partitions, a subset of the 10626 composition, have to be generated:
. mgof x, ee cr Percent completed (192 partitions) 0 ----- 20 ------ 40 ------ 60 ------ 80 ----- 100 .................................................. Number of obs = 20 N of outcomes = 5 Partitions = 192 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 10 0.0392 Log likelihood ratio | 9.55691 0.0773 Cressie-Read (2/3) | 9.699921 0.0432 ----------------------------------------------
+-------------------------+ ----+ Kolmogorov-Smirnov test +------------------------------------------
The ksmirnov option performs an exact Kolmogorov-Smirnov test for discrete data (not allowed with approx). Example:
. mgof x, ee nox2 nolr ksmirnov Percent completed (10626 compositions) 0 ----- 20 ------ 40 ------ 60 ------ 80 ----- 100 .................................................. Number of obs = 20 N of outcomes = 5 Compositions = 10626 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Kolmogorov-Smirnov D | .2 0.1656 ---------------------------------------------- +----------------------------------+ ----+ Complex survey design correction +---------------------------------
mgof may be used with pweights, in which case the correction outlined in the Methods and Formulas section is applied. Example:
. gen w = exp(invnorm(uniform())) . mgof x [pw = w] Number of obs = 20 N of outcomes = 5 F df1 = 3.46153 F df2 = 65.769 --------------------------------------------------------- Goodness-of-fit | Coef. F-value P-value ----------------------+---------------------------------- Pearson's X2 | 6.867921 1.1741 0.3287 Log likelihood ratio | 7.946182 1.3584 0.2611 ---------------------------------------------------------
More generally, use the svy option to take account of the complex survey design set by svyset: . svyset [pw = w]
pweight:w VCE:linearized Single unit:missing Strata 1:<one> SU 1:<observations> FPC 1:<zero> . mgof x, svy Number of strata = 1 Number of obs = 20 Number of PSUs = 20 Pop size = 35.039 Design df = 19 N of outcomes = 5 F df1 = 3.46153 F df2 = 65.769 --------------------------------------------------------- Goodness-of-fit | Coef. F-value P-value ----------------------+---------------------------------- Pearson's X2 | 6.867921 1.1741 0.3287 Log likelihood ratio | 7.946182 1.3584 0.2611 ---------------------------------------------------------
+---------------------------------------+ ----+ Test against non-uniform distribution +----------------------------
. recode x (1=0.1) (2=0.2) (3=0.4) (4=0.2) (5=0.1), generate(p) (20 differences between x and p) . mgof x = p, freq ee nodots Number of obs = 20 N of outcomes = 5 Compositions = 10626 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 8.375 0.0762 Log likelihood ratio | 8.170615 0.1115 ---------------------------------------------- x | observed expected -------------+---------------------- 1 | 3 2 2 | 1 4 3 | 9 8 4 | 2 4 5 | 5 2 -------------+---------------------- Total | 20 20
+--------------------------------------------------------+ ----+ Using the immediate command or the the matrix() option +-----------
. mgofi 3 1 9 2 5 / 2 4 8 4 2, ee nodots Number of obs = 20 N of outcomes = 5 Compositions = 10626 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 8.375 0.0762 Log likelihood ratio | 8.170615 0.1115 ---------------------------------------------- . matrix A = (3\1\9\2\5),(2\4\8\4\2) . mgof, matrix(A) ee nodots Number of obs = 20 N of outcomes = 5 Compositions = 10626 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 8.375 0.0762 Log likelihood ratio | 8.170615 0.1115 ----------------------------------------------
+-------------+ ----+ Empty cells +------------------------------------------------------
Empty cells can be introduced by adding observations with 0 weights (or by using mgofi or the matrix() option):
. gen w = 1 . set obs 21 obs was 20, now 21 . replace x = 6 in 21 (1 real change made) . replace w = 0 in 21 (1 real change made) . mgof x [fw = w], freq ee nodots Number of obs = 20 N of outcomes = 6 Partitions = 282 ---------------------------------------------- | Exact Goodness-of-fit | Coef. P-value ----------------------+----------------------- Pearson's X2 | 16 0.0071 Log likelihood ratio | 16.84977 0.0083 ---------------------------------------------- x | observed expected -------------+---------------------- 1 | 3 3.333333 2 | 1 3.333333 3 | 9 3.333333 4 | 2 3.333333 5 | 5 3.333333 6 | 0 3.333333 -------------+---------------------- Total | 20 20
Returned results
Depending on options, mgof saves the following in r():
Scalars r(N) number of observations r(N_pop) population size r(N_strata) number of strata r(N_psu) number of PSUs r(N_clust) number of clusters r(df_r) design degrees of freedom r(df) degrees of freedom for the chi2 statistics r(df1) numerator d.f. for the F statistics r(df2) denominator d.f. for the F statistics r(delta) mean generalized design effect r(a2) squared variation coefficient of generalized design effects r(reps) number of replications r(partitions) number of partitions r(compositions) number of compositions r(stat) value of test statistic r(F_stat) value of the corrected F statistic r(p_stat) p-value of r(stat) or r(F_stat) r(p_stat_srs) uncorrected p-value of r(stat) r(p_stat_lb) lower C.I. bound for r(p_stat) r(p_stat_ub) upper C.I. bound for r(p_stat)
where stat is: x2 (Pearson's X2) lr (log likelihood ratio) cr (Cressie-Read statistic) mlnp (minus log outcome probability) ksmirnov (Kolmogorov-Smirnov D)
Macros r(depvar) name of tabulated variable r(h0) definition of the theoretical distribution; either "= exp" or "= 1/k", depending on whether exp was specified, where k is the number of categories r(method) method used to compute the p-values r(stats) list of tested statistics r(lambda) lambda parameter of the Cressie-Read statistic r(citype) Monte Carlo confidence interval type r(cilevel) Monte Carlo confidence level
Matrices r(count) observed counts and expected counts
References
Conover, W. J. (1972). A Kolmogorov Goodness-of-Fit Test for Discontinuous Distributions. Journal of the American Statistical Association 67: 591-596.
Cressie, N., T. R. C. Read (1984). Multinomial Goodness-of-Fit Tests. Journal of the Royal Statistical Society (Series B) 46: 440-464.
Cressie, N., T. R. C. Read (1989). Pearson's X^2 and the Loglikelihood Ratio Statistic G^2: A Comparative Review. International Statistical Review 57: 19-43.
Hirji, K. F. (1997). A Comparison of Algorithms for Exact Goodness-of-Fit Tests for Multinomial Data. Communications in Statistics–Simulation and Computation 26: 1197-1227.
Horn, S. D. (1977). Goodness-of-Fit Tests for Discrete Data: A Review and an Application to a Health Impairment Scale. Biometrics 33: 237-247.
Jann, B. (2008). Multinomial goodness-of-fit: large sample tests with survey design correction and exact tests for small samples. The Stata Journal 8(2): 147-169. [Working paper version available from http://ideas.repec.org/p/ets/wpaper/2.html.]
Rao, J. N. K., A. J. Scott (1981). The Analysis of Categorical Data From Complex Sample Surveys: Chi-Squared Tests for Goodness of Fit and Independence in Two-Way Tables. Journal of the American Statistical Association 76: 221-230.
Sokal, R. R., F. J. Rohlf (1995). Biometry. The Principles and Practice of Statistics in Biological Research. Third Edition. New York: W. H. Freeman and Company.
Weesie, J. (1997). sg68: Goodness-of-fit statistics for multinomial distributions. Stata Technical Bulletin Reprints 6: 183-186.
Author
Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch
Thanks for citing mgof as follows:
Jann, B. (2008). Multinomial goodness-of-fit: large sample tests with survey design correction and exact tests for small samples. The Stata Journal 8(2): 147-169.
or
Jann, B. (2007). mgof: Stata module to perform goodness-of-fit tests for multinomial data. Available from http://ideas.repec.org/c/boc/bocode/s456854.html.
Also see
Online: ksmirnov, matrix, svy: tabulate twoway, proportion, mm_mgof(), moremata
User packages:
tabchi by Nicholas J. Cox multgof by Jeroen Weesie chi2fit by Stas Kolenikov csgof by Michael N. Mitchell