reclass
reclass varname , [ pcnttab(string) adjust bestmod q(real) u(real) dist(real) assoc(string) misclass(real) format(string) verbose ]
Description
reclass is called by perturb to create a table of reclassification probabilities for use a perturbation analysis. It can be used separately to experiment with different association patterns. varname should refer to a categorical variable. reclass creates a table of reclassification probabilities such that the expected frequencies of the reclassified variable will be equal to the frequency distribution of varname. In addition, an appropriate association is imposed between the reclassified variable and varname.
Options
pcnttab can be either a single value, a row or column matrix, or a square matrix. Usually, a single value between 0 and 100 will be specified indicating the percentage cases to be reclassified to the same category.
If a row or column matrix is specified its dimensions must correspond with the number of categories of varname. Values should be between 0 and 100 and indicate the percentage of cases to be reclassified to the same category for each category separately.
If a square matrix is specified, its dimensions must correspond with the number of categories of varname. The matrix should indicate the reclassification probabilities with the original variable in the rows and the reclassified variables in the columns. Either percentages or probabilities may be used. It is not necessary for these to add to 100 or to 1 respectively as reclass transforms them into columnwise proportions.
In most cases, the pcnttab option will suffice. The options below are useful for users familiar with loglinear models for square tables (mobility models).
adjust By default, reclass defines reclassification probabilities such that the expected frequencies of the reclassified variable are the same as those of varname when the pcnttab option is used. Use noadjust to suppress this and use the percentages specified in the pcnttab option unmodified. noadjust implies nobestmod.
bestmod By default, reclass imposes an appropriate pattern of association between varname and its reclassified counterpart when the pcnttab option is used. Use nobestmod to avoid this. The reclassification probabilities will be adjusted to make the expected frequencies of the reclassified variable equal to those of varname but they will otherwise be close approximations of the values specified in the pcnttab option.
misclass Maintained for compatiblity with the original version of misclass and perturb. Translated by reclass into pcnttab(100-misclass) noadjust.
The options below can be used to specify the parameters of a pattern of association. They will be ignored if pncttab was specified.
q the multiplicative parameter of a quasi-independence (constrained) model.
u the multiplicative parameter of a uniform association model.
dist the multiplicative parameter of a distance model.
assoc This allows users familiar with loglinear mobility models to specify an association pattern of their own choice. The argument for assoc should refer to a Stata program in which the variable paras is defined as a function of the row variable orig and the column variable dest to produce a loglinear pattern of associaton.
format Specify a valid format for printing results. The default is %8.3f.
verbose debugging information.
Remarks
The basic idea of reclassifying cases in a perturbation analysis is that each case will have a high probability, say 95%, of being reclassified into the same category. The remaining cases could then be distributed evenly among the remaining categories. There are two problems with this approach. First, smaller categories will tend to grow and larger categories will shrink after reclassification. Second, the association between the original and the reclassified variable will be arbitrary, with some reclassification categories being more likely than others. Both problems occur more strongly to the extent that the variable in question is unevenly distributed.
reclass solves thes problems by creating an initial table of expected frequencies for the original by the reclassified variable, given the initial reclassification probabilities as specified by the pcnttab option. The parameters for an appropriate pattern of association between the original and the reclassified variable are derived from this table. Then an adjusted table of expected frequencies is created with the pattern of a association found, such that the expected frequency distribution of the reclassified variable is identical to that of the original.
When a single percentage is specified in pcnttab, this percentage is used for the diagonal cells of the initial reclassification probabilities, with the remaining percentages distributed evenly among the other categories. The appropriate pattern of association in this case is a "quasi-independence (constrained)" loglinear mobility model (Hout 1983, Goodman 1984). The QI-C model makes the odds of reclassification to the same/different category the same for all categories. In addition, the reclassified category is independent of the original category, given that they are not the same. This model is fitted to the intital table of expected frequencies and a single ln(q) parameter is reported. This ln(q) parameter is used to create the adjusted table.
If the argument to pcnttab consists of a row or column vector of percentages, reclass assumes uses different odds for reclassification to the same versus different categories for each category. The pcnttab argument forms the diagonal of the initial reclassification percentages, the remaining percentages are distributed evenly among the other categories. A quasi-independence model is fitted to the initial table of expected frequencies, with separate odds per category for reclassification to the same versus a different category. An ln(q) parameter is reported for each category of varname. These parameters are then used to create the adjusted table.
If the argument to pncttab consists of a square matrix of percentages, reclass assumes that varname is an ordered categorical variable. Consequently, the percentages should be constructed so that short distance reclassification is more likely than long distance reclassification. reclass fits two models to the intial table of expected frequencies, a "quasi-distance" model and a "quasi-uniform association" model.
Both make short distance reclassification more likely than long distance but this is even more pronounced for a quasi-uniform association model than for a quasi-distance model. A quasi-distance model makes the likelihood of reclassification proportionately lower for each step away from the main diagonal. A quasi-uniform model on the other hand is equivalent to a squared distance model, reclassification is proportionately less likely for the squared number of steps from the main diagonal.
Both models include an ln(q) parameter that increases the likelihood of reclassification to the same category without affecting short or long distance reclassification. The quasi-uniform model also reports an ln(u) parameter, quasi-distance reports an ln(dist) parameter. The best fitting model is chosen by reclass and the deviance and df are reported.
If the patterns of associatioh used by reclass are not in fact appropriate to the problem at hand, the nobestmod option could be used. The final reclassification percentages will the be as close as possible to those in the pcnttab option. The reclassification probabilities will be adjusted to make the expected frequencies of the reclassified variable equal to those of the original, leading to some discrepancies.
Alternatively, reclass could be run using the noadjust option. The returned result e(gentab) is then equal to the initial table of expected frequencies. e(gentab) could be used in a loglinear analysis to derive an appropriate model of association, which could then be specified in the assoc option.
The adjusted table is created using a loglinear model of equal main effects (a halfway model) and the appropriate pattern of association as an offset variable. This is fitted to an arbitrary table with the frequency distribution of varname as both its row and column marginals. The predicted frequencies of this model form a symmetric table with the pre-specified marginals and pattern of association (Hendrickx, 2004; Kaufman & Schervish, 1986).
If the q, u, or dist parameters are known, these can be specified directly in the corresponding options. For other patterns of association, the assoc option could be used. assoc should refer to a small program that defines the variable paras in terms of orig and dest to produce a loglinear pattern of association. For example:
program define q25u5 gen paras=(orig==dest)*ln(25) + orig*dest*ln(5) end
The command reclass myvar, assoc(q25u5) will produce reclassification probabilities for the variable myvar using the program q25u5. This is equivalent to using the options q(25) and u(5). Other loglinear mobility models can be defined in a similar fashion.
Saved results
r(gentab) the initial expected table for the original by reclassified variable
r(margin) the frequency distribution of varname
r(classprob) the cumulative reclassification probabilities. These will be used by perturb to randomly reclassify varname.
References
Goodman, Leo A. (1984). The analysis of cross-classified data having ordered categories. Cambridge, Mass.: Harvard University Press.
Hendrickx, J. (2004). Using standardised tables for interpreting loglinear models. Submitted to Quality & Quantity.
Hout, M. (1983). Mobility tables. Beverly Hills: Sage Publications.
Kaufman, R.L., & Schervish, P.G. (1986). Using adjusted crosstabulations to interpret log-linear relationships. American Sociological Review 51:717-733
http://www.xs4all.nl/~jhckx/perturb/
Direct comments to: John Hendrickx
reclass is available at SSC-IDEAS. Use ssc install perturb to obtain the latest version.
Also see On-line: help for vif, collin, coldiag, coldiag2, perturb