help cem -------------------------------------------------------------------------------


cem -- Coarsened Exact Matching


cem varname1 [(cutpoints1)] [varname2 [(cutpoints2)]] ... [, options]

options Description ------------------------------------------------------------------------- treatment(varname) name of the treatment variable showbreaks display the cutpoints used for each variable autocuts(string) method used to automatically generate cutpoints k2k force cem to return a k2k solution imbbreaks(string) method used to automatically generate cutpoints for imbalance checks miname(string) filename root of the imputed datasets, if in separate files misets(integer) number of imputed datasets, if in separate files impvar(string) name of imputed dataset variable, if in stack/flong format noimbal do not evaluate the imbalance in the matched solution.


cem implements the Coarsened Exact Matching method described in Iacus, King, and Porro (2008). The main input for cem are the variables to use and the cutpoints that define the coarsening. Users can either specify cutpoints for a variable or allow cem to automatically coarsen the data based on a binning algorithm, chosen by the user. To specify a set of cutpoints for a variable, place a numlist in parentheses after the variable's name. To specify an automatic coarsening, place a string indicating the binning algorithm to use in parentheses after the variable's name. To create a certain number of equally spaced cutpoints, say 10, place "#10" in the parentheses (this will include the extreme values of the variable). Omitting the parenthetical statement after the variable name tells cem to use the default binning algorithm, itself set by autocuts. For example,

. cem age (10 20 30 40 50) education (scott) re74, treatment(treated)

will coarsen the first variable, age into bins of (0-10), (10-20), (20-30), (30-40), (40-50) and (50+). The Scott algoritm will be used on the second variable, education and the third variable, re74, will use the default binning algorithm, Sturge's rule. We could also use

. cem age (#6) education (scott) re74, treatment(treated)

to coarsen age using 6 equally spaced cutpoints. Using #0 will force cem into not coarsening the variable at all. The option autocuts can be used to reset the default binning algorithm. For example,

. cem age education re74, treatment(treated) autocuts(fd)

will coarsen all of variables using the Freedman-Diaconis rule.

cem can handle missing data in two ways. If you feed cem data with missing values, cem will simply treat the missing value as an additional category to match on. If you have multiply imputed data, you can specify one of two pieces of information, depending how the imputations are stored. First, if the imputations are stored in stacked or flong format (the Stata default using the mi commands), then you can simply pass the name of the imputation variable to the impvar option. If the imputaed datasets are in different files, you can specify the root of the imputed filenames ("imputed" if the datasets are named "imputed1.dta", "imputed2.dta", etc) in the option miname and the number of imputations in the option misets. For example:

. cem age education re74, treatment(treated) miname(imputed) misets(5)

In either format, cem will includes all imputations in the matching process. For observations with imputed values, cem assigns strata by finding the strata most often assigned to that observation over the imputations (this is like a plurality voting rule with ties broken randomly). Distances for the imbalance measure are calculated using the mean of imputations. Under either format, cem will result in a stacked or flong dataset, which can be directly used with Stata's mi commands.

The k2k will force the algorithm to create strata with equal numbers of treated and control units. This removes the need to use weights, but at a loss of information. It is recommended that you simply use the output cem_weights (see the cem documentation for more information).

Note that string variables are ignored by cem and that the ordering of value labels is used by cem. If you have an unordered variable, you may want to create dummy variables to use them in the matching process.


+------+ ----+ Main +-------------------------------------------------------------

varname# is a variable to be included as a coviarate.

cutpoints# is either a numlist for cutpoints, a string referring to the automatic coarsening rule to use, or a pound/hash sign (#) followed by a number specifying the number of equally sized bins to use. See Description for more information and examples. The binning algorithms available are "sturges" for Sturge's rule, "fd" for the Freedman-Diaconis rule, "scott" for Scott's rule and "ss" for Shimazaki-Shinomoto's rule. Note that cutpoints# only affects varname#.

+---------+ ----+ Options +----------------------------------------------------------

treatment(varname) sets the treatment variable used for matching. This is optional and if omitted, cem will simply sort the observations into strata based on the coarsening and not return any output related to matching.

showcutpoints will have cem display the cutpoints used for each variable on the screen.

autocuts(string) sets the default automatic coarsening algorithm. The default for this is "sturges". Any variable without a cutpoints# command after its name will use the autocuts argument.

k2k will have cem produce a matching result that has the same number of treated and control in each matched strata by randomly dropping observations.

imbbreaks(string) sets the coarsening method for the imbalance checks printed after cem runs. This should match whichever method is used for imbalance checks elsewhere.If either cem or imb has been run and there is a r(L1_breaks) available, this will be the default.

miname(string) is the root of the filenames of the imputed dataset. They should be in the working directory. For example, if miname were "imputed", then the filenames should be "imputed1.dta","imputed2.dta" and so on.

misets(integer) is the number of imputed datasets being used for matching.


The following are added as variables to the main Stata dataset. If you are using miname() for multiple imputation, cem will save each of these to each of the .dta files.

cem_strata the stratum that cem assigned each observation cem_weights the weight assigned to the observation's stratum. Equals 0 if the observation is unmatched and 1 if the observation is treated. cem_matched indicator if the observation was matched. cem_treat when using the multiple imputation features, cem outputs this variable, which is the treatment vector used for matching. cem applies the same combination rule to treatment as to strata.

Saved Results

Scalars r(n_strata) number of strata r(n_groups) number of levels of the treatment variable r(n_mstrata) number of strata with matches r(n_matched) number of matched observations r(L1) multivariate imbalance measure

Matrices r(match_table) cross tabulation of treatment and matched status r(groups) tabulation of treatment variable r(imbal) matrix of univariate imbalance measures

Strings r(varlist) list of covariate variables used r(treatment) treatment variable used for matching r(cem_call) call to cem r(L1_breaks) break method used for L1 distance

References and Distribution

cem is licensed under GLP2. For more information, see:

For a full reference on Coarsened Exact Matching, see:

Stefano M. Iacus, Gary King, and Giuseppe Porro, "Matching for Causal Inference Without Balance Checking", copy at <>

To report bugs or give comments, please contact Matthew Blackwell <>.