help cem-------------------------------------------------------------------------------

Title

cem-- Coarsened Exact Matching

Syntax

cemvarname1[(cutpoints1)] [varname2[(cutpoints2)]]...[,options]

optionsDescription -------------------------------------------------------------------------treatment(varname)name of the treatment variableshowbreaksdisplay the cutpoints used for each variableautocuts(string)method used to automatically generate cutpointsk2kforcecemto return a k2k solutionimbbreaks(string)method used to automatically generate cutpoints for imbalance checksminame(string)filename root of the imputed datasets, if in separate filesmisets(integer)number of imputed datasets, if in separate filesimpvar(string)name of imputed dataset variable, if in stack/flong formatnoimbaldo not evaluate the imbalance in the matched solution.

Description

cemimplements the Coarsened Exact Matching method described in Iacus, King, and Porro (2008). The main input forcemare the variables to use and the cutpoints that define the coarsening. Users can either specify cutpoints for a variable or allowcemto automatically coarsen the data based on a binning algorithm, chosen by the user. To specify a set of cutpoints for a variable, place a numlist in parentheses after the variable's name. To specify an automatic coarsening, place a string indicating the binning algorithm to use in parentheses after the variable's name. To create a certain number of equally spaced cutpoints, say 10, place "#10" in the parentheses (this will include the extreme values of the variable). Omitting the parenthetical statement after the variable name tellscemto use the default binning algorithm, itself set byautocuts. For example,

. cem age (10 20 30 40 50) education (scott) re74, treatment(treated)will coarsen the first variable,

ageinto bins of (0-10), (10-20), (20-30), (30-40), (40-50) and (50+). The Scott algoritm will be used on the second variable,educationand the third variable,re74, will use the default binning algorithm, Sturge's rule. We could also use

. cem age (#6) education (scott) re74, treatment(treated)to coarsen age using 6 equally spaced cutpoints. Using #0 will force

ceminto not coarsening the variable at all. The optionautocutscan be used to reset the default binning algorithm. For example,

. cem age education re74, treatment(treated) autocuts(fd)will coarsen all of variables using the Freedman-Diaconis rule.

cemcan handle missing data in two ways. If you feedcemdata with missing values,cemwill simply treat the missing value as an additional category to match on. If you have multiply imputed data, you can specify one of two pieces of information, depending how the imputations are stored. First, if the imputations are stored in stacked orflongformat (the Stata default using themicommands), then you can simply pass the name of the imputation variable to theimpvaroption. If the imputaed datasets are in different files, you can specify the root of the imputed filenames ("imputed" if the datasets are named "imputed1.dta", "imputed2.dta", etc) in the optionminameand the number of imputations in the optionmisets. For example:

. cem age education re74, treatment(treated) miname(imputed) misets(5)In either format,

cemwill includes all imputations in the matching process. For observations with imputed values,cemassigns strata by finding the strata most often assigned to that observation over the imputations (this is like a plurality voting rule with ties broken randomly). Distances for the imbalance measure are calculated using the mean of imputations. Under either format,cemwill result in a stacked orflongdataset, which can be directly used with Stata'smicommands.The

k2kwill force the algorithm to create strata with equal numbers of treated and control units. This removes the need to use weights, but at a loss of information. It is recommended that you simply use the outputcem_weights(see thecemdocumentation for more information).Note that string variables are ignored by

cemand that the ordering of value labels is used bycem. If you have an unordered variable, you may want to create dummy variables to use them in the matching process.

Arguments+------+ ----+ Main +-------------------------------------------------------------

varname#is a variable to be included as a coviarate.

cutpoints#is either anumlistfor cutpoints, a string referring to the automatic coarsening rule to use, or a pound/hash sign (#) followed by a number specifying the number of equally sized bins to use. See Description for more information and examples. The binning algorithms available are "sturges" for Sturge's rule, "fd" for the Freedman-Diaconis rule, "scott" for Scott's rule and "ss" for Shimazaki-Shinomoto's rule. Note that cutpoints# only affects varname#.+---------+ ----+ Options +----------------------------------------------------------

treatment(varname)sets the treatment variable used for matching. This is optional and if omitted,cemwill simply sort the observations into strata based on the coarsening and not return any output related to matching.

showcutpointswill havecemdisplay the cutpoints used for each variable on the screen.

autocuts(string)sets the default automatic coarsening algorithm. The default for this is "sturges". Any variable without acutpoints#command after its name will use the autocuts argument.

k2kwill havecemproduce a matching result that has the same number of treated and control in each matched strata by randomly dropping observations.

imbbreaks(string)sets the coarsening method for the imbalance checks printed aftercemruns. This should match whichever method is used for imbalance checks elsewhere.If eithercemorimbhas been run and there is ar(L1_breaks)available, this will be the default.

miname(string)is the root of the filenames of the imputed dataset. They should be in the working directory. For example, ifminamewere "imputed", then the filenames should be "imputed1.dta","imputed2.dta" and so on.

misets(integer)is the number of imputed datasets being used for matching.

OutputThe following are added as variables to the main Stata dataset. If you are using miname() for multiple imputation,

cemwill save each of these to each of the .dta files.

cem_stratathe stratum thatcemassigned each observationcem_weightsthe weight assigned to the observation's stratum. Equals 0 if the observation is unmatched and 1 if the observation is treated.cem_matchedindicator if the observation was matched.cem_treatwhen using the multiple imputation features,cemoutputs this variable, which is the treatment vector used for matching.cemapplies the same combination rule to treatment as to strata.

Saved ResultsScalars

r(n_strata)number of stratar(n_groups)number of levels of the treatment variabler(n_mstrata)number of strata with matchesr(n_matched)number of matched observationsr(L1)multivariate imbalance measure

Matrices

r(match_table)cross tabulation of treatment and matched statusr(groups)tabulation of treatment variabler(imbal)matrix of univariate imbalance measuresStrings

r(varlist)list of covariate variables usedr(treatment)treatment variable used for matchingr(cem_call)call to cemr(L1_breaks)break method used for L1 distance

References and Distribution

cemis licensed under GLP2. For more information, see: http://gking.harvard.edu/cem/For a full reference on Coarsened Exact Matching, see:

Stefano M. Iacus, Gary King, and Giuseppe Porro, "Matching for Causal Inference Without Balance Checking", copy at <http://gking.harvard.edu/files/abs/cem-abs.shtml>

To report bugs or give comments, please contact Matthew Blackwell <blackwel@fas.harvard.edu>.