help synth-------------------------------------------------------------------------------

Title

synth-- Synthetic control methods for comparative case studies

Syntax

synthdepvarpredictorvars,trunit(#)trperiod(#) [counit(numlist)xperiod(numlist)mspeperiod()resultsperiod()nested alloptunitnames(varname)figurekeep(file)customV(numlist)optsettings]Dataset must be declared as a (balanced) panel dataset using

tssetpanelvartimevar; see tsset. Variables specified indepvarandpredictorvarsmust be numeric variables; abbreviations are not allowed.

Description

synthimplements the synthetic control method for causal inference in comparative case studies.synthestimates the effect of an intervention of interest by comparing the evolution of an aggregate outcomedepvarfor a unit affected by the intervention to the evolution of the same aggregate outcome for a synthetic control group.synthconstructs this synthetic control group by searching for a weighted combination of control units chosen to approximate the unit affected by the intervention in terms of the outcome predictors. The evolution of the outcome for the resulting synthetic control group is an estimate of the counterfactual of what would have been observed for the affected unit in the absence of the intervention.synthcan also be used to conduct a variety of placebo and permutation tests that produce informative inference regardless of the number of available comparison units and the number of available time-periods. See Abadie and Gardeazabal (2003) and Abadie, Diamond, and Hainmueller (2010) for details.

Required Settings

predictorvarsthe list of predictor variables. By default, all predictor variables are averaged over the entire pre-intervention period, which ranges from the earliest time period available in the panel time variable specified intssettimevarto the period immediately prior to the intervention specified intrperiod. Missing values are ignored in the averages. The user has two options to flexibly specify the time periods over which predictors are averaged:(1)

xperiod(numlist) allows to specify a common period over which all predictors should be averaged; see below for details.(2) For each particular predictor the user can specify the period over which the variable will be averaged. For this,

synthuses a specialized syntax. The time period is specified in parenthesis directly following the variable name, e.g. varname(period) with no blanks between the variable name and itsperiod.periodcan contain either a single period, a numlist of periods, or several periods concatenated by a "&". The periods refer to the panel time variable specified intssettimevar. For example, assume the time periods are given in years, and there are four predictors X1, X2, X3, and X4 then:

. synth Y X1(1980) X2(1982&1986&1988) X3(1980(1)1990) X4indicates that:

X1(1980): the value of the variable X1 in the year 1980 is entered as a predictor.

X2(1982&1986&1988): the value of the variable X2 averaged over the years 1982, 1986, and 1988 is entered as a predictor.

X3(1980(1)1990): the value of the variable X3 averaged over the years 1980,1981,...,1990 is entered as a predictor.

X4: since no variable specific period is provided, the value of the variable X4 is averaged either over the entire pretreatment period (default) or the period specified inxperiod(numlist) and then entered as a predictor.

trunit(#) the unit number of the unit affected by the intervention as given in the panel id variable specified intssetpanelvar; see tsset. Notice that only a single unit number can be specified. If the intervention of interest affected several units the user may chose to combine these units first and then treat them as a single unit affected by the intervention.

trperiod(#) the time period when the intervention occurred. The time period refers to the panel time variable specified intssettimevar; see tsset. Only a single number can be specified.

Options

counit(numlist) a list of unit numbers for the control units as given in the panel id variable specified intssetpanelvar; see tsset.counit()should be specified as a list of integer numbers (see numlist) and contain at least two control units. The list of control units specified constitute what is called the `donor pool', the set of potential control units out of which the synthetic control unit is constructed. Notice thatcounitis optional, if nocounitis specified, the donor pool defaults to all units available in the panel id variable specified intsset, excluding the unit affected by the intervention specified intrunit.

xperiod(numlist) a list of time periods over which the predictor variables specified inpredictorvarsare averaged. The list of time periods refers to the panel time variable specified intssettimevar. For example, if the specified panel time variable is given in years,xperiod(1980(1)1988) indicates that the predictor variables are averaged over all years from 1980, 1981,...,1988. See numlist on how to specify lists of numbers. If noxperiodis specified,xperioddefaults to the entire pre-intervention period, which by default ranges from the earliest time period available in the panel time variable to the period immediately prior to the intervention. Notice that the period of the intervention itself is excluded from the average and missing entries are ignored. Also notice that variable-specific time periods always take precedence overxperiod. Usually,xperiodis specified to contain a number of pre-intervention periods, although post-intervention time periods could be included if the predictors are not affected by the intervention.

mspeperiod(numlist) a list of pre-intervention time periods over which the mean squared prediction error (MSPE) should be minimized. The list of time periods refers to the panel time variable specified intssettimevar; see tsset. The MSPE refers to the squared deviations between the outcome for the treated unit and the synthetic control unit summed over all pre-intervention periods specified inmspeperiod(numlist). See numlist on how to specify lists of numbers. If nomspeperiod()is specified,mspeperiod()defaults to the entire pre-intervention period ranging from the earliest time period available in the panel time variable to the period immediately prior to the intervention. Notice that the period of the intervention itself is excluded frommspeperiod(). Usually, themspeperiod()is specified to cover the whole pre-treatment period up to the time of the intervention, but other choices are possible.

resultsperiod(numlist) a list of time periods over which the results ofsynthshould be obtained in the optional figure (seefigure), the optional results dataset (seekeep), and the return matrices (seeereturn results)). The list of time periods refers to the panel time variable specified intssettimevar. If noresultsperiodis specified,resultsperioddefaults to the entire period, which by default ranges from the earliest to the latest time period available in the panel time variable.

nestedby defaultsynthuses a data-driven regression based method to obtain the variable weights contained in the V-matrix. This method relies on a constrained quadratic programming routine, that finds the best fitting W-weights conditional on the regression based V-matrix. This procedure is fast and often yields satisfactory results in terms of minimizing the MSPE. Specifyingnestedwill lead to better performance, however, at the expense of additional computing time. Ifnestedis specifiedsynthembarks on an fully nested optimization procedure that searches among all (diagonal) positive semidefinite V-matrices and sets of W-weights for the best fitting convex combination of the control units. The fully nested optimization contains the regression based V as a starting point, but often produces convex combinations that achieve even lower MSPE. IfcustomVis specified andnestedis specified, the user supplied V-matrix form the starting point for the nested optimization. All parameters of both optimizers can be tuned by the user depending on his application (seeoptimset).

alloptif nested is specified (seenested) the user can also specifyalloptif she is willing to trade-off even more computing time in order to gain fully robust results. Sometimes the search space may contain local minima such that the nested optimization procedure starting from the regression based V-matrix may not find the global minimum in the parameter space.alloptprovides a robustness check by running the nested optimization three times using three different starting points (the regression based V, equal V-weights, and a third procedure that uses Stata's ml search procedure to find good starting values.synthreturns the best result of all three attempts. This option usually will take three times the amount of computing time compared to thenestedoption. Oftenalloptwill lead to no improvement over just thenestedmethod, but sometimealloptproduces even better results.

unitnames(varname) a string variable that contains unit names. The unit names refer to the unit numbers in the panel id variable specified intssetpvar; see tsset. Ifunitnamesis provided the results will be displayed with unit numbers labeled by their respective unit names. So for example, if the user has two variables in his dataset called country_numbers (numeric) and country_names (string),unitnames(country_names)could be specified to display the results using country names instead of numbers. Alternatively, if the user does not specifyunitnames, but his panel id variable is labeled, the labels from the latter will be used.

figureif specifiedsynthproduces a line plot with outcome trends for the treated unit and the synthetic control unit for the years specified inresultsperiod().

keep(filename)saves a dataset with the results in the filefilename.dta. This dataset can be used to further process the results.If

keep(filename)is specified,filename.dtawill hold the following variables:

_time:A variable that contains the respective time period (from thetssetpanel time variable (timevar)) for all periods specified inresultsperiod().

_Y_treated:The observed outcomedepvarfor the treated unit specified intr()for each time period specified inresultsperiod().

_Y_synthetic:The estimated outcomedepvarfor the synthetic control unit estimated using the convex combination of the control units specified inco()for each time period specified inresultsperiod().

_Co_Number:A variable that contains the unit number (from thetssetpanel unit variable (panelvar) for each control unit specified inco(). If unit names are supplied viaconames()the unit numbers will be labeled accordingly (each control unit number is labeled with its respective name fromconames().

_W_weight:A variable that contains the estimated unit weight for each control units specified inco().

replacereplaces the dataset specified inkeep(filename)if it already exists.

customV(numlist) by defaultsynthuses a data-driven regression based method to obtain the variable weights contained in the V-matrix.customV() gives the user the option to supply custom V-Weights instead. Notice that the V-weights determine the predictive power of the respective variable for the outcome of interest over the pre-intervention period. Highly predictive variables should be given a high weight, so that the unit affected by the intervention and the synthetic control unit match strongly on this predictor. Weights are specified as a list with weights appearing in the same order as the predictors listed inpredictorvars. One weight must be supplied for each predictor. See the papers in the references for details. For now, only one weight per variable is allowed (the V matrix is diagonal), but future releases will allow non-diagonal V matrices to be supplied.

----------------------------------------------------------------------------- Control parameters for theOptimization Settings:constrained quadratic optimization routine:The constrained quadratic optimization routine is based on an algorithm that uses the interior point method to solve the constrained quadratic programming problem (see Vanderbei 1999 for more details on the interior point method). It is implemented via a C++ plugin and has the following tuning parameters:

margin(real)Margin for constraint violation tolerance. Default is 5 percent (ie. 0.05).maxiter(#)Maximum number of iterations. Default is 1000.sigf(#)Precision (no of significant figures). Default is 7.bound(#)Clipping bound for the variables. Default is 10.

Additional control parameters for the

nested optimization routine:If

nestedis specified, a nested optimization will be performed using the constrained quadratic programming routine and Stata's ml optimizer. By default,synthuses the maximize default settings. The user may tune the maximize settings depending on his application (e.g. like synth ... , iterate(20) ). -----------------------------------------------------------------------------

Saved ResultsBy default,

synthereturns the following matrices, which can be displayed by typingereturn listaftersynthis finished (also see ereturn).

e(V_matrix):A diagonal matrix that contains the normalized variable weights in the diagonal.

e(X_balance) :A matrix that juxtaposes the predictor values for the unit affected by the intervention and the synthetic control unit. The matrix has two columns (treated and synthetic) and as many rows as predictors.

e(W_weights) :A matrix that contains the unit numbers and unit weights, ie. the relative contribution of each control unit to the synthetic control unit. The matrix has two column and as many rows as control units.

e(Y_treated) :A matrix that contains the values of the response variable for the treated unit for each time period. The matrix has one column and as many rows as time periods specified inresultsperiod().

e(Y_synthetic) :A matrix that contains the values of the response variable for the synthetic control unit for each time period. The matrix has one column and as many rows as time periods specified in {cmd:resultsperiod().

e(RMSPE) :A one by one matrix that contains the Root Mean Squared Prediction Error (RMSPE)}.

ExamplesLoad Example Data: This panel dataset contains information for 39 US States for the years 1970-2000 (see Abadie, Diamond, and Hainmueller (2010) for details). . sysuse smoking

Declare the dataset as panel: . tsset state year

Example 1 - Construct synthetic control group: synth cigsale beer(1984(1)1988) lnincome retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989)

In this example, the unit affected by the intervention is unit no 3 (California) in the year 1989. The donor pool (since no

counit()is specified) defaults to the control units 1,2,4,5,...,39 ( ie. the other 38 states in the dataset). Since noxperiod()is provided, the predictor variables for which no variable specific time period is specified (retprice, lnincome, and age15to24) are averaged over the entire pre-intervention period up to the year of the intervention (1970,1981,...,1988). The beer variable has the time period (1984(1)1988) specified, meaning that it is averaged for the periods 1984,1985,...,1988. The variable cigsale will be used three times as a predictor using the values from periods 1988, 1980, and 1975 respectively. The MSPE is minimized over the entire pretreatment period, becausemspeperiod()is not provided. By default, results are displayed for the period from 1970,1971,...,2000 period (the earliest and latest year in the dataset).Example 2 - Construct synthetic control group: synth cigsale beer lnincome(1980&1985) retprice cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) fig

This example is similar to example 1, but now beer is averaged over the entire pretreatment period while lnincome is only averaged over the periods 1980 and 1985. Since no data is available for beer prior to 1984,

synthwill inform the user that there is missing data for this variable and that the missing values are ignored in the averaging. A results figure is also requested using thefigoption.Example 3 - Construct synthetic control group: synth cigsale retprice cigsale(1970) cigsale(1979) , trunit(33) counit(1(1)20) trperiod(1980) fig resultsperiod(1970(1)1990)

In this example, the unit affected by the intervention is state no 33, and the donor pool of potential control units is restricted to states no 1,2,...,20. The intervention occurs in 1980, and results are obtained for the 1970,1971,...,1990 period.

Example 4 - Construct synthetic control group: synth cigsale retprice cigsale(1970) cigsale(1979) , trunit(33) counit(1(1)20) trperiod(1980) resultsperiod(1970(1)1990) keep(resout)

This example is similar to example 2 but

keep(resout)is specified and thussynthwill save a dataset named resout.dta in the current Stata working directory (typepwdto see the path of your working directory). This dataset contains the result from the current fit and can be used for further processing. Also to easily access results recall thatsynthroutinely returns all result matrices. These can be displayed by typingereturn listaftersynthhas terminated.Example 5 - Construct synthetic control group: synth cigsale beer lnincome retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975) , trunit(3) trperiod(1989) xperiod(1980(1)1988) nested

This is again example 2, but the

nestedoption is specified, which may produce lower loss at the expense of additional computing time. Also,xperiod()is specified indicating that predictors are averaged for the 1980,1981,...,1988 period.Example 5 – Run placebo in space:

. tempname resmatforvalues i = 1/4 {synth cigsale retprice cigsale(1988) cigsale(1980) cigsale(1975) ,trunit(`i') trperiod(1989) xperiod(1980(1)1988)matrix `resmat' = nullmat(`resmat') \ e(RMSPE)local names `"`names' `"`i'"'"'}mat colnames `resmat' = "RMSPE"mat rownames `resmat' = `names'matlist `resmat' , row("Treated Unit")This is a code example to run placebo studies by iteratively reassigning the intervention in space to the first four states. To do so, we simply run a four loop each where the

trunit()setting is incremented in each iteration. Thus, in the first run ofsynthstate number one is assigned to the intervention, in the second run state number two, etc, etc. In each run we store the RMSPE and display it in a matrix at the end.

ReferencesAbadie, A., Diamond, A., and J. Hainmueller. 2010. Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program.

Journal of the AmericanStatistical Association105(490): 493-505.Abadie, A. and Gardeazabal, J. 2003. Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review 93(1): 113-132.

Vanderbei, R.J. 1999. LOQO: An interior point code for quadratic programming.

Optimization Methods and Software11: 451-484.

AuthorsJens Hainmueller, jhainm@mit.edu MIT

Alberto Abadie, alberto_abadie@harvard.edu Harvard University

Alexis Diamond, adiamond@fas.harvard.edu IFC