{smcl} {* 17may2019}{...} {cmd:help mata mm_greedy()} {hline} {title:Title} {pstd} {bf:mm_greedy() -- Greedy one-to-one and one-to-many matching without replacement} {title:Syntax} {p 8 21 2} {it:P} = {cmd:mm_greedy(}{it:T}{cmd:,} {it:C}{cmd:,} {it:n}{cmd:,} {it:caliper}{cmd:,} {it:f} [{cmd:,} {it:arg}]{cmd:)} {p 8 21 2} {it:E} = {cmd:mm_greedy2(}{it:T}{cmd:,} {it:C}{cmd:,} {it:n}{cmd:,} {it:caliper}{cmd:,} {it:f} [{cmd:,} {it:arg}]{cmd:)} {p 8 21 2} {it:E} = {cmd:mm_greedy_pairs(}{it:P}{cmd:)} {p 8 8 2} where {p 12 16 2} {it:P}: {it:real matrix} of dimension {it:nT x n} containing the indices of the matched controls, where {it:nT} is the number of rows of {it:T}; the rows of {it:P} correspond to the rows of {it:T} and the indices stored in {it:P} refer to rows of {it:C}; for example, value 3 in row 5 would mean that treatment observation 5 was matched with control observation 3; elements of {it:P} are set to missing in cases where no suitable control observation is found; for example, if you request five matches ({it:n}=5), but for a particular treatment observation only three matching controls are available, then the 4th and 5th elements in the corresponding row of {it:P} will be set to missing {p 12 16 2} {it:E}: {it:real matrix} of dimension {it:M x 3} containing and edge-list of treatment-control pairs, where {it:M} is the total number of matched pairs; column 1 contains treatment observation indices, column 2 contains control observation indices, column 3 contains weights defined as 1/{it:k_i}, where {it:k_i} is the total number of controls that have been matched to treatment observation {it:i} (the sum of weights is equal to the number of treatment observations for which at least one match was found) {p 12 16 2} {it:T}: {it:transmorphic matrix} containing the treatment observations; rows are observations, columns are variables; {it:T} and {it:C} must have the same number of columns {p 12 16 2} {it:C}: {it:transmorphic matrix} containing the control observations; rows are observations, columns are variables; {it:T} and {it:C} must have the same number of columns {p 12 16 2} {it:n}: {it:real scalar} specifying the number of control observations to be matched with each treatment observation; {it:n}>=. and {it:n}<1 will be interpreted as {it:n}=1 {p 6 16 2} {it:caliper}: {it:real scalar} specifying a caliper; if the distance between a pair of observations is larger than the caliper, the pair will not be considered as a potential match; set {it:caliper}=. to allow all pairs as potential matches {p 12 16 2} {it:f}: {it:pointer scalar} containing the address of the function to be used to compute the distances between treatment and control observations; usually this is coded as {cmd:&}{it:functionname}{cmd:()}; function {it:f} must return a {it:real colvector} (the computed distances between a single treatment observation and each control, in the same order as {it:C}); {cmd:mm_greedy()} calls function {it:f} repeatedly (one time for each treatment observation); in each call, the following three arguments will be passed on to {it:f}: a single row from {it:T} (1st argument), {it:C} (2nd argument), and {it:arg} (3rd argument) {p 10 16 2} {it:arg}: argument that will be passed on to function {it:f}; {it:arg} can be of any type {title:Description} {pstd} {cmd:mm_greedy()} matches controls to treatment observations using a greedy algorithm without replacement. It first matches the pair with the smallest distance, then the pair with the 2nd smallest distance, and so on. Any scalar distance metric can be used by providing function {it:f} computing the distances. Each control will be matched to a treatment observation at most once; if a control has been used, it will no longer be available for further matching. Ties (multiple pairs with the same distance) will be processed in random order. Set the sort seed if you want to obtain stable results (see {helpb set sortseed}). {pstd} The computational complexity of the algorithm implemented in {cmd:mm_greedy()} is of order {it:nT}*{it:nC}, where {it:nT} and {it:nC} are the numbers of observations in the two groups. For example, in an exercise with 1000 treatment observations and 10'000 control observations, 10'000'000 distances will have to be evaluated. This means that the algorithm is slow in large datasets {pstd} {cmd:mm_greedy2()} is like {cmd:mm_greedy()}, but returns the result in a different format. Whereas {cmd:mm_greedy()} returns a matrix of control indices with one row per treatment observation, {cmd:mm_greedy2()} returns an edge-list of matched pairs with treatment indices in the first column, control indices in the second column, and weights in the third column. The weights are defined as the inverse of the total number of controls that have been matched to a single treatment observation. Depending on application, either the format returned by {cmd:mm_greedy()} or the format returned by {cmd:mm_greedy2()} may be more convenient. {pstd} {cmd:mm_greedy_pairs()} can be used to transform the result returned by {cmd:mm_greedy()} into an edge-list as returned by {cmd:mm_greedy2()}. {title:Examples} {dlgtab:One-to-one matching} {pstd} In the following example a matched sample is generated based on absolute differences in the estimated propensity score. To compute the differences, we first have to define an appropriate function that can then be used by {cmd:mm_greedy()}: . {stata "mata: function absdif(T, C, arg) return(abs(C:-T))"} {pstd} We can now get some data and apply one-to-one matching: . {stata sysuse auto, clear} . {stata logit foreign weight mpg turn} . {stata predict ps, pr} . {stata generate byte domestic = 1 - foreign} . {stata "mata:"} : {stata T = st_data(., "ps", "foreign")} : {stata C = st_data(., "ps", "domestic")} : {stata P = mm_greedy(T, C, 1, ., &absdif())} : {stata end} {pstd} Vector {cmd:P} contains the index numbers of the matched controls. Here is a table showing the index numbers, the propensity scores of the treated, the propensity scores of the matched controls, and the propensity-score differences: . {stata "mata: P, T, C[P], C[P] - T"} {pstd} Here is how you could compute a mean difference based on the matched sample: . {stata "mata:"} : {stata T_price = st_data(., "price", "foreign")} : {stata C_price = st_data(., "price", "domestic")} : {stata mean(T_price) - mean(C_price)} {it:(raw difference)} : {stata mean(T_price) - mean(C_price[P])} {it:(matched difference)} : {stata end} {dlgtab:One-to-one matching with caliper} {pstd} Some of the matches in the above example are not very good. For example, for treatment observation 2, the matched control's propensity score deviates by about 0.7 (see table above). To prevent such bad matches, we could set a caliper. To set the maximum acceptable difference at 0.2 you could type: . {stata "mata: P = mm_greedy(T, C, 1, 0.2, &absdif())"} {pstd} For some of the treatment observations no suitable control could be found due to the caliper. In this cases, {cmd:P} is set to missing: . {stata "mata: P"} {pstd} Let us generate a permutation vector for the non-missing elements: . {stata "mata: p = select(1::rows(P), P:<.)"} {pstd} With the help of {cmd:p} we can now display a similar table as above . {stata "mata: P[p], T[p], C[P[p]], C[P[p]] - T[p]"} {pstd} and compute the matched mean difference: . {stata "mata: mean(T_price[p]) - mean(C_price[P[p]])"} {pstd} The matching quality is much better now and the result for the outcome difference changed somewhat. {pstd} Instead of selecting the relevant elements from {cmd:P} we could also {cmd:mm_greedy2()} to directly obtain a matrix containing appropriate permutation vectors for treated and controls: . {stata "mata:"} : {stata E = mm_greedy2(T, C, 1, 0.2, &absdif())} : {stata E} : {stata mean(T_price[E[,1]]) - mean(C_price[E[,2]])} : {stata end} {pstd} Alternatively, {cmd:P} could be transformed into {cmd:E} using {cmd:mm_greedy_pairs()}: . {stata "mata: mm_greedy_pairs(P)"} {dlgtab:One-to-many matching} {pstd} Use argument {it:n} to set the number of controls that should be matched to each treatment observation. Here is an example with 3 matches each: . {stata "mata: P = mm_greedy(T, C, 3, ., &absdif())"} {pstd} The columns of {cmd:P} contains the index numbers of the different matches. Because the number of controls is smaller than the three times the number of treatment observations, not all treatment observations received three matches (and some did not receive any matches at all): . {stata "mata: P"} {pstd} For one-to-many matching, working with the edge-list returned by {cmd:mm_greedy2()} is typically more convenient than working with the matrix of control indices returned by {cmd:mm_greedy2()}: . {stata "mata:"} : {stata E = mm_greedy2(T, C, 3, ., &absdif())} : {stata E} : {stata mean(T_price[E[,1]]) - mean(C_price[E[,2]], E[,3])} : {stata end} {pstd} {cmd:E[,3]} contains the appropriate weights to be applied to the selected control group observations. {dlgtab:Mahalanobis distance matching} {pstd} For Mahalanobis distance matching we need to define a different distance function and provide the separate X variables as well as the inverted variance matrix to the matching procedure: . {stata "mata:"} : {stata "function mahasq(T, C, S) return(rowsum(((C:-T)*S):*(C:-T)))"} : {stata T = st_data(., tokens("weight mpg turn"), "foreign")} : {stata C = st_data(., tokens("weight mpg turn"), "domestic")} : {stata S = invsym(variance((T\C)))} : {stata P = mm_greedy(T, C, 1, ., &mahasq(), S)} : {stata P, mahasq(T, C[P,], S)} : {stata end} {title:Source code} {pstd} {help moremata_source##mm_greedy:mm_greedy.mata} {title:Author} {pstd} Ben Jann, University of Bern, ben.jann@soz.unibe.ch {title:Also see} {psee} Online: help for {bf:{help moremata}} {p_end}