{smcl}
{* 17may2019}{...}
{cmd:help mata mm_greedy()}
{hline}
{title:Title}
{pstd}
{bf:mm_greedy() -- Greedy one-to-one and one-to-many matching without replacement}
{title:Syntax}
{p 8 21 2}
{it:P} = {cmd:mm_greedy(}{it:T}{cmd:,} {it:C}{cmd:,} {it:n}{cmd:,} {it:caliper}{cmd:,} {it:f} [{cmd:,} {it:arg}]{cmd:)}
{p 8 21 2}
{it:E} = {cmd:mm_greedy2(}{it:T}{cmd:,} {it:C}{cmd:,} {it:n}{cmd:,} {it:caliper}{cmd:,} {it:f} [{cmd:,} {it:arg}]{cmd:)}
{p 8 21 2}
{it:E} = {cmd:mm_greedy_pairs(}{it:P}{cmd:)}
{p 8 8 2}
where
{p 12 16 2}
{it:P}: {it:real matrix} of dimension {it:nT x n} containing the indices
of the matched controls, where {it:nT} is the number of rows of
{it:T}; the rows of {it:P} correspond to the rows of {it:T} and
the indices stored in {it:P} refer to rows of {it:C}; for example,
value 3 in row 5 would mean that treatment observation 5 was
matched with control observation 3; elements of {it:P} are set to
missing in cases where no suitable control observation is found;
for example, if you request five matches ({it:n}=5), but for a
particular treatment observation only three matching controls are
available, then the 4th and 5th elements in the corresponding row
of {it:P} will be set to missing
{p 12 16 2}
{it:E}: {it:real matrix} of dimension {it:M x 3} containing and
edge-list of treatment-control pairs,
where {it:M} is the total number of matched pairs; column 1 contains
treatment observation indices, column 2 contains control
observation indices, column 3 contains weights defined as
1/{it:k_i}, where {it:k_i} is the total number of controls that have
been matched to treatment observation {it:i} (the sum of
weights is equal to the number of treatment observations for which
at least one match was found)
{p 12 16 2}
{it:T}: {it:transmorphic matrix} containing the treatment observations; rows
are observations, columns are variables; {it:T} and {it:C} must have
the same number of columns
{p 12 16 2}
{it:C}: {it:transmorphic matrix} containing the control observations; rows
are observations, columns are variables; {it:T} and {it:C} must have
the same number of columns
{p 12 16 2}
{it:n}: {it:real scalar} specifying the number of control observations to be
matched with each treatment observation; {it:n}>=. and {it:n}<1 will
be interpreted as {it:n}=1
{p 6 16 2}
{it:caliper}: {it:real scalar} specifying a caliper; if the distance between
a pair of observations is larger than the caliper, the pair will
not be considered as a potential match; set {it:caliper}=. to allow
all pairs as potential matches
{p 12 16 2}
{it:f}: {it:pointer scalar} containing the address of the function to be
used to compute the distances between treatment and control
observations; usually this is coded as
{cmd:&}{it:functionname}{cmd:()}; function {it:f} must return a
{it:real colvector} (the computed distances between a single treatment
observation and each control, in the same order as {it:C});
{cmd:mm_greedy()} calls function {it:f} repeatedly (one time for each
treatment observation); in each call, the following three arguments will be
passed on to {it:f}: a single row from {it:T} (1st argument), {it:C} (2nd
argument), and {it:arg} (3rd argument)
{p 10 16 2}
{it:arg}: argument that will be passed on to function {it:f}; {it:arg} can be
of any type
{title:Description}
{pstd}
{cmd:mm_greedy()} matches controls to treatment observations using a greedy
algorithm without replacement. It first matches the pair with the smallest
distance, then the pair with the 2nd smallest distance, and so on. Any
scalar distance metric can be used by providing function {it:f} computing
the distances. Each control will be matched to a treatment observation at
most once; if a control has been used, it will no longer be available for
further matching. Ties (multiple pairs with the same distance) will be processed
in random order. Set the sort seed if you want to obtain stable results
(see {helpb set sortseed}).
{pstd}
The computational complexity of the algorithm implemented in
{cmd:mm_greedy()} is of order {it:nT}*{it:nC}, where {it:nT} and {it:nC}
are the numbers of observations in the two groups. For example, in an exercise with
1000 treatment observations and 10'000 control observations, 10'000'000
distances will have to be evaluated. This means that the
algorithm is slow in large datasets
{pstd}
{cmd:mm_greedy2()} is like {cmd:mm_greedy()}, but returns the
result in a different format. Whereas {cmd:mm_greedy()} returns a matrix
of control indices with one row per treatment observation, {cmd:mm_greedy2()}
returns an edge-list of matched pairs with treatment indices in the first
column, control indices in the second column, and weights in
the third column. The weights are defined as the inverse of the total number
of controls that have been matched to a single treatment observation. Depending
on application, either the format returned by {cmd:mm_greedy()}
or the format returned by {cmd:mm_greedy2()} may be more convenient.
{pstd}
{cmd:mm_greedy_pairs()} can be used to transform the result returned by
{cmd:mm_greedy()} into an edge-list as returned by {cmd:mm_greedy2()}.
{title:Examples}
{dlgtab:One-to-one matching}
{pstd}
In the following example a matched sample is generated based on absolute
differences in the estimated propensity score. To compute the differences,
we first have to define an appropriate function that can then be used by
{cmd:mm_greedy()}:
. {stata "mata: function absdif(T, C, arg) return(abs(C:-T))"}
{pstd}
We can now get some data and apply one-to-one matching:
. {stata sysuse auto, clear}
. {stata logit foreign weight mpg turn}
. {stata predict ps, pr}
. {stata generate byte domestic = 1 - foreign}
. {stata "mata:"}
: {stata T = st_data(., "ps", "foreign")}
: {stata C = st_data(., "ps", "domestic")}
: {stata P = mm_greedy(T, C, 1, ., &absdif())}
: {stata end}
{pstd}
Vector {cmd:P} contains the index numbers of the matched controls. Here is a
table showing the index numbers, the propensity scores of the treated, the
propensity scores of the matched controls, and the propensity-score differences:
. {stata "mata: P, T, C[P], C[P] - T"}
{pstd}
Here is how you could compute a mean difference
based on the matched sample:
. {stata "mata:"}
: {stata T_price = st_data(., "price", "foreign")}
: {stata C_price = st_data(., "price", "domestic")}
: {stata mean(T_price) - mean(C_price)} {it:(raw difference)}
: {stata mean(T_price) - mean(C_price[P])} {it:(matched difference)}
: {stata end}
{dlgtab:One-to-one matching with caliper}
{pstd}
Some of the matches in the above example are not very good. For example, for treatment
observation 2, the matched control's propensity score deviates by
about 0.7 (see table above). To prevent such bad matches, we could set a
caliper. To set the maximum acceptable difference at 0.2 you could type:
. {stata "mata: P = mm_greedy(T, C, 1, 0.2, &absdif())"}
{pstd}
For some of the treatment observations no suitable control could be
found due to the caliper. In this cases, {cmd:P} is set to missing:
. {stata "mata: P"}
{pstd}
Let us generate a permutation vector for the non-missing elements:
. {stata "mata: p = select(1::rows(P), P:<.)"}
{pstd}
With the help of {cmd:p} we can now display a similar table as above
. {stata "mata: P[p], T[p], C[P[p]], C[P[p]] - T[p]"}
{pstd}
and compute the matched mean difference:
. {stata "mata: mean(T_price[p]) - mean(C_price[P[p]])"}
{pstd}
The matching quality is much better now and the result for the outcome
difference changed somewhat.
{pstd}
Instead of selecting the relevant elements from
{cmd:P} we could also {cmd:mm_greedy2()} to directly
obtain a matrix containing appropriate permutation vectors for
treated and controls:
. {stata "mata:"}
: {stata E = mm_greedy2(T, C, 1, 0.2, &absdif())}
: {stata E}
: {stata mean(T_price[E[,1]]) - mean(C_price[E[,2]])}
: {stata end}
{pstd}
Alternatively, {cmd:P} could be transformed into {cmd:E} using
{cmd:mm_greedy_pairs()}:
. {stata "mata: mm_greedy_pairs(P)"}
{dlgtab:One-to-many matching}
{pstd}
Use argument {it:n} to set the number of controls that should be matched to
each treatment observation. Here is an example with 3 matches each:
. {stata "mata: P = mm_greedy(T, C, 3, ., &absdif())"}
{pstd}
The columns of {cmd:P} contains the index numbers of the different
matches. Because the number of controls is smaller than the three times the number
of treatment observations, not all treatment observations received three
matches (and some did not receive any matches at all):
. {stata "mata: P"}
{pstd}
For one-to-many matching, working with the edge-list returned by
{cmd:mm_greedy2()} is typically more convenient than working with
the matrix of control indices returned by {cmd:mm_greedy2()}:
. {stata "mata:"}
: {stata E = mm_greedy2(T, C, 3, ., &absdif())}
: {stata E}
: {stata mean(T_price[E[,1]]) - mean(C_price[E[,2]], E[,3])}
: {stata end}
{pstd}
{cmd:E[,3]} contains the appropriate weights to be applied to the
selected control group observations.
{dlgtab:Mahalanobis distance matching}
{pstd}
For Mahalanobis distance matching we need to define a different distance
function and provide the separate X variables as well as the inverted
variance matrix to the matching procedure:
. {stata "mata:"}
: {stata "function mahasq(T, C, S) return(rowsum(((C:-T)*S):*(C:-T)))"}
: {stata T = st_data(., tokens("weight mpg turn"), "foreign")}
: {stata C = st_data(., tokens("weight mpg turn"), "domestic")}
: {stata S = invsym(variance((T\C)))}
: {stata P = mm_greedy(T, C, 1, ., &mahasq(), S)}
: {stata P, mahasq(T, C[P,], S)}
: {stata end}
{title:Source code}
{pstd}
{help moremata_source##mm_greedy:mm_greedy.mata}
{title:Author}
{pstd} Ben Jann, University of Bern, ben.jann@soz.unibe.ch
{title:Also see}
{psee}
Online: help for {bf:{help moremata}}
{p_end}