.-
help for ^vmatch^
.-

Matching variables between subjects -----------------------------------

^vmatch^ casevar [^if^ exp] , ^g^enerate(newvar) ^show^(idvar) ^save^(filename) ^f^irst ^fuzz^(real 1e-6) [^r^]^d^evia(varlist1 dLo1[/dHi1] [varlist2 dLo2[/dHi2]] ...) [^r^]^e^ucli(varlist1 dist1 [varlist2 dist2] ...) ^o^utof(varlist1 #1 [varlist2 #2]...) ^s^trng(varlist) ^p^ower(#)

Description -----------

^vmatch^ attempts to find records that "match" a given "case" in the sense that they are similar (see rules) with respect to the variables declared in the options. Variable ^casevar^ is assumed to contain a 1 for cases (to be matched) and any other value for non-cases. The new variable ^newvar^ will be created to receive the number of matches for a case and, for non-cases, either the number of times it was called a match or the identification number of the case for which it was first called (see option ^first^). String variables are allowed in the matching process (see option ^strng^)

Cases (casevar==1) are not considered acceptable matches and the optional ^if^ condition only applies to eligible records. To be eligible as match, non-cases must have no missing values on any of the numeric variables declared in the matching options. Stata's -markout- is used for this and therefore string variables are not considered. The pool of eligible records is used when summary statsitics are required to apply matching rules.

There are 4 rules for matching which can be accessed through options. In the description below it is assumed that the values of a given case are denoted X0 or X0_i for vars X or X_i (resp), and d, d1 and d2 represent tolerated values for "distances". Some of the rules may be applied with either absolute or "relative" distances (option prefixed with ^r^). The rules are:

- Deviation ---------> accept as match if -d1 <= devia <= +d2 absolute devia= X-X0 relative devia= 100*( X-X0 ) / X0 %

- Euclidean ---------> accept as match if eucli <= d absolute eucli= p-root( sum{ p-power (X_i-X0_i) over i=1,2,...} ) relative eucli= p-root( sum{ p-power((X_i-X0_i)/s_i) }), where s_i is SD(X_i) The default value of p is 2 and may be optionally modified.

- Out of ---------> accept as match if at least a specified number of exact matches are found among the variables in the specified list.

- String ---------> accept exact matches only (i.e. x==x0)

Options ------- ^generate^ is not an option, the variable newvar must not exist.

^show^ asks that records be identified by the (existing) variable idvar.

^save^ asks that all variables declared in the matching options as well as the idvar and a new var (called Mfor, meaning "match for") be saved in the specified file. If the file already exists it is replaced (no warning). If no idvar is given (i.e. ^show^ option omitted) the sequential number in the present ordering (on entry) is used instead. The variable Mfor is not added to the current data, only saved in the new data file. It identifies (using idvar or seq.num.) the case for which the current record is a match. Cases have Mfor set to missing. Non-cases may appear several times as match(!)

^First^ asks that the generated newvar contain the identification (idvar or seq.num.) of the case for which the present match was first called.

^fuzz^ declares the precision in a numeric comparison: two numbers are considered equal when they differ by less than the specified value; the default is 1e-6. Useful to avoid wrong decisions (exclusions) due to roundoff. It may however cause too many inclusions.

^devia^ or ^rdevia^ request that any record whose deviation from the case (separately on each variable in each successive varlist) does not exceed the specified dLo on the low side nor dHi on the high side be accepted as match. When /dHi is omitted dHi is taken to be the same as dLo; when both are given there must not be blanks within the statement ^dLo/dHi^. When ^rdevia^ is asked, the deviations are taken as percentage deviation from the case (on a scale from 0 to 100, not 0 to 1, see rules above).

^eucli^ or ^reucli^ request that any record which, in the space defined by each varlist successively, lies within the specified distance from the considered case be accepted as match. The distance is computed according to the minkovski formula as the p-th root of the sum of the p-th power of the absolute value of deviations. The value of p is specified via the ^power^ option and defaults to 2 (yielding the euclidean distance). When ^reucli^ is asked the deviations are standardized (X-Xbar)/sdX, so that the difference between a case and a non-case becomes (X-X0)/sdX. The mean Xbar and standard deviation sdX are computed using eligible non-cases only.

^power^ declares the value of the power transformation used in computing distances and only has an effect when the ^[r]eucli^ option is used. A power of 2 produces euclidean distances and a power of 1 the so-called manhattan distance (sum of absolute deviations). When the power is 0 the distance is computed as the antilog of the sum of the log of deviations. Fractional as well as negative values of the power are allowed. Warning: when logs or reciprocals are taken a deviation of zero is replaced by the ^fuzz^ value which may become decisive in declaring matches. Indeed, in the case of logs the distance corresponds to the product of the deviations (i.e. the volume of a box) where some components are zero (or fuzz), thus yielding "small" products and good matching. In the case of negative powers the reciprocals of the fuzz values will become "large", and the sum will result in rather poor matching.

^outof^ : a match is any record for which at least ^#k^ variables within the given varlist^k^ match the considered case's values exactly.

^strng^ : a match is any record for which all string variables in a specified varlist are lexicographically identical to those of the considered case. (note: the option truly is strng without i: at first I was afraid to mix up things, later, well I was afraid to mess them up, or too lazy.)

Examples -------- Here is a do-file using the automobile data, (with an additional string-var color to check ^strng^ option) * cap prog drop aumatch prog def aumatch use auto, clear tempvar u g `u'=1+int(4*uniform()) g str5 color="white" if `u'==1 replace color="blue" if `u'==2 replace color="red" if `u'==3 replace color="green" if `u'==4 sort mak g tbM=mod(_n,15)==11 vmatch tbM, g(nbM1) d(w 200) re(p 1) s(col) show(mak) f vmatch tbM, g(nbM2) e(tru tur mpg 5) p(1) show(mak) save(auto2) end *

- g tbM=mod(_n,15)==11 after sort mak selects seq.num 11,26,41,56,71 (in the alfabetical order of makes) to be matched

- vmatch tbM, g(nbM1) d(w 200) re(p 1) s(col) show(mak) f asks: find matches for cases with tbM==1, store result in new var nbM1 match as follows: ^d^(w 200): weight does not deviate by more than 200 pounds ^re^(p 1) : provided the price is no more than 1 SD away ^s^(col) : make sure they are the same color as the case ^show^(mak) : display casewise (identified by make) list of matches ^f^ : replace nbM1 by first id.num for which non-case was match

- vmatch tbM, g(nbM2) e(tru tur mpg 5) p(1) show(mak) save(auto2) the sum of absolute values of differences (manhattan distance) between a case and its matches on the variables trunk, turn and mpg should not exceed 5; also, casewise display by make and save in file auto2.

Author ------ Guy D. van Melle, university of Lausanne, Switzerland email: guy.van-melle@@inst.hospvd.ch