/* ado\mahascore.ado 1-2-2008 David Kantor Based on earlier mahascore.ado, dated 2-9-2006. This is to compute a (squared) Mahalanobis distance measure. Properly, one would want the unsquared value, but our default is to leave it squared. But in most usages, the results are used in comparisons (or sortings); the proportional magnitude is not significant. So the squared values are just as good. We do have an -unsquared- option, for those who prefer the unsquared result. In Dec. 2007, I was informed that mahascore had a problem in that it does a (squared) Normalized Euclidean measure -- not a true Mahalanobis measure (squared or not). The Normalized Euclidean measure assumes no correlation between coviariates; equivalently, it assumes that the ellipsoids are oriented parallel to the axes. I am now attempting to correct that -- to yield a true (squared) Mahalanobis measure. 2011dec16: changed from use of -word- function to -word- macro extended function; this corrected a truncation problem that arose with lengthy varlists. 2012feb8: enhanced handling of errors from covariancemat. */ *! version 1.0.7 2012feb8 /* prior: 1.0.6 2011dec16 */ prog def mahascore version 8.2 /* A new variable will be computed that reflects the distance between each observation and one of these: a reference observation (the refobs option); a tuple of values passed in via the refvals option; the means of all the variables (the refmeans option) `refobs' is an index to the reference observation. We have a -treated- option; it is used in limiting the covariance and means calculations to the treated cases. ALSO: This could be made to have an -in- and an -if-. (But you might want to assure that the refobs is in the set defined by those qualifiers.) This allows a pre-calculated inverse covariance matrix (invcovarmat) to be passed in as an option. This facilitates prevention of recalculating the (same) covariances in repeated calls. There is another very good reason to have this option. There are times when you want to run this on a subset of the data, by doing a drop or keep operation -- but where you want variances calculated on a larger set. This happens, in particular, when this is called by mahapick using the -sliceby- option. `display' tells some things to display: covar, invcov, means. invcovarmat is optional; we are not using a reserved name such as "none" to indicate to compute the covariances (as was done analogously in mahascore). Instead, the user must put in the COMPUTE_invcovarmat option. The user must specify one of invcovarmat or compute_invcovarmat . We allow weights; this applies only to the computation of the covarmat and the means. The result, if not unsquared, ought to be >=0; this should be true if the invcovariance matrix is truly an inverse covariance matrix, but may not be in general if an arbitrary matrix is presented. (Negative resluts are meaningless and will yield missing under the -unsquared- option.) The matrix product that we compute to get the result is expected to be >=0 if invcovariance matrix is truly an inverse covariance matrix. This is based on a theorem that asserts that a covariance matrix (and it inverse) are positive semi-definite. */ syntax varlist [aw fw iw pw] , gen(name) /// [ refobs(numlist integer min=1 max=1 >=1 <= `=_N') /// treated(varname numeric) INVCOVarmat(name) COMPUTE_invcovarmat DISPlay(string) /// UNSQuared EUCLidean refvals(name) refmeans VERBose float /// noCOVTRLIMitation noMEANTRLIMitation ] /* refobs is given as a numlist rather than an integer -- so it can be optional without having to give a default value (which you can't distinguish from explicitly entering the value). numlist also enables us to specify the range limits in the syntax instead of having to test it later. */ local progname "mahascore" if "`verbose'" ~= "" { if "`refvals'" ~= "" { local refvals_disp "refvals(`refvals')" } if "`refobs'" ~= "" { local refobs_disp "refobs(`refobs')" } if "`invcovarmat'" ~= "" { local invcovarmat_disp "invcovarmat(`invcovarmat')" } disp as text "`progname' called; `refmeans' `refvals_disp' `refobs_disp' `invcovarmat_disp' `compute_invcovarmat'" /* -- we can't display every option. */ } capture confirm new var `gen' if _rc ~=0 { disp as error "`progname': gen (`gen') must be a new variable" exit 198 } if "`treated'" ~= "" { if "`covtrlimitation'" =="" { local iftreated_for_cov "if `treated'" } if "`meantrlimitation'" =="" { local iftreated_for_mean "if `treated'" } } if "`unsquared'" ~= "" { local sqrt "sqrt" } local nvars : word count `varlist' if "`refmeans'" ~= "" { if "`refvals'" ~= "" { disp as err "`progname': refvals will be ignored, since refmeans was specified" } if "`refobs'" ~= "" { disp as err "`progname': refobs will be ignored, since refmeans was specified" } /* create our own set of refvals -- the means of `varlist' */ tempname refvals matrix `refvals' = J(`nvars', 1, .) matrix rownames `refvals' = `varlist' foreach v of local varlist { local rownum = rownumb(matrix(`refvals'), "`v'") summ `v' `iftreated_for_mean' [`weight' `exp'] , meanonly matrix `refvals'[`rownum', 1] = r(mean) } if index("`display'", "means") { disp _new as text "means vector:" mat list `refvals' } } else { /* `refmeans' absent */ if "`refvals'" ~= "" { if "`refobs'" ~= "" { disp as err "`progname': refobs will be ignored, since refvals was specified" } capture confirm matrix `refvals' if _rc ~=0 { disp as error "`progname': refvals must be a matrix" exit 198 } local r_nrows = rowsof(`refvals') local r_ncols = colsof(`refvals') if `r_nrows' ~= `nvars' { disp as err "`progname': refvals (`refvals') must have as many rows as vars in varlist" exit 198 } if `r_ncols' ~= 1 { disp as err "`progname': refvals (`refvals') must have 1 column" exit 198 } local r_rows : rownames(`refvals') forvalues j = 1/`nvars' { if "`:word `j' of `r_rows''" ~= "`:word `j' of `varlist''" { disp as err "`progname': rownames of refvals must correspond to varlist" exit 198 } } } else { /* `refvals' absent */ if "`refobs'" == "" { disp as err "`progname': must specify refobs, refvals, or refmeans" exit 198 } } } if "`compute_invcovarmat'" ~= "" { if "`invcovarmat'" ~= "" { disp as err "`progname': note: both compute_invcovarmat and invcovarmat were specified; ignoring invcovarmat" } /* covariancemat is an ado by d.k. */ tempname cov invcov capture covariancemat `varlist' `iftreated_for_cov' [`weight' `exp'], covarmat(`cov') if _rc { disp as err "error from covariancemat" error _rc } if index("`display'", "covar") { disp _new as text "covariance matrix:" mat list `cov' } if "`euclidean'" ~= "" { /* -euclidean- option means to zero the off-diagonal elements. It should be equivalent to the behavior of the old mahascore; see notes at top. This can only apply under the compute_invcovarmat option, as it needs to be done BEFORE the inverting. The resulting inverted matrix should be just the reciprocals in the diagonal. */ local crows = rowsof(`cov') // local ccols = colsof(`cov') -- should equal crows forvalues j = 2 / `crows' { forvalues k = 1 / `=`j'-1' { mat `cov'[`j', `k'] = 0 mat `cov'[`k', `j'] = 0 } } if index("`display'", "covar") { disp _new as text "covariance matrix, after euclidean operation:" mat list `cov' } } /* disp "begin inv(cov)" */ mat `invcov' = inv(`cov') /* -- you may be able to do invsym(`cov') But there are pitfalls, if `cov' is not positive-definite. (It ought to be at least positive-semidefinite.) But we are most correct to use inv(). */ /* disp "end inv(cov)" */ } else { /* compute_invcovarmat not specified */ if "`invcovarmat'" == "" { disp as err "`progname': you must specify compute_invcovarmat or invcovarmat" exit 198 } else { capture confirm matrix `invcovarmat' if _rc ~=0 { disp as error "`progname': invcovarmat must be a matrix (inverse covariance)" exit 198 } } if "`weight'" ~= "" { disp as err "`progname': weight will be ignored, since compute_invcovarmat was not specified" } if "`treated'" ~= "" { disp as err "`progname': treated will be ignored, since compute_invcovarmat was not specified" } if "`euclidean'" ~= "" { disp as err "`progname': euclidean will be ignored, since compute_invcovarmat was not specified" } local invcov "`invcovarmat'" } local rows : rownames `invcov' local cols : colnames `invcov' local icrows = rowsof(`invcov') local iccols = colsof(`invcov') /* Equivalently, local icrows : word count `rows' local iccols : word count `cols' */ if `icrows' ~= `iccols' { disp as err "`progname': invcovarmat (`invcov') must be square" exit 198 } if `icrows' ~= `nvars' { disp as err "`progname': invcovarmat (`invcov') must by kXk where k=num vars in varlist" exit 198 } /* At this point, `iccols' == `nvars', implicitly. */ if index("`display'", "invcov") { disp _new as text "invcovariance matrix:" mat list `invcov' } forvalues j = 1/`nvars' { if "`:word `j' of `rows''" ~= "`:word `j' of `varlist''" { disp as err "`progname': rownames of invcovarmat must correspond to varlist" exit 198 } if "`:word `j' of `cols''" ~= "`:word `j' of `varlist''" { disp as err "`progname': colnames of invcovarmat must correspond to varlist" exit 198 } } if "`refvals'" ~= "" { /* Note: at this point, `refvals' could be from the refvals option, or computed via the refmeans option. Create macros for delayed expansion. */ local vjref "`refvals'[\`j',1]" local vkref "`refvals'[\`k',1]" local vref "`refvals'[\`rownum',1]" } else { /* `refvals' absent; `refobs' must be filled-in. */ local vjref "\`vj'[\`refobs']" local vkref "\`vk'[\`refobs']" local vref "\`v'[\`refobs']" } /*----> This is a computation of the score based directly on a matrix expression. Unfortunately, it needs to loop through the obs, and is relatively slow. Below is the alternative; much faster. Thus, the latter is to be used. (Testing showed the reuslts to be equal or very close -- differences of about 4e-14.) ---- quietly gen double `gen' = . tempname X Y /* Xt */ forvalues obsno = 1/`=_N' { matrix `X' = J(`nvars', 1, .) matrix rownames `X' = `varlist' foreach v of local varlist { local rownum = rownumb(matrix(`X'), "`v'") matrix `X'[`rownum', 1] = `v'[`obsno'] - `vref' /* last term should be `v'[`refobs'] or `refvals'[`rownum',1] */ } /* disp "obsno=`obsno', ---- listing X:" mat list `X' matrix `Xt' = `X'' disp "obsno=`obsno', ---- listing Xt:" mat list `Xt' */ matrix `Y' = `X'' * `invcov' * `X' // `Y' is 1x1 /* disp "listing Y" mat list `Y' */ quietly replace `gen' = `sqrt'(scalar(`Y'[1,1])) in `obsno' } <-----*/ /* The preferred alternative, following the method in _Dif_mbase in psmatch2 by Leuven & Sianesi. Calculate the equivalent matrix expression -- as a dataset variable. (They also do a trick based on the matrix being symmetrical; we don't do that here; we assume a generic matrix at this point. They also use a trick based on tokenizing a varlist. We don't do that here.) */ quietly gen double `gen' = 0 forvalues j = 1 / `icrows' { forvalues k = 1 / `icrows' { /* or iccols -- the same */ local vj : word `j' of `rows' local vk : word `k' of `rows' /* or cols -- should be the same */ quietly replace `gen' = `gen' + /// `invcov'[`j,', `k'] * (`vj' - `vjref') * (`vk' - `vkref') } } if "`unsquared'" ~= "" { quietly replace `gen' = sqrt(`gen') /* Note that this sqrt goes with the "alternative" computation -- not the matrix-expression computation, as that has `sqrt' built into it. */ } if "`float'" ~= "" { quietly recast float `gen', force } /* Note: for the float option, this -recast- at the end is more accurate than declaring `gen' to be float in the first place; that would introduce rounding at each stage of adding up many terms. */ end // mahascore