Robust confidence intervals for median and other percentile differences
cendif depvar [using filename] [weight] [if] [in], by(groupvar) [centile(numlist) level(#) eform ystargenerate(newvarlist) cluster(varname) cfweight(expression) funtype(functional_type) tdist transf(transformation_name) saving(filename[,replace]) nohold ]
where transformation_name is one of
iden | z | asin
and functional_type is one of
wcluster | bcluster | vonmises
fweights, iweights, and pweights are allowed; see weight.
bootstrap, by, jackknife, and statsby are allowed; see prefix.
cendif calculates confidence intervals for generalized Hodges-Lehmann median differences, and other percentile differences, between values of a Y-variable in depvar for a pair of observations chosen at random from two groups A and B, defined by the groupvar in the by() option. These confidence intervals are robust to the possibility that the population distributions in the two groups are different in ways other than location. This might happen if, for example, the two populations had different variances. For positive-valued variables, cendif can be used to calculate confidence intervals for median ratios or other percentile ratios. cendif is part of the somersd package and requires the somersd program to work. The parameters estimated by cendif are a subset of those estimated by censlope, which is also part of the somersd package. However, cendif may be more easy to use than censlope and more time-efficient for small sample numbers.
Options for use with cendif
by(groupvar) is not optional. It specifies the name of the grouping variable. This variable must have exactly two possible values. The lower value indicates group A, and the higher value indicates group B.
centile(numlist) specifies a list of percentile differences to be reported and defaults to centile(50) (median only) if not specified. Specifying centile(25 50 75) will produce the 25th, 50th, and 75th percentile differences.
level(#) specifies the confidence level, as a percentage, for confidence intervals; see level.
eform specifies that exponentiated percentile differences be given. This option is used if depvar is the log of a positive-valued variable. In this case, confidence intervals are calculated for percentile ratios between values of the original positive variable instead of for percentile differences.
ystargenerate(newvarlist) specifies a list of variables to be generated, corresponding to the percentile differences, containing the differences Y*(theta)=Y-group1*theta, where group1 is a binary variable indicating membership of group 1 and theta is the percentile difference. The variable names in the newvarlist are matched to the list of percentiles specified by the centile() option, sorted in ascending order of percentage. If the two lists have different lengths, cendif generates a number nmin of new variables equal to the minimum length of the two lists, matching the first nmin percentiles with the first nmin new variable names. Usually, there is only one percentile difference (the median difference) and one new ystargenerate() variable.
cluster(varname) specifies the variable that defines sampling clusters. If cluster() is defined, then the confidence intervals are calculated assuming that the data are a sample of clusters from a population of clusters rather than a sample of observations from a population of observations.
cfweight(expression) specifies an expression giving the cluster frequency weights. These cluster frequency weights must have the same value for all observations in a cluster. If cfweight() and cluster() are both specified, then each cluster in the dataset is assumed to represent a number of identical clusters equal to the cluster frequency weight for that cluster. If cfweight() is specified and cluster() is unspecified, then each observation in the dataset is treated as a cluster, and assumed to represent a number of identical one-observation clusters equal to the cluster frequency weight. For more details on the interpretation of weights, see Interpretation of weights in the help for somersd. Note that the observation frequency weights are used by cendif for tabulating the group frequencies.
funtype(functional_type) specifies whether the percentile differences estimated are between-cluster, within-cluster or Von Mises percentile differences. These three functional types are specified by the options funtype(bcluster), funtype(wcluster) or funtype(vonmises), respectively, and correspond to the functional types of the same names used by somersd. If funtype() is not specified, then funtype(bcluster) is assumed, and between-cluster percentile differences are estimated. If the clusters are pairs of observations, and if the by() option specifies an indicator variable indicating whether the observation is the first or second member of its pair, then the within-cluster median difference is the parameter corresponding to the sign test, and the Von Mises median difference is the conventional Hodges-Lehmann median difference between the group of first members and the group of second members, with confidence limits adjusted for clustering. For further details, see the manual cendif.pdf, distributed with somersd as an ancillary file.
tdist specifies that the standardized Somers' D estimates are assumed to be sampled from a t distribution with n-1 degrees of freedom, where n is the number of clusters or the number of observations if cluster() is not specified.
transf(transformation_name) specifies that the Somers' D estimates are to be transformed, defining a standard error for the transformed population value, from which the confidence limits for the percentile differences are calculated. z (the default) specifies Fisher's z (the hyperbolic arctangent), asin specifies Daniels' arcsine, and iden specifies identity or untransformed.
saving(filename[,replace]) specifies a dataset to be created, whose observations correspond to the observed values of differences between a value of depvar in group A and a value of depvar in group B. replace instructs Stata to replace any existing dataset of the same name. The saved dataset can then be reused if cendif is called later with using to save the long processing times used to calculate the set of observed differences. The saving() option and the using qualifier are provided mainly for programmers to use, at their own risk.
nohold indicates that any existing estimation results be overwritten with a new set of estimation results for the use of programmers. By default, any existing estimation results are restored after execution of cendif.
cendif is part of the somersd package and uses the program somersd, which calculates confidence intervals for Somers' D. A 100qth percentile difference is defined as a value of theta satisfying the equation
D[ystar(theta)|group_A] = 1-2q
where D[.|.] represents Somers' D, group_A is an indicator variable for membership of group A instead of group B, and ystar(theta) is a variable equal to depvar for observations in group A and equal to depvar+theta for observations in group B. If q=0.5, then the value of theta is the Hodges-Lehmann median difference. In this case, cendif y, by(group) gives the same median difference as npshift y, by(group), although the confidence limits may be different. (The program npshift calculates confidence intervals for the Hodges-Lehmann minimum difference, assuming that the two group distributions differ only in location. It is available from Stata Technical Bulletin (STB) in STB-52: sg123.)
For extreme percentiles and/or very small sample numbers, cendif sometimes calculates infinite positive upper confidence limits or infinite negative lower confidence limits. These are represented by +/-c(maxdouble), where c(maxdouble) is the c-class value specifying the largest positive number that can be stored in a double.
With very large sample numbers, cendif may be slow, as it must calculate every possible paired difference between values in the two samples to calculate the median difference. A possible remedy is to reduce the number of possible differences by grouping the Y variable. For instance, if income is a measure of income in dollars, and group is a binary variable indicating membership of one of two groups, then the user might type
. gene incomegp=100*(int(income/100)+1) . cendif incomegp, by(group) tdist
to calculate the median difference in income between the two groups to the nearest 100 dollars. This process would probably take less time than if the user typed
. cendif income, by(group) tdist
Full documentation of the somersd package (including methods and formulas) is provided in the files somersd.pdf, censlope.pdf, and cendif.pdf, which are distributed with the somersd package as ancillary files (see net). They can be viewed using the Adobe Acrobat Reader, which can be downloaded from
For a comprehensive review of Kendall's tau-a, Somers' D, and median differences, see Newson (2002). The definitive reference for the statistical and computational methods of censlope is Newson (2006).
. cendif weight, tdist by(foreign)
. cendif weight, tdist by(foreign) ce(25 50 75)
. gene logwt=log(weight) . cendif logwt, tdist by(foreign) ce(25 50 75) eform
. cendif mpg, by(foreign) saving(trash1, replace) . cendif mpg using trash1, by(foreign) tr(asin) tdist
The following example uses the funtype() option to estimate median differences between paired data. It uses the bplong dataset, distributed with Stata and accessible using the sysuse command, with one observation for each of 2 blood pressure measurements (before and after treatment) for each of a sample of patients. The option funtype(wcluster) specifies the median difference between measurements on the same patient before and after treatment, which is equal to zero under the null hypothesis tested by the sign test. The option funtype(vonmises) specifies the conventional Hodges-Lehmann median difference between the group of before-treatment measures and the group of after-treatment measurements, with estimates calculated as if the two groups were two independent samples, but with confidence limits adfjusted for clustering by patient. This Von Mises parameter is zero under the null hypothesis tested by the clustered ranksum test presented in Rosmer et al. (2006).
. sysuse bplong, clear . describe, simple . cendif bp, by(when) tdist cluster(patient) funtype(wcluster) . cendif bp, by(when) tdist cluster(patient) funtype(vonmises)
Roger Newson, Imperial College London, UK. Email: email@example.com
Newson R. 2002. Parameters behind "nonparametric" statistics: Kendall's tau, Somers' D and median differences. Stata Journal 2: 45-64.
Newson, R. 2006. Confidence intervals for rank statistics: Percentile slopes, differences, and ratios. Stata Journal 6: 497-520.
Rosmer, B., R. J. Glynn and M-L. T. Lee. 2006. Extension of the rank-sum test for clustered data: Two-group comparisons with group membership defined at the subunit level. Biometrics 62(4): 1251-1259.
Manual: [R] spearman, [R] ranksum, [R] signrank, [R] centile
STB: STB-52: sg123, STB-55: snp15, STB-57: snp15.1, STB-58: snp15.2, STB-58: snp16; STB-61: snp15.3; STB-61: snp16.1.
Online: ktau, ranksum, signrank, centile cid, npshift, somersd, censlope (if installed)