------------------------------------------------------------------------------- help forcendif(SJ6-4: snp15_7; SJ6-3: snp15_6; SJ5-3: snp15_5; SJ3-3: snp15_4; STB-61: snp15_3; STB-58: snp15_2; STB-57: snp15) -------------------------------------------------------------------------------

Robust confidence intervals for median and other percentile differences

cendifdepvar[usingfilename] [weight] [if] [in],by(groupvar)[centile(numlist)level(#)eformystargenerate(newvarlist)cluster(varname)cfweight(expression)funtype(functional_type)tdisttransf(transformation_name)saving(filename[,replace])nohold]where

transformation_nameis one of

iden|z|asinand

functional_typeis one of

wcluster|bcluster|vonmises

fweights,iweights, andpweights are allowed; see weight.

bootstrap,by,jackknife, andstatsbyare allowed; see prefix.

Description

cendifcalculates confidence intervals for generalized Hodges-Lehmann median differences, and other percentile differences, between values of a Y-variable indepvarfor a pair of observations chosen at random from two groups A and B, defined by thegroupvarin theby()option. These confidence intervals are robust to the possibility that the population distributions in the two groups are different in ways other than location. This might happen if, for example, the two populations had different variances. For positive-valued variables,cendifcan be used to calculate confidence intervals for median ratios or other percentile ratios.cendifis part of thesomersdpackage and requires thesomersdprogram to work. The parameters estimated bycendifare a subset of those estimated bycenslope, which is also part of thesomersdpackage. However,cendifmay be more easy to use thancenslopeand more time-efficient for small sample numbers.

Options for use with cendif

by(groupvar)is not optional. It specifies the name of the grouping variable. This variable must have exactly two possible values. The lower value indicates group A, and the higher value indicates group B.

centile(numlist)specifies a list of percentile differences to be reported and defaults tocentile(50)(median only) if not specified. Specifyingcentile(25 50 75)will produce the 25th, 50th, and 75th percentile differences.

level(#)specifies the confidence level, as a percentage, for confidence intervals; seelevel.

eformspecifies that exponentiated percentile differences be given. This option is used ifdepvaris the log of a positive-valued variable. In this case, confidence intervals are calculated for percentile ratios between values of the original positive variable instead of for percentile differences.

ystargenerate(newvarlist)specifies a list of variables to be generated, corresponding to the percentile differences, containing the differencesY*(theta)=Y-group1*theta, wheregroup1is a binary variable indicating membership of group 1 andthetais the percentile difference. The variable names in thenewvarlistare matched to the list of percentiles specified by thecentile()option, sorted in ascending order of percentage. If the two lists have different lengths,cendifgenerates a numbernminof new variables equal to the minimum length of the two lists, matching the firstnminpercentiles with the firstnminnew variable names. Usually, there is only one percentile difference (the median difference) and one newystargenerate()variable.

cluster(varname)specifies the variable that defines sampling clusters. Ifcluster()is defined, then the confidence intervals are calculated assuming that the data are a sample of clusters from a population of clusters rather than a sample of observations from a population of observations.

cfweight(expression)specifies an expression giving the cluster frequency weights. These cluster frequency weights must have the same value for all observations in a cluster. Ifcfweight()andcluster()are both specified, then each cluster in the dataset is assumed to represent a number of identical clusters equal to the cluster frequency weight for that cluster. Ifcfweight()is specified andcluster()is unspecified, then each observation in the dataset is treated as a cluster, and assumed to represent a number of identical one-observation clusters equal to the cluster frequency weight. For more details on the interpretation of weights, seeInterpretation ofweightsin the help forsomersd. Note that the observation frequency weights are used bycendiffor tabulating the group frequencies.

funtype(functional_type)specifies whether the percentile differences estimated are between-cluster, within-cluster or Von Mises percentile differences. These three functional types are specified by the optionsfuntype(bcluster),funtype(wcluster)orfuntype(vonmises), respectively, and correspond to the functional types of the same names used bysomersd. Iffuntype()is not specified, thenfuntype(bcluster)is assumed, and between-cluster percentile differences are estimated. If the clusters are pairs of observations, and if theby()option specifies an indicator variable indicating whether the observation is the first or second member of its pair, then the within-cluster median difference is the parameter corresponding to the sign test, and the Von Mises median difference is the conventional Hodges-Lehmann median difference between the group of first members and the group of second members, with confidence limits adjusted for clustering. For further details, see the manualcendif.pdf, distributed withsomersdas an ancillary file.

tdistspecifies that the standardized Somers' D estimates are assumed to be sampled from a t distribution with n-1 degrees of freedom, where n is the number of clusters or the number of observations ifcluster()is not specified.

transf(transformation_name)specifies that the Somers' D estimates are to be transformed, defining a standard error for the transformed population value, from which the confidence limits for the percentile differences are calculated.z(the default) specifies Fisher's z (the hyperbolic arctangent),asinspecifies Daniels' arcsine, andidenspecifies identity or untransformed.

saving(filename[,replace])specifies a dataset to be created, whose observations correspond to the observed values of differences between a value ofdepvarin group A and a value ofdepvarin group B.replaceinstructs Stata to replace any existing dataset of the same name. The saved dataset can then be reused ifcendifis called later withusingto save the long processing times used to calculate the set of observed differences. Thesaving()option and theusingqualifier are provided mainly for programmers to use, at their own risk.

noholdindicates that any existing estimation results be overwritten with a new set of estimation results for the use of programmers. By default, any existing estimation results are restored after execution ofcendif.

Remarks

cendifis part of thesomersdpackage and uses the programsomersd, which calculates confidence intervals for Somers' D. A 100qth percentile difference is defined as a value ofthetasatisfying the equation

D[ystar(theta)|group_A] = 1-2qwhere

D[.|.]represents Somers' D,group_Ais an indicator variable for membership of group A instead of group B, andystar(theta)is a variable equal todepvarfor observations in group A and equal todepvar+thetafor observations in group B. Ifq=0.5, then the value ofthetais the Hodges-Lehmann median difference. In this case,cendif y, by(group)gives the same median difference asnpshift y, by(group), although the confidence limits may be different. (The programnpshiftcalculates confidence intervals for the Hodges-Lehmann minimum difference, assuming that the two group distributions differ only in location. It is available from Stata Technical Bulletin (STB) in STB-52: sg123.)For extreme percentiles and/or very small sample numbers,

cendifsometimes calculates infinite positive upper confidence limits or infinite negative lower confidence limits. These are represented by+/-c(maxdouble), wherec(maxdouble)is the c-class value specifying the largest positive number that can be stored in a double.With very large sample numbers,

cendifmay be slow, as it must calculate every possible paired difference between values in the two samples to calculate the median difference. A possible remedy is to reduce the number of possible differences by grouping the Y variable. For instance, ifincomeis a measure of income in dollars, andgroupis a binary variable indicating membership of one of two groups, then the user might type

. gene incomegp=100*(int(income/100)+1). cendif incomegp, by(group) tdistto calculate the median difference in income between the two groups to the nearest 100 dollars. This process would probably take less time than if the user typed

. cendif income, by(group) tdistFull documentation of the

somersdpackage (including methods and formulas) is provided in the filessomersd.pdf,censlope.pdf, andcendif.pdf, which are distributed with thesomersdpackage as ancillary files (seenet). They can be viewed using the Adobe Acrobat Reader, which can be downloaded fromhttp://www.adobe.com/products/acrobat/readermain.html

For a comprehensive review of Kendall's tau-a, Somers' D, and median differences, see Newson (2002). The definitive reference for the statistical and computational methods of

censlopeis Newson (2006).

Examples

. cendif weight, tdist by(foreign)

. cendif weight, tdist by(foreign) ce(25 50 75)

. gene logwt=log(weight). cendif logwt, tdist by(foreign) ce(25 50 75) eform

. cendif mpg, by(foreign) saving(trash1, replace). cendif mpg using trash1, by(foreign) tr(asin) tdistThe following example uses the

funtype()option to estimate median differences between paired data. It uses thebplongdataset, distributed with Stata and accessible using thesysusecommand, with one observation for each of 2 blood pressure measurements (before and after treatment) for each of a sample of patients. The optionfuntype(wcluster)specifies the median difference between measurements on the same patient before and after treatment, which is equal to zero under the null hypothesis tested by the sign test. The optionfuntype(vonmises)specifies the conventional Hodges-Lehmann median difference between the group of before-treatment measures and the group of after-treatment measurements, with estimates calculated as if the two groups were two independent samples, but with confidence limits adfjusted for clustering by patient. This Von Mises parameter is zero under the null hypothesis tested by the clustered ranksum test presented in Rosmeret al.(2006).

. sysuse bplong, clear. describe, simple. cendif bp, by(when) tdist cluster(patient) funtype(wcluster). cendif bp, by(when) tdist cluster(patient) funtype(vonmises)

AuthorRoger Newson, Imperial College London, UK. Email: r.newson@imperial.ac.uk

ReferenceNewson R. 2002. Parameters behind "nonparametric" statistics: Kendall's tau, Somers' D and median differences.

Stata Journal2: 45-64.Newson, R. 2006. Confidence intervals for rank statistics: Percentile slopes, differences, and ratios.

Stata Journal6: 497-520.Rosmer, B., R. J. Glynn and M-L. T. Lee. 2006. Extension of the rank-sum test for clustered data: Two-group comparisons with group membership defined at the subunit level. Biometrics 62(4): 1251-1259.

Also seeManual:

[R] spearman,[R] ranksum,[R] signrank,[R] centileSTB: STB-52: sg123, STB-55: snp15, STB-57: snp15.1, STB-58: snp15.2, STB-58: snp16; STB-61: snp15.3; STB-61: snp16.1.

Online:

ktau,ranksum,signrank,centilecid,npshift,somersd,censlope(if installed)