Robust confidence intervals for median and other percentile slopes
censlope yvarname xvarname [weight] [if] [in], [centile(numlist) eform ystargenerate(newvarlist) estaddr somersd_options iteration_options]
where yvarname and xvarname are variable names.
fweights, iweights, and pweights are allowed; see weight. They are interpreted as for somersd.
bootstrap, by, jackknife, and statsby are allowed; see prefix.
censlope calculates confidence intervals for generalized Theil-Sen median slopes and other percentile slopes of a Y variable specified by yvarname with respect to an X variable specified by xvarname. These confidence intervals are robust to the possibility that the population distributions of the Y variable, conditional on different values of the X variable, are different in ways other than location. This might happen if, for example, the conditional distributions had different variances. For positive-valued Y variables, censlope can be used to calculate confidence intervals for median per-unit ratios or other percentile per-unit ratios associated with a unit increment in the X variable. If the X variable is binary with values 0 and 1, then the generalized Theil-Sen percentile slopes are the generalized Hodges-Lehmann percentile differences between the group of observations whose X value is 1 and the group of observations whose X value is 0. censlope is part of the somersd package and requires the somersd program to work. It executes the somersd command,
somersd xvarname yvarname [weight] [if] [in] [, somersd_options]
and then estimates the percentile slopes. The estimates and confidence limits for the percentile slopes are evaluated using an iterative numerical method, which the user may change from the default by using the iteration_options.
centile(numlist) specifies a list of percentile slopes to be reported and defaults to centile(50) (median only) if not specified. Specifying centile(25 50 75) will produce the 25th, 50th, and 75th percentile differences.
eform specifies that exponentiated percentile slopes be given. This option is used if yvarname specifies the log of a positive-valued variable. In this case, confidence intervals are calculated for percentile ratios or per-unit ratios between values of the original positive variable, instead of for percentile differences or per-unit differences.
ystargenerate(newvarlist) specifies a list of variables to be generated, corresponding to the percentile slopes, containing the differences Y*(beta)=Y-X*beta, where beta is the percentile slope. The variable names in the newvarlist are matched to the list of percentiles specified by the centile() option, sorted in ascending order of percent. If the two lists have different lengths, then censlope generates a number nmin of new variables equal to the minimum length of the two lists, matching the first nmin percentiles with the first nmin new variable names. Usually, there is only one percentile slope (the median slope), and one new ystargenerate() variable, whose median can be used as the intercept when drawing a straight line through the data points on a scatterplot.
estaddr specifies that the results saved in r() also be saved in e() (see Saved results below). This option makes it easier to use censlope with parmby, to create an output dataset (or results set) with one observation per by-group and data on confidence intervals for Somers' D and median slopes. parmby is part of the package parmest, downloadable from SSC.
somersd_options are any of the options available with somersd.
iteration_options are any of the options described in censlope_iteration.
censlope is part of the somersd package and uses the program somersd, which calculates confidence intervals for Somers' D and Kendall's tau-a. Given two random variables Y and X, a 100qth percentile slope of Y with respect to X is defined as a value of beta satisfying the equation
theta( Y-beta*X , X ) = 1 - 2q
where theta(U,V) represents either D(U|V) (Somers' D) or tau_a(U,V) (Kendall's tau-a) between the variables U and V. (For definitions of Somers' D and Kendall's tau-a, see somersd.) If q=0.5, then the value of beta is a Theil-Sen median slope. If in addition X is a binary variable, with possible values 0 and 1, then the value of beta is a Hodges-Lehmann median difference between Y values in the subpopulation in which X==1 and Y values in the subpopulation in which X==0. An alternative program for calculating Hodges-Lehmann median (and other percentile) differences is cendif, which is also distributed as part of the somersd package.
For extreme percentiles and/or very small sample numbers, censlope sometimes calculates infinite positive upper confidence limits or infinite negative lower confidence limits. These are represented by +/-c(maxdouble), where c(maxdouble) is the c-class value specifying the largest positive number that can be stored in a double.
censlope can use all the options used by somersd, to use any of the extended versions of Somers' D or Kendall's tau-a in the definition of percentile slopes, differences, and ratios. In particular, we may use the wstrata() option of somersd to estimate within-stratum median differences and slopes, based on comparisons between observations between the same stratum. This method allows us to estimate median differences in an outcome variable, associated with an exposure, within strata defined by grouping a confounder, or by grouping a propensity score for the exposure based on multiple confounders. Therefore, rank parameters (such as median differences) can be adjusted for confounders, just as regression parameters can be adjusted for confounders. However, regression methods are required to define propensity scores.
The program cendif is also part of the somersd package and calculates confidence intervals for a restricted subset of the parameters estimated by censlope, assuming a binary X variable and a restricted range of somersd_options. cendif does not use an iterative method but instead calculates all possible differences between Y values in the two groups defined by the binary X variable. In large samples, this method is more time-consuming than the iterative method used by censlope. However, in small samples (such as the auto data), cendif can be much faster than censlope.
Full documentation of the somersd package (including Methods and Formulas) is provided in the files somersd.pdf, censlope.pdf, and cendif.pdf, which are distributed with the somersd package as ancillary files (see net). They can be viewed using the Adobe Acrobat Reader, which can be downloaded from
For a comprehensive review of Kendall's tau-a, Somers' D, and median differences, see Newson (2002). The definitive reference for the statistical and computational methods of censlope is Newson (2006).
. censlope weight length
. censlope weight length, transf(z)
. censlope weight length, transf(z) centile(25(25)75)
. censlope weight foreign
. censlope weight foreign, transf(z)
. censlope weight foreign, transf(z) centile(0(25)100)
The following example estimates percentile weight ratios between non-U.S. and U.S. cars:
. gene logweight=log(weight) . censlope logweight foreign, transf(z) centile(0(25)100) eform
The following example uses the wstrata option of somersd to calculate median differences in fuel efficiency between non-U.S. and U.S. cars in the same weight quintile. We find that non-U.S. cars typically travel 2 to 7 more miles per gallon than U.S. cars, but 0 to 4 fewer miles per gallon than U.S. cars in the same weight quintile:
. xtile weightgp=weight, nquantiles(5) . tab weightgp foreign . censlope mpg foreign, transf(z) . censlope mpg foreign, transf(z) wstrata(weightgp)
The following example creates a scatterplot of car weight in U.S. pounds against car length in U.S. inches with a straight line through the data points, whose slope is the median slope and whose intercept is the median of the variable resi generated by the ystargenerate() option:
. censlope weight length, transf(z) tdist ystargenerate(resi) . egen intercept=median(resi) . gene what=weight-resi+intercept . lab var what "Predicted weight" . scatter weight length || line what length, sort
The following example uses the estaddr option together with parmby (part of the parmest package) to produce an output dataset (or results set) in the memory, with one observation per by-group, and data on confidence intervals for Somers' D and median slopes. This dataset is then input to the eclplot command to produce a confidence interval plot of Somers' D parameters and a confidence interval plot of median slopes. The packages parmest and eclplot can be downloaded from SSC.
. parmby "censlope weight length, tdist estaddr", by(foreign) norestore ecol(cimat) rename(ec_1_1 percent ec_1_2 pctlslope ec_1_3 minimum ec_1_4 maximum) . list . eclplot estimate min95 max95 foreign, hori ylabel(0 1) xtitle("Somers' D (95% CI)") . eclplot pctlslope minimum maximum foreign, hori ylabel(0 1) xtitle("Percentile slope (95% CI)")
censlope saves the following results in r():
Scalars r(level) confidence level r(fromabs) value of the fromabs() option r(tolerance) value of the tolerance() option
Macros r(yvar) name of the Y variable r(xvar) name of the X variable r(eform) eform if specified r(centiles) list of percentages for the percentiles r(technique) list of techniques from the technique() option r(tech_steps) list of step numbers for the techniques
Matrices r(cimat) confidence intervals for percentile differences or ratios r(rcmat) return codes for entries of r(cimat) r(bracketmat) bracket matrix r(techstepmat) column vector of step numbers for the techniques
The matrix r(cimat) has one row per percentile, as well as columns containing the percentages, percentile estimates, lower and upper confidence limits (labeled Percent, Pctl_Slope, Minimum, and Maximum if eform is not specified, or Percent, Pctl_Ratio, Minimum, and Maximum if eform is specified). The matrix r(rcmat) has the same numbers of rows and columns as r(cimat) with the same labels, and the first column contains the percentages, but the other entries contain return codes for the estimation of the corresponding entries of r(cimat). These return codes are equal to 0 if the beta-value was estimated successfully, 1 if the corresponding zetastar-value could not be calculated, 2 if the corresponding zetastar-value could not be bracketed, 3 if the beta-brackets failed to converge, and 4 if the beta-value could not be calculated from the converged beta-brackets. The matrix r(bracketmat) is the final version of the bracket matrix described in help for the fromabs() and brackets() options of censlope and has one row per beta-bracket, as well as two columns, labeled Beta and Zetastar, containing the beta-brackets and the corresponding zetastar-values. The matrix r(techstepmat) is a column vector with one row for each of the techniques listed in the technique() option, with a row label equal to the name of the technique and a value equal to the number of steps for that technique. The fromabs(), brackets(), tolerance(), and technique() options are described in iteration_options.
censlope also saves in e() a full set of estimation results for the somersd command. If estaddr is specified, this set of estimation results is expanded by adding a set of e() results with the same names and contents as the r() results. This option allows the user to pass a censlope command to parmby, producing an output dataset (or results set) with one observation per by-group and data on confidence intervals for Somers' D and for the median slope.
Roger Newson, Imperial College London, UK. Email: email@example.com
Newson, R. 2002. Parameters behind "nonparametric" statistics: Kendall's tau, Somers' D, and median differences. Stata Journal 2: 45-64.
Newson, R. 2006. Confidence intervals for rank statistics: Percentile slopes, differences, and ratios. Stata Journal 6: 497-520.
Manual: [R] spearman, [R] ranksum, [R] signrank, [R] centile
STB: STB-52: sg123, STB-55: snp15, STB-57: snp15.1, STB-58: snp15.2, STB-58: snp16; STB-61: snp15.3; STB-61: snp16.1
Online: ktau, ranksum, signrank cid, npshift, somersd, cendif, parmest, eclplot (if installed)