{smcl}
{hline}
help for {hi:cendif}{right:(SJ6-4: snp15_7; SJ6-3: snp15_6; SJ5-3: snp15_5; SJ3-3: snp15_4;}
{right:STB-61: snp15_3; STB-58: snp15_2; STB-57: snp15)}
{hline}
{title:Robust confidence intervals for median and other percentile differences}
{p 8 21 2}
{cmd:cendif} {it:depvar} [{cmd:using} {it:filename}] {weight} {ifin}{cmd:,}
{cmd:by(}{it:groupvar}{cmd:)}
[{cmdab:ce:ntile}{cmd:(}{it:numlist}{cmd:)} {cmdab:l:evel}{cmd:(}{it:#}{cmd:)} {cmdab:ef:orm}
{cmdab:ys:targenerate}{cmd:(}{help newvarlist:{it:newvarlist}}{cmd:)}
{cmdab:cl:uster}{cmd:(}{it:varname}{cmd:)}
{cmdab:cfw:eight}{cmd:(}{it:expression}{cmd:)}
{cmdab:fu:ntype}{cmd:(}{it:functional_type}{cmd:)}
{cmdab:td:ist} {cmdab:tr:ansf}{cmd:(}{it:transformation_name}{cmd:)}
{cmdab:sa:ving}{cmd:(}{it:filename}[{cmd:,replace}]{cmd:)} no{cmd:hold} ]
{pstd}
where {it:transformation_name} is one of
{p 8 21 2}
{cmd:iden} | {cmd:z} | {cmd:asin}
{pstd}
and {it:functional_type} is one of
{p 8 21 2}
{cmdab:w:cluster} | {cmdab:b:cluster} | {cmdab:v:onmises}
{pstd}
{cmd:fweight}s, {cmd:iweight}s, and {cmd:pweight}s are allowed; see
{help weight}.
{pstd}
{opt bootstrap}, {opt by}, {opt jackknife}, and {opt statsby}
are allowed; see {help prefix}.{p_end}
{title:Description}
{pstd}
{cmd:cendif} calculates confidence intervals for generalized Hodges-Lehmann
median differences, and other percentile differences, between values of a
Y-variable in {it:depvar} for a pair of observations chosen at random
from two groups A and B, defined by the {it:groupvar} in the
{cmd:by()} option. These confidence intervals are robust to the possibility
that the population distributions in the two groups are different in ways
other than location. This might happen if, for example, the two populations
had different variances. For positive-valued variables, {cmd:cendif} can be
used to calculate confidence intervals for median ratios or other percentile
ratios. {cmd:cendif} is part of the {helpb somersd} package and requires the
{helpb somersd} program to work. The parameters estimated by
{cmd:cendif} are a subset of those estimated by {helpb censlope}, which is also
part of the {helpb somersd} package. However, {cmd:cendif} may be more easy to
use than {helpb censlope} and more time-efficient for small sample numbers.
{title:Options for use with cendif}
{p 4 8 2}
{cmd:by(}{it:groupvar}{cmd:)} is not optional. It specifies the name of the
grouping variable. This variable must have exactly two possible values. The
lower value indicates group A, and the higher value indicates group B.
{p 4 8 2}
{cmd:centile(}{it:numlist}{cmd:)} specifies a list of percentile differences
to be reported and defaults to {cmd:centile(50)} (median only) if not
specified. Specifying {cmd:centile(25 50 75)} will produce the 25th, 50th, and
75th percentile differences.
{p 4 8 2}
{cmd:level(}{it:#}{cmd:)} specifies the confidence level, as a percentage, for
confidence intervals; see {helpb level}.
{p 4 8 2}
{cmd:eform} specifies that exponentiated percentile differences be
given. This option is used if {it:depvar} is the log of a positive-valued
variable. In this case, confidence intervals are calculated for percentile
ratios between values of the original positive variable instead of for
percentile differences.
{p 4 8 2}
{cmd:ystargenerate(}{help newvarlist:{it:newvarlist}}{cmd:)} specifies a list
of variables to be generated, corresponding to the percentile differences,
containing the differences {hi:Y*(theta)=Y-group1*theta}, where {hi:group1} is
a binary variable indicating membership of group 1 and {hi:theta} is the
percentile difference. The variable names in the
{help newvarlist:{it:newvarlist}} are matched to the list of percentiles
specified by the {cmd:centile()} option, sorted in ascending order of
percentage. If the two lists have different lengths, {cmd:cendif} generates
a number {it:nmin} of new variables equal to the minimum length of the two
lists, matching the first {it:nmin} percentiles with the first {it:nmin} new
variable names. Usually, there is only one percentile difference (the median
difference) and one new {cmd:ystargenerate()} variable.
{p 4 8 2}
{cmd:cluster(}{it:varname}{cmd:)} specifies the variable that defines
sampling clusters.
If {cmd:cluster()} is defined, then the confidence
intervals are calculated assuming that the data are a sample of clusters from
a population of clusters rather than a sample of observations from a
population of observations.
{p 4 8 2}
{cmd:cfweight(}{it:expression}{cmd:)} specifies an expression giving the
cluster frequency weights. These cluster frequency weights must have the same
value for all observations in a cluster. If {cmd:cfweight()} and
{cmd:cluster()} are both specified, then each cluster in the dataset is
assumed to represent a number of identical clusters equal to the cluster
frequency weight for that cluster. If {cmd:cfweight()} is specified and
{cmd:cluster()} is unspecified, then each observation in the dataset is
treated as a cluster, and assumed to represent a number of identical
one-observation clusters equal to the cluster frequency weight.
For more details on the interpretation of weights,
see {hi:Interpretation of weights} in the help for {helpb somersd}.
Note that the observation frequency weights are used by {cmd:cendif}
for tabulating the group frequencies.
{p 4 8 2}
{cmd:funtype(}{it:functional_type}{cmd:)} specifies
whether the percentile differences estimated are between-cluster, within-cluster
or Von Mises percentile differences.
These three functional types are specified by the
options {cmd:funtype(bcluster)}, {cmd:funtype(wcluster)} or
{cmd:funtype(vonmises)}, respectively,
and correspond to the functional types of the same names used by {helpb somersd}.
If {cmd:funtype()} is not specified, then {cmd:funtype(bcluster)} is assumed,
and between-cluster percentile differences are estimated.
If the clusters are pairs of observations,
and if the {cmd:by()} option specifies an indicator variable
indicating whether the observation is the first or second member of its pair,
then the within-cluster median difference is the parameter
corresponding to the {help signrank:sign test},
and the Von Mises median difference is the conventional Hodges-Lehmann median difference
between the group of first members and the group of second members,
with confidence limits adjusted for clustering.
For further details, see the manual {hi:cendif.pdf}, distributed with
{helpb somersd} as an ancillary file.
{p 4 8 2}
{cmd:tdist} specifies that the standardized Somers' {it:D} estimates are
assumed to be sampled from a t distribution with n-1 degrees of
freedom, where n is the number of clusters or the number of observations
if {cmd:cluster()} is not specified.
If {cmd:tdist} is not specified,
then the standardized Somers' {it:D} estimates are assumed to be sampled from a standard Normal distribution.
Simulation study data suggest that the {cmd:tdist} option should be recommended.
{p 4 8 2}
{cmd:transf(}{it:transformation_name}{cmd:)} specifies that the Somers' {it:D}
estimates are to be transformed, defining a standard error for the transformed
population value, from which the confidence limits for the percentile
differences are calculated. {cmd:z} (the default) specifies Fisher's z
(the hyperbolic arctangent), {cmd:asin} specifies Daniels' arcsine, and
{cmd:iden} specifies identity or untransformed.
{p 4 8 2}
{cmd:saving(}{it:filename}[{cmd:,replace}]{cmd:)} specifies a dataset to be
created, whose observations correspond to the observed values of differences
between a value of {it:depvar} in group A and a value of {it:depvar} in
group B. {cmd:replace} instructs Stata to replace any existing dataset
of the same name. The saved dataset can then be reused if {cmd:cendif} is
called later with {cmd:using} to save the long processing times
used to calculate the set of observed differences. The {cmd:saving()} option
and the {cmd:using} qualifier are provided mainly for programmers to use, at
their own risk.
{p 4 8 2}
{cmd:nohold} indicates that any existing estimation results be
overwritten with a new set of estimation results for the use of programmers.
By default, any existing estimation results are restored after execution of
{cmd:cendif}.
{marker cendif_remarks}{...}
{title:Remarks}
{pstd}
{cmd:cendif} is part of the {helpb somersd} package and uses the program
{helpb somersd}, which calculates confidence intervals for Somers' {it:D}. A
100{hi:q}th percentile difference is defined as a value of {hi:theta}
satisfying the equation
{pstd}
{hi:D[ystar(theta)|group_A] = 1-2q}
{pstd}
where {hi:D[.|.]} represents Somers' {it:D}, {hi:group_A} is an indicator
variable for membership of group A instead of group B, and
{hi:ystar(theta)} is a variable equal to {it:depvar} for observations in group
A and equal to {it:depvar}{hi:+theta} for observations in group B.
If {hi:q}=0.5, then the value of {hi:theta} is the Hodges-Lehmann median
difference. In this case, {cmd:cendif y, by(group)} gives the same median
difference as {cmd:npshift y, by(group)}, although the confidence limits may
be different. (The program {helpb npshift} calculates confidence intervals for
the Hodges-Lehmann minimum difference, assuming that the two group
distributions differ only in location. It is available from Stata Technical
Bulletin (STB) in STB-52: sg123.)
{pstd}
For extreme percentiles and/or very small sample numbers, {cmd:cendif}
sometimes calculates infinite positive upper confidence limits or infinite
negative lower confidence limits. These are represented by
{hi:+/-}{cmd:c(maxdouble)}, where {cmd:c(maxdouble)} is the
{help creturn:c-class value} specifying the largest positive number that can
be stored in a {help data_types:double}.
{pstd}
With very large sample numbers, {cmd:cendif} may be slow, as it must calculate
every possible paired difference between values in the two samples to
calculate the median difference. A possible remedy is to reduce the number of
possible differences by grouping the Y variable. For instance, if {cmd:income}
is a measure of income in dollars, and {cmd:group} is a binary variable
indicating membership of one of two groups, then the user might type
{p 4 8 2}{cmd:. gene incomegp=100*(int(income/100)+1)}{p_end}
{p 4 8 2}{cmd:. cendif incomegp, by(group) tdist}{p_end}
{pstd}
to calculate the median difference in income between the two groups to the
nearest 100 dollars. This process would probably take less time than if the
user typed
{p 4 8 2}{cmd:. cendif income, by(group) tdist}{p_end}
{pstd}
Full documentation of the {helpb somersd} package (including methods and
formulas) is provided in the files {hi:somersd.pdf}, {hi:censlope.pdf}, and
{hi:cendif.pdf}, which are distributed with the {helpb somersd} package as
ancillary files (see {helpb net}). They can be viewed using the Adobe
Acrobat Reader, which can be downloaded from
{browse "http://www.adobe.com/products/acrobat/readermain.html":http://www.adobe.com/products/acrobat/readermain.html}
{pstd}
For a comprehensive review of Kendall's tau-a, Somers' {it:D}, and median
differences, see Newson (2002).
The definitive reference for the statistical and computational methods of {cmd:censlope}
is Newson (2006).
{title:Examples}
{p 4 8 2}{cmd:. cendif weight, tdist by(foreign)}{p_end}
{p 4 8 2}{cmd:. cendif weight, tdist by(foreign) ce(0(25)100)}{p_end}
{p 4 8 2}{cmd:. gene logwt=log(weight)}{p_end}
{p 4 8 2}{cmd:. cendif logwt, tdist by(foreign) ce(0(25)100) eform}{p_end}
{p 4 8 2}{cmd:. cendif mpg, by(foreign) saving(trash1, replace)}{p_end}
{p 4 8 2}{cmd:. cendif mpg using trash1, by(foreign) tr(asin) tdist}{p_end}
{pstd}
The following example uses the {cmd:funtype()} option to estimate median differences between paired data.
It uses the {helpb dta_examples:bplong} dataset,
distributed with Stata and accessible using the {helpb sysuse} command,
with one observation for each of 2 blood pressure measurements (before and after treatment)
for each of a sample of patients.
The option {cmd:funtype(wcluster)} specifies the median difference
between measurements on the same patient before and after treatment,
which is equal to zero under the null hypothesis tested by the {help signrank:sign test}.
The option {cmd:funtype(vonmises)} specifies the conventional Hodges-Lehmann median difference
between the group of before-treatment measures and the group of after-treatment measurements,
with estimates calculated as if the two groups were two independent samples,
but with confidence limits adfjusted for clustering by patient.
This Von Mises parameter is zero under the null hypothesis tested by the clustered ranksum test
presented in Rosner {it:et al.} (2006).
{p 4 8 2}{cmd:. sysuse bplong, clear}{p_end}
{p 4 8 2}{cmd:. describe, simple}{p_end}
{p 4 8 2}{cmd:. cendif bp, by(when) tdist cluster(patient) funtype(wcluster)}{p_end}
{p 4 8 2}{cmd:. cendif bp, by(when) tdist cluster(patient) funtype(vonmises)}{p_end}
{title:Saved results}
{pstd}
{cmd:cendif} saves the following in {cmd:r()}:
{synoptset 20 tabbed}{...}
{p2col 5 20 24 2: Scalars}{p_end}
{synopt:{cmd:r(N)}}number of observations{p_end}
{synopt:{cmd:r(N_clust)}}number of clusters{p_end}
{synopt:{cmd:r(N_1)}}first sample size{p_end}
{synopt:{cmd:r(N_2)}}second sample size{p_end}
{synopt:{cmd:r(df_r)}}residual degrees of freedom (if {cmd:tdist} present){p_end}
{synopt:{cmd:r(level)}}confidence level{p_end}
{p2col 5 20 24 2: Macros}{p_end}
{synopt:{cmd:r(depvar)}}name of {it:Y}-variable{p_end}
{synopt:{cmd:r(by)}}name of {cmd:by()} variable defining groups{p_end}
{synopt:{cmd:r(clustvar)}}name of cluster variable{p_end}
{synopt:{cmd:r(cfweight)}}{cmd:cfweight()} expression{p_end}
{synopt:{cmd:r(funtype)}}{cmd:funtype()} option{p_end}
{synopt:{cmd:r(tdist)}}{cmd:tdist} if specified{p_end}
{synopt:{cmd:r(wtype)}}weight type{p_end}
{synopt:{cmd:r(wexp)}}weight expression{p_end}
{synopt:{cmd:r(centiles)}}list of percents for percentiles{p_end}
{synopt:{cmd:r(Dslist)}}list of D-star values for percentiles{p_end}
{synopt:{cmd:r(transf)}}transformation specified by {cmd:transf()}{p_end}
{synopt:{cmd:r(tranlab)}}transformation label in output{p_end}
{synopt:{cmd:r(eform)}}{cmd:eform} if specified{p_end}
{p2col 5 20 24 2: Matrices}{p_end}
{synopt:{cmd:r(cimat)}}confidence intervals for differences or ratios{p_end}
{synopt:{cmd:r(Dsmat)}}upper and lower limits for D-star values{p_end}
{p2colreset}{...}
{pstd}
The D-star value for a percentile {hi:theta} is the value of {hi:D[ystar(theta)|group_A]},
as defined in {helpb cendif##cendif_remarks:Remarks} above.
{title:Author}
{pstd}
Roger Newson, Imperial College London, UK.{break}
Email: {browse "mailto:r.newson@imperial.ac.uk":r.newson@imperial.ac.uk}
{title:References}
{phang}
Newson, R. 2002.
Parameters behind "nonparametric" statistics:
Kendall's tau, Somers' {it:D} and median differences.
{it:Stata Journal} 2: 45-64.
Download from
{browse "http://www.stata-journal.com/article.html?article=st0007":the {it:Stata Journal} website}.
{phang}
Newson, R. 2006.
Confidence intervals for rank statistics:
Percentile slopes, differences, and ratios.
{it:Stata Journal} 6: 497-520.
Download from
{browse "http://www.stata-journal.com/article.html?article=snp15_7":the {it:Stata Journal} website}.
{phang}
Rosner, B., R. J. Glynn and M-L. T. Lee. 2006.
Extension of the rank-sum test for clustered data:
Two-group comparisons with group membership defined at the subunit level.
{it:Biometrics} 62(4): 1251-1259.
{title:Also see}
{psee}
Manual: {hi:[R] spearman}, {hi:[R] ranksum}, {hi:[R] signrank}, {hi:[R] centile}
{p_end}
{psee}
STB: STB-52: sg123, STB-55: snp15, STB-57: snp15.1, STB-58: snp15.2,
STB-58: snp16; STB-61: snp15.3; STB-61: snp16.1.
{psee}
Online: {helpb ktau}, {helpb ranksum}, {helpb signrank}, {helpb centile}{break}
{helpb cid}, {helpb npshift}, {helpb somersd}, {helpb censlope} (if installed)
{p_end}