-------------------------------------------------------------------------------
help for scsomersd, sccendif and sccenslope                      (Roger Newson)
-------------------------------------------------------------------------------

Rank statistics for scenario comparisons

scsomersd y0 [ y1 ] [weight] [if] [in] [, sweight(expression) nyvar( newvar) nweight(newvar) ncfweight(newvar) nobs(newvar) nscen(newvar) somersd_options

sccendif y0 [ y1 ] [weight] [if] [in] [, sweight(expression) nyvar( newvar) nweight(newvar) ncfweight(newvar) nobs(newvar) nscen(newvar) cendif_options

sccenslope y0 [ y1 ] [weight] [if] [in] [, sweight(expression) nyvar( newvar) nweight(newvar) ncfweight(newvar) nobs(newvar) nscen(newvar) censlope_options

where y0 and y1 are either varnames or numbers, somersd_options is a list of options for somersd other than funtype(), cendif_options is a list of options for cendif other than funtype() and by(), and censlope_options is a list of options for censlope other than funtype().

fweights, iweights, and pweights are allowed; see weight. However, cluster frequency weights must be specified using the cfweight() option of somersd, cendif and censlope.

bootstrap, by, jackknife, and statsby are allowed. See prefix.

Description

scsomersd, sccendif and sccenclope compute confidence intervals for rank statistics comparing scenarios. Scenarios are alternative versions of the data, differing in the values of sampling probability weights and/or in the values of an outcome variable. The scenario-comparison rank statistics compare 2 scenarios, denoted Scenario 0 and Scenario 1, derived from the dataset in the memory, in a temporary extended dataset with 1 observation per original observation per scenario. scsomersd estimates the Somers' D or Kendall tau-a of the outcome variable, and sccendif and sccenslope estimate the Hodges-Lehmann percentile differences of the outcome variable, with respect to scenario membership. Examples of between-scenario rank statistics include the Gini coefficient of inequality, the population attributable risk (PAR), percentiles of weighted and/or clustered samples, and Hodges-Lehmann percentile differences between paired samples. scsomersd, sccendif and sccenclope use the packages somersd and expgen, which must be installed in order for the programs to work, and can be downloaded from SSC.

Options for use with scsomersd, sccendif and sccenslope

sweight(expression) specifies the weight expression for use in Scenario 1. The type of weights (fweights, pweights or iweights), and the weight expression for use in Scenario 0, are specified in the weight expression supplied to the command. If sweight() is not specified, then the weight expression for Scenario 1 is set to the weight expression for Scenario 0. Note that both scenario weight expressions are interpreted as importance weights, and that cluster frequency weights must be specified using the cfweight() option of somersd, cendif and censlope.

nyvar(newvar) specifies the name of the temporary variable, in the expanded dataset with 1 observation per original observation per scenario, containing the outcome or Y-values for use in the scenario comparison. In observations in Scenario 0, the value of this variable is equal to the variable (or number) y0. In observations in Scenario 1, the value of this variable is equal to the variable (or number) y1. In default, the name is set to _yvar.

nweight(newvar) specifies the name of the temporary variable, in the expanded dataset with 1 observation per original observation per scenario, containing the scenario-specific weights for use in the scenario comparison. These weights are equal to the weight expression passed to the command, in observations in Scenario 0, and equal to the weight expression specified by sweight(), in observations in Scenario 1. In default, the name is set to _weight.

ncfweight(newvar) specifies the name of the temporary variable, in the expanded dataset with 1 observation per original observation per scenario, containing the cluster frequency weights for use in the scenario comparison, specified in the cfweight() option of somersd, cendif and censlope. These cluster frequency weights belong to clusters in the original dataset, if a cluster() option is specified. Otherwise, they belong to clusters in the extended two-scenario dataset corresponding to the observations in the original dataset. In default, the name is set to _cfweight.

nobs(newvar) specifies the name of the temporary variable, in the expanded dataset with 1 observation per original observation per scenario, containing the sequential order of the observation, in the original dataset, corresponding to each observation in the extended dataset. In default, the name is set to _obs.

nscen(newvar) specifies the name of the temporary variable, in the expanded dataset with 1 observation per original observation per scenario, containing the scenario indicator of each observation in the extended dataset. In the case of scsomersd and sccenslope, this temporary variable is an indicator of membership of Scenario 0, equal to 0 for observations in Scenario 1, and 1 for observations in Scenario 0. In the case of sccendif, this temporary variable is an indication of membership of Scenario 1, equal to 0 for observations in Scenario 0, and 1 for observations in Scenario 1. In default, the name is set to _scen0 by scsomersd and sccenslope, and to _scen1 by sccendif.

somersd_options, cendif_options and censlope_options specify lists of options, to be passed to somersd, cendif and censlope, respectively. These options must not include the funtype() option, which is set automatically to funtype(vonmises). In the case of sccendif, these options must not include the by() option, which is set automatically to the name of the Scenario 1 membership indicator variable specified by the nsscen() option.

Remarks

scsomersd, sccendif and sccenslope work by calling somersd, cendif and censlope, respectively, in a temporary extended dataset, with 1 observation per original observation per scenario. This temporary dataset is generated using the expgen package, downloadable from SSC. It contains temporary variables, which are the outcome variable, scenario-specific weight variable, cluster frequency weight variable, observation sequence variable, and scenario membership indicator variable. The outcome variable is equal to the input variable (or number) y0 for observations in scenario 0. For observations in Scenario 1, the outcome variable is equal to the input variable (or number) y1, if specified, and otherwise is equal to the input variable (or number) y0. sccomersd calls somersd to estimate the Somers' D, or Kendall's tau-a, of the outcome variable, with respect to membership of Scenario 0 instead of Scenario 1. sccendif and sccenslope call cendif and censlope, respectively, to estimate the Hodges-Lehmann percentile differences between observations in Scenario 0 and observations in Scenario 1. In all cases, the observations of the temporary extended dataset are clustered, and the confidence interval calculation assumes that clusters are sampled from a population of clusters, instead of assuming that observations are sampled from a population of observations. Clusters are defined using the variable specified by the input cluster() option, if one is supplied, and otherwise are defined using the original-observation sequence variable specified by the nobs() option.

Scenario comparison statistics include a large number of commonly used statistics as special cases. Examples include the Gini inequality coefficient, the population attributable risk (PAR), percentiles estimated from samples which may be clustered and/or weighted by sampling probability, and Hodges-Lehmann percentile differences between paired samples.

The commands sccenslope and sccendif estimate the same parameters {Hodges-Lehmann percentile differences), with confidence intervals calculated by the same formulas. However, sccendif uses the cendif algorithm, which uses less computer time in small samples, and sccenslope uses the censlope algorithm, which uses less computer time in large samples. For more details on the formulas, see the on-line and .pdf documentation for cendif and censlope.

The somersd, cendif and censlope commands are part of the somersd package. The somersd and expgen packages are downloadable from SSC.

Examples

The following example estimates the Gini inequality coefficient for wages in the womenwage data. In this case, Scenario 0 is a fantasy lottery in which each woman has a number of tickets proportional to her wage, and Scenario 1 is a second fantasy lottery in which each woman has one ticket, whatever her wage, even if it is zero. The Gini inequality coefficient is reported as Somers' D, and is the difference between 2 probabilities, namely the probability that the winner of the first lottery has a higher wage than the winner of the second lottery and the probability that the winner of the second lottery has a higher wage than the winner of the first lottery. This difference is always non-negative, but is higher in populations with more unequal wage distributions.

. webuse womenwage, clear . describe . scsomersd wage [pwei=wage], swei(1) transf(z) tdist

The following example estimates the unstandardized population attributable risk (PAR) of case status with respect to exposure in the ugdp data. In this case, Scenario 0 is the sample we have, in which some subjects are exposed, and Scenario 1 is a fantasy sample, in which no subjects are exposed. The PAR is then the difference between the proportion of subjects which are cases in Scenario 0 and the proportion of subjects which are cases in Scenario 1. This difference between proportions is reported as Somers' D. Note that the population attributable risk (PAR) is a between-scenario difference, and is not the same parameter as a population attributable fraction (PAF), which is equal to one minus a between-scenario ratio, and which can be estimated using the punaf package, downloadable from SSC.

. webuse ugdp, clear . sort age exposed case . describe . list, sepby(age) . scsomersd case [pwei=1], sweight(exposed==0) cfwei(pop) tdist transf(z)

The following example estimates the age-standardized population attributable risk (PAR) in the ugdp data. We first define the total numbers of subjects by age group, and by age group and exposure, in the variables wfreq and wxfreq, respectively. The ratio dswei=wfreq/wxfreq is a direct standardization weight, standardizing from the sampled population at each exposure level to a target population, with the same age distribution as the total dataset at all exposure levels combined. We then input these direct standardization weights to scsomersd to define a difference between the prevalences of case status between Scenario 0 (the existing sample) and Scenario 1 (a fantasy sample with the same age distribution and no exposure).

. webuse ugdp, clear . sort age exposed case . by age: egen wfreq=total(pop) . by age exposed: egen wxfreq=total(pop) . gene dswei=wfreq/wxfreq . describe . list, sepby(age exposed) . scsomersd case [pwei=1], sweight(dswei*(exposed==0)) cfwei(pop) tdist transf(z)

The following example measures percentile prices (in dollars) of cars in the auto data. First, we define a string variable firm, containing, for each car, the firm that made the car. We then calculate 2 sets of percentile car prices (the 25th percentile, the median and the 75th percentile), each defined as a percentile difference between car prices under Scenario 0 (the sample of cars in the dataset) and Scenario 1 (a fantasy scenario in which all cars have zero price). The first set of percentiles are unweighted, with confidence intervals calculated assuming that car models have been sampled from a population of car models. The second set of percentiles are weighted by the car's volume in cubic inches (displacement), with confidence intervals calculated using the option cluster(firm), assuming that firms have been sampled from a population of firms.

. webuse auto, clear . gene firm=word(make,1) . lab var firm "Firm" . sort foreign firm make . describe . tab firm, m . sccendif price 0, tdist centile(25(25)75) . sccendif price 0 [pwei=displacement], tdist centile(25(25)75) cluster(firm)

The following example uses the bpwide data, with 1 observation for each of a sample of fictional patients, and data on blood pressures before and after an unidentified treatment. We measure the Hodges-Lehmann median difference between post-treatment and pre-treatment blood pressures in the paired sample, with confidence intervals calculated to allow for the non-independence of paired blood pressures from the same patient. This median difference is reported as a median slope, together with the mean sign of all treated-untreated differences (between the same or different patients), which is reported as Somers' D.

. webuse bpwide, clear . sccenslope bp_after bp_before, tdist

The following example uses the same bpwide data, and calculates the median pairwise difference between post-treatment and pre-treatment blood pressures from the same patient (reported as a median slope), together with the mean sign of those differences (reported as Somers' D).

. webuse bpwide, clear . gene bp_diff=bp_after-bp_before . lab var bp_diff "After-before blood pressure difference" . sccenslope bp_diff 0, tdist

Saved results

scsomersd saves in e() the estimation results from the somersd command that it calls. sccendif saves in r(), and optionally in e(), the results from the cendif command that it calls. sccenslope saves in r(), and in e(), the results from the censlope command that it calls. However, in all cases, the estimation sample indicator e() indicates, in each observation, the presence of that observation in Scenario 0, and is not affected by the presence of the same observation in Scenario 1. Observations present in one scenario may be absent in the other, due to zero sampling probability weights.

Author

Roger Newson, National Heart and Lung Institute, Imperial College London, UK. Email: r.newson@imperial.ac.uk

Also see

Manual: [R] spearman, [R] ranksum, [R] signrank, [R] roc, [R] centile Online: ktau, ranksum, signrank, roc, centile somersd, cendif, censlope, expgen, punaf if installed