Generalized power calculations saving results in variables
powercal newvarname [if exp] [in range] , [ nunit(expression_1) power(expression_2) alpha(expression_3) delta(expression_4) sdinf(expression_5) tdf(expression_6) noceiling float ]
where expression is a numeric expression. The numeric expression for each option must be in the form required by the generate command. That is to say, each expression must be specified so that the command
gene double newvarname=(expression)
will work.
Description
powercal performs generalized power calculations, storing the result in a new variable with a name specified by newvarname. All except one of the options nunit, power, alpha, delta and sdinf must be specified. The single unspecified option in this list specifies whether the output variable is the number of sampling units, power, alpha (significance level), delta (difference in parameter value to be detected), or the standard deviation (SD) of the influence function. Any of these 5 quantities can be calculated from the other 4. powercal can be used to calculate any of these quantities, assuming that we are testing a hypothesis that a parameter is zero, and that the true value is given by delta, and that the sample statistic is distributed around the population parameter in such a way that the pivotal quantity
PQ = sqrt(nunits) * (delta/sdinf)
has a standard Normal distribution (if tdf() is not specified) or a t-distribution with tdf() degrees of freedom (if tdf() is specified). The formulas used by powercal define power as the probability of detecting a difference in the right direction, using a two-tailed test.
Options
nunit(expression_1) gives an expression whose value is the number of independent sampling units. Sampling units are defined very generally. For instance, in an experiment involving equal-sized samples of individuals from Population A and Population B, a sampling unit might be a pair of sampled individuals, one from each population. Similarly, in a case-control study with 4 controls per case, a sampling unit might be a case together with 4 controls.
power(expression_2) gives an expression whose value is the power to detect a difference specified by the delta() option (see below). The power is defined as the probability that the sample difference is in the correct direction, and also large enough to be significant, using a 2-tailed test, at the level specified by the alpha() option (see below).
alpha(expression_3) gives an expression whose value is the size, or significance level, of the statistical test (in units of probability, not percentage).
delta(expression_4) gives an expression whose value is the true population difference to be detected. This difference is assumed to be positive. Therefore, if the user wishes to detect a negative difference, then s/he should specify an expression equal to minus that difference. The difference may be the log of a ratio parameter, such as an odds ratio, rate ratio, risk ratio or ratio of geometric means.
sdinf(expression_5) gives an expression whose value is the standard deviation of the influence function. That is to say, it is an expression equal to the expected standard error of the sample difference multiplied by the square root of the number of sampling units, where sampling units are defined generally, as specified in the option nunit(). In the simple case of a paired t-test, sdinf() is the standard deviation of the paired differences. More generally, sdinf() can be defined by calculating a standard error for a particular number of units, from a pilot study, from a simulation or from a formula, and multiplying this standard error by the square root of the number of units in the pilot study, simulation or formula.
tdf(expression_6) gives an expression whose value is the degrees of freedom of the t-distribution to be assumed for the pivotal quantity PQ specified above. The degrees of freedom expression is not necessarily integer-valued. If tdf() is absent, then PQ is assumed to follow a standard Normal distribution.
noceiling specifies that, if the output variable specified by newvarname is a number of units, then it will not be rounded up to the lowest integer no less than itself (as calculated by the Stata 8 ceil() function). This option can be useful if the output variable is intended to specify an amount of exposure, such as a number of person-years, and the input sdinf() expression specifies a standard deviation of the influence function per unit exposure. If noceiling is not specified, and power(), alpha(), delta() and sdinf() are specified, then powercal rounds up the output variable, so that it contains a whole number of units
float specifies that the output variable will have a storage type no higher than float. If float is not specified, then powercal creates the output variable with storage type double. Whether or not float is specified, powercal compresses the output variable as much as possible without loss of precision. (See help for compress.)
Remarks
powercal carries out sample size calculations for a more general range of possible experimental designs than sampsi, and stores the result in a new variable, instead of reporting the result in the log. The new variable may be input to further calculations and/or plotted and/or listed. powercal is intended as a low-level programming tool for users intending to carry out sample size calculations for a given experimental design. It is the responsibility of the user to ensure that the expressions are correct, and to choose a parameter scale on which the parameter is expected to be Normally distributed (or t-distributed), with a variance that does not vary excessively with the size of the measured difference.
The formulas used by powercal define power as the probability of detecting a difference in the right direction, using a two-tailed test. It follows that, in the limit, as the difference delta tends to zero, the power to detect a difference of delta with a P-value of alpha tends to a minimum of alpha/2, and not to a minimum of alpha. powercal converts to missing the results of all input expressions for power() and alpha() which evaluate to a number outside the open interval (0,1), and the results of all input expressions for delta(), sdinf() and nunit() which evaluate to a non-positive number. powercal also converts to missing all values in the output variable for which there is not a unique maximum or minimum value of the output quantity. See the manual powercal.pdf (distributed as an ancillary file with the powercal package) for details of the Methods and Formulas.
Examples
The following examples are explained in detail in the manual powercal.pdf, which is distributed with the powercal package. They are designed to work both under Stata 7 and under Stata 8.
This example creates Figure 1, displaying power as a function of the geometric mean ratio between 2 treatment groups:
. clear . scal cv=0.5 . scal sdlog=sqrt(log(cv*cv + 1)) . scal r20=exp(-2*sdlog*invnorm(0.2)) . disp _n as text "Coefficient of variation: " as result cv _n as text "SD of logs: " as result sdlog _n as text "20% tail ratio: " as result r20 . set obs 100 . gene logratio=log(2)*(_n/_N) . lab var logratio "Log GM ratio" . gene gmratio=exp(logratio) . lab var gmratio "GM ratio" . powercal power, alpha(0.01) delta(logratio) sdinf(sdlog*sqrt(2)) nunit(50) tdf(98) . version 7:graph power gmratio, s(.) c(L) ylab(0(0.05)1) yline(0.8 0.9) xlab(1(0.1)2) xlog
This example creates Figure 2, displaying detectable geometric mean ratios between 2 groups as a function of number per group:
. clear . scal cv=0.5 . scal sdlog=sqrt(log(cv*cv + 1)) . scal r20=exp(-2*sdlog*invnorm(0.2)) . disp _n as text "Coefficient of variation: " as result cv _n as text "SD of logs: " as result sdlog _n as text "20% tail ratio: " as result r20 . set obs 100 . gene npergp=_n . lab var npergp "Number per group" . powercal logratio, power(0.9) alpha(0.01) sdinf(sdlog*sqrt(2)) nunit(npergp) tdf(2*(npergp-1)) . gene hiratio=exp(logratio) . gene loratio=exp(-logratio) . lab var hiratio "Detectable GM ratio >1" . lab var loratio "Detectable GM ratio <1" . version 7:graph hiratio loratio npergp if _n>=5, s(..) c(LL) yline(1) ylab xlab(0(10)100)
This example creates Figures 3 and 4, displaying, respectively, detectable odds ratios in a case-control study as a function of number of cases and attainable significance levels as a function of odds ratio:
. clear . scal conprev=0.25 . scal conodds=conprev/(1-conprev) . disp _n as text "Expected control prevalence: " as result conprev _n as text "Expected control odds: " as result conodds . set obs 101 . gene logor=log(1.25)+(log(5)-log(1.25))*(_n-1)/(_N-1) . gene or=exp(logor) . gene caseodds=conodds*or . gene caseprev=caseodds/(1+caseodds) . gene sdinflor=sqrt( 1/caseprev + 1/(1-caseprev) + (1/2)*( 1/conprev + 1/(1-conprev) ) ) . lab var logor "Log odds ratio" . lab var or "Odds ratio" . lab var caseodds "Case exposure odds" . lab var caseprev "Case exposure prevalence" . lab var sdinflor "SD of influence for log OR" . desc . * Detectable OR by number of cases * . powercal ncases, power(0.9) alpha(0.01) delta(logor) sdinf(sdinflor) . version 7:graph or ncases if ncases<=2000, s(.) c(l) ylog ylab xlab . more . * Significance level by odds ratio * . powercal alphamin, power(0.9) delta(logor) sdinf(sdinflor) nunit(100) . version 7:graph alphamin or, s(.) c(l) ylog yreverse ylab(1 0.05 1e-1 1e-2 1e-3 1e-4 1e-5 1e-6 1e-7) yline(0.05 0.01) xlog xlab(1 1.25 1.5 2(1)5) . more
Author
Roger Newson, King's College, London, UK. Email: roger.newson@kcl.ac.uk
Also see
Manual: [R] sampsi On-line: help for sampsi