------------------------------------------------------------------------------- help forqqvalue(Roger Newson) -------------------------------------------------------------------------------

Generate frequentistq-values by inverting multiple-test procedures

qqvaluevarname[if] [in] [ ,method(method_name)bestof(#)qvalue(newvarname)npvalue(newvarname)rank(newvarname)svalue(newvarname)rvalue(newvarname)floatfast]where

method_nameis one of

bonferroni|sidak|holm|holland|hochberg|simes|yekutieli

byvarlist:can be used withqqvalue. (See help forby.) Ifbyvarlist:is used, then all generated variables are calculated using the specified multiple-test procedure within each by-group defined by the variables in thevarlist.

Description

qqvalueis similar to the R packagep.adjust. It inputs a single variable, assumed to containP-values calculated for multiple comparisons, in a dataset with 1 observation per comparison. It outputs a new variable, containing the frequentistq-values corresponding to theseP-values, calculated by inverting a multiple-test procedure specified by the user. Theseq-values represent, for each correspondingP-value, the minimum uncorrectedP-value threshold for which thatP-value would be in the discovery set, assuming that the specified multiple-test procedure was used on the same set of inputP-values to generate a correctedP-value threshold. These minimum uncorrectedP-value thresholds may represent familywise error rates or false discovery rates, depending on the procedure used. Optionally,qqvaluemay output other variables, containing the various intermediate results used in calculating theq-values. The multiple-test procedures available forqqvalueare a subset of those available using themultprocmodule of thesmileplotpackage, which can be downloaded from SSC.

Options forqqvalue

method(method_name)specifies the multiple-test procedure method to be used for calculating theq-values from the inputP-values. Themethod_namemay bebonferroni,sidak,holm,holland,hochberg,simes, oryekutieli. These method names specify that theq-values will be calculated from the inputP-values by inverting the multiple-test procedure specified by themethod()option of the same name for themultprocoption of thesmileplotpackage, which can be downloaded from SSC. Ifmethod()is unset, then it is set tobonferroni.

bestof(#)specifies an integer number. If thebestof()option is specified (and is greater than the number of inputP-values), then theq-values are calculated assuming that the inputP-values are a subset (usually the smallest) of a superset ofP-values. If themethod()option specifies a one-step method (such asbonferroniorsidak), then theq-values do not depend on the otherP-values in the superset, but only on the number ofP-values in the superset. If themethod()option specifies a step-down method (such asholmorholland), then it is assumed that all the otherP-values in the superset are greater than the largest of the inputP-values. If themethod()option specifies a step-up method (such ashochberg,simes, oryekutieli), then it is assumed that all the otherP-values in the superset are equal to 1, implying that theq-values will be conservative, and define an upper bound to the respectiveq-values that would have been calculated, if we knew the otherP-values in the superset. Ifbestof()is unspecified (or non-positive), then the inputP-values are assumed to be the full set ofP-values calculated. Thebestof()option is useful if the inputP-values are known (or suspected) to be the smallest of a greater set ofP-values, which we do not know. This often happens if the inputP-values are from a genome scan reported in the literature.

qvalue(newvarname)specifies the name of a new output variable to be generated, containing theq-values calculated from the inputP-values, using the multiple-test procedure specified by themethod()option.

npvalue(newvarname)specifies the name of a new output variable to be generated, containing, in each observation, the total number ofP-values in the sample of observations specified by theifandinqualifiers, or in the by-group containing that observation, if theby:prefix is specified.

rank(newvarname)is the name of a new variable to be generated, containing, in each observation, the rank of the correspondingP-value, from the lowest to the highest. TiedP-values are ranked according to their position in the input dataset. If theby:prefix is specified, then the ranks are defined within the by-group.

svalue(newvarname)specifies the name of a new output variable to be generated, containing thes-values calculated from the inputP-values. Thes-values are an intermediate result, calculated in the course of calculating theq-values, and are used mainly for validation. They are calculated from the inputP-values by inverting the formulas used for the rank-specific criticalP-value thresholds calculated by themultprocmodule of thesmileplotpackage. These rank-specificP-value thresholds are returned in the generated variable specified by thecritical()option ofmultproc. Note that thes-values may have values greater than 1.

rvalue(newvarname)specifies the name of a new output variable to be generated, containing ther-values calculated from the inputP-values. Ther-values are an intermediate result, calculated in the course of calculating theq-values, and are used mainly for validation. They are calculated from thes-values by truncating thes-values to a maximum of 1. Theq-values are calculated from ther-values using a procedure dependent on the multiple-test procedure specified by themethod()option. If the multiple-test procedure is a one-step procedure, such asbonferroniorsidak, then theq-values are equal to the correspondingr-values. If the multiple-test procedure is a step-down procedure, such asholmorholland, then theq-value for eachP-value is equal to the cumulative maximum of all ther-values corresponding toP-values of rank equal to or less than thatP-value. If the multiple-test procedure is a step-up procedure, such ashochberg,simesoryekutieli, then theq-value for eachP-value is equal to the cumulative minimum of all ther-values corresponding toP-values of rank equal to or greater than thatP-value.

floatspecifies that the output variables specified by theqvalue(),rvalue()andsvalue()options will be created as variables of typefloat. Iffloatis absent, then these variables are created as variables of typedouble. Whether or notfloatis specified, all generated variables are stored to the lowest precision possible without loss of information.

fastis an option for programmers. It specifies thatqqvaluewill not take any action so that it can restore the original data in the event of failure, or if the user pressesBreak.

RemarksThe methods and formulas for

qqvalueare given in Newson (2010). Multiple-test procedures are reviewed in Newsonet al.(2003), and described in the on-line help formultproc. All of these sources contain extensive references for further reading.The

qqplotpackage is similar to the R packagep.adjust, which also calculates frequentistq-values corresponding to multiple-test procedures. Note that, in the on-line documentation forp.adjustin R, theq-values are referred to as "adjustedP-values", although a lot of users refer to them as "q-values". There is no clear consensus regarding the correct terminology to use, even among statisticians. The term "q-value" was introduced in Storey (2003) to describe a minimum positive false discovery rate (pFDR) under which aP-value will be included in a discovery set, assuming that this discovery set is defined to control the pFDR. The pFDR is a quantity defined for empirical Bayesian methods. By contrast, the multiple-test procedures used byqqvalueandp.adjustdefine the discovery set to control either the familywise error rate (FWER) or the false discovery rate (FDR), both of which are defined for purely frequentist methods. For this reason, I originally used the term "quasi-q-values" to denote frequentistq-values, and chose the nameqqvaluefor the package to compute these. However, I was later advised that the prefix "quasi-" was not really necessary. I therefore now simply use the term "q-values", or "frequentistq-values" if I need to distinguish them from Bayesianq-values.

qqvalue,multprocandsmileplotall require input datasets with 1 observation for each of a set ofP-values, usually corresponding to a set of estimated parameters. Such input datasets may be produced using the official Stata utilitiesstatsbyandpostfile, or alternatively by the user-written Stata packageparmest, which can be downloaded from SSC.

Technical noteIf the user specifies

method(sidak), thenqqvalueuses the formula formethod(bonferroni)to calculate thes-values,r-values andq-values corresponding to inputP-values too small to be subtracted from 1 in double precision to give a result less than 1. Similarly, if the user specifiesmethod(holland), thenqqvalueuses the formula formethod(holm)to calculate thes-values,r-values andq-values corresponding to inputP-values too small to be subtracted from 1 in double precision to give a result less than 1. This practice ensures that theq-values will be sensible, and no smaller than the correspondingP-values, even if theseP-values are too small to be subtracted from 1 to give a result less than 1. It works because the Sidakq-value converges in ratio to the Bonferroniq-value in the limit as the correspondingP-value tends to zero.

ExamplesThe following example uses the

autodata, distributed with Stata. Thesomersdpackage is used to measure the Somers'Dparameters for rank associations between a list of car-specific variables and non-US origin. Theparmestpackage is then used to replace the dataset in memory with a new dataset, with 1 observation per estimated parameter and data on parameter estimates, confidence limits andP-values. We then useqqvalueto calculateq-values corresponding to theP-values, using the Simes procedure. Theparmestandsomersdpackages can be downloaded from SSC.. sysuse auto, clear . somersd foreign price mpg headroom trunk weight length turn displacement gear_ratio, tdist . parmest, norestore . qqvalue p, method(simes) qvalue(myqval) . list

The following example also uses the

autodata. It first uses thesomersdpackage, together with theparmbymodule of theparmestpackage, to create a new dataset in the memory, with 1 observation for each of a list of rank correlations involving car price in each car origin group (US and non-US cars). We then useqqvalue, with thebyvarlist:prefix, to demonstrate the calculation of 2 separate sets ofq-values, one for US-made cars and one for non-US-made cars.. sysuse auto, clear . parmby "somersd price mpg headroom trunk weight length turn displacement gear_ratio, tdist", by(foreign) norestore . by foreign: qqvalue p, method(simes) qvalue(myqval) . by foreign: list

AuthorRoger B. Newson, National Heart and Lung Institute, Imperial College London, UK. Email: r.newson@imperial.ac.uk

ReferencesNewson, R. B. 2010. Frequentist

q-values for multiple-test procedures.The Stata Journal10(4): 568-584. Download fromTHe Stata Journalwebsite.Newson, R. and the ALSPAC Study Team. 2003. Multiple-test procedures and smile plots.

The Stata Journal3(2): 109-132. Download fromTHeStata Journalwebsite.Storey, J. D. 2003. The positive false discovery rate: a Bayesian interpretation and the

q-value.The Annals of Statistics31(6): 2013–2035.

Also seeManual:

[R] by,[R] statsby,[P] postfile. On-line: help for[D] by,[D] statsby,[P] postfilehelp formultproc,smileplot,parmest,somersdif installed