{smcl}
{* *! version 1.0 7/1/2019}{...}
{vieweralsosee "" "--"}{...}
{viewerjumpto "Syntax" "sentinel##syntax"}{...}
{viewerjumpto "Description" "sentinel##description"}{...}
{viewerjumpto "Remarks" "sentinel##remarks"}{...}
{viewerjumpto "Algorithm" "sentinel##Algorithm"}{...}
{viewerjumpto "Examples" "sentinel##examples"}{...}
{viewerjumpto "Reference" "sentinel##Reference"}{...}
{phang}
{bf:sentinel} {hline 2} Select sentinel genetic variants
{marker syntax}{...}
{title:Syntax}
{p 8 17 2}
{cmdab:sentinel}
{it:depvar} {it:indepvars}
[{help if}]
[{help in}]
[{cmd:,} {it:options}]
{synoptset 20 tabbed}{...}
{synopthdr}
{synoptline}
{syntab:Main}
{synopt:{opt del:ta(#)}} a step size used to decrement the value of R-squared used in the sentinel program; default value is 0.025.{p_end}
{synopt:{opt r2values(#)}} number of different values of R-squared considered; r2values must be <= 1/delta; default value is 1/delta.{p_end}
{synopt:{opt p:value(#)}} p-value for inclusion in the sentinel model; default value is 0.01.{p_end}
{synopt:{opt ver:sion}} If present then the version of the sentinel program will be displayed.{p_end}
{synopt:{opt lis:tvariants}} If present then a list of the variants considered by the sentinel program will be displayed.{p_end}
{synopt:{opt showprog:ress}} If present then a report of the progress through the R**2 values will be displayed.{p_end}
{synoptline}
{p2colreset}{...}
{p 4 6 2}{it:depvar} is an indicator variable that designates case ({it:depvar} = 1) or control ({it:depvar} = 0)
status of study subjects.
{p_end}
{p 4 6 2}{it:indepvars} are SNPs observed on each subject. Each SNP gives the number of variant alleles for each subject.
{p_end}
{marker description}{...}
{title:Description}
{pstd}
{pstd}
{cmd:sentinel} selects sentinel SNPs from the genetic variants in {it:indepvars}. These are SNPs
that best detect independent risk-altering signals. In a multivariable multiplicative logistic
regression model that regresses {it:depvar} against the sentinel variants, each variant is significantly
associated with {it:depvar} and the absolute value of the correlation coefficient of each pair of variants is low.
{marker remarks}{...}
{title:Remarks}
{pstd}
Sentinel variants are those best detecting independent risk-altering signals. This program identifies sentinel variants
using the RISSc algorithm of Dupont et al. 2020. It is based explicitly upon LD patterns
and identifies variants that optimally detect the risk
signal of a given LD bin, and those which detect independent risk signals across LD bins under mutual
adjustment. Because any given set of variants may be sufficiently correlated that they are not significant
under mutual adjustment, the algorithm judiciously employs LD patterns to ensure that variants optimally
detecting independent risk signals are retained in the model, while others are removed. The algorithm
works well with highly correlated variants. It seeks a multivariable model of sentinel variants with low
pairwise correlation coefficients and high significance under mutual adjustment.
{marker Algorithm}{...}
{title:Algorithm}
{pstd}The RISSc algorithm selects SNPs that are mutually significant in a multivariable model, and which have low pair-wise R-squared
values. These are sentinel SNPs, optimally detecting the independent risk-altering association signals of the starting SNP
set. In what follows, all regressions are logistic and use multiplicative (additive genetic) models; {it:depvar}
is an indicator variable that identifies cases and controls. {it:d}, {it:#n} and {it:#p} are values passed to
the program by the delta, r2values and pvalue options. The algorithm identifies bins of SNPs
that are correlated with each other with diminishing R-squared thresholds. "Selected" means kept for
possible consideration in the final sentinel model. A selected SNP is "marked" if its association with
disease is sufficient to keep it from being deleted in the next step. Not all marked SNPs will make it
into the final model. Once a SNP is deleted, however, it is permanently excluded from further
consideration for inclusion in the final model.
{pstd}Step 1:
{p 8 8 2}Set R2 = 1. Identify bins of SNPs that are perfectly correlated with each other (R-squared = 1). Select
one SNP from each bin and delete all other SNPs in each bin from further consideration. Bins of size 1
are allowed. Regress {it:depvar} against all selected SNPs in a multivariable logistic
regression model. If this regression converges then mark all selected SNPs with
P <= {it:#p} for further consideration and designate those of P > {it:#p} as unmarked. If
the regression does not converge, then all selected SNPs are unmarked but remain as
candidates for further evaluation. Set R2 = 1-{it:#d}. Proceed to Step 2 with the selected SNPs, each
categorized as either marked or unmarked.
{pstd}Step {it:i}: {it:i} = 2 to {it:#n}:
{p 8 8 2}Identify bins of selected SNPs from Step {it:i} - 1 whose squared correlation coefficient is >= R2. For each bin:
{p 12 12 2}a) Identify the SNP with the greatest association with disease using simple logistic
regression. This SNP is denoted {it:best-in-bin}.
{p 12 12 2}b) Regress {it:depvar} against all of the SNPs in the bin. The {it:best-in-bin} SNP plus
any SNP in the multivariable regression for this bin that has P <= {it:#p} are selected together with
all SNPs that were marked in Step {it:i} - 1. Delete all SNPs in the bin that have not been selected
from further consideration.
{p 8 8 2}After the selections and deletions from each bin have been made, regress {it:depvar} against
all of these remaining selected SNPs in a multivariable logistic regression model. If this regression
converges, then mark all SNPs of P <= {it:#p} while designating those of P > {it:#p} as unmarked. Any
SNP that was previously marked will become unmarked if it no longer meets this P-value threshold. If
the model instead fails to converge, then retain the modeled SNPs but designate them as unmarked
unless they were marked at the previous step. Subtract {it:#d} from R2 and increment {it:i} by 1.
If {it:i} <= {it:#n} loop to repeat Step {it:i}.
{pstd}The final sentinel SNPs identified by this algorithm are those that were marked in
Step {it:#n}. In the application of this algorithm to the 183 genome-wide significant
variant set described in Dupont et al. 2020, the only multivariate model that actually
failed to converge was at Step 1 (SNPs representing bins of R2 = 1).
{marker examples}{...}
{title:Examples}
{phang}{cmd:. use testSNPs.dta}{p_end}
{phang}{cmd:. sentinel case_hpc snp8_128104117 rs6983267_T snp8_128191672}{p_end}
{phang}{cmd:. ds case_hpc, not}{p_end}
{phang}{cmd:. local snplist `r(varlist)'}{p_end}
{phang}{cmd:. * When the input list of SNPs is large it is less tedious}{p_end}
{phang}{cmd:. * to use a local macro to enter them into the sentinel program}{p_end}
{phang}{cmd:. sentinel case_hpc `snplist', delta(.05)}{p_end}
{title:Stored results}
{synoptset 15 tabbed}{...}
{p2col 5 15 19 2: Locals}{p_end}
{synopt:{cmd:r(sentinel)}} local macro consisting of the names of the sentinel SNPs selected by this program {p_end}
{title:Author}
{pstd}William D. Dupont{p_end}
{pstd}Dale Plummer{p_end}
{pstd}Department of Biostatistics{p_end}
{pstd}Vanderbilt University School of Medicine{p_end}
{pstd}Jeffrey R. Smith{p_end}
{pstd}Division of Genetic Medicine{p_end}
{pstd}Vanderbilt University Medical Center{p_end}
{pstd}Email {browse "mailto:william.dupont@vumc.org":william.dupont@vumc.org}{p_end}
{pstd}Email {browse "mailto:dale.plummer@vumc.org":dale.plummer@vumc.org}{p_end}
{pstd}Email {browse "mailto:jeffrey.smith@vumc.org":jeffrey.smith@vumc.org}{p_end}
{marker Reference}{...}
{title:Reference}
{pstd}Dupont WD, Breyer JP, Plummer WD et al. 8q24 genetic variation and comprehensive haplotypes
altering familial prostate cancer. {it:Nature Communications} {bf:11,} 1523 (2020). https://doi.org/10.1038/s41467-020-15122-1{p_end}
{pstd}(a pdf of this paper is posted at https://www.nature.com/articles/s41467-020-15122-1.pdf).{p_end}