{smcl} {* *! version 1.0 7/1/2019}{...} {vieweralsosee "" "--"}{...} {viewerjumpto "Syntax" "sentinel##syntax"}{...} {viewerjumpto "Description" "sentinel##description"}{...} {viewerjumpto "Remarks" "sentinel##remarks"}{...} {viewerjumpto "Algorithm" "sentinel##Algorithm"}{...} {viewerjumpto "Examples" "sentinel##examples"}{...} {viewerjumpto "Reference" "sentinel##Reference"}{...} {phang} {bf:sentinel} {hline 2} Select sentinel genetic variants {marker syntax}{...} {title:Syntax} {p 8 17 2} {cmdab:sentinel} {it:depvar} {it:indepvars} [{help if}] [{help in}] [{cmd:,} {it:options}] {synoptset 20 tabbed}{...} {synopthdr} {synoptline} {syntab:Main} {synopt:{opt del:ta(#)}} a step size used to decrement the value of R-squared used in the sentinel program; default value is 0.025.{p_end} {synopt:{opt r2values(#)}} number of different values of R-squared considered; r2values must be <= 1/delta; default value is 1/delta.{p_end} {synopt:{opt p:value(#)}} p-value for inclusion in the sentinel model; default value is 0.01.{p_end} {synopt:{opt ver:sion}} If present then the version of the sentinel program will be displayed.{p_end} {synopt:{opt lis:tvariants}} If present then a list of the variants considered by the sentinel program will be displayed.{p_end} {synopt:{opt showprog:ress}} If present then a report of the progress through the R**2 values will be displayed.{p_end} {synoptline} {p2colreset}{...} {p 4 6 2}{it:depvar} is an indicator variable that designates case ({it:depvar} = 1) or control ({it:depvar} = 0) status of study subjects. {p_end} {p 4 6 2}{it:indepvars} are SNPs observed on each subject. Each SNP gives the number of variant alleles for each subject. {p_end} {marker description}{...} {title:Description} {pstd} {pstd} {cmd:sentinel} selects sentinel SNPs from the genetic variants in {it:indepvars}. These are SNPs that best detect independent risk-altering signals. In a multivariable multiplicative logistic regression model that regresses {it:depvar} against the sentinel variants, each variant is significantly associated with {it:depvar} and the absolute value of the correlation coefficient of each pair of variants is low. {marker remarks}{...} {title:Remarks} {pstd} Sentinel variants are those best detecting independent risk-altering signals. This program identifies sentinel variants using the RISSc algorithm of Dupont et al. 2020. It is based explicitly upon LD patterns and identifies variants that optimally detect the risk signal of a given LD bin, and those which detect independent risk signals across LD bins under mutual adjustment. Because any given set of variants may be sufficiently correlated that they are not significant under mutual adjustment, the algorithm judiciously employs LD patterns to ensure that variants optimally detecting independent risk signals are retained in the model, while others are removed. The algorithm works well with highly correlated variants. It seeks a multivariable model of sentinel variants with low pairwise correlation coefficients and high significance under mutual adjustment. {marker Algorithm}{...} {title:Algorithm} {pstd}The RISSc algorithm selects SNPs that are mutually significant in a multivariable model, and which have low pair-wise R-squared values. These are sentinel SNPs, optimally detecting the independent risk-altering association signals of the starting SNP set. In what follows, all regressions are logistic and use multiplicative (additive genetic) models; {it:depvar} is an indicator variable that identifies cases and controls. {it:d}, {it:#n} and {it:#p} are values passed to the program by the delta, r2values and pvalue options. The algorithm identifies bins of SNPs that are correlated with each other with diminishing R-squared thresholds. "Selected" means kept for possible consideration in the final sentinel model. A selected SNP is "marked" if its association with disease is sufficient to keep it from being deleted in the next step. Not all marked SNPs will make it into the final model. Once a SNP is deleted, however, it is permanently excluded from further consideration for inclusion in the final model. {pstd}Step 1: {p 8 8 2}Set R2 = 1. Identify bins of SNPs that are perfectly correlated with each other (R-squared = 1). Select one SNP from each bin and delete all other SNPs in each bin from further consideration. Bins of size 1 are allowed. Regress {it:depvar} against all selected SNPs in a multivariable logistic regression model. If this regression converges then mark all selected SNPs with P <= {it:#p} for further consideration and designate those of P > {it:#p} as unmarked. If the regression does not converge, then all selected SNPs are unmarked but remain as candidates for further evaluation. Set R2 = 1-{it:#d}. Proceed to Step 2 with the selected SNPs, each categorized as either marked or unmarked. {pstd}Step {it:i}: {it:i} = 2 to {it:#n}: {p 8 8 2}Identify bins of selected SNPs from Step {it:i} - 1 whose squared correlation coefficient is >= R2. For each bin: {p 12 12 2}a) Identify the SNP with the greatest association with disease using simple logistic regression. This SNP is denoted {it:best-in-bin}. {p 12 12 2}b) Regress {it:depvar} against all of the SNPs in the bin. The {it:best-in-bin} SNP plus any SNP in the multivariable regression for this bin that has P <= {it:#p} are selected together with all SNPs that were marked in Step {it:i} - 1. Delete all SNPs in the bin that have not been selected from further consideration. {p 8 8 2}After the selections and deletions from each bin have been made, regress {it:depvar} against all of these remaining selected SNPs in a multivariable logistic regression model. If this regression converges, then mark all SNPs of P <= {it:#p} while designating those of P > {it:#p} as unmarked. Any SNP that was previously marked will become unmarked if it no longer meets this P-value threshold. If the model instead fails to converge, then retain the modeled SNPs but designate them as unmarked unless they were marked at the previous step. Subtract {it:#d} from R2 and increment {it:i} by 1. If {it:i} <= {it:#n} loop to repeat Step {it:i}. {pstd}The final sentinel SNPs identified by this algorithm are those that were marked in Step {it:#n}. In the application of this algorithm to the 183 genome-wide significant variant set described in Dupont et al. 2020, the only multivariate model that actually failed to converge was at Step 1 (SNPs representing bins of R2 = 1). {marker examples}{...} {title:Examples} {phang}{cmd:. use testSNPs.dta}{p_end} {phang}{cmd:. sentinel case_hpc snp8_128104117 rs6983267_T snp8_128191672}{p_end} {phang}{cmd:. ds case_hpc, not}{p_end} {phang}{cmd:. local snplist `r(varlist)'}{p_end} {phang}{cmd:. * When the input list of SNPs is large it is less tedious}{p_end} {phang}{cmd:. * to use a local macro to enter them into the sentinel program}{p_end} {phang}{cmd:. sentinel case_hpc `snplist', delta(.05)}{p_end} {title:Stored results} {synoptset 15 tabbed}{...} {p2col 5 15 19 2: Locals}{p_end} {synopt:{cmd:r(sentinel)}} local macro consisting of the names of the sentinel SNPs selected by this program {p_end} {title:Author} {pstd}William D. Dupont{p_end} {pstd}Dale Plummer{p_end} {pstd}Department of Biostatistics{p_end} {pstd}Vanderbilt University School of Medicine{p_end} {pstd}Jeffrey R. Smith{p_end} {pstd}Division of Genetic Medicine{p_end} {pstd}Vanderbilt University Medical Center{p_end} {pstd}Email {browse "mailto:william.dupont@vumc.org":william.dupont@vumc.org}{p_end} {pstd}Email {browse "mailto:dale.plummer@vumc.org":dale.plummer@vumc.org}{p_end} {pstd}Email {browse "mailto:jeffrey.smith@vumc.org":jeffrey.smith@vumc.org}{p_end} {marker Reference}{...} {title:Reference} {pstd}Dupont WD, Breyer JP, Plummer WD et al. 8q24 genetic variation and comprehensive haplotypes altering familial prostate cancer. {it:Nature Communications} {bf:11,} 1523 (2020). https://doi.org/10.1038/s41467-020-15122-1{p_end} {pstd}(a pdf of this paper is posted at https://www.nature.com/articles/s41467-020-15122-1.pdf).{p_end}