Stepwise hapipf routine to identify the parsimonious model to describe the Hapl > otype block pattern
swblock [varlist] [, mv(string) pvalue(#) stop noise acc(#) ipfacc(#) store replace ]
Description
This command systematically fits a series of hapipf log-linear models that models the LD structure within a set of loci.
The log-linear model is fitted using iterative proportional fitting which is available using {hi ssc} and is called ipf (version 1.36 or later). Additionally, the user will also have to install hapipf (version 1.44 or later). This algorithm can handle very large contingency tables and converges to maximum likelihood estimates even when the likelihood is badly behaved.
If you are connected to the Web you can install the latest version by clicking > here ssc install hapipf
The varlist consists of paired variables representing the alleles at each locus. If phase is known then the paired variables are in fact the genotypes. When phase is unknown the algorithm assumes Hardy Weinberg Equilibrium so that models are based on chromosomal data and not genotypic data.
This algorithm can handle missing alleles at the loci by using the mv() option.
Options
mv(string) specifies how the missing data will be handled, the default is mv. If the string is mv, i.e mv(mv), then the missing data will be assumed to be missing at random (MAR) and the EM algorithm expands the unknown phase to consider all possible values for the missing value. The main assumption of this algorithm is that the missing data can only take the alleles observed for a given loci. Relaxing this assumption would not make any difference because alleles that are never observed usually give expected frequencies that are close to 0, however, it would increase the number of cells and hence reduce power. The only other string this option takes is mvdel, i.e mv(mvdel) here the missing data are assumed to be missing completely at random (MCAR) and subjects are deleted when they contain any missing data at any loci. Under this assumption complete subjects are representative of the whole dataset and hence deletion will give unbiased estimates.
stop specifies that the search should stop when the inclusion of minimum high order LD terms do not significantly change the log likelihood. For example if none of the third order LD terms included in the model were significant then the algorithm will not fit the fourth order terms.
acc(#) specifies the tolerance of hapipf convergence. The default is 0.0001.
ipfacc(#) specifies the tolerance of hapipf convergence. The default is 1.000e- > 07.
pvalue(#) specifies the significance level for inclusion to the model; terms wi > th p>pvalue() are not eligible for inclusion.
noise specifies that the test statistic values are included in the output
store specifies that all the model output is saved to a file called fresults.dt > a
replace specifies that the old fresults.dta can be overwritten.
Examples
Take a dataset with 7 loci, the pairs of alleles at locus i are the variables li_1 and li_2.
.swblock l1_1-l7_2, mv(mvdel)
mvdel was specified as the missing data mechanism and all subjects with any mis > sing data are deleted.
The following command changes the inclusion significance level to 1%
.swblock l1_1-l7_2, mv(mvdel) pvalue(0.01)
To store the results in a stata dataset do
.swblock l1_1-l7_2, mv(mvdel) pvalue(0.01) store replace
Author
Adrian Mander, Glaxo Smithkline, Harlow, UK. Email adrian.p.mander@gsk.com
Also see
On-line: Help for hapipf (MUST be installed), ipf (MUST be installed) hapblock (if installed).