-------------------------------------------------------------------------------
help for hapipf
-------------------------------------------------------------------------------

Haplotype frequency using an EM algorithm and log-linear modelling

hapipf [varlist] [using] [if exp] , ipf(string) [ ldim(varlist) start display known phase(varname) acc(#) ipfacc(#) nolog model(#) lrtest(#,#) convars(string) confile(string) usew(string) savew(string) mv mvdel menu rare(#) cc(varname) ]

Description

This command calculates allele/haplotype frequencies using log-linear modelling embedded within an EM algorithm. The EM algorithm handles the phase uncertainty and the log-linear modelling allows testing for linkage disequilibrium and disease association. These tests can be controlled for confounders using a stratified analysis specified by the log-linear model. The log-linear model can also model the relationship between loci and hence can group similar haplotypes.

The log-linear model is fitted using iterative proportional fitting which is implemented in the STB command ipf.ado (the user will have to install this function first). This algorithm can handle very large contingency tables and converges to maximum likelihood estimates even when the likelihood is badly behaved.

The varlist consists of paired variables representing the alleles at each locus. If phase is known then the pairs are the genotypes. When phase is unknown the algorithm assumes Hardy Weinberg Equilibrium so that models are based on chromosomal data and not genotypic data.

This algorithm can handle missing alleles at the loci by using the mv option.

Options

menu specifies that the commands syntax is determined using the window interface. In the window interface select the loci and disease variables and then press the APPLY button. There should be a choice of four models when there some loci and the disease variable are selected. Either press the "put in review window" button so the syntax is placed in the review window OR press "run command" button and that model will be fitted.

rare(#) specifies that all haplotypes below this frequency (a number between 0 and 1) are grouped into one category. NOTE that you must specify the saturated mode in the ipf option and that you do not include the dependent variable in this model. This option will automatically fit firstly the saturated model to calculate haplotype frequencies and hence identify the groupings. Then with the grouped categories the independence and association models are fitted and a likelihood ratio test comparing the two is produced. The user must specify the outcome variable in the cc option.

cc(varname) specifies the outcome variable when using the rare option.

mv specifies that the algorithm should replace missing data (".") with a copy of each of the possible alleles at this locus. This is performed at the same stage as the handling of the missing phase when the dataset is expanded into all possible observations. If this option is not specified but some of the alleles do contain missing data the algorithm sees the symbol "." as another allele.

mvdel specifies that people with missing alleles are deleted.

ldim(varlist) specifies the variables that determine the dimension of the contingency table. By default the variables contained in the ipf() option define the dimension.

ipf(string) specifies the log-linear model. It requires special syntax of the form l1*l2+l3. l1*l2 allows all the interactions between the first two loci and locus 3 is independent of them. This syntax is used in most books on Log-linear modelling.

start specifies that the starting posterior weights of the EM algorithm are chosen at random. The default is that every haplotype has equal weight.

display specifies whether the expected and imputed haplotype frequencies are shown on the screen.

known specifies that phase is known.

phase(varname) specifies a variable that contains 1's where phase is known and 0's where phase is unknown.

acc(real) specifies the convergence threshold of the change of the full log-likelihood.

ipfacc(real) specifies the convergence theshold of the change in the log-likeli > hood of the log-linear model.

nolog specifies whether the log-likelihood is displayed at each iteration.

model(#) specifies a label for the log-linear model being fitted. This label is used in the lrtest() option.

lrtest(#,#) performs a likelihood ratio test using two models that have been labelled in the model() option.

convars(string) specifies a list of variables in the constraints file.

confile(string) specifies the name of the constraints file.

usew(string) specifies that the starting weights come from an external data fil > e.

savew(string) specifies that the final set of weights is saved to a data file. Thus hapipf can be run once with the savew() option and then to replicate the same analysis quicker the user can specify the usew() option with the savew() datafile.

Examples

Take a dataset with 3 loci, the pairs of alleles at locus 1 are the variables ass1 and ass2, the pairs of alleles at locus 2 are the variables bss1 and bss2 and the pairs of alleles at locus 3 are the variables drss1 and drss2. Note the ipf() option requires the log-linear model that contains lj terms which represent locus j.

The indicator variable for whether a person is a case or control is caco. To test whether the haplotypes are associated with disease is the likelihood ratio test comparing the models l1*l2*l3*caco and l1*l2*l3+caco. The following stata commands perform this test.

.hapipf ass1 ass2 bss1 bss2 drss1 drss2, ipf(l1*l2*l3*caco) model(0) display .hapipf ass1 ass2 bss1 bss2 drss1 drss2, ipf(l1*l2*l3+caco) model(1) lrtest(0,1 > ) display

If there are too many rare haplotypes they can be grouped by the following command and the likelihood ratio test is calculated for the two models above.

.hapipf ass1 ass2 bss1 bss2 drss1 drss2, ipf(l1*l2*l3) cc(caco) rare(0.05)

To identify haplotype blocks. Lets say we have 5 loci and that the first block contains the first three loci and the second block contains the last two loci then you can compare the following models

.hapipf a1 a2 b1 b2 c1 c2 d1 d2 e1 e2, ipf(l1*l2*l3*l4*l5) model(0) display .hapipf a1 a2 b1 b2 c1 c2 d1 d2 e1 e2, ipf(l1*l2*l3+l4*l5) model(1) lrtest(0,1) > display

Author

Adrian Mander, Glaxo Smithkline, Harlow, UK.

Email adrian.p.mander@gsk.com

Also see

On-line: help for Quantitative Trait qhapipf (if installed), Profile likelihoods profhap (if installed), Log-linear modelling ipf (if installed), Haplotype blocks hapblock (if installed), Haplotype blocks swblock (if installed).