------------------------------------------------------------------------------- help for genass -------------------------------------------------------------------------------

Genetic Case-control Association tests

genass [varlist] [if exp] [in range] , group(casevar) id(study_id) output(filename) [replace text all hw allelic genotypic dominant recessive trend pwld(statistic) map(marker map) table graph]


genass is a wrapper for various genetic analysis routines which test for Hardy-Weinberg Equilibrium, perform allelic and genotypic tests of association (general, recessive and dominantmodels), and testing for trends across genotypes (additive/multiplicative models). It will carry out the tests selected for the given range of SNPs, and accumulate the results into one Stata formatted data set for subsequent review.


By virtue of being a wrapper command genass has a number of user-written ado-fies that are required for functioning. These dependencies are gencc, genhw, gtab, labmask, listtex, and pwld. If you do not have these installed then genass will not work. See findit for information on how to locate and install user written commands.


genass assumes that your genotype data is numerically encoded, with the wild-type allele as 1 and the mutant (rarer) allele as 2. This assumption is erroneos as, in some genes, the mutations (even though they cause disease) are selectively neutral (because fecundity is not affected), and the frequency of the alleles are therefore determined under the Neutral Model of Molecular Evolution (Kimura 1985). If you believe that the wild-type allele is in fact allele 2 and the mutant allele is 1, then the results under the recessive model become the dominant results, and vice-versa.


group(casevar) specifies the variable that defines case-control status. Any observations without a case-control status will cause the program to crash.

id(study_id) specifies the variable that uniquely identifies an observatio within your dataset.

output(filename) specifies the variable that uniquely identifies an observatio within your dataset.

text specifies that the data is to be written to a tab-delimited text file as well as being saved as a Stata formatted data set.

replace will over-write the output dataset if it already exists.

hw specifies that Hardy-Weinberg is to be tested in both cases and controls.

all specifies that all statistics are to be calculated (the default option).

allelic specifies that allelic association tests are to be performed. Note that the confidence intervals that are reported are exact and will differ from those obtained using the gencc command which uses the Cornfield approximation.

genotypic specifies that a general genotypic association tests are to be performed. This is a Chi-squared test on a 2x3 table.

dominant specifies that a dominant test of genotypic association tests are to be performed. The Odds-Ratio and 95% CI for the dominant model are also reported.

recessive specifies that a recessive genotypic association tests are to be performed. The Odds-Ratio and 95% CI for the recessive model are also reported.

trend specifies that a Trend Test for Proportions is to be performed. This is similar to the Mantel-Haenszel test for trend in Odds-Ratio across ordered groups, and is a valid test of association even if samples are not in Hardy-Weinberg equilibrium (Sassini 1997).

pwld(statistic) specifies that pair-wise linkage disequilibrium is to be calculated. By default D' is calculated, however it is possible to calculate any measure described in Devlin & Risch (1995). For further details of statistics please see pwld. Results are saved as a Stata formatted data set with the same name as specified for saving results, prefixed with pwld_.

map(filename) specifies the tab-delimited ASCII text file that contains the map information. The file should have two columns, the first row should contain a header (this defines the variables that are in the file). The column of marker names should be called locus, whilst the column containing the markers position in base-pairs should be called pos. Each subsequent row should contain the marker name (as referred to in your data set) followed by the position in base-pairs (the two columns should be seperated by a tab).

table specifies that a html page consisting of ten tables (one for each group of tests) is to be generated. These tables can then be inserted into web-sites or copied into word-processing software for publication (depends upon Roger Newsons listtex).

graph specifies graphs of the results are to be plotted. You MUST specify the marker map which should be a tab-delimited ASCII text file where the first column contains the locus names, and the second column contains their position in base-pairs. All graphs are generated as Portable Network Graphics format (.png) and are saved in the sub-directory graphs (which if it does not exist is created).


The results data set is saved in the file specified under the output option and is a Stata formatted data set with all specified statistics collated into one data set. All variables are labelled with a description, to see a description of the statistics that have been calculated use the describe command.

If you have specified the graph option then a number of graphs will have been generated and saved as Protable Network Graphics files (.png), and if you have specified the pwld() option then you will have a file containing the pair-wise linkage disequilibrium statistics.

There are three classes of graphs, one that plots the allele frequencies, a second group which plots the odds-ratios, and a third set which plots the -log10(p-values). If you are not familiar with interpreting such graphs then the easiest way to do so is to take your calculator and determine what -log10() of the numbers 0.05, 0.1, 0.01, 0.001 0.0001 and 0.00001 and you will see a pattern emerging.

The graphs have been saved to the sub-directory graphs, which if it did not already exist has been created. The table below details the filenames of the graphs that have been generated and what they are displaying.

File | Description > --------------------+---------------------------------------------------------- ------- allele_frq.png | Allele frequencies in cases and controls > hw_eqm.png | P-values for Hardy-Weinberg equilibrium in cases and cont > rols allele1_or.png | Odds-Ratio & 95% CI for allele 1 > allele2_or.png | Odds-Ratio & 95% CI for allele 1 > genotype_ass.png | Genotypic, dominant and recessive p-values > dominant_or.png | Odds-Ratio & 95% CI for dominant model > recessive_or.png | Odds-Ratio & 95% CI for recessive model > trend.png | P-values for trend test >


It should be noted that the p-values that are reported by this program are uncorrected. The exact p-values reported for Hardy-Weinberg equilibrium are calculated using the method proposed by Guo and Thompson (1992). All other exact p-values are calculated using Fisher's (1970) method which mitigates against asymptotic test statistics caused by low cell counts, but does not correct for multiple testing. The issue of correcting for multiple testing can be approached in a number of ways, but is particularly problematic when correcting multiple tests of association with SNPs, which because of their syntenic nature will often demonstrate a degree of correlation.

One of the appealing features of generating a dataset of your results is that you can merge in details of a genetic map (using the map() option) to aid in the graphing of results. In fact if you specify the graph option as well the graphs will be generated for you.

However, because of a limitation of 80 characters in local macros, it is NOT possible to autmoatically generate the graphs with the x-axis labelled with the marker names. If you wish to generate such graphs you can used the code within the genass command as a basis, but you will need to explicity speficy the location of each of the marker labels under the xlabel() options.

If you are recieving an error message saying no; data in memory would be lost then it means that you have modified your data set prior to running the genass command. To rectify this simply insert a line in your do-file to save, replace your data, and re-run your do-file.


The example below shows how to run your analysis on a range of SNPs, then load your results and list the Hardy-Weinberg Equilibrium statistics for those markers that show significant deviation from Hardy-Weinberg (at the 5% level of significance).

. genass snp*, group(status) id(status) output(gen_res) map(map.txt) graph

. use gen_res

. list hw_* if(hw_controls_p < 0.05 | hw_cases_p < 0.05)

You can of course substitute the conditions and variables that are listed to any of your choice. This provides a quick and efficent way of determining which markers show association, deviate from Hardy-Weinberge Equilibrium etc.

The following example includes the pwld() option and calculates the statistic R^2 (r-squared), the Stata file that contains the pairwise LD statistics that were calculated is then loaded into memory, sorted and listed.

. genass snp*, group(status) id(status) output(gen_res) pwld(R2) map(map.txt) graph

. use pwld_gen_res

. order col row

. sort col row

. list


Altman D.G. (1999) Practical Statistics for Medical Research Chapman & Hall/CRC

Armitage P (1955) Tests for Linear Trends in Proportions and Frequencies. Biometrics 3:375-386

Fisher R.A. (1970) Statistical Methods for Research Workers. 14th Edition Oxford University Press

Guo S.W., Thomspon E.A. (1992) Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics 48:361-372

Hardy G (1908) Mendelian proportions in a mixed population Science 28:49-50

Kimura M (1985) The Neutral Theory of Molecular Evolution Cambridge University Press

Sassini P (1997) From Genotypes to Genes: Doubling the Sample Size. Biometrics 53:1253-1261

Weinberg W (1908) On the demonstration of heredity in man Naturkunde in Wurttemberg, Stuttgart 64:368-382


Neil Shephard, ARC Epidemiology Unit

The University of Manchester


Please email nshephard@gmail.com if you encounter problems with this program.

Also see

Online: help for describe epitab, gencc, genhw, graph bar, graph twoway,