-------------------------------------------------------------------------------
help for trimmean
-------------------------------------------------------------------------------

Trimmed means as descriptive statistics

trimmean varname [if exp] [in range] , percent(numlist) [ format(format) ceiling generate(newvar) ]

Description

trimmean calculates trimmed means as descriptive statistics for varname.

Remarks

The order statistics of a sample of n values of x are defined by

x(1) <= x(2) <= ... <= x(n-1) <= x(n) so that x(1) is the smallest and x(n) is the largest.

The idea of a trimmed mean is quite old. For some related history, see Stigler (1973) and Barnett and Lewis (1994). The term "trimmed mean" appears to have been introduced by Tukey (1962).

As implemented in trimmean, the recipe is to set aside some fraction of the lowest order statistics and the same fraction of the highest order statistics and then to calculate the mean of what remains, thus providing some protection against possible stretched tails or outliers in a sample. For example, suppose n = 100 and we set aside 5% in each tail, namely x(1),...,x(5) and x(96),...,x(100). We can then take the mean of x(6),...,x(95). For such a definition, see (for example) Bickel (1965, p.848), Hampel et al. (1986, p.178), Barnett and Lewis (1994, p.79), David and Nagaraja (2003, p.213), or Jurecková and Picek (2006, p.67). The review by Rosenberger and Gasko (1983) remains very clear and helpful on both specific details and wider context.

By courtesy, or as a limiting case, the 50% trimmed mean is taken to be the median. The 0% trimmed mean is just the usual mean. The 25% trimmed mean has sometimes been called the midmean (i.e. the mean of the middle half of the data).

The more general rule implemented by default is that the lowest value included in the calculation of the p% trimmed mean is x(r), where r = 1 + floor(n * p/100) and the highest value included is thus x(n - r + 1). The ceiling option specifies use of ceil() rather than floor(). See Cox (2003) for more discussion and further references on those functions.

Some authors propose a yet more elaborate definition in which some values may be given fractional weights. See (for example) Andrews et al. (1972, p.7), Rosenberger and Gasko (1983, p.311), Barnett and Lewis (1994, p.79) or Huber and Ronchetti (2009, pp.57-58). More precisely, whenever p/100 is not a multiple of 1/n, floor(n * p/100) values are removed in each tail, and the smallest and largest remaining values are assigned weight 1 + floor(n * p/100) - n * p/100. So for n = 74 and p = 5/100, their product is 3.7. Rounding down gives 3 and so we work with x(4),...,x(71). However, x(4) and x(71) are assigned weight 4 - 3.7 = 0.3 and x(5),...,x(70) weight 1. Then a weighted mean is taken.

The idea underlying this alternative definition is twofold: p% should mean precisely that, and also that the result of trimming should vary as smoothly as possible with p. This alternative definition is not implemented here. Whatever its merits, always using weights that are 1 or 0 is appealingly simple and appears entirely adequate for the descriptive and exploratory uses for which this command is intended. Moreover, any fine structure that results from the inclusion and exclusion of particular values as trimming proportion varies is likely to be trivial or part of what we are watching for, so there is no loss either way.

Note that the user-written program iqr by Hamilton (1991) calculates the 10% trimmed mean (only) as a sideline to other aims. His definition is the mean of values greater than the 10% percentile and less than the 90% percentile as calculated by summarize, so results may often differ at least slightly from those calculated by trimmean.

Options

percent(numlist) specifies percents of trimming for one or more trimmed means. Percents must be integers between 0 and 50 but otherwise can be specified as a numlist. This is a required option.

format(format) specifies a numeric format for displaying trimmed means. The default is the display format of varname.

ceiling specifies use of ceil() rather than floor() in the calculation of ranks to be included.

generate(newvar) specifies that an indicator (a.k.a. dummy) variable be generated with value 1 if an observation was included in the last trimmed mean calculated and 0 otherwise. The trimmed mean with highest trimming percent is always produced last, regardless of user input.

Examples

. sysuse auto . trimmean mpg, p(0(5)50) . trimmean mpg if foreign, p(0(5)50) . trimmean price if foreign, p(0(5)50) format(%6.1f)

Saved results

results Stata matrix with columns percents, number averaged and trimmed means

Author

Nicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk

Acknowledgments

Rebecca Pope and Ariel Linden both found a typo in the help. Ariel suggested an option for generating indicator variables.

References

Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H. and Tukey, J.W. 1972. Robust estimates of location: Survey and advances. Princeton, NJ: Princeton University Press.

Barnett, V. and Lewis, T. 1994. Outliers in statistical data. Chichester: John Wiley.

Bickel, P.J. 1965. On some robust estimates of location. Annals of Mathematical Statistics 36: 847-858.

Cox, N.J. 2003. Stata tip 2: Building with floors and ceilings. Stata Journal 3: 446-447. http://www.stata-journal.com/sjpdf.html?articlenum=dm0002

David, H.A. and Nagaraja, H.N. 2003. Order statistics. Hoboken, NJ: John Wiley.

Hamilton, L.C. 1991. Resistant normality check and outlier identification. Stata Technical Bulletin 3: 15-18. http://www.stata.com/products/stb/journals/stb3.pdf

Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W.A. 1986. Robust statistics: The approach based on influence functions. New York: John Wiley.

Huber, P.J. and Ronchetti, E.M. 2009. Robust statistics. Hoboken, NJ: John Wiley.

Jurecková, J. and Picek, J. 2006. Robust statistical methods with R. Boca Raton, FL: Chapman and Hall/CRC. [caron on c of Jurecková]

Rosenberger, J.L. and Gasko, M. 1983. Comparing location estimators: trimmed means, medians, and trimean. In Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (Eds) Understanding robust and exploratory data analysis. New York: John Wiley, 297-338.

Stigler, S.M. 1973. Simon Newcomb, Percy Daniell, and the history of robust estimation 1885-1920. Journal of the American Statistical Association 68: 872-879.

Tukey, J. W. 1962. The future of data analysis. Annals of Mathematical Statistics 33: 1-67.

Also see

Online: summarize, means, trimplot (if installed), hsmode (if installed), shorth (if installed)