-------------------------------------------------------------------------------
help for winsor
-------------------------------------------------------------------------------

Winsorizing a variable

winsor varname [if exp] [in range] , generate(newvar) { p(#) | h(#) } [ { highonly | lowonly } ]

Description

winsor takes the non-missing values of a variable x ordered such that

x_1 <= ... <= x_n

and generates a new variable y identical to x except that the h highest and h lowest values are replaced by the next value counting inwards from the extremes:

y_1, ... , y_h = y_(h + 1)

y_n, ... , y_(n - h + 1) = y_(n - h)

h can be specified directly or indirectly by specifying a fraction p of the number of observations n:

h = [ p n ]

where [ ] denotes integer part. This transformation is named after the biostatistician Charles P. Winsor (1895-1951): see, for example, Tukey (1962). For more discussion and references, see Barnett and Lewis (1994).

Charles (Charlie) Winsor was educated at Harvard as an engineer and then worked for the New England Telephone and Telegraph Company, but his interests shifted to biological research and biostatistics. After further study at Johns Hopkins and Harvard, he held posts at Iowa State College and Johns Hopkins; in between, in the Second World War, he did government work at Princeton.

Options

generate(newvar) specifies the name of the new variable. It is a required option.

p(#) specifies the fraction of the observations to be modified in each tail. p should be greater than 0 and less than 0.5 and imply a value of h as just below.

h(#) specifies the number of the observations to be modified in each tail. h should be at least 1 and less than half the number of non-missing observations.

Just one of p() and h() should be specified.

highonly and lowonly specify that Winsorizing should be one-sided, referring only to the tail with the highest values or only to the tail with the lowest values, respectively. These options should not be specified together.

Examples

. winsor mpg, gen(Wmpg) h(3)

. winsor mpg, gen(Wmpg2) p(0.1)

References

Anonymous. 1951. In memoriam: Charles P. Winsor. Biometrics 7: 221.

Barnett, V. and Lewis, T. 1994. Outliers in statistical data. Chichester: John Wiley. [Previous editions 1978, 1984.]

Tukey, J.W. 1962. The future of data analysis. Annals of Mathematical Statistics 33: 1-67.

Author

Nicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk