-------------------------------------------------------------------------------
help for skewplot
-------------------------------------------------------------------------------

Skewness plots 

        skewplot varname [if exp] [in range] [, skew by(byvar) missing
                 scatter_options]

        skewplot varlist [if exp] [in range] [, skew scatter_options]


Description

    skewplot produces by default a plot of the midsummary versus the spread
    for the variables in varlist, also known as the mid versus spread plot.
    With the skew option, it produces a plot of the skewness function versus
    the spread function. Such plots convey both the general character and the
    fine structure of the symmetry or skewness of data sets, and can be used
    to compare distributions or to assess whether transformations are
    necessary or effective.


Remarks 

    Order n data values for a variable x and label them such that x_(1) <=
    ... <= x_(n). In a perfectly symmetric set of data, the midsummaries

        (x_(1) + x_(n)) / 2, 
        (x_(2) + x_(n - 1)) / 2, 
        etc. 

    would all be identical, and equal to the median. A plot of each
    midsummary

        (x_(i) + x_(n - i + 1)) / 2

    versus each difference or spread or quasi-range

        x_(n - i + 1) - x_(i) 

    would yield a horizontal straight line. Conversely, skewness in sets of
    data will be reflected by departures from horizontality.

    Apart from the divisor of 2, this plot was suggested by J.W. Tukey (Wilk
    and Gnanadesikan 1968). See also Gnanadesikan (1977 or 1997, Ch.6.2) or
    Fisher (1983). The form used here and the name `mid versus spread plot'
    are found in Hoaglin (1985). It is usual to plot only that half of the
    sample results for which spread is >= 0.

    The skew option produces an alternative form promoted by Benjamini and
    Krieger (1996, 1999). The identity

        x_(n - i + 1) = median  

                      + (x_(n - i + 1) - x_(i)) / 2 

                      + (x_(i) + x_(n - i + 1) - 2 * median) / 2 

                      = median + spread function + skewness function    

    for x_(i) in the lower half of the sample leads to a plot of the skewness
    function versus the spread function, known as the skewness versus spread
    plot. Note that the skewness function is midsummary - median, and will be
    constant and zero for a perfectly symmetric distribution, and that the
    spread function is half the spread of the mid versus spread plot.

    In addition, the ratio of the skewness and spread functions or

        x_(i) + x_(n - i + 1) - 2 * median
        ----------------------------------                        
              x_(n - i + 1) - x_(i)

    is a measure of skewness (in the traditional sense) originally suggested
    for quartiles by Bowley (1902) and generalised to this form by David and
    Johnson (1956). It varies between -1 and 1. A similar general measure was
    used by Parzen (1979). Graphically this measure is the slope of the line
    connecting (0,0) and each data point.

    See Benjamini and Krieger (1996, 1999) and Groeneveld (1998) for concise
    reviews tracing such ideas from late 19th century antecedents to recent
    work and further details on the interpretation of the skewness versus
    spread plot.


Options 

    skew specifies the skewness versus spread plot, not the default mid
        versus spread plot.

    by(byvar) specifies that calculations are to be carried out separately
        for each group defined by byvar. by() is allowed only with a single
        varname.

    missing, used only with by(), permits the use of non-missing values of
        varname corresponding to missing values for the variable named by
        by(). The default is to ignore such values.

    scatter_options refers to options of graph twoway scatter.


Examples

    . webuse citytemp
    . describe
    . skewplot *dd
    . skewplot *dd, skew
    . skewplot cooldd, by(region)
    . skewplot cooldd, by(region) ms(i i i i) c(l l l l)
    . skewplot temp*


References 

    Benjamini, Y. and Krieger, A.M. 1996. Concepts and measures for skewness
        with data-analytic implications. Canadian Journal of Statistics 24:
        131-140.

    Benjamini, Y. and Krieger, A.M. 1999. Skewness - concepts and measures.
        In Kotz, S., Read, C.B. and Banks, D.L. (eds) Encyclopedia of
        Statistical Sciences Update Volume 3. New York: John Wiley, 663-670.

    Bowley, A.L. 1902. Elements of statistics. London: P.S. King.  (2nd
        edition: see p.331.)

    David, F.N. and Johnson, N.L. 1956. Some tests of significance with
        ordered variables. Journal, Royal Statistical Society B 18: 1-20.

    Fisher, N.I. 1983. Graphical methods in nonparametric statistics: a
        review and annotated bibliography. International Statistical Review
        51: 25-58.

    Gnanadesikan, R. 1977 (2nd edition 1997).  Methods for statistical data
        analysis of multivariate observations.  New York: John Wiley.

    Groeneveld, R. 1998. Skewness, Bowley's measures of. In Kotz, S., Read,
        C.B. and Banks, D.L. (eds) Encyclopedia of Statistical Sciences
        Update Volume 2. New York: John Wiley, 619-621.

    Hoaglin, D.C. 1985. Using quantiles to study shape. In Hoaglin, D.C.,
        Mosteller, F. and Tukey, J.W. (eds) Exploring data tables, trends,
        and shapes. New York: John Wiley, 417-460.

    Parzen, E. 1979. Nonparametric statistical data modeling.  Journal,
        American Statistical Association 74, 105-131.

    Wilk, M.B. and Gnanadesikan, R. 1968. Probability plotting methods for
        the analysis of data. Biometrika 55: 1-17.


Author

    Nicholas J. Cox, University of Durham
    n.j.cox@durham.ac.uk


Acknowledgments 

    Richard Groeneveld tracked down the Bowley reference.


Also see 

    On-line: graph, symplot
    Manual: [G] graph, [R] diagnostic plots