{smcl} {* *! NJC 10aug2016/25aug2016}{...} {cmd:help lvalues} {hline} {title:Title} {p 8 8 2}Letter value calculation {title:Syntax} {p 8 12 2} {cmd:lvalues} {varlist} {ifin} [ {cmd:,} {opt a(#)} {opt by(byvarlist)} {c -(} {opt gen:erate(newvarlist)} {c |} {opt display:only} {c )-} {opt l:ist} {it:list_options} ] {title:Description} {pstd} {cmd:lvalues} calculates letter values as defined by Tukey (1977) and Hoaglin (1983) for each variable in {varlist}. By default letter values are stored in new variables. Optionally, letter values may be displayed only, without generation of new variables. {title:Remarks} {pstd} Consider a set of {it:n} values ordered, smallest first, so that they have ranks 1 to {it:n}. The ordered values are often called {it:order statistics} or (particularly in statistical graphics) the (sample) {it:quantiles}. In ranking, tied values are here assigned distinct (unique) ranks, so each integer from 1 to {it:n} is used just once as a rank. {pstd} The {it:depth} associated with rank {it:i} is the smaller of {it:i} and {it:n} - {it:i} + 1. Hence the extremes (minimum and maximum) with ranks 1 and {it:n} both have depth 1, the second smallest and second largest values both have depth 2, and so on. Think of depth as giving the number of values counted inwards from the extremes. {pstd} The conventional rule for calculating a median can be stated in terms of a depth (1 + {it:n})/2. If {it:n} is odd, then the result is an integer; and if {it:n} is even, then the result is a half-integer. So if {it:n} = 75, the depth is 38, which means that the median is the single value which has rank 38; if {it:n} = 74, the depth is 37.5, which is interpreted as the mean of, or midpoint between, the values with ranks 37 and 38. The median may be tagged with the letter M. The median is a {it:letter value}, in Tukey's terminology. {pstd} Further letter values are calculated by extending this idea to mark successively smaller tail fractions of a sample. Fourths (approximate quartiles) (tagged F, say) both have a depth which is (1 + floor(depth of median))/2; eighths (approximate octiles) (tagged E, say) have depth (1 + floor(depth of fourths))/2; and so on. See Hoaglin (1983) for a systematic account. In each case integer and half-integer depths imply selecting single values and averaging adjacent ordered values respectively. {pstd} Note that Tukey (1970) discussed medians M, hinges H, eighths E and in passing sixteenths defined in this way. Tukey (1977) used further letter values D (for sixteenths), C, B, A, Z, Y, X, and so on, as needed, stopping when the extremes are reached at depth 1 (each is tagged 1). These letter tags are used in the output of the {help lv} command. The labels M, F, E are pleasantly mnemonic and those and other tags help to simplify tabular displays. However, memorising the meanings of other tags is harder work. Knowing or using the tags is less important than keeping an eye on the depths, ranks and plotting positions associated with each letter value. {pstd} The term {it:letter values} historically was closely tied to particular letter value displays, which could be produced with relatively little effort from small datasets using only sorting, averaging pairs of numbers, and subtraction (e.g. Tukey 1977; Mosteller and Tukey 1977; Velleman and Hoaglin 1981). {help lv} is the standard Stata implementation. Despite the advent of larger datasets and ubiquitous computing facilities, interest in letter values continues (e.g. Hofmann {it:et al.} 2011). In essence, the letter values are interesting and useful as a parsimonious but informative reduction of a sample distribution based on order statistics (quantiles), with detail in the tails. Hence they are pertinent to data screening and exploratory data analysis, including determination of distribution location, scale and shape; identification of problematic data points; and consideration of transformations. {pstd} By default, {cmd:lvalues} calculates new variables as follows. For every variable in {varlist} there is a new variable containing its letter values for the observations included in the calculation. In addition, variables give ranks, depths and plotting positions {bind:({it:i} - {it:a})/({it:n} - 2{it:a} + 1)} for some {it:a}. The default variable names for {it:k} variables in {varlist} are {cmd:_lv1} to {cmd:_lv}{it:k} and {cmd:_rank}, {cmd:_depth} and {cmd:_ppos}. If any of those names is in use, and alternatives not in use are not suggested through the {cmd:generate()} option, then the command will fail. Unlike {help lv}, {cmd:lvalues} will not overwrite existing variables. {pstd} As no letter value necessarily corresponds uniquely to any single data value, and as many letter values are means of (midpoints between) data values, the values of any new variables are (contrary to usual Stata practice) not to be considered as aligned with values of other variables in the same observations. However, if the {cmd:by()} option is used, values of any new variables will be placed in observations with corresponding values of the {it:byvarlist} specified. Positively, it is always true that letter value results are aligned with depths, ranks and plotting positions. {pstd} The number of letter values for {it:n} values is {bind:1 + 2 * ceil(log_2 {it:n})}. For {it:n} = 1, that is 1, so the single letter value (median) is just the single data value. For {it:n} = 2, 3, 4, 5, 6, 7 the number of letter values is 3, 5, 5, 7, 7, 7, i.e. in some cases there are more letter values than data values. For {it:n} <= 7, {cmd:lvalues} just returns the ordered values. With that small a sample size, looking at all the values is both feasible and sensible. {pstd} Here is a handle on the number of letter values: for {it:n} = 1000, 1 million, 1 billion, there are 21, 41, 61 letter values. Note that {help lv} will not display or save more than 21 letter values. {pstd} See also Tukey (1977) and Hoaglin (1985) for more on using letter values in study of distributions. See Cox (2004) for discussion of related skewness plots. {title:Options} {phang} {opt a()} specifies the constant {it:a} in calculating plotting positions. The default is 1/3, as suggested by Hoaglin (1983) in a detailed discussion of letter values and plotting positions. A particular advantage of this choice is that it corresponds closely to the position of the median of the sampling distribution of each order statistic. See also Cox (2014) on plotting positions in a Stata context. {phang} {opt by()} specifies one or more variables defining distinct groups for which letter values are to be calculated separately. {phang} {opt generate()} specifies new variable names as alternatives to the default, up to as many as the number of variables plus 3. If fewer variable names are suggested, as many of {cmd:_lv1} up, {cmd:_rank}, {cmd:_depth} and {cmd:_ppos} are used as needed, but those default names used must still be new in the dataset. {phang} {opt displayonly} specifies display of the letter values only, with no generation of new variables. Here "display" implies {cmd:list}, as just below. {phang} {opt list} specifies that the letter values be listed. {cmd:list} may be specified by itself or together with options of {help list}. The default options include {cmd:sep(0) noobs} or (with the {cmd:by()} option) {cmd:sepby(}{it:byvar}{cmd:) noobs}. Plotting positions are shown to 3 decimal places (but stored as {cmd:double} variables). {title:Examples} {pstd}Setup{p_end} {phang2}{cmd:. sysuse auto} {pstd}Calculate letter values for {cmd:mpg}{p_end} {pstd}(default variable names {cmd:_lv1}, {cmd:_rank}, {cmd:_depth}, {cmd:_ppos}):{p_end} {phang2}{cmd:. lvalues mpg} {pstd}Calculate letter values for {cmd:mpg} with new names:{p_end} {phang2}{cmd:. lvalues mpg, generate(lv_mpg rank depth ppos)} {pstd}Calculate letter values for {cmd:mpg} with new names, separately by groups of {cmd:foreign}:{p_end} {phang2}{cmd:. lvalues mpg, generate(lv_mpgf rankf depthf pposf) by(foreign)} {pstd}Different variables and different groups at once; other options:{p_end} {phang2}{cmd:. sysuse auto, clear}{p_end} {phang2}{cmd:. lvalues mpg weight, gen(lv_mpg lv_weight rank depth ppos) by(foreign) list a(0.5)} {pstd}Display only:{p_end} {phang2}{cmd:. lvalues headroom trunk weight length displacement, displayonly} {title:Author} {pstd}Nicholas J. Cox, Durham University{break} n.j.cox@durham.ac.uk {title:Acknowledgment} {pstd}David Hoaglin rekindled my interest in letter values by a comment at the Chicago Stata Conference in 2016 and provided helpful encouragement thereafter. {title:References} {phang} Cox, N. J. 2004. Graphing distributions. {it:Stata Journal} 4: 66{c -}88. {browse "http://www.stata-journal.com/sjpdf.html?articlenum=gr0003":http://www.stata-journal.com/sjpdf.html?articlenum=gr0003} {phang} Cox, N. J. 2014. Calculating percentile ranks or plotting positions. {browse "http://www.stata.com/support/faqs/statistics/percentile-ranks-and-plotting-positions/":http://www.stata.com/support/faqs/statistics/percentile-ranks-and-plotting-positions/} {phang} Hoaglin, D. C. 1983. Letter values: A set of selected order statistics. In {it:Understanding Robust and Exploratory Data Analysis}, ed. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, 33{c -}57. New York: John Wiley. {phang} Hoaglin, D. C. 1985. Using quantiles to study shape. In {it:Exploring Data Tables, Trends, and Shapes}, ed. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, 417{c -}460. New York: John Wiley. {phang} Hofmann, H., K. Kafadar, and H. Wickham. 2011. Letter-value plots: Boxplots for large data. {browse "http://vita.had.co.nz/papers/letter-value-plot.pdf":http://vita.had.co.nz/papers/letter-value-plot.pdf} {phang} Mosteller, F. and J. W. Tukey. 1977. {it:Data Analysis and Regression}. Reading, MA: Addison-Wesley. {phang}Tukey, J. W. 1970. {it:Exploratory data analysis. Limited Preliminary Edition. Volume I.} Reading, MA: Addison-Wesley. {phang} Tukey, J. W. 1977. {it:Exploratory Data Analysis}. Reading, MA: Addison-Wesley. {phang} Velleman, P. F. and D. C. Hoaglin. 1981. {it:Applications, Basics, and Computing of Exploratory Data Analysis.} Boston: Duxbury. {browse "https://ecommons.cornell.edu/retrieve/91/A-B-C_of_EDA_040127.pdf":https://ecommons.cornell.edu/retrieve/91/A-B-C_of_EDA_040127.pdf} {title:Also see} {psee} Manual: {manlink R lv} {psee} {space 2}Help: {manhelp diagnostic_plots R:diagnostic plots}, {manhelp stem R}, {manhelp summarize R}, {help qplot} (if installed), {help skewplot} (if installed), {help stripplot} (if installed) {p_end}