{smcl}
{* 18dec2022}{...}
{hline}
help for {hi:find_denom}
{hline}
{title:Finding the denominator: minimum sample size from percentages}
{p 8 8 2}{cmd:find_denom} #1 [#2 ...] {cmd:,} {opt eps:ilon(precision)}
{title:Description}
{pstd}
{cmd:find_denom} reports minimum sample size and minimum frequencies
given one or more percentages rounded to some precision or resolution.
{title:Options}
{pstd}
{cmd:epsilon()} is a required option indicating half the perceived
precision or resolution. Thus if percentages are rounded to integers,
specify {cmd:epsilon(0.5)}; if rounded to 1 decimal place, specify
{cmd:epsilon(0.05)}. The thinking is that a report of # means that the
true value is brtween # - epsilon and # + epsilon.
{title:Remarks}
{pstd}
An old joke with many variants has the following flavour. A naive
researcher is reporting on a rather small project: 33% of the sample
said A, 33% said B, but the other person refused to answer. It is
immediate that the sample size is 3. Only a twist more challenging: What
denominator or sample size underlies a percentage breakdown of 40, 40,
20? That breakdown is consistent with a sample size of 5, with 2, 2, 1
as class frequencies. It is also consistent with any multiple of 5 and,
dependent on amount of rounding, reportably consistent with other
percentage breakdowns too. Thus 2001, 1999, 1000 is exactly 40.02,
39.98, 20.00 as a percentage breakdown and so rounds to 40.0, 40.0, 20.0
to 1 decimal place, as would 2002, 1998 and 1000, and as would many
other possibilities.
{pstd}
Every researcher should know that sample size should always be reported.
Every researcher with any experience knows that does not always happen,
and the culprits are not confined to advertising, journalism, or
politics. Having flagged that this is an ethical issue, we now
concentrate on the technicalities of trying to guess the minimum sample
size consistent with a reported percentage breakdown. We assume honest
and accurate reporting, other than the sample size being suppressed.
{pstd}
The problem was discussed by Wallis and Roberts (1956, pp.185{c -}189)
(hereafter WR) and in much more technical detail by Becker, Chambers,
and Wilks (1988) (hereafter BCW). Two ideas arise immediately. First, a
complete set of percentages is not needed to say something about minimum
sample size. Thus a single percentage reported as 33% implies that the
sample size cannot be 2 and must be at least 3. Second, the smallest
percentage reported, or if smaller the smallest positive difference
between two percentages reported, gives another handle on the minimum
sample size. Thus with a percentage breakdown of 40, 30, 30, the
smallest positive difference is 10 and equivalently 100/10 = 10 is the
minimum sample size.
{pstd}
WR (p.186) report a fictitious percentage breakdown
{space 4}23.1
{space 4}15.4
{space 4}30.8
{space 4}19.2
{space 4} 7.7
{space 4} 3.8
{pstd}-- from which both the smallest percentage and the smallest positive
difference are 3.8, suggesting a minimum sample size of 100/3.8, which
rounds as an integer to 26. The implied frequencies are thus
{space 4}6
{space 4}4
{space 4}8
{space 4}5
{space 4}2
{space 4}1
{pstd}WR (1956, pp.187{c -}188) report percentage breakdowns of movie
ratings from {it:Consumer Reports} August 1949, p.383. The categories
are in turn percentages reporting Excellent, Good, Fair, and Poor. Some
examples are
{space 4}Alias Nick Beal 6 27 47 20
{space 4}Bride of Vengeance 11 22 56 11
{pstd}BCW (p.272) report these percentages for considering vendor for
1986 from a personal computer magazine:
{space 4}Ours 14.6
{space 4}A 12.2
{space 4}B 12.2
{space 4}C 7.3
{space 4}D 7.3
{pstd}They report an algorithm and S code with this recipe for
proportions (my wording). The idea is just to bump up the sample size
until implied percentages are all consistent with the stated precision.
{space 4}f <- vector of proportions
{space 4}eps <- precision
{space 4}n <- 1
{space 4}repeat {c -(}
{space 4}i <- f * n rounded to integers
{space 8}if each i is in [(# - eps) * this i, (# + eps) * this i]
{space 12}break with result
{space 8}n <- n + 1
{space 4}{c )-}
{pstd}It is this algorithm, translated from S to Stata, but adapted for
percentage input, that is implemented here.
{pstd}BCS (pp.274{c -}277) further discuss speeding-up computations
and allowing a certain number of outliers, in essence percentages that do
not fit, say because they were reported incorrectly. These elaborations
are not implemented here, but should be of interest for a deeper study.
{pstd}
On the problem of how often rounded percentages sum to exactly 100, see
Mosteller, Youtz, and Zahn (1967) and Diaconis and Freedman (1979).
{title:Examples}
{p 4 8 2}{cmd:. find_denom 23.1 15.4 30.8 19.2 7.7 3.8, eps(0.05)}{p_end}
{p 4 8 2}{cmd:. find_denom 6 27 47 20, eps(0.5)}{p_end}
{p 4 8 2}{cmd:. find_denom 11 22 56 11, eps(0.5)}{p_end}
{p 4 8 2}{cmd:. find_denom 14.6 12.2 12.2 7.3 7.3, eps(0.05)}{p_end}
{title:Author}
{p 4 4 2}Nicholas J. Cox, Durham University{break}
n.j.cox@durham.ac.uk
{title:References}
{phang}
Becker, R. A., J. M. Chambers and A. R. Wilks. 1988.
{it:The New S Language: A Programming Environment for Data Analysis and Graphics.}
Pacific Grove, CA: Wadsworth & Brooks-Cole.
{phang}
Diaconis, P. and D. Freedman.
1979.
On rounding percentages.
{it:Journal of the American Statistical Association} 74: 359{c -}364.
{phang}
Mosteller, F., C. Youtz and D. Zahn. 1967.
The distribution of sums of rounded percentages.
{it:Demography} 4: 850{c -}858.
Reprinted in Fienberg, S. E. and D. C. Hoaglin (eds) 2006.
{it:Selected Papers of Frederick Mosteller}. New York: Springer, 399{c -}411.
{phang}
Wallis, W. A. and H. V. Roberts. 1956.
{it:Statistics: A New Approach.}
Glencoe, IL: Free Press.