{smcl}
{* *! version 1.3.1 21January2015 Dirk Enzmann}{...}
{hi:help divcat}
{hline}
{title:Title}
{pstd}{hi:divcat} {hline 2} calculates five measures of diversity for multiple categories: Generalized
variance (GV), entropy (H), its normalized counterparts (NGV, NH), and polarization (RQ).
{title:Syntax}
{p 8 15 2}
{cmd:divcat} {varname} {ifin} {weight} [{cmd:,} {it:options} ]
{synoptset 20 tabbed}{...}
{synopthdr:options}
{synoptline}
{synopt :{opt t:ableout}}display a frequency table of {cmd:{it:varname}}
{p_end}
{synopt :{opt b:ase(#)}}use the logarithm to base # when calculating entropy (H) (default: 2)
{p_end}
{synopt :{opt nol:abel}}omit labels of subgroups specified using the {help by} prefix
{p_end}
{synopt :{opt g:v}}show the generalized variance (GV) and its normalized counterpart (NGV) in a separate table (default: show
all diversity measures in a common table)
{p_end}
{synopt :{opt e:ntropy}}show the entropy (H) and its normalized counterpart (NH) in a separate table (default: show
all diversity measures in a common table)
{p_end}
{synopt :{opt r:q}}show the polarization measure (RQ) in a separate table (default: show
all diversity measures in a common table)
{p_end}
{synopt :{opt nod:etail}}omit common table of all diversity measures
{p_end}
{synopt :{opt gen_gv(newvar)}}generate a new variable {cmd:{it:newvar}} containing the generalized variance (GV)
{p_end}
{synopt :{opt gen_ngv(newvar)}}generate a new variable {cmd:{it:newvar}} containing the normalized generalized variance (NGV)
{p_end}
{synopt :{opt gen_h(newvar)}}generate a new variable {cmd:{it:newvar}} containing the entropy (H)
{p_end}
{synopt :{opt gen_nh(newvar)}}generate a new variable {cmd:{it:newvar}} containing the normalized entropy (NH)
{p_end}
{synopt :{opt gen_rq(newvar)}}generate a new variable {cmd:{it:newvar}} containing the polarization measure (RQ)
{p_end}
{synopt :{opt replace}}replace the contents of {cmd:{it:newvar}} if {cmd:{it:newvar}} exists already
{p_end}
{synoptline}
{p2colreset}{...}
{pstd}
{hi:by} is allowed (see {help by});{p_end}
{pstd}
{opt aweight}s, {opt fweight}s, and {opt iweight}s are allowed (see {help weight}).{p_end}
{title:Description}
{pstd} {cmd:divcat} calculates five measures of diversity of a categorical
variable (i.e. for multiple categories): generalized variance (GV), entropy
(H), its normalized counterparts (NGV and NH, resp.), and polarization
(RQ). The formulas are given in Budescu and Budescu (2012) (GV, NGV, H, and
NH) and in Montalvo & Reynal-Querol (2002, 2008) (RQ). {cmd:divcat} allows to
generate new variables containing these measures, which is especially useful
when calculating diversity measures separately for subgroups (specified using
the {help by} prefix).
{pstd} There is a large variety of diversity measures. Related concepts are
variance, heterogeneity, inequality, entropy, or concentration. Depending
on the field of study, the same measures are known under different
names:
{p 4 6 2} - The {bf:GV} is also known as the Blau Index (Blau, 1977) or the Hirschman-Herfindahl
Index (HHI) (Hirschman, 1945; Herfindahl, 1950), although some equate
the HHI with the Simpson Index (SI) (Simpson, 1949), whereas GV is actually
1-SI (compare -ineq-: {net "describe ineq, from(http://fmwww.bc.edu/RePEc/bocode/i)":click here}). The
GV can be interpreted as the probability that two randomly paired members of
a population belong to two different subgroups.{p_end}
{p 4 6 2} - The {bf:NGV} is the normalized GV bounded by 0 and 1. Another name for the NGV is
the Index of Qualitative Variation (IQV) (Mueller, Schuessler, & Costner, 1970). The
index can be interpreted as the proportion of the observed variation to the
maximum possible variation. The normalizing transforms GV to a relative measure allowing
comparisons with diversity measures from studies with a different number of categories because
its size does not depend on the number of categories (as in the case of GV).{p_end}
{p 4 6 2} - The entropy measure {bf:H} shares many properties of GV (the abbreviation actually
stands for the Greek letter Eta); an attractive property is its additivity (see Budescu &
Budescu, 2012). Early on, it has been described by Shannon (1948); its formulas differ as to
the base of the logarithm used: The Shannon formula uses the logarithm to base {it:e} (see
also -ineq-: {net "describe ineq, from(http://fmwww.bc.edu/RePEc/bocode/i)":click here}), whereas
Budescu and Budescu (2012) are using the logarithm to base 2 (which is the default
of {cmd:divcat}). In the case of two groups, when using the logarithm to the base 2, H is
equal to its normalized counterpart NH.{p_end}
{p 4 6 2} - The {bf:NH} is the normalized H bounded by 0 and 1. As in the case of GV and its
normalized counterpart NGV, NH is a relative measure: its size no longer depends on
the number of categories (as in the case of H).{p_end}
{p 4 6 2} - {bf:RQ} differs from the previous measures as it is a polarization
measure for discrete variables (see Montalvo & Reynal-Querol, 2002, 2008). As do the
normalized measures, it is bounded between 0 and 1. However, in contrast to the other
measures of diversity that reach a maximum if the cases are distributed equally across all categories (or
if all groups are of the same size), RQ reaches a maximum if there are two (large) groups of equal size (and
all other groups are small). This makes RQ attractive for the study of social conflicts.{p_end}
{pstd} A helpful discussion of the properties of the "fractionalization" measures GV and H as well as
its normalized counterparts can be found in Budescu and Budescu (2012).
{title:Options}
{dlgtab:Main}
{phang}
{opt t:ableout} displays a frequency table of the categorical variable {cmd:{it:varname}}
specified with {cmd:divcat}. If subgroups are specified using the {help by} prefix, a frequency
table for each subgroup will be produced.
{phang}
{opt b:ase(#)} sets the base of the logarithm used when calculating entropy (H) (default: 2). Possible
alternatives are base 10 or the natural logarithm (base {it:e}). Note that the latter
can be specified using the option {opt base(e)}.
{phang}
{opt nol:abel} omits labels of subgroups specified using the {help by} prefix. Note that the maximum
width of row labels of the results table is 32 characters. {opt nol:abel} can help to make the values
separating the subgroups visible in spite of this space restriction.
{phang}
{opt g:v} shows the generalized variance (GV) and its normalized counterpart (NGV) in a separate
table (default: show all diversity measures in a common table)
{phang}
{opt e:ntropy} shows the entropy (H) and its normalized counterpart (NH) in a separate
table (default: show all diversity measures in a common table)
{phang}
{opt r:q} shows the polarization measure (RQ) in a separate table (default: show all diversity
measures in a common table)
{phang}
{opt nod:etail} suppresses the common table of all diversity measures (default: show all diversity
measures in a common table)
{phang}
{opt gen_gv(newvar)} generates a new variable {cmd:{it:newvar}} containing the generalized variance (GV)
{phang}
{opt gen_ngv(newvar)} generates a new variable {cmd:{it:newvar}} containing the normalized generalized
variance (NGV)
{phang}
{opt gen_h(newvar)} generates a new variable {cmd:{it:newvar}} containing the entropy (H)
{phang}
{opt gen_nh(newvar)} generates a new variable {cmd:{it:newvar}} containing the normalized entropy (NH)
{phang}
{opt gen_rq(newvar)} generates a new variable {cmd:{it:newvar}} containing the polarization measure (RQ)
{phang}
{opt replace} replaces the variable specified by {opt gen_gv()}, {opt gen_ngv()}, {opt gen_h()},
{opt gen_nh()}, or {opt gen_rq()} if {cmd:{it:newvar}} exists already.
{title:Examples}
{pstd} Example 1 shows how to calculate the diversity of "rep78" over the subgroups of "foreign", and
how to save GV into the new variable "gv" (to replicate example 1, copy and paste the two command lines
into Stata's command window):
{cmd:sysuse auto, clear}
{cmd:bys foreign: divcat rep78, gen_gv(gv)}
{pstd} Examples 2 and 3 demonstrate that the "fractionalization" measures (GV to NH) reach a maximum
if all cases are distributed equally across {it:all} categories of "cat" (first set of input data), whereas the
polarization measure (RQ) moves towards a maximum if the majority of cases is distributed equally
across only {it:two} categories of "cat" (second set of input data) (to replicate the examples, copy and paste the
command lines into Stata's command window):
{cmd:clear}
{cmd:input cat cases}
1 33
2 33
3 34
{cmd:end}
{cmd:divcat cat [fw = cases], t base(e)}
{cmd:clear}
{cmd:input cat cases}
1 48
2 48
3 4
{cmd:end}
{cmd:divcat cat [fw = cases], t base(e)}
{title:Saved Results}
{pstd} {cmd:divcat} saves the following in {cmd:r()}: {p_end}
{synoptset 14 tabbed}{...}
{p2col 5 14 18 2: Scalars}{p_end}
{synopt:{cmd:r(N_total)}}total number of cases{p_end}
{synopt:{cmd:r(bygroups)}}number of groups defined by the variables specified with prefix {help by}{p_end}
{synopt:{cmd:r(categs)}}number of categories of {cmd:{it:varname}} (of last {help by} group){p_end}
{synopt:{cmd:r(N)}}number of cases (of last {help by} group){p_end}
{synopt:{cmd:r(GV)}}generalized variance GV (of last {help by} group){p_end}
{synopt:{cmd:r(NGV)}}normalized generalized variance NGV (of last {help by} group){p_end}
{synopt:{cmd:r(H)}}entropy H (of last {help by} group){p_end}
{synopt:{cmd:r(NH)}}normalized entropy NH (of last {help by} group){p_end}
{synopt:{cmd:r(RQ)}}polarization measure RQ (of last {help by} group){p_end}
{synoptset 14 tabbed}{...}
{p2col 5 14 18 2: Macros}{p_end}
{synopt:{cmd:r(base)}}base of logarithm used to calculate the entropy H{p_end}
{synopt:{cmd:r(by)}}group variables specified using the {help by} prefix (if used){p_end}
{synopt:{cmd:r(wgt)}}weights (if used){p_end}
{synoptset 14 tabbed}{...}
{p2col 5 14 18 2: Matrices}{p_end}
{synopt:{cmd:r(div)}}matrix of diversity measures (over {help by} groups){p_end}
{title:References}
{p 4 7 2}Blau, P. M. (1977). {it:Inequality and Heterogeneity}. New York:
Free Press.{p_end}
{p 4 7 2}Budescu, D. V. & Budescu,
M. (2012). {browse "http://psycnet.apa.org/journals/met/17/2/215/":How to measure diversity when you must}.
{it:Psychological Methods}, {it:17}, 215-227.{p_end}
{p 4 7 2}Herfindahl, O. C. (1950). {it:Concentration in the U.S. Steel Industry} (unpublished
doctoral dissertation). New York, NY: Columbia University.{p_end}
{p 4 7 2}Hirschman, A. O. (1945). {it:National Power and the Structure of Foreign Trade.} Berkeley,
CA: University of California Press.{p_end}
{p 4 7 2}Montalvo, J. G. & Reynal-Querol,
M. (2002). {it:Why ethnic fractionalization? Polarization, ethnic conflict, and growth}
(mimeo). [URL: {browse "https://ideas.repec.org/p/upf/upfgen/660.html":https://ideas.repec.org/p/upf/upfgen/660.html}].{p_end}
{p 4 7 2}Montalvo, J. G. & Reynal-Querol, M. (2008). {browse "http://onlinelibrary.wiley.com/doi/10.1111/j.1468-0297.2008.02193.x/abstract":Discrete polarization with an application to the determinants of genocides}. {it:The Economic Journal},
{it:118}, 1835-1865.{p_end}
{p 4 7 2}Mueller, J. H., Schuessler, H. L., & Costner, H. L. (1970). {it:Statistical Reasoning in Sociology}. Boston: Houghton
Mifflin.{p_end}
{p 4 7 2}Shannon, C. (1948). A mathematical theory of communications. {it:Bell System Technical Journal}, {it:27},
397-423, 623-656.{p_end}
{p 4 7 2}Simpson, E. H. (1949). Measurement of diversity. {it:Nature}, {it:163}, 688.{p_end}
{title:Author}
{phang}Dirk Enzmann{p_end}
{phang}Institute of Criminal Sciences, Hamburg{p_end}
{phang}email: {browse "mailto:dirk.enzmann@uni-hamburg.de"}{p_end}