{smcl}
{* *! version 2.12 12-Jun-20233, Dirk Enzmann}{...}
{hi:help nb_adjust}
{hline}
{title:Title}
{pstd}{hi:nb_adjust} {hline 2} adjust or remove outliers of a variable assumed
to have a negative binomial distribution
{title:Syntax}
{p 8 15 2}
{cmd:nb_adjust} {varname} {ifin} [{cmd:,} {it:options} ]
{synoptset 20 tabbed}{...}
{synopthdr:options}
{synoptline}
{synopt :{opth g:enerate(newvar)}}generate {cmd:{it:newvar}} with outliers
adjusted or removed
{p_end}
{synopt :{opt sm:all(#)}}smallest value to define outliers (default: 0)
{p_end}
{synopt :{opt la:rge(#)}}value assumed to be large and no outlier (default: 0)
{p_end}
{synopt :{opt th:reshold(#)}}fix threshold to define outliers (default: not
fixed)
{p_end}
{synopt :{opt li:mit(#)}}values beyond {cmd:limit} are extremes and will be
removed (default: none)
{p_end}
{synopt :{opt seed(#)}}initial value of random-number {help seed} (default: not
set)
{p_end}
{synopt :{opt rep:licates(#)}}number of replicates of random numbers (default: 250)
{p_end}
{synopt :{opt cen:sor}}censor outliers instead of adjustment (default: adjust)
{p_end}
{synopt :{opt rem:ove}}remove outliers instead of adjustment (default: adjust)
{p_end}
{synopt :{opt nod:etail}}suppress details (default: show details)
{p_end}
{synopt :{opt replace}}replace contents of {cmd:{it:newvar}} if {cmd:{it:newvar}}
exists already
{p_end}
{synoptline}
{pstd}
{hi:by} is allowed (see {help by})
{title:Description}
{pstd} {cmd:nb_adjust} identifies and adjusts (or removes) outliers
of {cmd:{it:varname}} assuming that the values of {cmd:{it:varname}} have a
negative binomial distribution. Per default a value is defined as an outlier
if its expected frequency is less than 0.5 (rule-based outlier definition).
{pstd} {cmd: nb_ajdust} calculates the threshold to define an outlier by
estimating the mean and overdispersion parameter of {cmd:{it:varname}} and by
using the parameters {cmd:mu} = exp(_b[_cons]) and {cmd:size} = mu/e(delta)
obtained by -{help nbreg} {cmd:{it:varname}}, dispersion(constant)- as follows:
{pmore} {cmd:threshold} = max({cmd:counts}) of {cmd:counts} such that
{pmore} {help round}({cmd:n} * {help nbinomialp}({cmd:size},{cmd:counts},{cmd:prob})) > 0
{pstd} with {cmd:counts} = 0..max({cmd:{it:varname}}), {cmd:n} = sample size,
and {cmd:prob} = {cmd:size}/({cmd:size} + {cmd:mu}).
{pstd} Alternatively, the user can fix the threshold defining an outlier by
using the option {opt th:reshold(#)}.
{pstd} Per default outliers are values greater 0 which are greater than the
rule-based or user-fixed threshold. When using the rule-based definition of
outliers it is possible that values will be defined as outliers which the user
nevertheless wants to treat as "normal". By using the option {opt sm:all(#)}
(default: 0) the user can restrict the value to define outliers to a
minimum: When using this option, only values greater than the maximum of
({cmd:small}, {cmd:threshold}) are defined as outliers.
{pstd} {cmd:nb_adjust} adjusts outliers by replacing its values by random draws
from a negative binomial distribution with parameters {cmd:mu} and {cmd:size} as
estimated using the original values of {cmd:{it:varname}}. Replacement values
are ordered by size such as to preserve the rank order of outlying cases. To
make sure that the randomly drawn values have a minimal size, the user can apply
the option {opt la:rge(#)} (default: 0): Using this option, the lower bound of
replacement values is the minimum of ({cmd:large}, {cmd:threshold}) (i.e. in
cases of {cmd:threshold} < {cmd:large} the minimum of replacement values is
{cmd:threshold} instead of {cmd:large}). Note, however, that the absolute
minimum of replacement values is always {cmd:small}. The upper bound of
replacement values is the original value.
{pstd} The adjustment of outliers and the rule-based definition of the threshold
defining outliers are based on the observed values of {cmd:{it:varname}}. Using
the option {opt li:mit(#)}, extreme values greater than {cmd:limit} can be
eliminated from the observed values before estimating the mean and overdispersion
parameter of {cmd:{it:varname}}. Excluding extremes by using the option
{opt li:mit(#)}, only the remaining values are used to define the rule-based
outlier threshold and to define the negative binomial distribution from which to
draw replacement values of outliers. If the option {opth g:enerate(newvar)} is
used, values of {cmd:{it:varname}} > {cmd:limit} will be set to the extended
{help missing} value .r in variable {cmd:{it:newvar}}.
{pstd} To allow a replication of the random draws, the user can set the initial
value of the random-number {help seed} to any number between between 0
and 2^31-1 (2,147,483,647) by using the option {opt seed(#)}.
{pstd} Per default the random draws for the adjustment of outliers will be
replicated 250 times and averages over the replications rounded to the next
integer will be used as replacement values. The option {opt rep:licates(#)} allows
to specify the number of replicates (minimum 0 = no replicates).
{pstd} Instead of adjusting outlying values, outliers can be censored (i.e. set
to the outlier threshold) using the option {opt cen:sor} or removed (i.e. set to
the extended {help missing} value .o in variable {cmd:{it:newvar}}) by using the
option {opt rem:ove}. In this case, the option {opt la:rge(#)} will have no
effect. Note, however, that values to be censored or removed must always
be > {cmd:small}.
{pstd} An example showing how to apply outlier detection and outlier adjustment of
negative binomial distributed counts for the estimation of rates of reporting
criminal victimizations to the police is shown in Section 3 of Enzmann ({browse "http://dx.doi.org/10.13140/RG.2.2.20133.68328":2023}).
{title:Options}
{dlgtab:Main}
{phang}
{opth g:enerate(newvar)} specifies the name {cmd:{it:newvar}} of the variable
with outliers of {cmd:{it:varname}} adjusted or removed.
{phang}
{opt sm:all(#)} restricts the value to define outliers to a minimum
(default: 0). If the rule-based outlier threshold is less than {cmd:small}, only
values > {cmd:small} will be treated as outliers. Note that the user-fixed
outlier threshold must not be less than {cmd:small}.
{phang}
{opt la:rge(#)} makes sure that the randomly drawn values to replace outliers
have a minimal size. Using this option the lower bound of replacement values is
the minimum of ({cmd:large}, {cmd:threshold}). In cases of {cmd:threshold} <
{cmd:large} the minimal size of replacement values is {cmd:threshold} instead of
{cmd:large}.
{phang}
{opt th:reshold(#)} sets the threshold of outliers to a user-fixed value (default:
rule-based threshold). Using this option overrides the default of {cmd:nb_adjust}
which determines the threshold of outliers such that the expected frequency of an
outlying value is less than 0.5. Note that the user-fixed value of {cmd:threshold}
must not be smaller than the value specified with the option {opt sm:all(#)}.
{phang}
{opt li:mit(#)} serves to exclude extreme values > {cmd:limit} when determining
the mean and overdispersion parameter of {cmd:{it:varname}}. Values > {cmd:limit}
will temporarily be set to missing and will be set to the extended {help missing}
value .r when generating {cmd:{it:newvar}}. This option can be used to eliminate
the influence of extreme values on the rule-based definition of outliers and
their randomly drawn replacement values.
{phang}
{opt seed(#)} specifies the initial value of the random-number {help seed} (default: not
set). To enable a replication of the random numbers drawn when adjusting outliers,
{cmd:seed} can be set to any number between between 0 and 2^31-1 (2,147,483,647).
{phang}
{opt rep:licates(#)} specifies the number of replicates for the adjustment of
outliers (default: 250). Replacement values are the averages over the replications
rounded to the next integer. The minium value of {opt rep:licates(#)} is
0 (no replicates).
{phang}
{opt cen:sor} sets all outliers to the constant value of the outlier threshold instead
of adjusting outlying values by random draws from a negative binomial distribution. Note
that only values greater than the maximum of ({cmd:threshold}, {cmd:small}) (see option
{opt sm:all(#)}) will be censored.
{phang}
{opt rem:ove} sets outliers to the extended {help missing} value .o instead of
adjusting outlying values by random draws from a negative binomial distribution. Note
that only values greater than the maximum of ({cmd:threshold}, {cmd:small}) (see option
{opt sm:all(#)}) will be removed by setting them to missing.
{phang}
{opt nod:etail} allows to reduce details of the output.
{phang}
{opt replace} replaces the variable specified by {opth g:enerate(newvar)} if
{cmd:{it:newvar}} exists already.
{title:Example}
{pstd} Using an open response question, 12 year old students were asked to
indicate the number of times their family had moved to a different place. The
frequency distribution of the count variable "moves" shows at least one case
with a rather implausible value of 32.
{pstd} To replicate the example, copy and paste the 26 lines between -clear- and
-tab moves, missing- into Stata's command window:
{cmd:clear}
{cmd:input moves freq}
0 486
1 763
2 315
3 281
4 163
5 88
6 40
7 27
8 9
9 5
10 2
11 2
12 1
13 2
15 2
17 1
18 2
24 1
25 1
32 1
. 12
{cmd:end}
{cmd:expand freq}
{cmd:tab moves, missing}
{pstd}{cmd:nb_adjust} is used to create two new variabels ("moves_adj" and
"moves_rem") with outlying values adjusted and removed, resp. Before using the
default rule of {cmd:nb_adjust} to determine the threshold of outliers and to
randomly draw replacement values, extremely implausible values > 30 will be
removed. A number of three moves are assumed to be "normal", thus the option
{opt sm:all(3)} will be used to restrict outliers to values > 3. Fifteen moves
are assumed to be "large", therefore replacement values of outliers are
constrained to be at least 15.
{pstd}Thus, three options specifiying small values, large values, and the limit
to extreme values will be used for adjusting outliers (large values need not be
specified when removing outliers). The option {opt seed(#)} will be set to the
arbitrary value of 4 to enable a replication of the random draws of replacement
values:
{stata nb_adjust moves, g(moves_adj) sm(3) la(15) li(30) seed(4)}
{stata nb_adjust moves, g(moves_rem) sm(3) li(30) rem}
{pstd}The default rule of {cmd:nb_adjust} determined 12 to be the rule-based
threshold of outliers (overall, 9 values or 0.4% of sample are > 12 and thus
defined as outliers). Because 12 is greater than the minimal size of values
defined as "normal" by using option {opt sm:all(3)}, all 9 cases will have their
values adjusted by random draws from a negative binomial distribution or will be
removed by setting them to the extended missing value .o.
{pstd}Because the outlier threshold of 12 is smaller than the minimal size of
adjusted values specified by the option {opt la:rge(15)}, the lower bound
of adjusted values was reduced to the outlier threshold of 12. The commands
{cmd:fre} (if necessary, {net "describe fre, from(http://fmwww.bc.edu/RePEc/bocode/f)":fre}
will be installed by the first call to {cmd:nb_adjust}) and {help summarize:sum} show
the effect of adjusting and removing the 9 outliers:
{stata fre moves* }
{stata sum moves* }
{title:Saved Results}
{pstd} {cmd:nb_adjust} saves the following in {cmd:r()}: {p_end}
{synoptset 14 tabbed}{...}
{p2col 5 14 18 2: Scalars}{p_end}
{synopt:{cmd:r(small)}}minimum value to define outliers{p_end}
{synopt:{cmd:r(large)}}specified minimal size of replacement values{p_end}
{synopt:{cmd:r(limit)}}limit to define extreme values (0 = not used){p_end}
{synopt:{cmd:r(seed)}}initial value of the random number seed (-1 = not used){p_end}
{synopt:{cmd:r(repl)}}number of replicates of the random draws{p_end}
{synopt:{cmd:r(N)}}number of valid cases (cases not greater than {cmd:limit}) of
{cmd:{it:varname}}{p_end}
{synopt:{cmd:r(mu)}}mean of {cmd:{it:varname}}{p_end}
{synopt:{cmd:r(size)}}mu/delta of {cmd:{it:varname}} as determined by
-{help nbreg} {cmd:{it:varname}}, dispersion(constant)-{p_end}
{synopt:{cmd:r(threshold)}}threshold defining values as outliers{p_end}
{synopt:{cmd:r(nout)}}number of outliers of {cmd:{it:varname}}{p_end}
{synopt:{cmd:r(percout)}}percentage of outliers of {cmd:{it:varname}}{p_end}
{synopt:{cmd:r(low)}}lower bound used for adjustment values{p_end}
{synopt:{cmd:r(nadj)}}number of outliers adjusted (or removed){p_end}
{synoptset 14 tabbed}{...}
{p2col 5 14 18 2: Macros}{p_end}
{synopt:{cmd:r(varname)}}name of variable used{p_end}
{synopt:{cmd:r(newvar)}}name of new variable with values adjusted or removed{p_end}
{synopt:{cmd:r(adj)}}handling of outliers (adjusted, censored, or removed){p_end}
{synopt:{cmd:r(method)}}method to define outliers (rule-based or fixed){p_end}
{title:References}
{p 4 7 2}Enzmann, D. (2023). {it:Reporting Rates as an Indicator of Ignorance: Issues of Measurement and Design}. Hamburg: University of Hamburg, Institute of
Criminal Sciences. [{browse "http://dx.doi.org/10.13140/RG.2.2.20133.68328":http://dx.doi.org/10.13140/RG.2.2.20133.68328}]{p_end}
{title:Requires}
{pstd} {cmd:nb_adjust} requires the SSC packages {net "describe fre, from(http://fmwww.bc.edu/RePEc/bocode/f)":fre},
{net "describe elabel, from(http://fmwww.bc.edu/RePEc/bocode/e)":elabel},
and {net "describe moremata, from(http://fmwww.bc.edu/RePEc/bocode/m)":moremata}. If
necessary they will be installed by the first call of {bf:nb_adjust}.
{title:Author}
{phang}Dirk Enzmann{p_end}
{phang}Institute of Criminal Sciences, Hamburg{p_end}
{phang}email: {browse "mailto:dirk.enzmann@uni-hamburg.de"}{p_end}