{smcl}
{* 11aug2020}{...}
{cmd:help mata mm_ddens()}
{hline}
{title:Title}
{pstd}
{bf:mm_ddens() -- Density estimation by diffusion}
{title:Syntax}
{p 8 23 2}
{it:real matrix}{bind: }
{cmd:mm_ddens(}{it:X} [{cmd:,} {it:w}{cmd:,} {it:minmax}{cmd:,} {it:n}{cmd:,} {it:h}{cmd:,} {it:qui}]{cmd:)}
{p 4 8 2}
where
{p 14 18 2}{it:X}: {it:real colvector} containing data (missing values not allowed)
{p 14 18 2}{it:w}: {it:real colvector} containing weights (missing or negative
values not allowed); default is {it:w} = 1 (no weights)
{p 9 18 2}{it:minmax}: {it:real vector} specifying the lower and upper bounds
of the support of {it:X}; specify one value (lower boundary only) or
two values (lower and upper boundary); default is {cmd:(.,.)} (unbounded support;
in this case, grid points will be chosen such that they extend the observed data
range by 10 percent on each side)
{p 14 18 2}{it:n}: {it:real scalar} specifying the grid size (number of
evaluation points); will be rounded up to the next power of 2; default is
{it:n} = 2^14 = 16384; specify {cmd:.} (missing) to select the default; reducing the
grid size will reduce the accuracy of the results
{p 14 18 2}{it:h}: {it:real scalar} specifying the bandwidth (kernel halfwidth); the
default is to determine the bandwidth automatically; set {it:h} to {cmd:.} (missing)
to select the default; the obtained bandwidth will be returned in {it:h} (that is,
{it:h} will be replaced)
{p 12 18 2}{it:qui}: {it:qui}!=0 suppresses the warning message that is displayed
if the bandwidth estimation failed and the oversmoothed rule is used to determine
the bandwidth
{title:Description}
{pstd}
{cmd:mm_ddens()} implements (non-adaptive) kernel density estimation by
diffusion as suggested by Botev et al. (2010). The estimator is equivalent
to gaussian kernel density estimation with reflection at the boundaries of
the support. The optimal bandwidth is determined automatically without relying
on preliminary assumptions about distributional shape.
{pstd}
{cmd:mm_ddens()} returns a real matrix containing the estimated density in
the first column and the evaluation points in the second column. The bandwidth
is returned in argument {it:h}.
{pstd}
{cmd:mm_ddens()} is based on {browse "https://www.r-project.org/":R}
code provided by Z.I. Botev in file {browse "http://web1.maths.unsw.edu.au/~zdravkobotev/kde.R":kde.R}
at {browse "https://web.maths.unsw.edu.au/~zdravkobotev/"}.
{title:Examples}
{pstd}
Density estimation with default settings:
. {stata "mata:"}
: {stata x = rnormal(50000,1, 0, 1) \ rnormal(50000,1, 4, 1.5)}
: {stata D = mm_ddens(x)}
: {stata mm_plot(D, "line")}
: {stata end}
{pstd}
{cmd:mm_ddens()} evaluates the density function across a regular grid of
evaluation points. The grid must be of sufficient size for the results to be
accurate. Use linear interpolation if you want to
obtain a density estimate at less points or at irregularly-spaced points. In
the following example, we obtain the density
at different quantiles of X:
. {stata "mata:"}
: {stata x = rnormal(1000,1, 0, 1) \ rnormal(1000,1, 4, 1.5)}
: {stata at = mm_quantile(x,1,(.25, .5, .75)')}
: {stata D = mm_ddens(x)}
: {stata mm_ipolate(D[,2], D[,1], at), at}
: {stata end}
{pstd}
The estimator implemented by {cmd:mm_ddens()} performs well on variables that
have bounded support (equivalent to the so-called reflection estimator). Simply specify
the values of the bounds of the support. Example:
. {stata "mata:"}
: {stata x = rchi2(1000,1,2)}
: {stata D = mm_ddens(x,1, (0,.))}
: {stata mm_plot(D, "line")}
: {stata end}
{pstd}
The bandwidth used by the estimator is returned in argument {it:h}:
. {stata "mata:"}
: {stata x = rnormal(1000,1, 0, 1) \ rnormal(1000,1, 4, 1.5)}
: {stata (void) mm_ddens(x, 1, ., ., h=.)}
: {stata h}
: {stata end}
{pstd}
The returned bandwidth is appropriate for use with a gaussian kernel. To use
the bandwidth with a different kernel, for example, in Stata command
{helpb kdensity}, you need to rescale the bandwidth by the ratio of canonical
bandwidths of the kernels:
. {stata webuse trocolen}
. {stata "mata:"}
: {stata x = st_data(., "length")}
: {stata (void) mm_ddens(x, 1, ., ., h=.)}
: {stata h}
: {stata h = h / mm_kdel0_gaussian() * mm_kdel0_parzen()}
: {stata h}
: {stata stata(sprintf("kdensity length, kernel(parzen) n(100) bwidth(%g)", h))}
: {stata end}
{pstd}
Weights, if specified, are interpreted as (possibly non-integer) frequency
weights. That is, the sample size is equal to the sum of weights. If the weights are
not frequency weights, normalize them such that their sum is equal to the number of observations:
. {stata webuse highschool}
. {stata "mata:"}
: {stata x = st_data(., "weight")}
: {stata w = st_data(., "sampwgt")}
: {stata w = w / sum(w) * rows(x)} // normalize weights
: {stata D = mm_ddens(x, w)}
: {stata mm_plot(D, "line")}
: {stata end}
{pstd}
Furthermore, if the weights are sampling weights, it may make sense
to adjust the bandwidth depending on the the variability of the weights (see
section 8.4 in {browse "http://boris.unibe.ch/69421/2/kdens.pdf":Jann (2007)}:
. {stata webuse highschool}
. {stata "mata:"}
: {stata x = st_data(., "weight")}
: {stata w = st_data(., "sampwgt")}
: {stata w = w / sum(w) * rows(x)} // normalize weights
: {stata (void) mm_ddens(x, w, ., ., h = .)}
: {stata h}
: {stata "h = h * (sum(w:^2)/rows(x))^.2"} // adjust bandwidth
: {stata h}
: {stata D = mm_ddens(x, w, ., ., h)}
: {stata mm_plot(D, "line")}
: {stata end}
{title:Source code}
{pstd}
{help moremata_source##mm_ddens:mm_ddens.mata}
{title:References}
{phang}
Botev, Z.I., J.F. Grotowski, and D.P. Kroese (2010). Kernel density
estimation via diffusion. Annals of Statistics
38(5): 2916-2957. DOI: {browse "http://doi.org/10.1214/10-AOS799":10.1214/10-AOS799}.
{p_end}
{phang}
Jann, B. (2007). Univariate kernel density
estimation. DOI: {browse "http://boris.unibe.ch/69421/2/kdens.pdf":10.7892/boris.69421}.
{p_end}
{title:Author}
{pstd} Ben Jann, University of Bern, ben.jann@soz.unibe.ch
{title:Also see}
{psee}
Online: help for
{helpb mf_mm_density:mm_density()},
{helpb mf_mm_ecdf:mm_ecdf()},
{helpb mf_mm_histogram:mm_histogram()},
{helpb moremata},
{helpb kdensity},
{helpb histogram}
{p_end}