```help mata kdens()
-------------------------------------------------------------------------------

Title

kdens() -- Univariate kernel density estimation

Contents

Information on functions (syntax, description, remarks, conformability,
diagnostics) can be found below under the following headings:

Wrappers

Elementary functions

Dependencies

moremata is required. Type

. ssc describe moremata

Methods and Formulas

See http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf.

Source code

kdens.mata

Author

Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch

Aknowledgements

Some of the code is loosely based on R code from the "KernSmooth" package
(R port by Brian Ripley, S original by Matt Wand) and the "sm" package (R
port by Brian Ripley, S original by Adrian W. Bowman and Adelchi
Azzalini).

Also see

Online:  help for kdens, moremata

-------------------------------------------------------------------------------
Wrappers
-------------------------------------------------------------------------------

Syntax

d = kdens(x, w, g [, bw, k, a, lb, ub, btype, lbwf])

d = _kdens(x, w, g [, bw, k, a, ll, ul, btype, lbwf])

v = kdens_var(d, x, w, g [, bw, k, pw, lb, ub, btype, lbwf])

v = _kdens_var(d, x, w, g [, bw, k, pw, ll, ul, btype, lbwf])

bw = kdens_bw(x [, w, method, k, m, ll, ul, ldpi])

g = kdens_grid(x [, w, bw, k, m, min, max])

where

d:  real colvector containing density estimate

v:  real colvector containing variance estimate

bw:  real scalar containing bandwidth; kdens() and kdens_grid()
may replace bw if it is too small

g:  real colvector containing equally-spaced grid points at which
to estimate the density

x:  real colvector containing data points

w:  real colvector containing weights

method:  string scalar containing "silverman" (default),
"normalscale", "oversmoothed", "sjpi", or "dpi"

k:  string scalar containing "epanechnikov", "epan2" (default),
"biweight", "triweight", "cosine", "gaussian", "parzen",
"rectangle" or "triangle"

a:  real scalar specifying number of iterations for the adaptive
kernel density estimator (default 0)

pw:  real scalar indicating that the weights are pweights
(normalized to the number of observations)

m:  real scalar containing size of evaluation grid (number of
evaluation points) (default is 512)

lb:  real scalar indicating that the support of x is lower bounded
at g

ub:  real scalar indicating that the support of x is upper bounded
at g[rows(g)]

btype:  real scalar specifying the method to be used for boundary
correction; btype==0: renormalization, btype==1:
reflection, btype==2: linear combination

ll:  real scalar containing lower limit of support of x

ul:  real scalar containing upper limit of support of x

ldpi:  real scalar specifying the number of stages of functional
estimation for the dpi method (default is 2)

min:  real scalar specifying minimum value of evaluation grid; will
be ignored if min is missing or larger than min(x)

max:  real scalar specifying maximum value of evaluation grid; will
be ignored if max is missing or smaller than max(x)

lwbf:  will be replaced by the local bandwidth factors in
[_]kdens(); real colvector containing local bandwidth
factors in [_]kdens_var()

Description

kdens() returns the binned approximation kernel density estimate of x
using evaluation grid g. g must be a regular grid of equidistant points
covering the whole range of x. Use kdens_grid() to produce g. The default
kernel used by kdens() is the gaussian kernel function. Specify k as
indicated above to use another kernel function. bw is the bandwidth. If
bw is omitted, the optimal of Silverman is used. Note that kdens()
imposes a minimum bandwidth (see help kdens for details). If bw is
smaller than the minimum bandwidth, it is reset to this minimum. Specify
a as an integer larger than zero to obtain the adaptive bandwidth kernel
density, where a indicates the number of iterations applied to determine
the local bandwidth factors. kdens() supports density estimation for
bounded variables. Specify lb!=0 and ub!=0 to indicate that the support
of x is bounded. The default method for estimation at the boundaries is
the normalization method. Alternatively, btype==1 causes the reflection
method to be used and btype==2 causes the linear combination method to be
used. If specified, lwbf will be replaced by the local bandwidth factors
(or set to 1 if a=0).

_kdens() returns the exact kernel density estimate. ll and ul specify the
lower and upper limits of the data support.

kdens_var() returns asymptotic point-wise variance estimates for the
binned approximation density estimate. Use kdens_var() after kdens(), but
not after _kdens(). pw!=1 specifies that the weights w are (normalized)
sampling weights. If d has been derived by the adaptive kernel density
method (a>=1 in kdens()), the local bandwidth factors, lwbf, should be
provided to kdens_var().

kdens_var() returns asymptotic point-wise variance estimates for the
exact density estimate. Use _kdens_var() after _kdens(), but not after
kdens().

kdens_bw() returns an estimate of the "optimal" bandwidth given the data
and kernel function. Available methods are "silverman" (optimal of
Silverman; the default), "normalscale" (the normal scale rule),
"oversmoothed" (the oversmoothed rule), "sjpi" (the Sheather-Jones
plug-in estimate) and "dpi" (a variant of the Sheather-Jones plug-in
estimate called the direct plug-in bandwidth estimate). If the method is
"sjpi" or "dpi" you might want to set m, the number of evaluation points
used to estimate the density functionals, and for bounded variables the
lower and upper limits, ll and ul.  Furthermore, with the "dpi" method
you may specify the the number of stages of functional estimation, ldpi
(default is 2).

kdens_grid() returns a grid of m equally-spaced points over the range of
x. The default grid size is m=512. The default range of the grid is
[min(x)-bw*tau, max(x)+bw*tau], where bw is the bandwidth and tau is the
halfwidth of the support of kernel k (in the case of the gaussian kernel,
tau is set to 3). Alternatively, if min<=min(x) is specified, the lower
limit of the grid is set to min. If .>max>=max(x) is specified, the upper
limit of the grid is set to max. Note that, similar to kdens(),
kdens_grid() imposes a minimum bandwidth and resets bw if it is too small
(see above).

Remarks

Suppose, x is a data vector. To estimate the density of x using a
gaussian kernel you could type, for example:

: bw = kdens_bw(x, 1, "sjpi")
: g  = kdens_grid(x, 1, bw)
: d  = kdens(x, 1, g, bw)
: v  = kdens_var(d, x, 1, g, bw)

g and d could then be used, e.g., to plot the density function. v could
be used to construct point-wise confidence intervals.

If the adaptive estimator is used and the variance be estimated, the
local bandwidth factors have to be passed to kdens_var(). Example:

: bw = kdens_bw(x, 1, "sjpi")
: g  = kdens_grid(x, 1, bw)
: d  = kdens(x, 1, g, bw, "", 1, 0, 0, 0, l=.)
: v  = kdens_var(d, x, 1, g, bw, "", 0, 0, 0, 0, l)

Conformability

kdens(x, w, g, bw, k, a, lb, ub, btype, lbwf),
_kdens(x, w, g, bw, k, a, ll, ul, btype, lbwf),
kdens_var(d, x, w, g, bw, k, pw, lb, ub, btype, lbwf),
_kdens_var(d, x, w, g, bw, k, pw, ll, ul, btype, lbwf),
kdens_grid(x, w, bw, k, m, min, max):
result:  m x 1

kdens_bw(x, w, method, k, m, ll, ul, ldpi):
result:  1 x 1

where

x:  n x 1
w:  n x 1 or 1 x 1
g:  m x 1
d:  m x 1
bw:  1 x 1
method:  1 x 1
k:  1 x 1
a:  1 x 1
pw:  1 x 1
m:  1 x 1
lb:  1 x 1
ub:  1 x 1
btype:  1 x 1
ll:  1 x 1
ul:  1 x 1
ldpi:  1 x 1
min:  1 x 1
max:  1 x 1
lwbf:  n x 1

Diagnostics

kdens() and kdens_var() return invalid results if the grid g is not
equally-spaced.

kdens_bw() aborts with error if ll>min(x) or ul<max(x) and method is
"sjpi" or "dpi".  _kdens() and _kdens_var() abort with error if ll>min(x)
or ul<max(x).

The functions return invalid results if the data contain missing values.

Weights are assumed to be normalized to the number of observations (i.e.
sum of weights = number of observations).

-------------------------------------------------------------------------------
Elementary functions
-------------------------------------------------------------------------------

Syntax

real colvector kdens_gen(x, w, g, h [, k, ll, ul, btype])

real colvector kdens_bin(g, gc, h [, k, lb, ub, btype])

real colvector kdens_dd(g, gc, h, drv [, lb, ub])

real colvector kdens_df(g, gc, h, drv [, lb, ub])

real colvector kdens_avar(x, w, g, h, d [, k, pw, ll, ul])

real colvector kdens_evar(x, w, g, h, d [, k, pw, ll, ul, btype])

real scalar    kdens_bw_simple(x, w [, rule, scale])

real scalar    kdens_bw_sjpi(x, w [, m, scale, ll, ul])

real scalar    kdens_bw_dpi(x, w [, m, scale, ll, ul, ldpi])

real scalar    kdens_lbwf(x, w, g, d)

where

x:  real colvector containing data points

w:  real colvector containing weights

g:  real colvector containing grid points at which to estimate
the density

gc:  real colvector containing grid counts

h:  real colvector containing (local) bandwidth

d:  real colvector containing preliminary density estimate

k:  string scalar containing "epanechnikov", "epan2" (default),
"biweight", "triweight", "cosine", "gaussian", "parzen",
"rectangle" or "triangle"

drv:  real scalar specifying the order of derivative

pw:  real scalar indicating that the weights are pweights

m:  real scalar specifying the number of equally spaced grid
points (default: 401)

rule:  string scalar containing "silverman" (default),
"normalscale", or "oversmoothed"

scale:  string scalar containing "minim" (default), "stddev", "iqr"

lb:  real scalar indicating that the support of x is lower bounded
at g

ub:  real scalar indicating that the support of x is upper bounded
at g[rows(g)]

btype:  real scalar specifying the method to be used for boundary
correction; btype==0: renormalization, btype==1:
reflection, btype==2: linear combination

ll:  real scalar containing lower boundary of support of x

ul:  real scalar containing upper boundary of support of x

ldpi:  real scalar specifying the number of stages of functional
estimation (default is 2)

Description

kdens_gen() returns the density estimate of x at the points g using
bandwidth h. If h is a scalar, kdens_gen() returns a fixed bandwidth
kernel density estimate. If h is vector, kdens_gen() returns an adaptive
kernel density estimate. Furthermore, specify ll<. and ul<. if the
support of x is bounded. The default method for estimation at the
boundaries is the normalization method. Alternatively, btype==1 causes
the reflection method to be used and btype==2 causes the linear
combination method to be used.

kdens_bin() returns a density estimate based on binned data. g contains a
grid of equidistant evaluation points and gc contains the grid counts.
Use mm_makegrid() and mm_fastlinbin() from the moremata package to
produce g and gc. If possible, kdens_bin() computes the estimate as the
convolution of fast Fourier transforms.

kdens_dd() returns the drvth density derivative estimate based on binned
data using the gaussian kernel function.  kdens_dd() is used by
kdens_df(). drv should be a nonnegative integer. kdens_dd() supports
bounded variables using the reflection method.

kdens_df() returns a density functional estimate based on binned data
using the gaussian kernel. kdens_df() is used by kdens_bw_sjpi() and
kdens_bw_dpi().  drv specifies the derivative in the functional and
should be a nonnegative integer. kdens_df() supports bounded variables
using the reflection method.

kdens_avar() returns approximate point-wise variance estimates for the
density estimate in d.

kdens_evar() returns exact point-wise variance estimates for the density
estimate in d.

kdens_bw_simple() returns a quick and simple bandwidth estimate
(standardized, see below). The available estimators are the optimal of
Silverman (default), the normal scale rule, and the oversmoothed rule.
The default estimator conforms to the automatic bandwidth selection in
official Stata's kdensity. Use scale to determine the scale estimate that
is used in bandwidth estimation. The default is to use the minimum of the
standard deviation and the inter-quartile range/1.349.

kdens_bw_sjpi() returns the Sheather-Jones plug-in bandwidth estimate
(standardized, see below). kdens_bw_sjpi() supports bounded variables
using the reflection method. If the Sheather-Jones plug-in estimate is
larger than the oversmoothed bandwidth estimate (see above), the latter
is returned (this may rarely happen with bounded variables). Missing is
returned if the algorithm does not converge (for example, because the
estimate is getting to small given the size of the evaluation grid).

kdens_bw_dpi() returns a variant of the Sheather-Jones plug-in estimate
called the direct plug-in bandwidth estimate (standardized, see below).
ldpi in {0,1,...} specifies the number of stages of functional
estimation. level=2 is the default. kdens_bw_dpi() supports bounded
variables using the reflection method.

Note that the bandwidth estimates returned by kdens_bw_simple(),
kdens_bw_sjpi(), or kdens_bw_dpi() are standardized estimates. They
should be multiplied by the kernel's canonical bandwidth before being
used for density estimation. For example,
kdens_bw_sjpi(...)*mm_kdel0_epan2() returns the Sheather-Jones plug-in
bandwidth scaled for use with the epan2 kernel.

kdens_lbwf() returns the local bandwidth factors to be used for adaptive
kernel density estimation based on a preliminary density estimate.

Remarks

Suppose, x is the data vector (or the variable in the dataset) for which
the density be estimated. The commands

: h = kdens_bw_simple(x, 1) * mm_kdel0_gaussian()
: g = mm_makegrid(x, 50, h)
: d = kdens_gen(x, 1, g, h, "epanechnikov")

produce a density estimate equivalent to

. kdensity x

Adaptive kernel density estimation can be implemented as

: h = kdens_bw_simple(x, 1) * mm_kdel0_epan2()
: g = mm_makegrid(x, 50, h)
: d = kdens_gen(x, 1, g, h)
: l = kdens_lbwf(x, 1, g, d)
: d = kdens_gen(x, 1, g, h*l)

The binned approximation estimator is

: h = kdens_bw_simple(x, 1) * mm_kdel0_epan2()
: g = mm_makegrid(x, 512, h)
: gc = mm_fastlinbin(x, 1, g)
: d = kdens_bin(g, gc, h)

Conformability

kdens_gen(x, w, g, h, k, ll, ul, btype),
kdens_avar(x, w, g, h, d, k, pw, ll, ul),
kdens_evar(x, w, g, h, d, k, pw, ll, ul, btype):
h:  n x 1 or 1 x 1
result:  m x 1

kdens_bin(g, gc, h, k, lb, ub, btype):
h:  m x 1 or 1 x 1
result:  m x 1

kdens_dd(g, gc, h, drv, lb, ub):
h:  1 x 1
result:  m x 1

kdens_df(g, gc, h, drv, lb, ub):
h:  1 x 1
result:  1 x 1

kdens_bw_simple(x, w, rule, scale),
kdens_bw_sjpi(x, w, m, scale, ll, ul),
kdens_bw_dpi(x, w, m, scale, ll, ul, level):
result:  1 x 1

kdens_lbwf(x, w, g, d):
result:  n x 1

where

x:  n x 1
w:  n x 1 or 1 x 1
g:  m x 1
gc:  m x 1
d:  m x 1
k:  1 x 1
drv:  1 x 1
pw:  1 x 1
m:  1 x 1
rule:  1 x 1
scale:  1 x 1
lb:  1 x 1
ub:  1 x 1
btype:  1 x 1
ll:  1 x 1
ul:  1 x 1
ldpi:  1 x 1

Diagnostics

The functions return invalid results if the data contain missing values.

Weights are assumed to be normalized to the number of observations (i.e.
```