help mata kdens()
-------------------------------------------------------------------------------

Title

kdens() -- Univariate kernel density estimation

Contents

Information on functions (syntax, description, remarks, conformability, diagnostics) can be found below under the following headings:

Wrappers

Elementary functions

Dependencies

moremata is required. Type

. ssc describe moremata

Methods and Formulas

See http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf.

Source code

kdens.mata

Author

Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch

Aknowledgements

Some of the code is loosely based on R code from the "KernSmooth" package (R port by Brian Ripley, S original by Matt Wand) and the "sm" package (R port by Brian Ripley, S original by Adrian W. Bowman and Adelchi Azzalini).

Also see

Online: help for kdens, moremata

------------------------------------------------------------------------------- Wrappers -------------------------------------------------------------------------------

Syntax

d = kdens(x, w, g [, bw, k, a, lb, ub, btype, lbwf])

d = _kdens(x, w, g [, bw, k, a, ll, ul, btype, lbwf])

v = kdens_var(d, x, w, g [, bw, k, pw, lb, ub, btype, lbwf])

v = _kdens_var(d, x, w, g [, bw, k, pw, ll, ul, btype, lbwf])

bw = kdens_bw(x [, w, method, k, m, ll, ul, ldpi])

g = kdens_grid(x [, w, bw, k, m, min, max])

where

d: real colvector containing density estimate

v: real colvector containing variance estimate

bw: real scalar containing bandwidth; kdens() and kdens_grid() may replace bw if it is too small

g: real colvector containing equally-spaced grid points at which to estimate the density

x: real colvector containing data points

w: real colvector containing weights

method: string scalar containing "silverman" (default), "normalscale", "oversmoothed", "sjpi", or "dpi"

k: string scalar containing "epanechnikov", "epan2" (default), "biweight", "triweight", "cosine", "gaussian", "parzen", "rectangle" or "triangle"

a: real scalar specifying number of iterations for the adaptive kernel density estimator (default 0)

pw: real scalar indicating that the weights are pweights (normalized to the number of observations)

m: real scalar containing size of evaluation grid (number of evaluation points) (default is 512)

lb: real scalar indicating that the support of x is lower bounded at g[1]

ub: real scalar indicating that the support of x is upper bounded at g[rows(g)]

btype: real scalar specifying the method to be used for boundary correction; btype==0: renormalization, btype==1: reflection, btype==2: linear combination

ll: real scalar containing lower limit of support of x

ul: real scalar containing upper limit of support of x

ldpi: real scalar specifying the number of stages of functional estimation for the dpi method (default is 2)

min: real scalar specifying minimum value of evaluation grid; will be ignored if min is missing or larger than min(x)

max: real scalar specifying maximum value of evaluation grid; will be ignored if max is missing or smaller than max(x)

lwbf: will be replaced by the local bandwidth factors in [_]kdens(); real colvector containing local bandwidth factors in [_]kdens_var()

Description

kdens() returns the binned approximation kernel density estimate of x using evaluation grid g. g must be a regular grid of equidistant points covering the whole range of x. Use kdens_grid() to produce g. The default kernel used by kdens() is the gaussian kernel function. Specify k as indicated above to use another kernel function. bw is the bandwidth. If bw is omitted, the optimal of Silverman is used. Note that kdens() imposes a minimum bandwidth (see help kdens for details). If bw is smaller than the minimum bandwidth, it is reset to this minimum. Specify a as an integer larger than zero to obtain the adaptive bandwidth kernel density, where a indicates the number of iterations applied to determine the local bandwidth factors. kdens() supports density estimation for bounded variables. Specify lb!=0 and ub!=0 to indicate that the support of x is bounded. The default method for estimation at the boundaries is the normalization method. Alternatively, btype==1 causes the reflection method to be used and btype==2 causes the linear combination method to be used. If specified, lwbf will be replaced by the local bandwidth factors (or set to 1 if a=0).

_kdens() returns the exact kernel density estimate. ll and ul specify the lower and upper limits of the data support.

kdens_var() returns asymptotic point-wise variance estimates for the binned approximation density estimate. Use kdens_var() after kdens(), but not after _kdens(). pw!=1 specifies that the weights w are (normalized) sampling weights. If d has been derived by the adaptive kernel density method (a>=1 in kdens()), the local bandwidth factors, lwbf, should be provided to kdens_var().

kdens_var() returns asymptotic point-wise variance estimates for the exact density estimate. Use _kdens_var() after _kdens(), but not after kdens().

kdens_bw() returns an estimate of the "optimal" bandwidth given the data and kernel function. Available methods are "silverman" (optimal of Silverman; the default), "normalscale" (the normal scale rule), "oversmoothed" (the oversmoothed rule), "sjpi" (the Sheather-Jones plug-in estimate) and "dpi" (a variant of the Sheather-Jones plug-in estimate called the direct plug-in bandwidth estimate). If the method is "sjpi" or "dpi" you might want to set m, the number of evaluation points used to estimate the density functionals, and for bounded variables the lower and upper limits, ll and ul. Furthermore, with the "dpi" method you may specify the the number of stages of functional estimation, ldpi (default is 2).

kdens_grid() returns a grid of m equally-spaced points over the range of x. The default grid size is m=512. The default range of the grid is [min(x)-bw*tau, max(x)+bw*tau], where bw is the bandwidth and tau is the halfwidth of the support of kernel k (in the case of the gaussian kernel, tau is set to 3). Alternatively, if min<=min(x) is specified, the lower limit of the grid is set to min. If .>max>=max(x) is specified, the upper limit of the grid is set to max. Note that, similar to kdens(), kdens_grid() imposes a minimum bandwidth and resets bw if it is too small (see above).

Remarks

Suppose, x is a data vector. To estimate the density of x using a gaussian kernel you could type, for example:

: bw = kdens_bw(x, 1, "sjpi") : g = kdens_grid(x, 1, bw) : d = kdens(x, 1, g, bw) : v = kdens_var(d, x, 1, g, bw)

g and d could then be used, e.g., to plot the density function. v could be used to construct point-wise confidence intervals.

If the adaptive estimator is used and the variance be estimated, the local bandwidth factors have to be passed to kdens_var(). Example:

: bw = kdens_bw(x, 1, "sjpi") : g = kdens_grid(x, 1, bw) : d = kdens(x, 1, g, bw, "", 1, 0, 0, 0, l=.) : v = kdens_var(d, x, 1, g, bw, "", 0, 0, 0, 0, l)

Conformability

kdens(x, w, g, bw, k, a, lb, ub, btype, lbwf), _kdens(x, w, g, bw, k, a, ll, ul, btype, lbwf), kdens_var(d, x, w, g, bw, k, pw, lb, ub, btype, lbwf), _kdens_var(d, x, w, g, bw, k, pw, ll, ul, btype, lbwf), kdens_grid(x, w, bw, k, m, min, max): result: m x 1

kdens_bw(x, w, method, k, m, ll, ul, ldpi): result: 1 x 1

where

x: n x 1 w: n x 1 or 1 x 1 g: m x 1 d: m x 1 bw: 1 x 1 method: 1 x 1 k: 1 x 1 a: 1 x 1 pw: 1 x 1 m: 1 x 1 lb: 1 x 1 ub: 1 x 1 btype: 1 x 1 ll: 1 x 1 ul: 1 x 1 ldpi: 1 x 1 min: 1 x 1 max: 1 x 1 lwbf: n x 1

Diagnostics

kdens() and kdens_var() return invalid results if the grid g is not equally-spaced.

kdens_bw() aborts with error if ll>min(x) or ul<max(x) and method is "sjpi" or "dpi". _kdens() and _kdens_var() abort with error if ll>min(x) or ul<max(x).

The functions return invalid results if the data contain missing values.

Weights are assumed to be normalized to the number of observations (i.e. sum of weights = number of observations).

------------------------------------------------------------------------------- Elementary functions -------------------------------------------------------------------------------

Syntax

real colvector kdens_gen(x, w, g, h [, k, ll, ul, btype])

real colvector kdens_bin(g, gc, h [, k, lb, ub, btype])

real colvector kdens_dd(g, gc, h, drv [, lb, ub])

real colvector kdens_df(g, gc, h, drv [, lb, ub])

real colvector kdens_avar(x, w, g, h, d [, k, pw, ll, ul])

real colvector kdens_evar(x, w, g, h, d [, k, pw, ll, ul, btype])

real scalar kdens_bw_simple(x, w [, rule, scale])

real scalar kdens_bw_sjpi(x, w [, m, scale, ll, ul])

real scalar kdens_bw_dpi(x, w [, m, scale, ll, ul, ldpi])

real scalar kdens_lbwf(x, w, g, d)

where

x: real colvector containing data points

w: real colvector containing weights

g: real colvector containing grid points at which to estimate the density

gc: real colvector containing grid counts

h: real colvector containing (local) bandwidth

d: real colvector containing preliminary density estimate

k: string scalar containing "epanechnikov", "epan2" (default), "biweight", "triweight", "cosine", "gaussian", "parzen", "rectangle" or "triangle"

drv: real scalar specifying the order of derivative

pw: real scalar indicating that the weights are pweights

m: real scalar specifying the number of equally spaced grid points (default: 401)

rule: string scalar containing "silverman" (default), "normalscale", or "oversmoothed"

scale: string scalar containing "minim" (default), "stddev", "iqr"

lb: real scalar indicating that the support of x is lower bounded at g[1]

ub: real scalar indicating that the support of x is upper bounded at g[rows(g)]

btype: real scalar specifying the method to be used for boundary correction; btype==0: renormalization, btype==1: reflection, btype==2: linear combination

ll: real scalar containing lower boundary of support of x

ul: real scalar containing upper boundary of support of x

ldpi: real scalar specifying the number of stages of functional estimation (default is 2)

Description

kdens_gen() returns the density estimate of x at the points g using bandwidth h. If h is a scalar, kdens_gen() returns a fixed bandwidth kernel density estimate. If h is vector, kdens_gen() returns an adaptive kernel density estimate. Furthermore, specify ll<. and ul<. if the support of x is bounded. The default method for estimation at the boundaries is the normalization method. Alternatively, btype==1 causes the reflection method to be used and btype==2 causes the linear combination method to be used.

kdens_bin() returns a density estimate based on binned data. g contains a grid of equidistant evaluation points and gc contains the grid counts. Use mm_makegrid() and mm_fastlinbin() from the moremata package to produce g and gc. If possible, kdens_bin() computes the estimate as the convolution of fast Fourier transforms.

kdens_dd() returns the drvth density derivative estimate based on binned data using the gaussian kernel function. kdens_dd() is used by kdens_df(). drv should be a nonnegative integer. kdens_dd() supports bounded variables using the reflection method.

kdens_df() returns a density functional estimate based on binned data using the gaussian kernel. kdens_df() is used by kdens_bw_sjpi() and kdens_bw_dpi(). drv specifies the derivative in the functional and should be a nonnegative integer. kdens_df() supports bounded variables using the reflection method.

kdens_avar() returns approximate point-wise variance estimates for the density estimate in d.

kdens_evar() returns exact point-wise variance estimates for the density estimate in d.

kdens_bw_simple() returns a quick and simple bandwidth estimate (standardized, see below). The available estimators are the optimal of Silverman (default), the normal scale rule, and the oversmoothed rule. The default estimator conforms to the automatic bandwidth selection in official Stata's kdensity. Use scale to determine the scale estimate that is used in bandwidth estimation. The default is to use the minimum of the standard deviation and the inter-quartile range/1.349.

kdens_bw_sjpi() returns the Sheather-Jones plug-in bandwidth estimate (standardized, see below). kdens_bw_sjpi() supports bounded variables using the reflection method. If the Sheather-Jones plug-in estimate is larger than the oversmoothed bandwidth estimate (see above), the latter is returned (this may rarely happen with bounded variables). Missing is returned if the algorithm does not converge (for example, because the estimate is getting to small given the size of the evaluation grid).

kdens_bw_dpi() returns a variant of the Sheather-Jones plug-in estimate called the direct plug-in bandwidth estimate (standardized, see below). ldpi in {0,1,...} specifies the number of stages of functional estimation. level=2 is the default. kdens_bw_dpi() supports bounded variables using the reflection method.

Note that the bandwidth estimates returned by kdens_bw_simple(), kdens_bw_sjpi(), or kdens_bw_dpi() are standardized estimates. They should be multiplied by the kernel's canonical bandwidth before being used for density estimation. For example, kdens_bw_sjpi(...)*mm_kdel0_epan2() returns the Sheather-Jones plug-in bandwidth scaled for use with the epan2 kernel.

kdens_lbwf() returns the local bandwidth factors to be used for adaptive kernel density estimation based on a preliminary density estimate.

Remarks

Suppose, x is the data vector (or the variable in the dataset) for which the density be estimated. The commands

: h = kdens_bw_simple(x, 1) * mm_kdel0_gaussian() : g = mm_makegrid(x, 50, h) : d = kdens_gen(x, 1, g, h, "epanechnikov")

produce a density estimate equivalent to

. kdensity x

Adaptive kernel density estimation can be implemented as

: h = kdens_bw_simple(x, 1) * mm_kdel0_epan2() : g = mm_makegrid(x, 50, h) : d = kdens_gen(x, 1, g, h) : l = kdens_lbwf(x, 1, g, d) : d = kdens_gen(x, 1, g, h*l)

The binned approximation estimator is

: h = kdens_bw_simple(x, 1) * mm_kdel0_epan2() : g = mm_makegrid(x, 512, h) : gc = mm_fastlinbin(x, 1, g) : d = kdens_bin(g, gc, h)

Conformability

kdens_gen(x, w, g, h, k, ll, ul, btype), kdens_avar(x, w, g, h, d, k, pw, ll, ul), kdens_evar(x, w, g, h, d, k, pw, ll, ul, btype): h: n x 1 or 1 x 1 result: m x 1

kdens_bin(g, gc, h, k, lb, ub, btype): h: m x 1 or 1 x 1 result: m x 1

kdens_dd(g, gc, h, drv, lb, ub): h: 1 x 1 result: m x 1

kdens_df(g, gc, h, drv, lb, ub): h: 1 x 1 result: 1 x 1

kdens_bw_simple(x, w, rule, scale), kdens_bw_sjpi(x, w, m, scale, ll, ul), kdens_bw_dpi(x, w, m, scale, ll, ul, level): result: 1 x 1

kdens_lbwf(x, w, g, d): result: n x 1

where

x: n x 1 w: n x 1 or 1 x 1 g: m x 1 gc: m x 1 d: m x 1 k: 1 x 1 drv: 1 x 1 pw: 1 x 1 m: 1 x 1 rule: 1 x 1 scale: 1 x 1 lb: 1 x 1 ub: 1 x 1 btype: 1 x 1 ll: 1 x 1 ul: 1 x 1 ldpi: 1 x 1

Diagnostics

The functions return invalid results if the data contain missing values.

Weights are assumed to be normalized to the number of observations (i.e.