help mata kdens()-------------------------------------------------------------------------------

Title

kdens() -- Univariate kernel density estimation

ContentsInformation on functions (syntax, description, remarks, conformability, diagnostics) can be found below under the following headings:

Wrappers

Elementary functions

Dependencies

morematais required. Type. ssc describe moremata

Methods and FormulasSee http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf.

Source codekdens.mata

AuthorBen Jann, ETH Zurich, jann@soz.gess.ethz.ch

AknowledgementsSome of the code is loosely based on R code from the "KernSmooth" package (R port by Brian Ripley, S original by Matt Wand) and the "sm" package (R port by Brian Ripley, S original by Adrian W. Bowman and Adelchi Azzalini).

Also seeOnline: help for

kdens,moremata

-------------------------------------------------------------------------------

Wrappers-------------------------------------------------------------------------------

Syntax

d =kdens(x,w,g[,bw,k,a,lb,ub,btype,lbwf])

d =_kdens(x,w,g[,bw,k,a,ll,ul,btype,lbwf])

v =kdens_var(d,x,w,g[,bw,k,pw,lb,ub,btype,lbwf])

v =_kdens_var(d,x,w,g[,bw,k,pw,ll,ul,btype,lbwf])

bw =kdens_bw(x[,w,method,k,m,ll,ul,ldpi])

g =kdens_grid(x[,w,bw,k,m,min,max])

where

d:real colvectorcontaining density estimate

v:real colvectorcontaining variance estimate

bw:real scalarcontaining bandwidth;kdens()andkdens_grid()may replacebwif it is too small

g:real colvectorcontaining equally-spaced grid points at which to estimate the density

x:real colvectorcontaining data points

w:real colvectorcontaining weights

method:string scalarcontaining"silverman"(default),"normalscale","oversmoothed","sjpi", or"dpi"

k:string scalarcontaining"epanechnikov","epan2"(default),"biweight","triweight","cosine","gaussian","parzen","rectangle"or"triangle"

a:real scalarspecifying number of iterations for the adaptive kernel density estimator (default 0)

pw:real scalarindicating that the weights arepweights (normalized to the number of observations)

m:real scalarcontaining size of evaluation grid (number of evaluation points) (default is 512)

lb:real scalarindicating that the support ofxis lower bounded atg[1]

ub:real scalarindicating that the support ofxis upper bounded atg[rows(g)]

btype:real scalarspecifying the method to be used for boundary correction;btype==0: renormalization,btype==1: reflection,btype==2: linear combination

ll:real scalarcontaining lower limit of support ofx

ul:real scalarcontaining upper limit of support ofx

ldpi:real scalarspecifying the number of stages of functional estimation for thedpimethod (default is 2)

min:real scalarspecifying minimum value of evaluation grid; will be ignored ifminis missing or larger thanmin(x)

max:real scalarspecifying maximum value of evaluation grid; will be ignored ifmaxis missing or smaller thanmax(x)

lwbf: will be replaced by the local bandwidth factors in [_]kdens();real colvectorcontaining local bandwidth factors in [_]kdens_var()

Description

kdens()returns the binned approximation kernel density estimate ofxusing evaluation gridg.gmust be a regular grid of equidistant points covering the whole range ofx. Usekdens_grid()to produceg. The default kernel used bykdens()is the gaussian kernel function. Specifykas indicated above to use another kernel function.bwis the bandwidth. Ifbwis omitted, the optimal of Silverman is used. Note thatkdens()imposes a minimum bandwidth (see helpkdensfor details). Ifbwis smaller than the minimum bandwidth, it is reset to this minimum. Specifyaas an integer larger than zero to obtain the adaptive bandwidth kernel density, whereaindicates the number of iterations applied to determine the local bandwidth factors.kdens()supports density estimation for bounded variables. Specifylb!=0 andub!=0 to indicate that the support ofxis bounded. The default method for estimation at the boundaries is the normalization method. Alternatively,btype==1 causes the reflection method to be used andbtype==2 causes the linear combination method to be used. If specified,lwbfwill be replaced by the local bandwidth factors (or set to 1 ifa=0).

_kdens()returns the exact kernel density estimate.llandulspecify the lower and upper limits of the data support.

kdens_var()returns asymptotic point-wise variance estimates for the binned approximation density estimate. Usekdens_var()afterkdens(), but not after_kdens().pw!=1 specifies that the weightsware (normalized) sampling weights. Ifdhas been derived by the adaptive kernel density method (a>=1 inkdens()), the local bandwidth factors,lwbf, should be provided tokdens_var().

kdens_var()returns asymptotic point-wise variance estimates for the exact density estimate. Use_kdens_var()after_kdens(), but not afterkdens().

kdens_bw()returns an estimate of the "optimal" bandwidth given the data and kernel function. Availablemethods are"silverman"(optimal of Silverman; the default),"normalscale"(the normal scale rule),"oversmoothed"(the oversmoothed rule),"sjpi"(the Sheather-Jones plug-in estimate) and"dpi"(a variant of the Sheather-Jones plug-in estimate called the direct plug-in bandwidth estimate). If themethodis"sjpi"or"dpi"you might want to setm, the number of evaluation points used to estimate the density functionals, and for bounded variables the lower and upper limits,llandul. Furthermore, with the"dpi"method you may specify the the number of stages of functional estimation,ldpi(default is 2).

kdens_grid()returns a grid ofmequally-spaced points over the range ofx. The default grid size ism=512. The default range of the grid is [min(x)-bw*tau, max(x)+bw*tau], wherebwis the bandwidth and tau is the halfwidth of the support of kernelk(in the case of the gaussian kernel, tau is set to 3). Alternatively, ifmin<=min(x) is specified, the lower limit of the grid is set tomin. If .>max>=max(x) is specified, the upper limit of the grid is set tomax. Note that, similar tokdens(),kdens_grid()imposes a minimum bandwidth and resetsbwif it is too small (see above).

RemarksSuppose,

xis a data vector. To estimate the density ofxusing a gaussian kernel you could type, for example:: bw = kdens_bw(x, 1, "sjpi") : g = kdens_grid(x, 1, bw) : d = kdens(x, 1, g, bw) : v = kdens_var(d, x, 1, g, bw)

ganddcould then be used, e.g., to plot the density function.vcould be used to construct point-wise confidence intervals.If the adaptive estimator is used and the variance be estimated, the local bandwidth factors have to be passed to

kdens_var(). Example:: bw = kdens_bw(x, 1, "sjpi") : g = kdens_grid(x, 1, bw) : d = kdens(x, 1, g, bw, "", 1, 0, 0, 0, l=.) : v = kdens_var(d, x, 1, g, bw, "", 0, 0, 0, 0, l)

Conformability

kdens(x,w,g,bw,k,a,lb,ub,btype,lbwf),_kdens(x,w,g,bw,k,a,ll,ul,btype,lbwf),kdens_var(d,x,w,g,bw,k,pw,lb,ub,btype,lbwf),_kdens_var(d,x,w,g,bw,k,pw,ll,ul,btype,lbwf),kdens_grid(x,w,bw,k,m,min,max):result:m x1

kdens_bw(x,w,method,k,m,ll,ul,ldpi):result: 1x1where

x:n x1w:n x1 or 1x1g:m x1d:m x1bw: 1x1method: 1x1k: 1x1a: 1x1pw: 1x1m: 1x1lb: 1x1ub: 1x1btype: 1x1ll: 1x1ul: 1x1ldpi: 1x1min: 1x1max: 1x1lwbf:n x1

Diagnostics

kdens()andkdens_var()return invalid results if the gridgis not equally-spaced.

kdens_bw()aborts with error ifll>min(x) orul<max(x) andmethodis"sjpi"or"dpi"._kdens()and_kdens_var()abort with error ifll>min(x) orul<max(x).The functions return invalid results if the data contain missing values.

Weights are assumed to be normalized to the number of observations (i.e. sum of weights = number of observations).

-------------------------------------------------------------------------------

Elementary functions-------------------------------------------------------------------------------

Syntax

real colvectorkdens_gen(x,w,g,h[,k,ll,ul,btype])

real colvectorkdens_bin(g,gc,h[,k,lb,ub,btype])

real colvectorkdens_dd(g,gc,h,drv[,lb,ub])

real colvectorkdens_df(g,gc,h,drv[,lb,ub])

real colvectorkdens_avar(x,w,g,h,d[,k,pw,ll,ul])

real colvectorkdens_evar(x,w,g,h,d[,k,pw,ll,ul,btype])

real scalarkdens_bw_simple(x,w[,rule,scale])

real scalarkdens_bw_sjpi(x,w[,m,scale,ll,ul])

real scalarkdens_bw_dpi(x,w[,m,scale,ll,ul,ldpi])

real scalarkdens_lbwf(x,w,g,d)

where

x:real colvectorcontaining data points

w:real colvectorcontaining weights

g:real colvectorcontaining grid points at which to estimate the density

gc:real colvectorcontaining grid counts

h:real colvectorcontaining (local) bandwidth

d:real colvectorcontaining preliminary density estimate

k:string scalarcontaining"epanechnikov","epan2"(default),"biweight","triweight","cosine","gaussian","parzen","rectangle"or"triangle"

drv:real scalarspecifying the order of derivative

pw:real scalarindicating that the weights arepweights

m:real scalarspecifying the number of equally spaced grid points (default: 401)

rule:string scalarcontaining"silverman"(default),"normalscale", or"oversmoothed"

scale:string scalarcontaining"minim"(default),"stddev","iqr"

lb:real scalarindicating that the support ofxis lower bounded atg[1]

ub:real scalarindicating that the support ofxis upper bounded atg[rows(g)]

btype:real scalarspecifying the method to be used for boundary correction;btype==0: renormalization,btype==1: reflection,btype==2: linear combination

ll:real scalarcontaining lower boundary of support ofx

ul:real scalarcontaining upper boundary of support ofx

ldpi:real scalarspecifying the number of stages of functional estimation (default is 2)

Description

kdens_gen()returns the density estimate ofxat the pointsgusing bandwidthh. Ifhis a scalar,kdens_gen()returns a fixed bandwidth kernel density estimate. Ifhis vector,kdens_gen()returns an adaptive kernel density estimate. Furthermore, specifyll<. andul<. if the support ofxis bounded. The default method for estimation at the boundaries is the normalization method. Alternatively,btype==1 causes the reflection method to be used andbtype==2 causes the linear combination method to be used.

kdens_bin()returns a density estimate based on binned data.gcontains a grid of equidistant evaluation points andgccontains the grid counts. Usemm_makegrid()andmm_fastlinbin()from themorematapackage to producegandgc. If possible,kdens_bin()computes the estimate as the convolution of fast Fourier transforms.

kdens_dd()returns thedrvth density derivative estimate based on binned data using the gaussian kernel function.kdens_dd()is used bykdens_df().drvshould be a nonnegative integer.kdens_dd()supports bounded variables using the reflection method.

kdens_df()returns a density functional estimate based on binned data using the gaussian kernel.kdens_df()is used bykdens_bw_sjpi()andkdens_bw_dpi().drvspecifies the derivative in the functional and should be a nonnegative integer.kdens_df()supports bounded variables using the reflection method.

kdens_avar()returns approximate point-wise variance estimates for the density estimate ind.

kdens_evar()returns exact point-wise variance estimates for the density estimate ind.

kdens_bw_simple()returns a quick and simple bandwidth estimate (standardized, see below). The available estimators are the optimal of Silverman (default), the normal scale rule, and the oversmoothed rule. The default estimator conforms to the automatic bandwidth selection in official Stata'skdensity. Usescaleto determine the scale estimate that is used in bandwidth estimation. The default is to use the minimum of the standard deviation and the inter-quartile range/1.349.

kdens_bw_sjpi()returns the Sheather-Jones plug-in bandwidth estimate (standardized, see below).kdens_bw_sjpi()supports bounded variables using the reflection method. If the Sheather-Jones plug-in estimate is larger than the oversmoothed bandwidth estimate (see above), the latter is returned (this may rarely happen with bounded variables). Missing is returned if the algorithm does not converge (for example, because the estimate is getting to small given the size of the evaluation grid).

kdens_bw_dpi()returns a variant of the Sheather-Jones plug-in estimate called the direct plug-in bandwidth estimate (standardized, see below).ldpiin {0,1,...} specifies the number of stages of functional estimation.level=2 is the default.kdens_bw_dpi()supports bounded variables using the reflection method.Note that the bandwidth estimates returned by

kdens_bw_simple(),kdens_bw_sjpi(), orkdens_bw_dpi()are standardized estimates. They should be multiplied by the kernel's canonical bandwidth before being used for density estimation. For example,kdens_bw_sjpi(...)*mm_kdel0_epan2()returns the Sheather-Jones plug-in bandwidth scaled for use with theepan2kernel.

kdens_lbwf()returns the local bandwidth factors to be used for adaptive kernel density estimation based on a preliminary density estimate.

RemarksSuppose,

xis the data vector (or the variable in the dataset) for which the density be estimated. The commands: h = kdens_bw_simple(x, 1) * mm_kdel0_gaussian() : g = mm_makegrid(x, 50, h) : d = kdens_gen(x, 1, g, h, "epanechnikov")

produce a density estimate equivalent to

. kdensity x

Adaptive kernel density estimation can be implemented as

: h = kdens_bw_simple(x, 1) * mm_kdel0_epan2() : g = mm_makegrid(x, 50, h) : d = kdens_gen(x, 1, g, h) : l = kdens_lbwf(x, 1, g, d) : d = kdens_gen(x, 1, g, h*l)

The binned approximation estimator is

: h = kdens_bw_simple(x, 1) * mm_kdel0_epan2() : g = mm_makegrid(x, 512, h) : gc = mm_fastlinbin(x, 1, g) : d = kdens_bin(g, gc, h)

Conformability

kdens_gen(x,w,g,h,k,ll,ul,btype),kdens_avar(x,w,g,h,d,k,pw,ll,ul),kdens_evar(x,w,g,h,d,k,pw,ll,ul,btype):h:n x1 or 1x1result:m x1

kdens_bin(g,gc,h,k,lb,ub,btype):h:m x1 or 1x1result:m x1

kdens_dd(g,gc,h,drv,lb,ub):h: 1x1result: mx1

kdens_df(g,gc,h,drv,lb,ub):h: 1x1result: 1x1

kdens_bw_simple(x,w,rule,scale),kdens_bw_sjpi(x,w,m,scale,ll,ul),kdens_bw_dpi(x,w,m,scale,ll,ul,level):result: 1x1

kdens_lbwf(x,w,g,d):result:n x1where

x:n x1w:n x1 or 1x1g:m x1gc:m x1d:m x1k: 1x1drv: 1x1pw: 1x1m: 1x1rule: 1x1scale: 1x1lb: 1x1ub: 1x1btype: 1x1ll: 1x1ul: 1x1ldpi: 1x1

DiagnosticsThe functions return invalid results if the data contain missing values.

Weights are assumed to be normalized to the number of observations (i.e.