help kdens-------------------------------------------------------------------------------

Title

kdens-- Univariate kernel density estimation

Syntax

kdensvarname[if] [in] [weight] [,kdens_optionsgraph_options]

_kdensvarname[if] [in] [weight],generate(d[x])[kdens_options]

twowaykdensvarname[if] [in] [weight] [,twoway_kdens_options]

kdens_optionsDescription ------------------------------------------------------------------------- Mainkernel(kernel)type of kernel function, wherekernelisepanechnikov,epan2(the default),biweight,triweight,cosine,gaussian,parzen,rectangleortriangle.exactuse the exact estimatorn(#)estimate density using#points; default isn(512)n2(#)interpolate density estimate to#points *generate(d[x])store the density estimate innewvardand the estimation points innewvarxat(var_x)estimate density at the values invar_xrange(# #)range of estimation points, minimum and maximumreplaceoverwrite existing variablesBandwidth

bw(#|type)set bandwidth to#,#> 0, or specify automatic bandwidth selector wheretypeissilverman(the default),normalscale,oversmoothed,sjpi, ordpi[(#)]adjust(#)scale bandwidth by#,#> 0adaptive[(#)] use the adaptive kernel density estimatorBoundary correction

ll(#)value of lower boundaryul(#)value of upper boundaryreflection|lcuse the reflection method or the linear combination method for boundary correction; only one ofreflectionandlcis allowed; the default method is renormalizationConfidence intervals

ci[(stub|lo up)] draw (or store) pointwise confidence intervalsvce(vcetype)vcetypemay bebootstraporjackknifeplus options; seevce()below for detailsusmooth[(#)] apply undersmoothing for confidence interval estimationvariance(V)store variance estimate innewvarVlevel(#)set confidence level; default islevel(95)------------------------------------------------------------------------- *generate()is required for_kdens

graph_optionsDescription ------------------------------------------------------------------------- Mainnographsuppress graphKernel plot

cline_optionsaffect rendition of the plotted kernel density estimateciopts(area_options)affect rendition of the plotted confidence intervalDensity plots

histogram[(#)] add a histogram to the graph;#specifies the number of barshistopts(twoway_hist)affect rendition of the histogramnormaladd normal density to the graphnormopts(cline_options)affect rendition of normal densitystudent(#)add Student's t density with#degrees of freedom to the graphstopts(cline_options)affect rendition of the Student's t densityAdd plot

addplot(plot)add other plots to the generated graphY-Axis, X-Axis, Title, Caption, Legend, Overall

twoway_optionsany options other thanby()documented in[G]twoway_options-------------------------------------------------------------------------

twoway_kdens_optionsDescription -------------------------------------------------------------------------kernel(kernel)type of kernel function, as specified aboveexactuse the exact estimatorn(#)estimate density using#points; default isn(512)n2(#)interpolate density estimate to#pointsat(var_x)estimate density at the values invar_xrange(# #)range of estimation points, minimum and maximumbw(#|type)set bandwidth to#or specify automatic bandwidth selector wheretypeissilverman(the default),normalscale,oversmoothed,sjpi, ordpi[(#)]adjust(#)scale bandwidth by#,#> 0adaptive[(#)] use the adaptive kernel density estimator

ll(#)value of lower boundaryul(#)value of upper boundaryreflection|lcuse the reflection method or the linear combination method for boundary correction; the default method is renormalization

horizontalgraph horizontally

cline_optionschange the look of the line

axis_choice_optionsassociate plot with alternative axis

twoway_optionsany options documented in[G]twoway_options-------------------------------------------------------------------------

fweights,aweights, andpweights are allowed; see weight.

Description

kdensproduces univariate kernel density estimates and graphs the result.kdenssupplements official Stata'skdensityand also incorporates and extends some of the capabilities of various previous user add-ons such asadgakern(STB-16 snp6),bandw(STB-27 snp6_2), andvarwiker(SJ 3-2 st0036) by Salgado-Ugarte et al.,akdensityby Van Kerm (SJ 3-2 st0037), andasciker/bscikerby Fiorio (SJ 4-2 st0064).Main features are:

o

kdensis fast. It employs an approximation algorithm based on linearly binned data over a regular grid of estimation points. The algorithm produces very accurate results as long as the grid size is not too small (see then()option). Alternatively, specify theexactoption to use the slow exact estimator.o Several automatic bandwidth selectors including the Sheather-Jones plug-in estimate are available. See the

bw()option. In addition, adaptive (variable bandwidth) kernel density estimation is supported (see theadaptiveoption).o Optionally,

kdenscomputes pointwise confidence intervals (see theciandusmoothoptions), either using asymptotic formulas or replication techniques (see thevce()option).o Boundary correction for variables with bounded domain is supported. See the

ll()andul()options.

_kdensis the engine used bykdens. The heavy lifting is done in Mata. Seemata kdens().

Dependencies

kdensrequires themorematapackage. Type. ssc describe moremata

Options (density estimation)+------+ ----+ Main +-------------------------------------------------------------

kernel(kernel)specifies the kernel function.kernelmay beepanechnikov(Epanechnikov kernel function),epan2(alternative Epanechnikov kernel function; the default),biweight(biweight kernel function),triweight(triweight kernel function),cosine(cosine trace),gaussian(Gaussian kernel function),parzen(Parzen kernel function),rectangle(rectangle kernel function) ortriangle(triangle kernel function). Note that usually the different kernel functions produce very similar results. By default,epan2, specifying the Epanechnikov kernel, is used.

exactcauses the exact kernel density estimator to be used instead of the binned approximation estimator. The exact estimator can be slow in large datasets.

n(#), where#> 2, specifies the "evaluation grid size", i.e. the number of (equally spaced) points at which the density estimate be evaluated. The default is grid size 512. This should be enough for the binned approximation estimator to be accurate in most situations (see Hall and Wand 1996). Note thatn()also sets the number of estimation points for thesjpianddpibandwidth selectors (see thebw()option below).

n2(#), where#must be equal to the value ofn()or smaller, specifies the "output grid size". Ifn2()is equal ton()(the default), then the "evaluation" grid and the "output" grid coincide and the density estimate is returned as is. However, ifn2()is smaller thann(), the density estimate will be linearly interpolated from the "evaluation" grid to the "output" grid. Note thatn2()will be reset to_N, the number of observations in the dataset, if_Nis smaller thann2().n2()has no effect ifat()is specified.

generate(d[x])stores the results of the estimation.newvardwill contain the density estimate.newvarxwill contain the points at which the density is evaluated. The results are written to the to the firstn()observations in the data set in ascending order of evaluation points. Alternatively, ifat(var_x)is specified, the density estimate is written to the observations identified byvar_x.xmust be omitted in this case.

at(var_x)specifies a variable that contains the values at which the density be estimated. This option allows you more easily to obtain density estimates for different variables or different subsamples of a variable and then overlay the estimated densities for comparison. With the binned approximation estimator, the density is first estimated using an equally-spaced grid of evaluation points (see then()option) and is then linearly interpolated to the values ofvar_x. With the exact estimator, the density is directly estimated at the values ofvar_x(unless theadaptiveoption is specified).

range(# #)specifies the range of values (minimum and maximum) at which the density be estimated. The default range of the evaluation grid is defined as [min(x)-h*tau, max(x)+h*tau], where h is the bandwidth and tau is the halfwidth of the kernel support (in the case of the gaussian kernel, tau is set to 3). This allows the density estimate to become (approximately) zero on both sides of the observed data. Specifyingll(#),ul(#), orat(var_x)may also change the evaluation range.As with the

at()option,range()only affects the "output grid". Internally, the density will be estimated over the full data range. An exception is again the exact estimator (unless theadaptiveoption is specified).

replacepermitskdensto overwrite existing variables.

+-----------+ ----+ Bandwidth +--------------------------------------------------------

bw(#|type)may be used to determine the bandwidth of the kernel, the halfwidth of the density window around each evaluation point.bw(#), where # > 0, sets the bandwidth to #. Alternatively, specifybw(type)to choose the automatic bandwidth selector determining the "optimal" bandwidth. Choices aresilverman(optimal of Silverman),normalscale(normal scale rule),oversmoothed(oversmoothed rule),sjpi(Sheather-Jones plug-in estimate) anddpi[(#)] (a variant of the Sheather-Jones plug-in estimate called the direct plug-in bandwidth estimate). The#indpi()specifies the desired number of stages of functional estimation and should be a nonnegative integer (the default is 2;dpi(0)is equivalent tonormalscale).bw(silverman)is the default.Note that automatic bandwidth estimates are rescaled depending on the canonical bandwidth of the kernel function. A consequence of this is that density estimates from the different kernel functions are directly comparable. For example, identical results are computed for

epanechnikovandepan2(apart from round-off error), because the two kernel functions are just scaled versions of one another. No bandwidth rescaling is applied if a specific bandwidth value, i.e.bw(#), is specified.Furthermore, note that

kdensimposes a minimum bandwidth. Let d denote the distance between two consecutive points on the evaluation grid. The minimum bandwidth then is h_min = d/2 * cb_k / cb_r, where cb_k is the canonical bandwidth of the applied kernel and cb_r is the canonical bandwidth of the rectangular kernel. If the bandwidth is smaller than h_min, it is reset to h_min.

adjust(#), where#> 0, causes the bandwidth to be multiplied by #. Default isadjust(1).

adaptive[(#)] specifies that the adaptive kernel density estimator be applied. The adaptive estimator has less bias than the ordinary estimator.#is the desired number of iterations used to determine the local bandwidth factors. The default is 1 (additional iterations usually do not significantly change the density estimate).

+---------------------+ ----+ Boundary correction +----------------------------------------------

ll(#)andul(#)specify the lower and upper boundary of the domain of the variable. Note thatll(#)must be lower than or equal to the minimum observed value andul(#)must be larger than or equal to the maximum observed value. The default method used bykdensfor density estimation near the boundaries is the renormalization method.

reflectioncauses the reflection technique to be used for boundary correction instead of the renormalization method.

lccauses the linear combination technique to be used for boundary correction instead of the renormalization method.Only one of

reflectionandlcis allowed. The renormalization method and the reflection method have comparable properties with respect to bias and variance. However, note that the reflection method implies the slope of the density to be zero at the boundary. The linear combination technique is better than the other methods in terms of bias, but has larger variance (and the density estimate may get negative in some situations).

+----------------------+ ----+ Confidence intervals +---------------------------------------------

ci[(stub|lo up)] plots pointwise confidence intervals. Ifci(stub)is specified, the results are stored innewvarstub_loandnewvarstub_up. Alternatively, specifyci(lo up)to save the results innewvarloandnewvarup. Ifciis specified without arguments, butgenerate(d[x])is specified, the confidence intervals are stored innewvard_loandnewvard_up.

vce(vcetype[,vceopts])indicates that the confidence intervals be estimated using replication techniques. Ifvce()is omitted, analytic formulas are used to compute the confidence intervals.vcetypemay bebootstraporjackknife.fweights andaweights are not allowed ifvce()is specified.Common

vceopts:

strata(varname)specifies a variable that identifies strata. If this option is specified, bootstrap samples are taken independently within each stratum / stratified jackknife estimates are produced.

cluster(varname)specifies a variable that identifies sample clusters. If this option is specified, the sample drawn during each bootstrap replication is a sample of clusters / clusters are left out for jackknife estimation.

nodotssuppresses display of the replication dots. By default, a single dot character is displayed for each successful replication. A single red 'x' is displayed, if a replication is not successful.

mseindicates that the variances be computed using deviations of the replicates from the density estimate based on the entire dataset. By default, variances are computed using deviations from the average of the replicates.Additional

vceoptsforvce(jackknife):

subpop(varname)specifies that estimates be computed for the single subpopulation for whichvarname!=0.

fpc(varname)requests a finite population correction for the variance estimates. The values invarnameare interpreted as stratum sampling rates. The values must be in [0,1] and are assumed to be constant within each stratum.Additional

vceoptsforvce(bootstrap):

reps(#)specifies the number of bootstrap replications to be performed. The default is 50. More replications are usually required to get reliable results.

normalcomputes normal approximation confidence intervals.

percentilecomputes percentile confidence intervals.

bccomputes bias-corrected confidence intervals.

bcacomputes bias-corrected and accelerated confidence intervals.

tcomputes percentile-t confidence intervals. The default analytic formulas are used for standard error estimation within the bootstrap replicates.Only one of

normal,percentile,bc,bca, andtis allowed. See[R]bootstrapfor methodical details. For the percentile-t method see help formm_bs().

usmooth(#)specifies that confidence intervals be based on an undersmoothed density estimate in order to reduce the bias.#specifies the degree of undersmoothing and should be within .2 and 1. The default value is 1/4 = .25. Higher values result in stronger undersmoothing. A value of 1/5 = .2 results in no undersmoothing. (See Fiorio 2004.)

variance(V)specifies that the pointwise variance be stored innewvarV.

level(#)specifies the confidence level, as a percentage, for confidence intervals. The default islevel(95)or as set byset level.

Options (graph)+------+ ----+ Main +-------------------------------------------------------------

nographsuppresses the graph. Instead of specifyingnographyou might as well use_kdensdirectly.+-------------+ ----+ Kernel plot +------------------------------------------------------

cline_optionsaffect the rendition of the plotted kernel density estimate. Seeconnect_options.

ciopts(area_options)specifies details about the rendition of the plotted confidence interval. Seearea_options.+---------------+ ----+ Density plots +----------------------------------------------------

histogram[(#)] requests that a histogram of the data be added to graph. The histogram will be placed in the background, behind the density estimate.#specifies the number of bins to be used.

histopts(options)specifies details about the rendition of the histogram, such as the look of the bars. Seetwoway histogram.

normalrequests that a normal density be overlaid on the density estimate for comparison.

normopts(cline_options)specifies details about the rendition of the normal curve, such as the color and style of line used. Seeconnect_options.

student(#)specifies that a Student's t density with#degrees of freedom be overlaid on the density estimate for comparison.

stopts(cline_options)affect the rendition of the Student's t density. Seeconnect_options.+----------+ ----+ Add plot +---------------------------------------------------------

addplot(plot)provides a way to add other plots to the generated graph. Seeaddplot_option.+-------------------------------------------------+ ----+ Y-Axis, X-Axis, Title, Caption, Legend, Overall +------------------

twoway_optionsare any of the options documented intwoway_options, excludingby(). These include options for titling the graph (seetitle_options) and options for saving the graph to disk (seesaving_option).

Examples. use http://www.stata-press.com/data/r7/trocolen.dta

. kdens length

. kdens length, bw(sjpi)

. kdens length, adaptive

. kdens length, ci usmooth

. kdens length, ci vce(jackknife)

. kdens length, ci vce(bootstrap, reps(200))

. _kdens length, kernel(parzen) gen(parzen x) replace . _kdens length, kernel(cosine) gen(cosine) at(x) . line parzen cosine x

. gen length2 = abs(length-417) . kdens length2, ll(0) ci

. kdens length, histogram ciopts(recast(rline) pstyle(p2) lp(dash))

. generate byte g = uniform()<.5 . twoway kdens length if g==1 || kdens length if g==0

Methods and FormulasSee http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf.

ReferencesFiorio, C. V. 2004. Confidence intervals for kernel density estimation. The Stata Journal 4: 168-179.

Hall, P. and M. P. Wand. 1996. On the Accuracy of Binned Kernel Density Estimators. Journal of Multivariate Analysis 56: 165-184.

AuthorBen Jann, ETH Zurich, jann@soz.gess.ethz.ch

Thanks for citing this software as follows:

Jann, B. (2005). kdens: Stata module for univariate kernel density estimation. Available from http://ideas.repec.org/c/boc/bocode/s456410.html.

Also seeOnline:

mata kdens(),kdensity,graph,histogram,lowess