help kdens
-------------------------------------------------------------------------------

Title

kdens -- Univariate kernel density estimation

Syntax

kdens varname [if] [in] [weight] [, kdens_options graph_options ]

_kdens varname [if] [in] [weight] , generate(d [x]) [ kdens_options ]

twoway kdens varname [if] [in] [weight] [, twoway_kdens_options ]

kdens_options Description ------------------------------------------------------------------------- Main kernel(kernel) type of kernel function, where kernel is epanechnikov, epan2 (the default), biweight, triweight, cosine, gaussian, parzen, rectangle or triangle. exact use the exact estimator n(#) estimate density using # points; default is n(512) n2(#) interpolate density estimate to # points * generate(d [x]) store the density estimate in newvar d and the estimation points in newvar x at(var_x) estimate density at the values in var_x range(# #) range of estimation points, minimum and maximum replace overwrite existing variables

Bandwidth bw(#|type) set bandwidth to #, # > 0, or specify automatic bandwidth selector where type is silverman (the default), normalscale, oversmoothed, sjpi, or dpi[(#)] adjust(#) scale bandwidth by #, # > 0 adaptive[(#)] use the adaptive kernel density estimator

Boundary correction ll(#) value of lower boundary ul(#) value of upper boundary reflection | lc use the reflection method or the linear combination method for boundary correction; only one of reflection and lc is allowed; the default method is renormalization

Confidence intervals ci[(stub|lo up)] draw (or store) pointwise confidence intervals vce(vcetype) vcetype may be bootstrap or jackknife plus options; see vce() below for details usmooth[(#)] apply undersmoothing for confidence interval estimation variance(V) store variance estimate in newvar V level(#) set confidence level; default is level(95) ------------------------------------------------------------------------- * generate() is required for _kdens

graph_options Description ------------------------------------------------------------------------- Main nograph suppress graph

Kernel plot cline_options affect rendition of the plotted kernel density estimate ciopts(area_options) affect rendition of the plotted confidence interval

Density plots histogram[(#)] add a histogram to the graph; # specifies the number of bars histopts(twoway_hist) affect rendition of the histogram normal add normal density to the graph normopts(cline_options) affect rendition of normal density student(#) add Student's t density with # degrees of freedom to the graph stopts(cline_options) affect rendition of the Student's t density

Add plot addplot(plot) add other plots to the generated graph

Y-Axis, X-Axis, Title, Caption, Legend, Overall twoway_options any options other than by() documented in [G] twoway_options -------------------------------------------------------------------------

twoway_kdens_options Description ------------------------------------------------------------------------- kernel(kernel) type of kernel function, as specified above exact use the exact estimator n(#) estimate density using # points; default is n(512) n2(#) interpolate density estimate to # points at(var_x) estimate density at the values in var_x range(# #) range of estimation points, minimum and maximum bw(#|type) set bandwidth to # or specify automatic bandwidth selector where type is silverman (the default), normalscale, oversmoothed, sjpi, or dpi[(#)] adjust(#) scale bandwidth by #, # > 0 adaptive[(#)] use the adaptive kernel density estimator

ll(#) value of lower boundary ul(#) value of upper boundary reflection | lc use the reflection method or the linear combination method for boundary correction; the default method is renormalization

horizontal graph horizontally

cline_options change the look of the line

axis_choice_options associate plot with alternative axis

twoway_options any options documented in [G] twoway_options -------------------------------------------------------------------------

fweights, aweights, and pweights are allowed; see weight.

Description

kdens produces univariate kernel density estimates and graphs the result. kdens supplements official Stata's kdensity and also incorporates and extends some of the capabilities of various previous user add-ons such as adgakern (STB-16 snp6), bandw (STB-27 snp6_2), and varwiker (SJ 3-2 st0036) by Salgado-Ugarte et al., akdensity by Van Kerm (SJ 3-2 st0037), and asciker/bsciker by Fiorio (SJ 4-2 st0064).

Main features are:

o kdens is fast. It employs an approximation algorithm based on linearly binned data over a regular grid of estimation points. The algorithm produces very accurate results as long as the grid size is not too small (see the n() option). Alternatively, specify the exact option to use the slow exact estimator.

o Several automatic bandwidth selectors including the Sheather-Jones plug-in estimate are available. See the bw() option. In addition, adaptive (variable bandwidth) kernel density estimation is supported (see the adaptive option).

o Optionally, kdens computes pointwise confidence intervals (see the ci and usmooth options), either using asymptotic formulas or replication techniques (see the vce() option).

o Boundary correction for variables with bounded domain is supported. See the ll() and ul() options.

_kdens is the engine used by kdens. The heavy lifting is done in Mata. See mata kdens().

Dependencies

kdens requires the moremata package. Type

. ssc describe moremata

Options (density estimation)

+------+ ----+ Main +-------------------------------------------------------------

kernel(kernel) specifies the kernel function. kernel may be epanechnikov (Epanechnikov kernel function), epan2 (alternative Epanechnikov kernel function; the default), biweight (biweight kernel function), triweight (triweight kernel function), cosine (cosine trace), gaussian (Gaussian kernel function), parzen (Parzen kernel function), rectangle (rectangle kernel function) or triangle (triangle kernel function). Note that usually the different kernel functions produce very similar results. By default, epan2, specifying the Epanechnikov kernel, is used.

exact causes the exact kernel density estimator to be used instead of the binned approximation estimator. The exact estimator can be slow in large datasets.

n(#), where # > 2, specifies the "evaluation grid size", i.e. the number of (equally spaced) points at which the density estimate be evaluated. The default is grid size 512. This should be enough for the binned approximation estimator to be accurate in most situations (see Hall and Wand 1996). Note that n() also sets the number of estimation points for the sjpi and dpi bandwidth selectors (see the bw() option below).

n2(#), where # must be equal to the value of n() or smaller, specifies the "output grid size". If n2() is equal to n() (the default), then the "evaluation" grid and the "output" grid coincide and the density estimate is returned as is. However, if n2() is smaller than n(), the density estimate will be linearly interpolated from the "evaluation" grid to the "output" grid. Note that n2() will be reset to _N, the number of observations in the dataset, if _N is smaller than n2(). n2() has no effect if at() is specified.

generate(d [x]) stores the results of the estimation. newvar d will contain the density estimate. newvar x will contain the points at which the density is evaluated. The results are written to the to the first n() observations in the data set in ascending order of evaluation points. Alternatively, if at(var_x) is specified, the density estimate is written to the observations identified by var_x. x must be omitted in this case.

at(var_x) specifies a variable that contains the values at which the density be estimated. This option allows you more easily to obtain density estimates for different variables or different subsamples of a variable and then overlay the estimated densities for comparison. With the binned approximation estimator, the density is first estimated using an equally-spaced grid of evaluation points (see the n() option) and is then linearly interpolated to the values of var_x. With the exact estimator, the density is directly estimated at the values of var_x (unless the adaptive option is specified).

range(# #) specifies the range of values (minimum and maximum) at which the density be estimated. The default range of the evaluation grid is defined as [min(x)-h*tau, max(x)+h*tau], where h is the bandwidth and tau is the halfwidth of the kernel support (in the case of the gaussian kernel, tau is set to 3). This allows the density estimate to become (approximately) zero on both sides of the observed data. Specifying ll(#), ul(#), or at(var_x) may also change the evaluation range.

As with the at() option, range() only affects the "output grid". Internally, the density will be estimated over the full data range. An exception is again the exact estimator (unless the adaptive option is specified).

replace permits kdens to overwrite existing variables.

+-----------+ ----+ Bandwidth +--------------------------------------------------------

bw(#|type) may be used to determine the bandwidth of the kernel, the halfwidth of the density window around each evaluation point. bw(#), where # > 0, sets the bandwidth to #. Alternatively, specify bw(type) to choose the automatic bandwidth selector determining the "optimal" bandwidth. Choices are silverman (optimal of Silverman), normalscale (normal scale rule), oversmoothed (oversmoothed rule), sjpi (Sheather-Jones plug-in estimate) and dpi[(#)] (a variant of the Sheather-Jones plug-in estimate called the direct plug-in bandwidth estimate). The # in dpi() specifies the desired number of stages of functional estimation and should be a nonnegative integer (the default is 2; dpi(0) is equivalent to normalscale). bw(silverman) is the default.

Note that automatic bandwidth estimates are rescaled depending on the canonical bandwidth of the kernel function. A consequence of this is that density estimates from the different kernel functions are directly comparable. For example, identical results are computed for epanechnikov and epan2 (apart from round-off error), because the two kernel functions are just scaled versions of one another. No bandwidth rescaling is applied if a specific bandwidth value, i.e. bw(#), is specified.

Furthermore, note that kdens imposes a minimum bandwidth. Let d denote the distance between two consecutive points on the evaluation grid. The minimum bandwidth then is h_min = d/2 * cb_k / cb_r, where cb_k is the canonical bandwidth of the applied kernel and cb_r is the canonical bandwidth of the rectangular kernel. If the bandwidth is smaller than h_min, it is reset to h_min.

adjust(#), where # > 0, causes the bandwidth to be multiplied by #. Default is adjust(1).

adaptive[(#)] specifies that the adaptive kernel density estimator be applied. The adaptive estimator has less bias than the ordinary estimator. # is the desired number of iterations used to determine the local bandwidth factors. The default is 1 (additional iterations usually do not significantly change the density estimate).

+---------------------+ ----+ Boundary correction +----------------------------------------------

ll(#) and ul(#) specify the lower and upper boundary of the domain of the variable. Note that ll(#) must be lower than or equal to the minimum observed value and ul(#) must be larger than or equal to the maximum observed value. The default method used by kdens for density estimation near the boundaries is the renormalization method.

reflection causes the reflection technique to be used for boundary correction instead of the renormalization method.

lc causes the linear combination technique to be used for boundary correction instead of the renormalization method.

Only one of reflection and lc is allowed. The renormalization method and the reflection method have comparable properties with respect to bias and variance. However, note that the reflection method implies the slope of the density to be zero at the boundary. The linear combination technique is better than the other methods in terms of bias, but has larger variance (and the density estimate may get negative in some situations).

+----------------------+ ----+ Confidence intervals +---------------------------------------------

ci[(stub|lo up)] plots pointwise confidence intervals. If ci(stub) is specified, the results are stored in newvar stub_lo and newvar stub_up. Alternatively, specify ci(lo up) to save the results in newvar lo and newvar up. If ci is specified without arguments, but generate(d [x]) is specified, the confidence intervals are stored in newvar d_lo and newvar d_up.

vce(vcetype [, vceopts]) indicates that the confidence intervals be estimated using replication techniques. If vce() is omitted, analytic formulas are used to compute the confidence intervals. vcetype may be bootstrap or jackknife. fweights and aweights are not allowed if vce() is specified.

Common vceopts:

strata(varname) specifies a variable that identifies strata. If this option is specified, bootstrap samples are taken independently within each stratum / stratified jackknife estimates are produced.

cluster(varname) specifies a variable that identifies sample clusters. If this option is specified, the sample drawn during each bootstrap replication is a sample of clusters / clusters are left out for jackknife estimation.

nodots suppresses display of the replication dots. By default, a single dot character is displayed for each successful replication. A single red 'x' is displayed, if a replication is not successful.

mse indicates that the variances be computed using deviations of the replicates from the density estimate based on the entire dataset. By default, variances are computed using deviations from the average of the replicates.

Additional vceopts for vce(jackknife):

subpop(varname) specifies that estimates be computed for the single subpopulation for which varname!=0.

fpc(varname) requests a finite population correction for the variance estimates. The values in varname are interpreted as stratum sampling rates. The values must be in [0,1] and are assumed to be constant within each stratum.

Additional vceopts for vce(bootstrap):

reps(#) specifies the number of bootstrap replications to be performed. The default is 50. More replications are usually required to get reliable results.

normal computes normal approximation confidence intervals.

percentile computes percentile confidence intervals.

bc computes bias-corrected confidence intervals.

bca computes bias-corrected and accelerated confidence intervals.

t computes percentile-t confidence intervals. The default analytic formulas are used for standard error estimation within the bootstrap replicates.

Only one of normal, percentile, bc, bca, and t is allowed. See [R] bootstrap for methodical details. For the percentile-t method see help for mm_bs().

usmooth(#) specifies that confidence intervals be based on an undersmoothed density estimate in order to reduce the bias. # specifies the degree of undersmoothing and should be within .2 and 1. The default value is 1/4 = .25. Higher values result in stronger undersmoothing. A value of 1/5 = .2 results in no undersmoothing. (See Fiorio 2004.)

variance(V) specifies that the pointwise variance be stored in newvar V.

level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level.

Options (graph)

+------+ ----+ Main +-------------------------------------------------------------

nograph suppresses the graph. Instead of specifying nograph you might as well use _kdens directly.

+-------------+ ----+ Kernel plot +------------------------------------------------------

cline_options affect the rendition of the plotted kernel density estimate. See connect_options.

ciopts(area_options) specifies details about the rendition of the plotted confidence interval. See area_options.

+---------------+ ----+ Density plots +----------------------------------------------------

histogram[(#)] requests that a histogram of the data be added to graph. The histogram will be placed in the background, behind the density estimate. # specifies the number of bins to be used.

histopts(options) specifies details about the rendition of the histogram, such as the look of the bars. See twoway histogram.

normal requests that a normal density be overlaid on the density estimate for comparison.

normopts(cline_options) specifies details about the rendition of the normal curve, such as the color and style of line used. See connect_options.

student(#) specifies that a Student's t density with # degrees of freedom be overlaid on the density estimate for comparison.

stopts(cline_options) affect the rendition of the Student's t density. See connect_options.

+----------+ ----+ Add plot +---------------------------------------------------------

addplot(plot) provides a way to add other plots to the generated graph. See addplot_option.

+-------------------------------------------------+ ----+ Y-Axis, X-Axis, Title, Caption, Legend, Overall +------------------

twoway_options are any of the options documented in twoway_options, excluding by(). These include options for titling the graph (see title_options) and options for saving the graph to disk (see saving_option).

Examples

. use http://www.stata-press.com/data/r7/trocolen.dta

. kdens length

. kdens length, bw(sjpi)

. kdens length, adaptive

. kdens length, ci usmooth

. kdens length, ci vce(jackknife)

. kdens length, ci vce(bootstrap, reps(200))

. _kdens length, kernel(parzen) gen(parzen x) replace . _kdens length, kernel(cosine) gen(cosine) at(x) . line parzen cosine x

. gen length2 = abs(length-417) . kdens length2, ll(0) ci

. kdens length, histogram ciopts(recast(rline) pstyle(p2) lp(dash))

. generate byte g = uniform()<.5 . twoway kdens length if g==1 || kdens length if g==0

Methods and Formulas

See http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf.

References

Fiorio, C. V. 2004. Confidence intervals for kernel density estimation. The Stata Journal 4: 168-179.

Hall, P. and M. P. Wand. 1996. On the Accuracy of Binned Kernel Density Estimators. Journal of Multivariate Analysis 56: 165-184.

Author

Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch

Thanks for citing this software as follows:

Jann, B. (2005). kdens: Stata module for univariate kernel density estimation. Available from http://ideas.repec.org/c/boc/bocode/s456410.html.

Also see

Online: mata kdens(), kdensity, graph, histogram, lowess