{smcl} {* 19feb2008}{...} {cmd:help kdens} {hline} {title:Title} {pstd}{hi:kdens} {hline 2} Univariate kernel density estimation {title:Syntax} {p 8 17 2} {cmd:kdens} {varname} {ifin} {weight} [{cmd:,} {help kdens##1:{it:kdens_options}} {help kdens##2:{it:graph_options}} ] {p 8 17 2} {cmd:_kdens} {varname} {ifin} {weight} {cmd:,} {opt g:enerate(d [x])} [ {help kdens##1:{it:kdens_options}} ] {p 8 17 2} {cmdab:tw:oway} {cmd:kdens} {varname} {ifin} {weight} [{cmd:,} {help kdens##3:{it:twoway_kdens_options}} ] {synoptset 25 tabbed}{...} {marker 1}{synopthdr:kdens_options} {synoptline} {syntab :Main} {synopt :{opt k:ernel(kernel)}}type of kernel function, where {it:kernel} is {opt e:panechnikov}, {opt epan2} (the default), {opt b:iweight}, {opt triw:eight}, {opt c:osine}, {opt g:aussian}, {opt p:arzen}, {opt r:ectangle} or {opt t:riangle}. {p_end} {synopt :{opt exact}}use the exact estimator {p_end} {synopt :{opt n(#)}}estimate density using {it:#} points; default is {cmd:n(512)} {p_end} {synopt :{opt n2(#)}}interpolate density estimate to {it:#} points {p_end} {p2coldent :* {opt g:enerate(d [x])}}store the density estimate in {it:{help newvar}} {it:d} and the estimation points in {it:{help newvar}} {it:x} {p_end} {synopt :{opt at(var_x)}}estimate density at the values in {it:var_x} {p_end} {synopt :{opt ra:nge(# #)}}range of estimation points, minimum and maximum {p_end} {synopt :{opt r:eplace}}overwrite existing variables {p_end} {syntab :Bandwidth} {synopt :{opt bw(#|type)}}set bandwidth to {it:#}, {it:#} > 0, or specify automatic bandwidth selector where {it:type} is {cmdab:s:ilverman} (the default), {cmdab:n:ormalscale}, {cmdab:o:versmoothed}, {opt sj:pi}, or {cmdab:d:pi}[{cmd:(}{it:#}{cmd:)}] {p_end} {synopt :{opt adj:ust(#)}}scale bandwidth by {it:#}, {it:#} > 0 {p_end} {synopt :{cmdab:a:daptive}[{cmd:(}{it:#}{cmd:)}]}use the adaptive kernel density estimator {p_end} {syntab :Boundary correction} {synopt :{opt ll(#)}}value of lower boundary{p_end} {synopt :{opt ul(#)}}value of upper boundary{p_end} {synopt :{opt refl:ection} | {opt lc}}use the reflection method or the linear combination method for boundary correction; only one of {opt reflection} and {opt lc} is allowed; the default method is renormalization{p_end} {syntab :Confidence intervals} {synopt :{cmd:ci}[{cmd:(}{it:stub}|{it:lo up}{cmd:)}]}draw (or store) pointwise confidence intervals {p_end} {synopt :{cmd:vce(}{it:{help kdens##vce:vcetype}}{cmd:)}}{it:vcetype} may be {opt boot:strap} or {opt jack:knife} plus options; see {helpb kdens##vce:vce()} below for details{p_end} {synopt :{cmdab:us:mooth}[{cmd:(}{it:#}{cmd:)}]}apply undersmoothing for confidence interval estimation {p_end} {synopt :{opt var:iance(V)}}store variance estimate in {it:{help newvar}} {it:V}{p_end} {synopt :{opt l:evel(#)}}set confidence level; default is {cmd:level(95)}{p_end} {synoptline} {p 4 6 2}* {opt generate()} is required for {cmd:_kdens}{p_end} {synoptset 25 tabbed}{...} {marker 2}{synopthdr:graph_options} {synoptline} {syntab :Main} {synopt :{opt nogr:aph}}suppress graph{p_end} {syntab :Kernel plot} {synopt :{it:{help cline_options}}}affect rendition of the plotted kernel density estimate{p_end} {synopt :{opth ciopts(area_options)}}affect rendition of the plotted confidence interval{p_end} {syntab :Density plots} {synopt :{cmdab:hist:ogram}[{cmd:(}{it:#}{cmd:)}]}add a histogram to the graph; {it:#} specifies the number of bars{p_end} {synopt :{opth histopts(twoway_hist)}}affect rendition of the histogram{p_end} {synopt :{opt nor:mal}}add normal density to the graph{p_end} {synopt :{opth normopts(cline_options)}}affect rendition of normal density{p_end} {synopt :{opt stu:dent(#)}}add Student's t density with {it:#} degrees of freedom to the graph{p_end} {synopt :{opth stopts(cline_options)}}affect rendition of the Student's t density{p_end} {syntab :Add plot} {synopt :{opth "addplot(addplot_option:plot)"}}add other plots to the generated graph{p_end} {syntab :Y-Axis, X-Axis, Title, Caption, Legend, Overall} {synopt :{it:{help twoway_options}}}any options other than {opt by()} documented in {bind:{bf:[G] {it:twoway_options}}}{p_end} {synoptline} {synoptset 25 tabbed}{...} {marker 3}{synopthdr:twoway_kdens_options} {synoptline} {synopt :{opt k:ernel(kernel)}}type of kernel function, as specified above {p_end} {synopt :{opt exact}}use the exact estimator {p_end} {synopt :{opt n(#)}}estimate density using {it:#} points; default is {cmd:n(512)} {p_end} {synopt :{opt n2(#)}}interpolate density estimate to {it:#} points {p_end} {synopt :{opt at(var_x)}}estimate density at the values in {it:var_x} {p_end} {synopt :{opt ra:nge(# #)}}range of estimation points, minimum and maximum {p_end} {synopt :{opt bw(#|type)}}set bandwidth to {it:#} or specify automatic bandwidth selector where {it:type} is {cmdab:s:ilverman} (the default), {cmdab:n:ormalscale}, {cmdab:o:versmoothed}, {opt sj:pi}, or {cmdab:d:pi}[{cmd:(}{it:#}{cmd:)}] {p_end} {synopt :{opt adj:ust(#)}}scale bandwidth by {it:#}, {it:#} > 0 {p_end} {synopt :{cmdab:a:daptive}[{cmd:(}{it:#}{cmd:)}]}use the adaptive kernel density estimator {p_end} {synopt :{opt ll(#)}}value of lower boundary{p_end} {synopt :{opt ul(#)}}value of upper boundary{p_end} {synopt :{opt refl:ection} | {opt lc}}use the reflection method or the linear combination method for boundary correction; the default method is renormalization{p_end} {synopt :{opt hor:izontal}}graph horizontally {p_end} {synopt :{it:{help cline_options}}}change the look of the line {synopt :{it:{help axis_choice_options}}}associate plot with alternative axis {synopt :{it:{help twoway_options}}}any options documented in {bind:{bf:[G] {it:twoway_options}}}{p_end} {synoptline} {pstd} {cmd:fweight}s, {cmd:aweight}s, and {cmd:pweight}s are allowed; see {help weight}. {title:Description} {pstd} {cmd:kdens} produces univariate kernel density estimates and graphs the result. {cmd:kdens} supplements official Stata's {helpb kdensity} and also incorporates and extends some of the capabilities of various previous user add-ons such as {cmd:adgakern} (STB-16 {net "stb 16 snp6":snp6}), {cmd:bandw} (STB-27 {net "stb 27 snp6_2":snp6_2}), and {cmd:varwiker} (SJ 3-2 {net "sj 3-2 st0036":st0036}) by Salgado-Ugarte et al., {cmd:akdensity} by Van Kerm (SJ 3-2 {net "sj 3-2 st0037":st0037}), and {cmd:asciker}/{cmd:bsciker} by Fiorio (SJ 4-2 {net "sj 4-2 st0064":st0064}). {pstd}Main features are: {phang2}{space 1}o{space 2}{cmd:kdens} is fast. It employs an approximation algorithm based on linearly binned data over a regular grid of estimation points. The algorithm produces very accurate results as long as the grid size is not too small (see the {opt n()} option). Alternatively, specify the {cmd:exact} option to use the slow exact estimator. {phang2}{space 1}o{space 2}Several automatic bandwidth selectors including the Sheather-Jones plug-in estimate are available. See the {cmd:bw()} option. In addition, adaptive (variable bandwidth) kernel density estimation is supported (see the {cmd:adaptive} option). {phang2}{space 1}o{space 2}Optionally, {cmd:kdens} computes pointwise confidence intervals (see the {cmd:ci} and {cmd:usmooth} options), either using asymptotic formulas or replication techniques (see the {cmd:vce()} option). {phang2}{space 1}o{space 2}Boundary correction for variables with bounded domain is supported. See the {cmd:ll()} and {cmd:ul()} options. {pstd}{cmd:_kdens} is the engine used by {cmd:kdens}. The heavy lifting is done in Mata. See {helpb mf_kdens:mata kdens()}. {title:Dependencies} {pstd} {cmd:kdens} requires the {cmd:moremata} package. Type {com}. {net "describe moremata, from(http://fmwww.bc.edu/repec/bocode/m/)":ssc describe moremata}{txt} {title:Options (density estimation)} {dlgtab:Main} {phang} {opt kernel(kernel)} specifies the kernel function. {it:kernel} may be {opt epanechnikov} (Epanechnikov kernel function), {opt epan2} (alternative Epanechnikov kernel function; the default), {opt biweight} (biweight kernel function), {opt triweight} (triweight kernel function), {opt cosine} (cosine trace), {opt gaussian} (Gaussian kernel function), {opt parzen} (Parzen kernel function), {opt rectangle} (rectangle kernel function) or {opt triangle} (triangle kernel function). Note that usually the different kernel functions produce very similar results. By default, {opt epan2}, specifying the Epanechnikov kernel, is used. {phang}{cmd:exact} causes the exact kernel density estimator to be used instead of the binned approximation estimator. The exact estimator can be slow in large datasets. {phang} {opt n(#)}, where {it:#} > 2, specifies the "evaluation grid size", i.e. the number of (equally spaced) points at which the density estimate be evaluated. The default is grid size 512. This should be enough for the binned approximation estimator to be accurate in most situations (see Hall and Wand 1996). Note that {opt n()} also sets the number of estimation points for the {cmd:sjpi} and {cmd:dpi} bandwidth selectors (see the {cmd:bw()} option below). {phang} {opt n2(#)}, where {it:#} must be equal to the value of {cmd:n()} or smaller, specifies the "output grid size". If {opt n2()} is equal to {opt n()} (the default), then the "evaluation" grid and the "output" grid coincide and the density estimate is returned as is. However, if {opt n2()} is smaller than {opt n()}, the density estimate will be linearly interpolated from the "evaluation" grid to the "output" grid. Note that {opt n2()} will be reset to {helpb _N}, the number of observations in the dataset, if {helpb _N} is smaller than {opt n2()}. {opt n2()} has no effect if {opt at()} is specified. {phang} {opt generate(d [x])} stores the results of the estimation. {it:{help newvar}} {it:d} will contain the density estimate. {it:{help newvar}} {it:x} will contain the points at which the density is evaluated. The results are written to the to the first {opt n()} observations in the data set in ascending order of evaluation points. Alternatively, if {opt at(var_x)} is specified, the density estimate is written to the observations identified by {it:var_x}. {it:x} must be omitted in this case. {phang} {opt at(var_x)} specifies a variable that contains the values at which the density be estimated. This option allows you more easily to obtain density estimates for different variables or different subsamples of a variable and then overlay the estimated densities for comparison. With the binned approximation estimator, the density is first estimated using an equally-spaced grid of evaluation points (see the {opt n()} option) and is then linearly interpolated to the values of {it:var_x}. With the exact estimator, the density is directly estimated at the values of {it:var_x} (unless the {cmd:adaptive} option is specified). {phang}{opt range(# #)} specifies the range of values (minimum and maximum) at which the density be estimated. The default range of the evaluation grid is defined as [min(x)-h*tau, max(x)+h*tau], where h is the bandwidth and tau is the halfwidth of the kernel support (in the case of the gaussian kernel, tau is set to 3). This allows the density estimate to become (approximately) zero on both sides of the observed data. Specifying {opt ll(#)}, {opt ul(#)}, or {opt at(var_x)} may also change the evaluation range. {pmore}As with the {cmd:at()} option, {cmd:range()} only affects the "output grid". Internally, the density will be estimated over the full data range. An exception is again the exact estimator (unless the {cmd:adaptive} option is specified). {phang} {opt replace} permits {cmd:kdens} to overwrite existing variables. {dlgtab:Bandwidth} {phang} {opt bw(#|type)} may be used to determine the bandwidth of the kernel, the halfwidth of the density window around each evaluation point. {opt bw(#)}, where # > 0, sets the bandwidth to #. Alternatively, specify {opt bw(type)} to choose the automatic bandwidth selector determining the "optimal" bandwidth. Choices are {opt silverman} (optimal of Silverman), {opt normalscale} (normal scale rule), {opt oversmoothed} (oversmoothed rule), {opt sjpi} (Sheather-Jones plug-in estimate) and {cmd:dpi}[{cmd:(}{it:#}{cmd:)}] (a variant of the Sheather-Jones plug-in estimate called the direct plug-in bandwidth estimate). The {it:#} in {opt dpi()} specifies the desired number of stages of functional estimation and should be a nonnegative integer (the default is 2; {cmd:dpi(0)} is equivalent to {opt normalscale}). {cmd:bw(silverman)} is the default. {pmore}Note that automatic bandwidth estimates are rescaled depending on the canonical bandwidth of the kernel function. A consequence of this is that density estimates from the different kernel functions are directly comparable. For example, identical results are computed for {cmd:epanechnikov} and {cmd:epan2} (apart from round-off error), because the two kernel functions are just scaled versions of one another. No bandwidth rescaling is applied if a specific bandwidth value, i.e. {opt bw(#)}, is specified. {pmore}Furthermore, note that {cmd:kdens} imposes a minimum bandwidth. Let d denote the distance between two consecutive points on the evaluation grid. The minimum bandwidth then is h_min = d/2 * cb_k / cb_r, where cb_k is the canonical bandwidth of the applied kernel and cb_r is the canonical bandwidth of the rectangular kernel. If the bandwidth is smaller than h_min, it is reset to h_min. {phang} {opt adjust(#)}, where {it:#} > 0, causes the bandwidth to be multiplied by #. Default is {cmd:adjust(1)}. {phang} {opt adaptive}[{cmd:(}{it:#}{cmd:)}] specifies that the adaptive kernel density estimator be applied. The adaptive estimator has less bias than the ordinary estimator. {it:#} is the desired number of iterations used to determine the local bandwidth factors. The default is 1 (additional iterations usually do not significantly change the density estimate). {dlgtab:Boundary correction} {phang} {opt ll(#)} and {opt ul(#)} specify the lower and upper boundary of the domain of the variable. Note that {opt ll(#)} must be lower than or equal to the minimum observed value and {opt ul(#)} must be larger than or equal to the maximum observed value. The default method used by {cmd:kdens} for density estimation near the boundaries is the renormalization method. {phang} {opt reflection} causes the reflection technique to be used for boundary correction instead of the renormalization method. {phang} {opt lc} causes the linear combination technique to be used for boundary correction instead of the renormalization method. {pmore} Only one of {opt reflection} and {opt lc} is allowed. The renormalization method and the reflection method have comparable properties with respect to bias and variance. However, note that the reflection method implies the slope of the density to be zero at the boundary. The linear combination technique is better than the other methods in terms of bias, but has larger variance (and the density estimate may get negative in some situations). {dlgtab:Confidence intervals} {phang}{cmd:ci}[{cmd:(}{it:stub}|{it:lo up}{cmd:)}] plots pointwise confidence intervals. If {opt ci(stub)} is specified, the results are stored in {it:{help newvar}} {it:stub}{cmd:_lo} and {it:{help newvar}} {it:stub}{cmd:_up}. Alternatively, specify {opt ci(lo up)} to save the results in {it:{help newvar}} {it:lo} and {it:{help newvar}} {it:up}. If {opt ci} is specified without arguments, but {opt generate(d [x])} is specified, the confidence intervals are stored in {it:{help newvar}} {it:d}{cmd:_lo} and {it:{help newvar}} {it:d}{cmd:_up}. {marker vce}{phang}{cmd:vce(}{it:vcetype} [{cmd:,} {it:vceopts}]{cmd:)} indicates that the confidence intervals be estimated using replication techniques. If {cmd:vce()} is omitted, analytic formulas are used to compute the confidence intervals. {it:vcetype} may be {cmd:bootstrap} or {cmd:jackknife}. {cmd:fweight}s and {cmd:aweight}s are not allowed if {cmd:vce()} is specified. {pmore}Common {it:vceopts}: {phang2} {opth str:ata(varname)} specifies a variable that identifies strata. If this option is specified, bootstrap samples are taken independently within each stratum / stratified jackknife estimates are produced. {phang2} {opth cl:uster(varname)} specifies a variable that identifies sample clusters. If this option is specified, the sample drawn during each bootstrap replication is a sample of clusters / clusters are left out for jackknife estimation. {phang2} {opt nod:ots} suppresses display of the replication dots. By default, a single dot character is displayed for each successful replication. A single red 'x' is displayed, if a replication is not successful. {phang2} {opt mse} indicates that the variances be computed using deviations of the replicates from the density estimate based on the entire dataset. By default, variances are computed using deviations from the average of the replicates. {pmore}Additional {it:vceopts} for {cmd:vce(jackknife)}: {phang2} {opth sub:pop(varname)} specifies that estimates be computed for the single subpopulation for which {varname}!=0. {phang2} {opth fpc(varname)} requests a finite population correction for the variance estimates. The values in {it:varname} are interpreted as stratum sampling rates. The values must be in [0,1] and are assumed to be constant within each stratum. {pmore}Additional {it:vceopts} for {cmd:vce(bootstrap)}: {phang2} {opt r:eps(#)} specifies the number of bootstrap replications to be performed. The default is 50. More replications are usually required to get reliable results. {phang2} {opt n:ormal} computes normal approximation confidence intervals. {phang2} {opt p:ercentile} computes percentile confidence intervals. {phang2} {opt bc} computes bias-corrected confidence intervals. {phang2} {opt bca} computes bias-corrected and accelerated confidence intervals. {phang2} {opt t} computes percentile-t confidence intervals. The default analytic formulas are used for standard error estimation within the bootstrap replicates. {pmore}Only one of {cmd:normal}, {cmd:percentile}, {cmd:bc}, {cmd:bca}, and {cmd:t} is allowed. See {bf:[R] bootstrap} for methodical details. For the percentile-t method see help for {helpb mf_mm_bs##r3:mm_bs()}. {phang} {opt usmooth(#)} specifies that confidence intervals be based on an undersmoothed density estimate in order to reduce the bias. {it:#} specifies the degree of undersmoothing and should be within .2 and 1. The default value is 1/4 = .25. Higher values result in stronger undersmoothing. A value of 1/5 = .2 results in no undersmoothing. (See Fiorio 2004.) {phang} {opt variance(V)} specifies that the pointwise variance be stored in {it:{help newvar}} {it:V}. {phang} {opt level(#)} specifies the confidence level, as a percentage, for confidence intervals. The default is {cmd:level(95)} or as set by {helpb level:set level}. {title:Options (graph)} {dlgtab:Main} {phang} {opt nograph} suppresses the graph. Instead of specifying {opt nograph} you might as well use {cmd:_kdens} directly. {dlgtab:Kernel plot} {phang} {it:cline_options} affect the rendition of the plotted kernel density estimate. See {it:{help connect_options}}. {phang} {opth ciopts(area_options)} specifies details about the rendition of the plotted confidence interval. See {it:{help area_options}}. {dlgtab:Density plots} {phang} {cmd:histogram}[{cmd:(}{it:#}{cmd:)}] requests that a histogram of the data be added to graph. The histogram will be placed in the background, behind the density estimate. {it:#} specifies the number of bins to be used. {phang} {opt histopts(options)} specifies details about the rendition of the histogram, such as the look of the bars. See {helpb twoway histogram}. {phang} {opt normal} requests that a normal density be overlaid on the density estimate for comparison. {phang} {opt normopts(cline_options)} specifies details about the rendition of the normal curve, such as the color and style of line used. See {it:{help connect_options}}. {phang} {opt student(#)} specifies that a Student's t density with {it:#} degrees of freedom be overlaid on the density estimate for comparison. {phang} {opt stopts(cline_options)} affect the rendition of the Student's t density. See {it:{help connect_options}}. {dlgtab:Add plot} {phang} {opt addplot(plot)} provides a way to add other plots to the generated graph. See {it:{help addplot_option}}. {dlgtab:Y-Axis, X-Axis, Title, Caption, Legend, Overall} {phang} {it:twoway_options} are any of the options documented in {it:{help twoway_options}}, excluding {opt by()}. These include options for titling the graph (see {it:{help title_options}}) and options for saving the graph to disk (see {it:{help saving_option}}). {title:Examples} {com}. {stata "use http://www.stata-press.com/data/r7/trocolen.dta"} . {stata "kdens length"} . {stata "kdens length, bw(sjpi)"} . {stata "kdens length, adaptive"} . {stata "kdens length, ci usmooth"} . {stata "kdens length, ci vce(jackknife)"} . {stata "kdens length, ci vce(bootstrap, reps(200))"} . {stata "_kdens length, kernel(parzen) gen(parzen x) replace"} . {stata "_kdens length, kernel(cosine) gen(cosine) at(x)"} . {stata "line parzen cosine x"} . {stata "gen length2 = abs(length-417)"} . {stata "kdens length2, ll(0) ci"} . {stata "kdens length, histogram ciopts(recast(rline) pstyle(p2) lp(dash))"} . {stata "generate byte g = uniform()<.5"} . {stata "twoway kdens length if g==1 || kdens length if g==0"}{txt} {title:Methods and Formulas} {pstd} See {browse "http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf"}. {title:References} {phang} Fiorio, C. V. 2004. Confidence intervals for kernel density estimation. The Stata Journal 4: 168-179. {phang} Hall, P. and M. P. Wand. 1996. On the Accuracy of Binned Kernel Density Estimators. Journal of Multivariate Analysis 56: 165-184. {title:Author} {pstd} Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch {pstd}Thanks for citing this software as follows: {pmore} Jann, B. (2005). kdens: Stata module for univariate kernel density estimation. Available from http://ideas.repec.org/c/boc/bocode/s456410.html. {title:Also see} {psee} Online: {helpb mf_kdens:mata kdens()}, {helpb kdensity}, {helpb graph}, {helpb histogram}, {helpb lowess} {p_end}