help mata kdens()
-------------------------------------------------------------------------------

Title

    kdens() -- Univariate kernel density estimation


Contents

    Information on functions (syntax, description, remarks, conformability,
    diagnostics) can be found below under the following headings:

        Wrappers

        Elementary functions


Dependencies

    moremata is required. Type

        . ssc describe moremata


Methods and Formulas

    See http://fmwww.bc.edu/RePEc/bocode/k/kdens.pdf.


Source code

    kdens.mata


Author

    Ben Jann, ETH Zurich, jann@soz.gess.ethz.ch


Aknowledgements

    Some of the code is loosely based on R code from the "KernSmooth" package
    (R port by Brian Ripley, S original by Matt Wand) and the "sm" package (R
    port by Brian Ripley, S original by Adrian W. Bowman and Adelchi
    Azzalini).


Also see

    Online:  help for kdens, moremata


-------------------------------------------------------------------------------
Wrappers
-------------------------------------------------------------------------------

Syntax

         d = kdens(x, w, g [, bw, k, a, lb, ub, btype, lbwf])

         d = _kdens(x, w, g [, bw, k, a, ll, ul, btype, lbwf])

         v = kdens_var(d, x, w, g [, bw, k, pw, lb, ub, btype, lbwf])

         v = _kdens_var(d, x, w, g [, bw, k, pw, ll, ul, btype, lbwf])

        bw = kdens_bw(x [, w, method, k, m, ll, ul, ldpi])

         g = kdens_grid(x [, w, bw, k, m, min, max])


    where

            d:  real colvector containing density estimate

            v:  real colvector containing variance estimate

           bw:  real scalar containing bandwidth; kdens() and kdens_grid()
                  may replace bw if it is too small

            g:  real colvector containing equally-spaced grid points at which
                  to estimate the density

            x:  real colvector containing data points

            w:  real colvector containing weights

       method:  string scalar containing "silverman" (default),
                  "normalscale", "oversmoothed", "sjpi", or "dpi"

            k:  string scalar containing "epanechnikov", "epan2" (default),
                  "biweight", "triweight", "cosine", "gaussian", "parzen",
                  "rectangle" or "triangle"

            a:  real scalar specifying number of iterations for the adaptive
                  kernel density estimator (default 0)

           pw:  real scalar indicating that the weights are pweights
                  (normalized to the number of observations)

            m:  real scalar containing size of evaluation grid (number of
                  evaluation points) (default is 512)

           lb:  real scalar indicating that the support of x is lower bounded
                  at g[1]

           ub:  real scalar indicating that the support of x is upper bounded
                  at g[rows(g)]

        btype:  real scalar specifying the method to be used for boundary
                  correction; btype==0: renormalization, btype==1:
                  reflection, btype==2: linear combination

           ll:  real scalar containing lower limit of support of x

           ul:  real scalar containing upper limit of support of x

         ldpi:  real scalar specifying the number of stages of functional
                  estimation for the dpi method (default is 2)

          min:  real scalar specifying minimum value of evaluation grid; will
                  be ignored if min is missing or larger than min(x)

          max:  real scalar specifying maximum value of evaluation grid; will
                  be ignored if max is missing or smaller than max(x)

         lwbf:  will be replaced by the local bandwidth factors in
                  [_]kdens(); real colvector containing local bandwidth
                  factors in [_]kdens_var()


Description

    kdens() returns the binned approximation kernel density estimate of x
    using evaluation grid g. g must be a regular grid of equidistant points
    covering the whole range of x. Use kdens_grid() to produce g. The default
    kernel used by kdens() is the gaussian kernel function. Specify k as
    indicated above to use another kernel function. bw is the bandwidth. If
    bw is omitted, the optimal of Silverman is used. Note that kdens()
    imposes a minimum bandwidth (see help kdens for details). If bw is
    smaller than the minimum bandwidth, it is reset to this minimum. Specify
    a as an integer larger than zero to obtain the adaptive bandwidth kernel
    density, where a indicates the number of iterations applied to determine
    the local bandwidth factors. kdens() supports density estimation for
    bounded variables. Specify lb!=0 and ub!=0 to indicate that the support
    of x is bounded. The default method for estimation at the boundaries is
    the normalization method. Alternatively, btype==1 causes the reflection
    method to be used and btype==2 causes the linear combination method to be
    used. If specified, lwbf will be replaced by the local bandwidth factors
    (or set to 1 if a=0).

    _kdens() returns the exact kernel density estimate. ll and ul specify the
    lower and upper limits of the data support.

    kdens_var() returns asymptotic point-wise variance estimates for the
    binned approximation density estimate. Use kdens_var() after kdens(), but
    not after _kdens(). pw!=1 specifies that the weights w are (normalized)
    sampling weights. If d has been derived by the adaptive kernel density
    method (a>=1 in kdens()), the local bandwidth factors, lwbf, should be
    provided to kdens_var().

    kdens_var() returns asymptotic point-wise variance estimates for the
    exact density estimate. Use _kdens_var() after _kdens(), but not after
    kdens().

    kdens_bw() returns an estimate of the "optimal" bandwidth given the data
    and kernel function. Available methods are "silverman" (optimal of
    Silverman; the default), "normalscale" (the normal scale rule),
    "oversmoothed" (the oversmoothed rule), "sjpi" (the Sheather-Jones
    plug-in estimate) and "dpi" (a variant of the Sheather-Jones plug-in
    estimate called the direct plug-in bandwidth estimate). If the method is
    "sjpi" or "dpi" you might want to set m, the number of evaluation points
    used to estimate the density functionals, and for bounded variables the
    lower and upper limits, ll and ul.  Furthermore, with the "dpi" method
    you may specify the the number of stages of functional estimation, ldpi
    (default is 2).

    kdens_grid() returns a grid of m equally-spaced points over the range of
    x. The default grid size is m=512. The default range of the grid is
    [min(x)-bw*tau, max(x)+bw*tau], where bw is the bandwidth and tau is the
    halfwidth of the support of kernel k (in the case of the gaussian kernel,
    tau is set to 3). Alternatively, if min<=min(x) is specified, the lower
    limit of the grid is set to min. If .>max>=max(x) is specified, the upper
    limit of the grid is set to max. Note that, similar to kdens(),
    kdens_grid() imposes a minimum bandwidth and resets bw if it is too small
    (see above).


Remarks

    Suppose, x is a data vector. To estimate the density of x using a
    gaussian kernel you could type, for example:

        : bw = kdens_bw(x, 1, "sjpi")
        : g  = kdens_grid(x, 1, bw)
        : d  = kdens(x, 1, g, bw)
        : v  = kdens_var(d, x, 1, g, bw)

    g and d could then be used, e.g., to plot the density function. v could
    be used to construct point-wise confidence intervals.

    If the adaptive estimator is used and the variance be estimated, the
    local bandwidth factors have to be passed to kdens_var(). Example:

        : bw = kdens_bw(x, 1, "sjpi")
        : g  = kdens_grid(x, 1, bw)
        : d  = kdens(x, 1, g, bw, "", 1, 0, 0, 0, l=.)
        : v  = kdens_var(d, x, 1, g, bw, "", 0, 0, 0, 0, l)


Conformability

    kdens(x, w, g, bw, k, a, lb, ub, btype, lbwf),
    _kdens(x, w, g, bw, k, a, ll, ul, btype, lbwf),
    kdens_var(d, x, w, g, bw, k, pw, lb, ub, btype, lbwf),
    _kdens_var(d, x, w, g, bw, k, pw, ll, ul, btype, lbwf),
    kdens_grid(x, w, bw, k, m, min, max):
        result:  m x 1

    kdens_bw(x, w, method, k, m, ll, ul, ldpi):
        result:  1 x 1

    where

             x:  n x 1
             w:  n x 1 or 1 x 1
             g:  m x 1
             d:  m x 1
            bw:  1 x 1
        method:  1 x 1
             k:  1 x 1
             a:  1 x 1
            pw:  1 x 1
             m:  1 x 1
            lb:  1 x 1
            ub:  1 x 1
         btype:  1 x 1
            ll:  1 x 1
            ul:  1 x 1
          ldpi:  1 x 1
           min:  1 x 1
           max:  1 x 1
          lwbf:  n x 1


Diagnostics

    kdens() and kdens_var() return invalid results if the grid g is not
    equally-spaced.

    kdens_bw() aborts with error if ll>min(x) or ul<max(x) and method is
    "sjpi" or "dpi".  _kdens() and _kdens_var() abort with error if ll>min(x)
    or ul<max(x).

    The functions return invalid results if the data contain missing values.

    Weights are assumed to be normalized to the number of observations (i.e.
    sum of weights = number of observations).


-------------------------------------------------------------------------------
Elementary functions
-------------------------------------------------------------------------------

Syntax


        real colvector kdens_gen(x, w, g, h [, k, ll, ul, btype])

        real colvector kdens_bin(g, gc, h [, k, lb, ub, btype])

        real colvector kdens_dd(g, gc, h, drv [, lb, ub])

        real colvector kdens_df(g, gc, h, drv [, lb, ub])

        real colvector kdens_avar(x, w, g, h, d [, k, pw, ll, ul])

        real colvector kdens_evar(x, w, g, h, d [, k, pw, ll, ul, btype])

        real scalar    kdens_bw_simple(x, w [, rule, scale])

        real scalar    kdens_bw_sjpi(x, w [, m, scale, ll, ul])

        real scalar    kdens_bw_dpi(x, w [, m, scale, ll, ul, ldpi])

        real scalar    kdens_lbwf(x, w, g, d)


    where

            x:  real colvector containing data points

            w:  real colvector containing weights

            g:  real colvector containing grid points at which to estimate
                  the density

           gc:  real colvector containing grid counts

            h:  real colvector containing (local) bandwidth

            d:  real colvector containing preliminary density estimate

            k:  string scalar containing "epanechnikov", "epan2" (default),
                  "biweight", "triweight", "cosine", "gaussian", "parzen",
                  "rectangle" or "triangle"

          drv:  real scalar specifying the order of derivative

           pw:  real scalar indicating that the weights are pweights

            m:  real scalar specifying the number of equally spaced grid
                  points (default: 401)

         rule:  string scalar containing "silverman" (default),
                  "normalscale", or "oversmoothed"

        scale:  string scalar containing "minim" (default), "stddev", "iqr"

           lb:  real scalar indicating that the support of x is lower bounded
                  at g[1]

           ub:  real scalar indicating that the support of x is upper bounded
                  at g[rows(g)]

        btype:  real scalar specifying the method to be used for boundary
                  correction; btype==0: renormalization, btype==1:
                  reflection, btype==2: linear combination

           ll:  real scalar containing lower boundary of support of x

           ul:  real scalar containing upper boundary of support of x

         ldpi:  real scalar specifying the number of stages of functional
                  estimation (default is 2)


Description

    kdens_gen() returns the density estimate of x at the points g using
    bandwidth h. If h is a scalar, kdens_gen() returns a fixed bandwidth
    kernel density estimate. If h is vector, kdens_gen() returns an adaptive
    kernel density estimate. Furthermore, specify ll<. and ul<. if the
    support of x is bounded. The default method for estimation at the
    boundaries is the normalization method. Alternatively, btype==1 causes
    the reflection method to be used and btype==2 causes the linear
    combination method to be used.

    kdens_bin() returns a density estimate based on binned data. g contains a
    grid of equidistant evaluation points and gc contains the grid counts.
    Use mm_makegrid() and mm_fastlinbin() from the moremata package to
    produce g and gc. If possible, kdens_bin() computes the estimate as the
    convolution of fast Fourier transforms.

    kdens_dd() returns the drvth density derivative estimate based on binned
    data using the gaussian kernel function.  kdens_dd() is used by
    kdens_df(). drv should be a nonnegative integer. kdens_dd() supports
    bounded variables using the reflection method.

    kdens_df() returns a density functional estimate based on binned data
    using the gaussian kernel. kdens_df() is used by kdens_bw_sjpi() and
    kdens_bw_dpi().  drv specifies the derivative in the functional and
    should be a nonnegative integer. kdens_df() supports bounded variables
    using the reflection method.

    kdens_avar() returns approximate point-wise variance estimates for the
    density estimate in d.

    kdens_evar() returns exact point-wise variance estimates for the density
    estimate in d.

    kdens_bw_simple() returns a quick and simple bandwidth estimate
    (standardized, see below). The available estimators are the optimal of
    Silverman (default), the normal scale rule, and the oversmoothed rule.
    The default estimator conforms to the automatic bandwidth selection in
    official Stata's kdensity. Use scale to determine the scale estimate that
    is used in bandwidth estimation. The default is to use the minimum of the
    standard deviation and the inter-quartile range/1.349.

    kdens_bw_sjpi() returns the Sheather-Jones plug-in bandwidth estimate
    (standardized, see below). kdens_bw_sjpi() supports bounded variables
    using the reflection method. If the Sheather-Jones plug-in estimate is
    larger than the oversmoothed bandwidth estimate (see above), the latter
    is returned (this may rarely happen with bounded variables). Missing is
    returned if the algorithm does not converge (for example, because the
    estimate is getting to small given the size of the evaluation grid).

    kdens_bw_dpi() returns a variant of the Sheather-Jones plug-in estimate
    called the direct plug-in bandwidth estimate (standardized, see below).
    ldpi in {0,1,...} specifies the number of stages of functional
    estimation. level=2 is the default. kdens_bw_dpi() supports bounded
    variables using the reflection method.

    Note that the bandwidth estimates returned by kdens_bw_simple(),
    kdens_bw_sjpi(), or kdens_bw_dpi() are standardized estimates. They
    should be multiplied by the kernel's canonical bandwidth before being
    used for density estimation. For example,
    kdens_bw_sjpi(...)*mm_kdel0_epan2() returns the Sheather-Jones plug-in
    bandwidth scaled for use with the epan2 kernel.

    kdens_lbwf() returns the local bandwidth factors to be used for adaptive
    kernel density estimation based on a preliminary density estimate.


Remarks

    Suppose, x is the data vector (or the variable in the dataset) for which
    the density be estimated. The commands

        : h = kdens_bw_simple(x, 1) * mm_kdel0_gaussian()
        : g = mm_makegrid(x, 50, h)
        : d = kdens_gen(x, 1, g, h, "epanechnikov")

    produce a density estimate equivalent to

        . kdensity x

    Adaptive kernel density estimation can be implemented as

        : h = kdens_bw_simple(x, 1) * mm_kdel0_epan2()
        : g = mm_makegrid(x, 50, h)
        : d = kdens_gen(x, 1, g, h)
        : l = kdens_lbwf(x, 1, g, d)
        : d = kdens_gen(x, 1, g, h*l)

    The binned approximation estimator is

        : h = kdens_bw_simple(x, 1) * mm_kdel0_epan2()
        : g = mm_makegrid(x, 512, h)
        : gc = mm_fastlinbin(x, 1, g)
        : d = kdens_bin(g, gc, h)


Conformability

    kdens_gen(x, w, g, h, k, ll, ul, btype),
    kdens_avar(x, w, g, h, d, k, pw, ll, ul),
    kdens_evar(x, w, g, h, d, k, pw, ll, ul, btype):
             h:  n x 1 or 1 x 1
        result:  m x 1

    kdens_bin(g, gc, h, k, lb, ub, btype):
             h:  m x 1 or 1 x 1
        result:  m x 1

    kdens_dd(g, gc, h, drv, lb, ub):
             h:  1 x 1
        result:  m x 1

    kdens_df(g, gc, h, drv, lb, ub):
             h:  1 x 1
        result:  1 x 1

    kdens_bw_simple(x, w, rule, scale),
    kdens_bw_sjpi(x, w, m, scale, ll, ul),
    kdens_bw_dpi(x, w, m, scale, ll, ul, level):
        result:  1 x 1

    kdens_lbwf(x, w, g, d):
        result:  n x 1

    where

             x:  n x 1
             w:  n x 1 or 1 x 1
             g:  m x 1
            gc:  m x 1
             d:  m x 1
             k:  1 x 1
           drv:  1 x 1
            pw:  1 x 1
             m:  1 x 1
          rule:  1 x 1
         scale:  1 x 1
            lb:  1 x 1
            ub:  1 x 1
         btype:  1 x 1
            ll:  1 x 1
            ul:  1 x 1
          ldpi:  1 x 1


Diagnostics

    The functions return invalid results if the data contain missing values.

    Weights are assumed to be normalized to the number of observations (i.e.