help spkde Version 1.0.0 -------------------------------------------------------------------------------

Title

spkde -- Kernel estimation of density and intensity functions for two-dimensional spatial point patterns

Syntax

spkde [varlist] [if] [in] using gridpoints [, options]

options Description ------------------------------------------------------------------------- Main * xcoord(xvar) variable containing the x-coordinate of data points * ycoord(yvar) variable containing the y-coordinate of data points kernel(kf) kernel function, where kf is one of the following: quartic (default), uniform, normal, negexp, triangular, epanechnikov truncated(t) truncation parameter for kernel functions normal and negexp, where t is a positive number * bandwidth(method) method for setting the value of the kernel bandwidth, where method is one of the following: fbw, ndp, mixed

Estimation parameters fbw(adq|b) fixed kernel bandwidth, where q is positive integer and b is a positive number ndp(d) minimum (weighted) number of data points to be used for kernel estimation at each grid point, where d is a positive number ndpw(ndpwvar) weight data points by variable ndpwvar when searching for the minimum number of data points to be used for kernel estimation edgecorrection apply approximate edge correction

Reporting dots display job progression dots noverbose suppress display of job progression

Saving results * saving(kernest [, replace]) save results to Stata dataset kernest ------------------------------------------------------------------------- * Required option

Description

spkde implements a variety of kernel estimators of both the probability density function and the intensity function of two-dimensional spatial point patterns.

A two-dimensional spatial point pattern S can be defined as a set of data points s_i (i = 1, ..., n) located in a two-dimensional study region R at coordinates (s_i1, s_i2). Each data point s_i represents the location in R of one or more “objects” of some kind: people, events, sites, buildings, plants, cases of a disease, etc.

In the analysis of spatial point patterns we are often interested in determining whether the distribution of the objects of interest within R exhibits some form of clustering, as opposed to being random. To explore the possibility of clustering, it may be useful to describe the spatial point pattern of interest by means of its probability density function p(s) and/or its intensity function l(s).

The probability density function p(s) defines the probability of observing an object at location s in R. In turn, the intensity function l(s) defines the expected number of objects per unit area at location s in R. Therefore, p(s) and l(s) differ only by a constant of proportionality (Bailey and Gatrell 1995; Waller and Gotway 2004).

Both the probability density function p(s) and the intensity function l(s) of a given two-dimensional spatial point pattern can be easily estimated by means of nonparametric estimators, e.g., kernel estimators. Kernel estimators are used to generate a spatially smooth estimate of p(s) and/or l(s) at a fine grid of points s_g (g = 1, ..., G) covering the study region R (Bailey and Gatrell 1995; Waller and Gotway 2004).

spkde computes kernel estimates of both p(s) and l(s) at a grid of points generated by the user-written Stata program spgrid. Expressly, for each grid point s_g, spkde computes first the kernel estimate of the intensity l(s_g), and then the kernel estimate of the density p(s_g).

The intensity l(s_g) at each grid point s_g is estimated as follows:

^ c n (1) l(s_g) = ----- · SUM k(d_ig / h_g) · y_i A_g i=1

where k(·) is the kernel function - usually a unimodal symmetrical bivariate probability density function; d_ig is the Euclidean distance between data point s_i and grid point s_g; h_g is the kernel bandwidth - i.e., the radius of the kernel function - at grid point s_g; y_i is the number of objects located at data point s_i; A_g is the area of the subregion of R over which the kernel function is evaluated at grid point s_g, possibly corrected for edge effects; and c is a constant of proportionality.

In turn, the density p(s_g) at each grid point s_g is estimated as follows:

^ ^ l(s_g) (2) p(s_g) = ------------ G ^ SUM l(s_j) j=1

The estimates of l(s_g) and p(s_g), along with several other auxiliary variables, are saved to Stata dataset kernest and can be visualized using the user-written Stata program spmap.

Kernel function

As shown by Equation 1 above, for each grid point s_g, kernel estimators compute a weighted sum of the objects making up the spatial point pattern of interest. The weight k(d_ig / h_g) applied to each object is a function of the ratio between the object's distance from s_g (d_ig) and the value of the kernel bandwidth h_g. In turn, the way weights depend on d_ig and h_g is determined by the form of the kernel function k(·) used in the analysis.

spkde allows to choose among six basic types of kernel function: uniform, normal, negative exponential, quartic (the default), triangular, and Epanechnikov. Moreover, through specification of option truncated(t), it is possible to use a truncated version of the normal and negative exponential kernel functions.

Let z = d_ig / h_g denote the argument of the kernel function, and t>0 denote a truncation parameter specified with option truncated(t). Then, the kernel functions made available by spkde can be described as follows:

----------------------------------------------------------------------- Name Formula ----------------------------------------------------------------------- Uniform k(z) = 1 if z < 1 k(z) = 0 otherwise

Normal k(z) = exp(-z²/2)

Truncated normal k(z) = exp(-z²/2) if d_ig < h_g·t k(z) = 0 otherwise

Negative exponential k(z) = exp(-3z)

Truncated negative exponential k(z) = exp(-3z) if d_ig < h_g·t k(z) = 0 otherwise

Quartic k(z) = (1-z²)² if z < 1 k(z) = 0 otherwise

Triangular k(z) = (1-z) if z < 1 k(z) = 0 otherwise

Epanechnikov k(z) = (1-z²) if z < 1 k(z) = 0 otherwise -----------------------------------------------------------------------

Kernel bandwidth

While the precise form of the kernel function has a moderate influence on the estimates of p(s) and l(s), the kernel bandwidth plays a major role in kernel estimation, since it determines the degree of smoothing with which p(s) and l(s) are estimated: large bandwidths may result in oversmoothing, small bandwidths may retain too much local deatail and exhibit spikes at isolated event locations (Bailey and Gatrell 1995; Waller and Gotway 2004).

spkde allows to choose among three methods for setting the value of the kernel bandwidth h_g at each grid point s_g: fixed bandwidth, minimum (weighted) number of data points, and mixed.

The fixed bandwidth method sets h_g = h at all grid points, where h is a positive number expressed in the same unit of measurement (miles, kilometers, meters, pixels, etc.) as the grid and data points coordinates.

The minimum (weighted) number of data points method sets h_g = r_g(ndp), where r_g(ndp) is the radius of the smallest circle centered on s_g that circumscribes at least ndp (weighted) data points. The quantity ndp can express either an unweighted or a weighted number of data points. In the latter case, the weight associated with each data point must be stored in variable ndpwvar and specified with option ndpw(ndpwvar). For discussion and illustration of this adaptive method and its applications see, e.g., Bailey and Gatrell (1995), Brundson (1995), Talbot et al. (2000).

The mixed method is a combination of the other two methods. Expressly, it sets h_g = h if the circle centered on s_g and having radius h circumscribes at least ndp (weighted) data points, and h_g = r_g(ndp) otherwise.

Edge correction

For bounded and truncated kernel functions, the nominal value of the area over which the kernel function is evaluated at each grid point s_g (see Equation 1 above) equals the area of the kernel window, i.e., of the circle centered on s_g and having radius h_g (for bounded functions) or h_g·t (for truncated functions, t being the truncation parameter). Formally: A_g = 3.1416 · h_g² for bounded functions, and A_g = 3.1416 · (h_g·t)² for truncated functions.

For grid points located near the edges of the study region, a greater or lesser portion of their kernel window may lie outside the study region, so that the densities/intensities computed with the standard formula at such grid points may result underestimated (Bailey and Gatrell 1995).

Although an accurate assessment of the impact of edge effects and the development of proper corrections remain an open area of research (Waller and Gotway 2004), some adjustments are possible. spkde implements an approximate edge correction that consists in rescaling the area of the kernel window at each grid point s_g by a factor ec_g that approximately equals the proportion of the kernel window lying within the study region. After this correction, A_g = 3.1416 · h_g² · ec_g for bounded functions, and A_g = 3.1416 · (h_g·t)² · ec_g for truncated functions.

In spkde, the approximate edge correction described above can be applied to the estimation of l(s) and p(s) only when a bounded or truncated kernel function is used. This includes the uniform, quartic, triangular, Epanechnicov, truncated normal, and truncated negative exponential functions.

Input datasets

spkde makes use of two input datasets: datapoints and gridpoints.

datapoints is the dataset that resides in memory when spkde is run. It consists of n observations, one for each data point making up the spatial point pattern of interest. At the minimum, it must include xvar, a numeric variable that contains the x-coordinate of data points, and yvar, a numeric variable that contains the y-coordinate of data points. When each data point “hosts” more than one object of J>0 different types, then spkde must be run specifying a varlist made of J numeric variables, each containing the number of objects of a given type located at each data point; in this case, dataset datapoints must also include the J variables specified in varlist. Finally, dataset datapoints can optionally include ndpwvar, a numeric variable sometimes used to set the value of the kernel bandwidth.

gridpoints is the using dataset invoked by spkde at run time. It is generated by the user-written Stata program spgrid and contains the spatial grid used by spkde to compute the kernel estimates of p(s_g) and l(s_g).

Output dataset

spkde routinely generates kernest, a Stata dataset that consists of G observations - one for each grid point s_g - and includes the following variables:

o spgrid_id is a numeric variable that uniquely identifies the cells making up the grid.

o spgrid_xcoord is a numeric variable that contains the x-coordinate of each grid point.

o spgrid_ycoord is a numeric variable that contains the y-coordinate of each grid point.

o bandwidth is a numeric variable that contains the values of h_g (see above).

o ndp is a numeric variable that contains the number of data points used for kernel estimation at each grid point.

o A is a numeric variable that contains the values of A_g (see above).

If no varlist is specified, three additional variables are included in dataset kernest:

o c is a numeric variable that contains the value of c (see above).

o lambda is a numeric variable that contains the kernel estimate of l(s_g) (see above).

o p is a numeric variable that contains the kernel estimate of p(s_g) (see above).

On the other hand, if a varlist is specified, for each variable varname in varlist three additional variables are included in dataset kernest:

o varname_c is a numeric variable that contains the value of c for objects of type varname.

o varname_lambda is a numeric variable that contains the kernel estimate of l(s_g) for objects of type varname.

o varname_p is a numeric variable that contains the kernel estimate of p(s_g) for objects of type varname.

When option ndpw(ndpwvar) is specified, an additional variable is included in dataset kernest:

o wndp is a numeric variable that contains the weighted number of data points used for kernel estimation at each grid point.

Finally, when option edgecorrection is specified, an additional variable is included in dataset kernest:

o edgecorrect is a numeric variable that contains the edge correction factor ec_g (see above).

Visualization of kernel estimates

The kernel estimates of l(s_g) and p(s_g) generated by spkde can be visualized using the user-written Stata program spmap.

To this purpose, two datasets are needed:

o kernest is the output dataset routinely generated by spkde (see section Output dataset above).

o gridcells is one of the two output datasets generated by the user-written Stata program spgrid.

To visualize the kernel estimates of interest, spmap must be run with kernest as the master dataset, and gridcells as the basemap dataset.

Alternative applications

Although spkde has been designed to carry out kernel estimation of density and intensity functions for two-dimensional spatial point patterns, it can be used also for estimating the joint probability density function p(x,y) of any pair of quantitative variables X and Y (see section Examples 2 - Alternative applications below).

Options

+------+ ----+ Main +-------------------------------------------------------------

xcoord(xvar) specifies the name of the variable containing the x-coordinate of each data point s_i.

ycoord(yvar) specifies the name of the variable containing the y-coordinate of each data point s_i.

kernel(kf) specifies the basic type of kernel function to be used in the estimation of p(s) and l(s) (for details, see section Kernel function above).

kernel(quartic) is the default and requests that the quartic kernel function be used.

kernel(uniform) requests that the uniform kernel function be used.

kernel(normal) requests that the normal kernel function be used.

kernel(negexp) requests that the negative exponential kernel function be used.

kernel(triangular) requests that the triangular kernel function be used.

kernel(epanechnikov) requests that the Epanechnikov kernel function be used.

truncated(t) applies only when option kernel(normal) or kernel(negexp) is specified. It requests that, for each grid point s_g, the selected kernel function be evaluated only within a distance h_g·t from s_g, where t is a positive number.

bandwidth(method) specifies the method to be used for setting the value of the kernel bandwidth h_g at each grid point s_g (for details, see section Kernel bandwidth above).

bandwidth(fbw) requests that the fixed bandwidth method be used.

bandwidth(ndp) requests that the minimum (weighted) number of data points method be used.

bandwidth(mixed) requests that the mixed method be used.

+-----------------------+ ----+ Estimation parameters +--------------------------------------------

fbw(adq|b) sets the value of h as defined in section Kernel bandwidth above.

fbw(adq) sets h = adq, where adq equals the average distance between each data point s_i and its q nearest data points.

fbw(b) sets h = b.

ndp(d) sets the value of ndp as defined in section Kernel bandwidth above, namely ndp = d.

ndpw(ndpwvar) specifies that ndp denotes a weighted number of data points and the weights are stored in variable ndpwvar.

edgecorrection requests that an approximate edge correction be applied to the estimation of p(s) and l(s) (for details, see section Edge correction above).

+-----------+ ----+ Reporting +--------------------------------------------------------

dots requests that job progression dots be displayed.

noverbose requests that the display of every indicator of job progression be suppressed.

+----------------+ ----+ Saving results +---------------------------------------------------

saving(kernest [, replace])} requests that the kernel estimates of p(s_g) and l(s_g) (g = 1, ..., G), along with a set of auxiliary variables, be saved to dataset kernest. If specified, suboption replace requests that dataset kernest be overwritten if already existing.

Examples 1 - Standard applications

. spgrid using "Italy-OutlineCoordinates.dta", /// resolution(w10) unit(kilometers) /// cells("GridCells.dta") /// points("GridPoints.dta") /// replace compress dots

. use "Italy-DataPoints.dta", clear . spkde using "GridPoints.dta", /// xcoord(xcoord) ycoord(ycoord) /// bandwidth(fbw) fbw(100) dots /// saving("Kde.dta", replace) . use "Kde.dta", clear . spmap lambda using "GridCells.dta", /// id(spgrid_id) clnum(20) /// fcolor(Rainbow) ocolor(none ..) /// legend(off) /// point(data("Italy-DataPoints.dta") /// x(xcoord) y(ycoord))

. use "Italy-DataPoints.dta", clear . spkde using "GridPoints.dta", /// xcoord(xcoord) ycoord(ycoord) /// bandwidth(fbw) fbw(100) dots /// edgecorrect /// saving("Kde.dta", replace) . use "Kde.dta", clear . spmap lambda using "GridCells.dta", /// id(spgrid_id) clnum(20) /// fcolor(Rainbow) ocolor(none ..) /// legend(off) /// point(data("Italy-DataPoints.dta") /// x(xcoord) y(ycoord))

. use "Italy-DataPoints.dta", clear . spkde using "GridPoints.dta", /// xcoord(xcoord) ycoord(ycoord) /// kernel(normal) /// bandwidth(fbw) fbw(ad5) dots /// saving("Kde.dta", replace) . use "Kde.dta", clear . spmap lambda using "GridCells.dta", /// id(spgrid_id) clnum(20) /// fcolor(Rainbow) ocolor(none ..) /// legend(off) /// point(data("Italy-DataPoints.dta") /// x(xcoord) y(ycoord))

. use "Italy-DataPoints.dta", clear . spkde using "GridPoints.dta", /// xcoord(xcoord) ycoord(ycoord) /// bandwidth(ndp) ndp(4) dots /// saving("Kde.dta", replace) . use "Kde.dta", clear . spmap lambda using "GridCells.dta", /// id(spgrid_id) clnum(20) /// fcolor(Rainbow) ocolor(none ..) /// legend(off) /// point(data("Italy-DataPoints.dta") /// x(xcoord) y(ycoord))

. use "Italy-DataPoints.dta", clear . spkde dcvd95 pop95 using "GridPoints.dta", /// xcoord(xcoord) ycoord(ycoord) /// bandwidth(fbw) fbw(100) dots /// saving("Kde.dta", replace) . use "Kde.dta", clear . generate ratio = dcvd95_lambda / pop95_lambda * 1000 . spmap ratio using "GridCells.dta", /// id(spgrid_id) clnum(20) /// fcolor(Rainbow) ocolor(none ..) /// legend(off)

Examples 2 - Alternative applications

As mentioned above, spkde can be used also for estimating the joint probability density function of any pair of quantitative variables (for an alternative, see Stata program kdens2, written by Christopher F. Baum and available from the Boston SSC Archive). spmap can then be used to generate the corresponding density plot. To this purpose, it is advised to make use of mylabels, a Stata program written by Nicholas J. Cox and available from the Boston SSC Archive.

As an example, let us estimate and plot the bivariate probability density function for two of the variables included in the auto dataset: mpg and price. This can be done in four steps as follows:

1. Normalize variables in the range [0,1]

. sysuse "auto.dta", clear . summarize price mpg . clonevar x = mpg . clonevar y = price . replace x = (x-0) / (50-0) . replace y = (y-0) / (20000-0) . mylabels 0(10)50, myscale((@-0) / (50-0)) local(XLAB) . mylabels 0(5000)20000, myscale((@-0) / (20000-0)) local(YLAB) . keep x y . save "xy.dta", replace

2. Generate a 100x100 grid

. spgrid, shape(hexagonal) xdim(100) /// xrange(0 1) yrange(0 1) /// dots replace /// cells("2D-GridCells.dta") /// points("2D-GridPoints.dta")

3. Estimate the bivariate probability density function

. spkde using "2D-GridPoints.dta", /// xcoord(x) ycoord(y) /// bandwidth(fbw) fbw(0.1) dots /// saving("2D-Kde.dta", replace)

4. Display the density plot

. use "2D-Kde.dta", clear . recode lambda (.=0) . spmap lambda using "2D-GridCells.dta", /// id(spgrid_id) clnum(20) fcolor(Rainbow) /// ocolor(none ..) legend(off) /// point(data("xy.dta") x(x) y(y)) /// freestyle aspectratio(1) /// xtitle(" " "Mileage (mpg)") /// xlab(`XLAB') /// ytitle("Price" " ") /// ylab(`YLAB', angle(0))

Acknowledgments

I wish to thank Nick Cox for helpful suggestions.

Author

Maurizio Pisati Department of Sociology and Social Research University of Milano Bicocca - Italy maurizio.pisati@unimib.it

References

Bailey, T.C. and A.C. Gatrell. 1995. Interactive Spatial Data Analysis. Harlow: Longman.

Brundson, C. 1995. Estimating Probability Surfaces for Geographical Point Data: An Adaptive Kernel Algorithm. Computers & Geosciences 21: 877-894.

Talbot, T.O., Kulldorff, M., Forand, S.P. and V.B. Haley. 2007. Evaluation of Spatial Filters To Create Smoothed Maps of Health Data. Statistics in Medicine 19: 2399-2408.

Waller, L.A. and C.A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. Hoboken NJ: Wiley.

Also see

Online: spgrid (if installed), spmap (if installed), mylabels (if installed)