help spkdeVersion 1.0.0 -------------------------------------------------------------------------------

Title

spkde-- Kernel estimation of density and intensity functions for two-dimensional spatial point patterns

spkde[varlist] [if] [in] usinggridpoints[,options]

optionsDescription ------------------------------------------------------------------------- Main *xcoord(xvar)variable containing the x-coordinate of data points *ycoord(yvar)variable containing the y-coordinate of data pointskernel(kf)kernel function, wherekfis one of the following:quartic(default),uniform,normal,negexp,triangular,epanechnikovtruncated(t)truncation parameter for kernel functionsnormalandnegexp, wheretis a positive number *bandwidth(method)method for setting the value of the kernel bandwidth, wheremethodis one of the following:fbw,ndp,mixedEstimation parameters

fbw(adq|b)fixed kernel bandwidth, whereqis positive integer andbis a positive numberndp(d)minimum (weighted) number of data points to be used for kernel estimation at each grid point, wheredis a positive numberndpw(ndpwvar)weight data points by variablendpwvarwhen searching for the minimum number of data points to be used for kernel estimationedgecorrectionapply approximate edge correctionReporting

dotsdisplay job progression dotsnoverbosesuppress display of job progressionSaving results *

saving(kernest[, replace])save results to Stata datasetkernest------------------------------------------------------------------------- * Required option

spkdeimplements a variety of kernel estimators of both the probability density function and the intensity function of two-dimensional spatial point patterns.A two-dimensional spatial point pattern

Scan be defined as a set ofdatapointss_i(i= 1, ...,n) located in a two-dimensional study regionRat coordinates (s_i1,s_i2). Each data points_irepresents the location inRof one or more “objects” of some kind: people, events, sites, buildings, plants, cases of a disease, etc.In the analysis of spatial point patterns we are often interested in determining whether the distribution of the objects of interest within

Rexhibits some form of clustering, as opposed to being random. To explore the possibility of clustering, it may be useful to describe the spatial point pattern of interest by means of itsprobability density functionp(s) and/or itsintensity functionl(s).The probability density function

p(s) defines the probability of observing an object at locationsinR. In turn, the intensity functionl(s) defines the expected number of objects per unit area at locationsinR. Therefore,p(s) andl(s) differ only by a constant of proportionality (Bailey and Gatrell 1995; Waller and Gotway 2004).Both the probability density function

p(s) and the intensity functionl(s) of a given two-dimensional spatial point pattern can be easily estimated by means of nonparametric estimators, e.g., kernel estimators.Kernel estimatorsare used to generate a spatially smooth estimate ofp(s) and/orl(s) at a fine grid of pointss_g(g= 1, ...,G) covering the study regionR(Bailey and Gatrell 1995; Waller and Gotway 2004).

spkdecomputes kernel estimates of bothp(s) andl(s) at a grid of points generated by the user-written Stata program spgrid. Expressly, for each grid points_g,spkdecomputes first the kernel estimate of the intensityl(s_g), and then the kernel estimate of the densityp(s_g).The intensity

l(s_g) at each grid points_gis estimated as follows:^

cn(1)l(s_g) = ----- · SUMk(d_ig/h_g) ·y_iA_gi=1where

k(·) is the kernel function - usually a unimodal symmetrical bivariate probability density function;d_igis the Euclidean distance between data points_iand grid points_g;h_gis the kernel bandwidth - i.e., the radius of the kernel function - at grid points_g;y_iis the number of objects located at data points_i;A_gis the area of the subregion ofRover which the kernel function is evaluated at grid points_g, possibly corrected for edge effects; andcis a constant of proportionality.In turn, the density

p(s_g) at each grid points_gis estimated as follows:^ ^

l(s_g) (2)p(s_g) = ------------G^ SUMl(s_j)j=1The estimates of

l(s_g) andp(s_g), along with several other auxiliary variables, are saved to Stata datasetkernestand can be visualized using the user-written Stata program spmap.

As shown by Equation 1 above, for each grid point

s_g, kernel estimators compute a weighted sum of the objects making up the spatial point pattern of interest. The weightk(d_ig/h_g) applied to each object is a function of the ratio between the object's distance froms_g(d_ig) and the value of the kernel bandwidthh_g. In turn, the way weights depend ond_igandh_gis determined by the form of the kernel functionk(·) used in the analysis.

spkdeallows to choose among six basic types of kernel function: uniform, normal, negative exponential, quartic (the default), triangular, and Epanechnikov. Moreover, through specification of optiontruncated(t), it is possible to use a truncated version of the normal and negative exponential kernel functions.Let

z=d_ig/h_gdenote the argument of the kernel function, andt>0 denote a truncation parameter specified with optiontruncated(t). Then, the kernel functions made available byspkdecan be described as follows:----------------------------------------------------------------------- Name Formula ----------------------------------------------------------------------- Uniform

k(z) = 1 ifz< 1k(z) = 0 otherwiseNormal

k(z) = exp(-z²/2)Truncated normal

k(z) = exp(-z²/2) ifd_ig<h_g·tk(z) = 0 otherwiseNegative exponential

k(z) = exp(-3z)Truncated negative exponential

k(z) = exp(-3z) ifd_ig<h_g·tk(z) = 0 otherwiseQuartic

k(z) = (1-z²)² ifz< 1k(z) = 0 otherwiseTriangular

k(z) = (1-z) ifz< 1k(z) = 0 otherwiseEpanechnikov

k(z) = (1-z²) ifz< 1k(z) = 0 otherwise -----------------------------------------------------------------------

While the precise form of the kernel function has a moderate influence on the estimates of

p(s) andl(s), the kernel bandwidth plays a major role in kernel estimation, since it determines the degree of smoothing with whichp(s) andl(s) are estimated: large bandwidths may result in oversmoothing, small bandwidths may retain too much local deatail and exhibit spikes at isolated event locations (Bailey and Gatrell 1995; Waller and Gotway 2004).

spkdeallows to choose among three methods for setting the value of the kernel bandwidthh_gat each grid points_g: fixed bandwidth, minimum (weighted) number of data points, and mixed.The

fixed bandwidthmethod setsh_g=hat all grid points, wherehis a positive number expressed in the same unit of measurement (miles, kilometers, meters, pixels, etc.) as the grid and data points coordinates.The

minimum (weighted) number of data pointsmethod setsh_g=r_g(ndp), wherer_g(ndp) is the radius of the smallest circle centered ons_gthat circumscribes at leastndp(weighted) data points. The quantityndpcan express either an unweighted or a weighted number of data points. In the latter case, the weight associated with each data point must be stored in variablendpwvarand specified with optionndpw(ndpwvar). For discussion and illustration of this adaptive method and its applications see, e.g., Bailey and Gatrell (1995), Brundson (1995), Talbotet al.(2000).The

mixedmethod is a combination of the other two methods. Expressly, it setsh_g=hif the circle centered ons_gand having radiushcircumscribes at leastndp(weighted) data points, andh_g=r_g(ndp) otherwise.

For bounded and truncated kernel functions, the nominal value of the area over which the kernel function is evaluated at each grid point

s_g(see Equation 1 above) equals the area of the kernel window, i.e., of the circle centered ons_gand having radiush_g(for bounded functions) orh_g·t(for truncated functions,tbeing the truncation parameter). Formally:A_g= 3.1416 ·h_g² for bounded functions, andA_g= 3.1416 · (h_g·t)² for truncated functions.For grid points located near the edges of the study region, a greater or lesser portion of their kernel window may lie outside the study region, so that the densities/intensities computed with the standard formula at such grid points may result underestimated (Bailey and Gatrell 1995).

Although an accurate assessment of the impact of edge effects and the development of proper corrections remain an open area of research (Waller and Gotway 2004), some adjustments are possible.

spkdeimplements an approximate edge correction that consists in rescaling the area of the kernel window at each grid points_gby a factorec_gthat approximately equals the proportion of the kernel window lying within the study region. After this correction,A_g= 3.1416 ·h_g² ·ec_gfor bounded functions, andA_g= 3.1416 · (h_g·t)² ·ec_gfor truncated functions.In

spkde, the approximate edge correction described above can be applied to the estimation ofl(s) andp(s) only when a bounded or truncated kernel function is used. This includes the uniform, quartic, triangular, Epanechnicov, truncated normal, and truncated negative exponential functions.

spkdemakes use of two input datasets:datapointsandgridpoints.

datapointsis the dataset that resides in memory whenspkdeis run. It consists ofnobservations, one for each data point making up the spatial point pattern of interest. At the minimum, it must includexvar, a numeric variable that contains the x-coordinate of data points, andyvar, a numeric variable that contains the y-coordinate of data points. When each data point “hosts” more than one object ofJ>0 different types, thenspkdemust be run specifying avarlistmade ofJnumeric variables, each containing the number of objects of a given type located at each data point; in this case, datasetdatapointsmust also include theJvariables specified invarlist. Finally, datasetdatapointscan optionally includendpwvar, a numeric variable sometimes used to set the value of the kernel bandwidth.

gridpointsis the using dataset invoked byspkdeat run time. It is generated by the user-written Stata program spgrid and contains the spatial grid used byspkdeto compute the kernel estimates ofp(s_g) andl(s_g).

spkderoutinely generateskernest, a Stata dataset that consists ofGobservations - one for each grid points_g- and includes the following variables:o

spgrid_idis a numeric variable that uniquely identifies the cells making up the grid.o

spgrid_xcoordis a numeric variable that contains the x-coordinate of each grid point.o

spgrid_ycoordis a numeric variable that contains the y-coordinate of each grid point.o

bandwidthis a numeric variable that contains the values ofh_g(see above).o

ndpis a numeric variable that contains the number of data points used for kernel estimation at each grid point.o

Ais a numeric variable that contains the values ofA_g(see above).If no

varlistis specified, three additional variables are included in datasetkernest:o

cis a numeric variable that contains the value ofc(see above).o

lambdais a numeric variable that contains the kernel estimate ofl(s_g) (see above).o

pis a numeric variable that contains the kernel estimate ofp(s_g) (see above).On the other hand, if a

varlistis specified, for each variablevarnameinvarlistthree additional variables are included in datasetkernest:o

varname_cis a numeric variable that contains the value ofcfor objects of typevarname.o

varname_lambdais a numeric variable that contains the kernel estimate ofl(s_g) for objects of typevarname.o

varname_pis a numeric variable that contains the kernel estimate ofp(s_g) for objects of typevarname.When option

ndpw(ndpwvar)is specified, an additional variable is included in datasetkernest:o

wndpis a numeric variable that contains the weighted number of data points used for kernel estimation at each grid point.Finally, when option

edgecorrectionis specified, an additional variable is included in datasetkernest:o

edgecorrectis a numeric variable that contains the edge correction factorec_g(see above).

Visualization of kernel estimatesThe kernel estimates of

l(s_g) andp(s_g) generated byspkdecan be visualized using the user-written Stata program spmap.To this purpose, two datasets are needed:

o

kernestis the output dataset routinely generated byspkde(see section Output dataset above).o

gridcellsis one of the two output datasets generated by the user-written Stata program spgrid.To visualize the kernel estimates of interest, spmap must be run with

kernestas themasterdataset, andgridcellsas thebasemapdataset.

Although

spkdehas been designed to carry out kernel estimation of density and intensity functions for two-dimensional spatial point patterns, it can be used also for estimating the joint probability density functionp(x,y) of any pair of quantitative variablesXandY(see section Examples 2 - Alternative applications below).

+------+ ----+ Main +-------------------------------------------------------------

xcoord(xvar)specifies the name of the variable containing the x-coordinate of each data points_i.

ycoord(yvar)specifies the name of the variable containing the y-coordinate of each data points_i.

kernel(kf)specifies the basic type of kernel function to be used in the estimation ofp(s) andl(s) (for details, see section Kernel function above).

kernel(quartic)is the default and requests that the quartic kernel function be used.

kernel(uniform)requests that the uniform kernel function be used.

kernel(normal)requests that the normal kernel function be used.

kernel(negexp)requests that the negative exponential kernel function be used.

kernel(triangular)requests that the triangular kernel function be used.

kernel(epanechnikov)requests that the Epanechnikov kernel function be used.

truncated(t)applies only when optionkernel(normal)orkernel(negexp)is specified. It requests that, for each grid points_g, the selected kernel function be evaluated only within a distanceh_g·tfroms_g, wheretis a positive number.

bandwidth(method)specifies the method to be used for setting the value of the kernel bandwidthh_gat each grid points_g(for details, see section Kernel bandwidth above).

bandwidth(fbw)requests that thefixedbandwidth method be used.

bandwidth(ndp)requests that the minimum (weighted)number ofdatapoints method be used.

bandwidth(mixed)requests that the mixed method be used.+-----------------------+ ----+ Estimation parameters +--------------------------------------------

fbw(adq|b)sets the value ofhas defined in section Kernel bandwidth above.

fbw(adq)setsh=adq, whereadqequals the average distance between each data points_iand itsqnearest data points.

fbw(b)setsh=b.

ndp(d)sets the value ofndpas defined in section Kernel bandwidth above, namelyndp=d.

ndpw(ndpwvar)specifies thatndpdenotes a weighted number of data points and the weights are stored in variablendpwvar.

edgecorrectionrequests that an approximate edge correction be applied to the estimation ofp(s) andl(s) (for details, see section Edge correction above).+-----------+ ----+ Reporting +--------------------------------------------------------

dotsrequests that job progression dots be displayed.

noverboserequests that the display of every indicator of job progression be suppressed.+----------------+ ----+ Saving results +---------------------------------------------------

saving(kernest[, replace])} requests that the kernel estimates ofp(s_g) andl(s_g) (g= 1, ...,G), along with a set of auxiliary variables, be saved to datasetkernest. If specified, suboptionreplacerequests that datasetkernestbe overwritten if already existing.

-Examples 1Standard applications

. spgrid using "Italy-OutlineCoordinates.dta", ///resolution(w10) unit(kilometers) ///cells("GridCells.dta") ///points("GridPoints.dta") ///replace compress dots

. use "Italy-DataPoints.dta", clear. spkde using "GridPoints.dta", ///xcoord(xcoord) ycoord(ycoord) ///bandwidth(fbw) fbw(100) dots ///saving("Kde.dta", replace). use "Kde.dta", clear. spmap lambda using "GridCells.dta", ///id(spgrid_id) clnum(20) ///fcolor(Rainbow) ocolor(none ..) ///legend(off) ///point(data("Italy-DataPoints.dta") ///x(xcoord) y(ycoord))

. use "Italy-DataPoints.dta", clear. spkde using "GridPoints.dta", ///xcoord(xcoord) ycoord(ycoord) ///bandwidth(fbw) fbw(100) dots ///edgecorrect ///saving("Kde.dta", replace). use "Kde.dta", clear. spmap lambda using "GridCells.dta", ///id(spgrid_id) clnum(20) ///fcolor(Rainbow) ocolor(none ..) ///legend(off) ///point(data("Italy-DataPoints.dta") ///x(xcoord) y(ycoord))

. use "Italy-DataPoints.dta", clear. spkde using "GridPoints.dta", ///xcoord(xcoord) ycoord(ycoord) ///kernel(normal) ///bandwidth(fbw) fbw(ad5) dots ///saving("Kde.dta", replace). use "Kde.dta", clear. spmap lambda using "GridCells.dta", ///id(spgrid_id) clnum(20) ///fcolor(Rainbow) ocolor(none ..) ///legend(off) ///point(data("Italy-DataPoints.dta") ///x(xcoord) y(ycoord))

. use "Italy-DataPoints.dta", clear. spkde using "GridPoints.dta", ///xcoord(xcoord) ycoord(ycoord) ///bandwidth(ndp) ndp(4) dots ///saving("Kde.dta", replace). use "Kde.dta", clear. spmap lambda using "GridCells.dta", ///id(spgrid_id) clnum(20) ///fcolor(Rainbow) ocolor(none ..) ///legend(off) ///point(data("Italy-DataPoints.dta") ///x(xcoord) y(ycoord))

. use "Italy-DataPoints.dta", clear. spkde dcvd95 pop95 using "GridPoints.dta", ///xcoord(xcoord) ycoord(ycoord) ///bandwidth(fbw) fbw(100) dots ///saving("Kde.dta", replace). use "Kde.dta", clear. generate ratio = dcvd95_lambda / pop95_lambda * 1000. spmap ratio using "GridCells.dta", ///id(spgrid_id) clnum(20) ///fcolor(Rainbow) ocolor(none ..) ///legend(off)

-Examples 2Alternative applicationsAs mentioned above,

spkdecan be used also for estimating the joint probability density function of any pair of quantitative variables (for an alternative, see Stata program kdens2, written by Christopher F. Baum and available from the Boston SSC Archive). spmap can then be used to generate the corresponding density plot. To this purpose, it is advised to make use of mylabels, a Stata program written by Nicholas J. Cox and available from the Boston SSC Archive.As an example, let us estimate and plot the bivariate probability density function for two of the variables included in the

autodataset:mpgandprice. This can be done in four steps as follows:1. Normalize variables in the range [0,1]

. sysuse "auto.dta", clear. summarize price mpg. clonevar x = mpg. clonevar y = price. replace x = (x-0) / (50-0). replace y = (y-0) / (20000-0). mylabels 0(10)50, myscale((@-0) / (50-0)) local(XLAB). mylabels 0(5000)20000, myscale((@-0) / (20000-0)) local(YLAB). keep x y. save "xy.dta", replace2. Generate a 100x100 grid

. spgrid, shape(hexagonal) xdim(100) ///xrange(0 1) yrange(0 1) ///dots replace ///cells("2D-GridCells.dta") ///points("2D-GridPoints.dta")3. Estimate the bivariate probability density function

. spkde using "2D-GridPoints.dta", ///xcoord(x) ycoord(y) ///bandwidth(fbw) fbw(0.1) dots ///saving("2D-Kde.dta", replace)4. Display the density plot

. use "2D-Kde.dta", clear. recode lambda (.=0). spmap lambda using "2D-GridCells.dta", ///id(spgrid_id) clnum(20) fcolor(Rainbow) ///ocolor(none ..) legend(off) ///point(data("xy.dta") x(x) y(y)) ///freestyle aspectratio(1) ///xtitle(" " "Mileage (mpg)") ///xlab(`XLAB') ///ytitle("Price" " ") ///ylab(`YLAB', angle(0))

AcknowledgmentsI wish to thank Nick Cox for helpful suggestions.

AuthorMaurizio Pisati Department of Sociology and Social Research University of Milano Bicocca - Italy maurizio.pisati@unimib.it

ReferencesBailey, T.C. and A.C. Gatrell. 1995.

Interactive Spatial DataAnalysis. Harlow: Longman.Brundson, C. 1995. Estimating Probability Surfaces for Geographical Point Data: An Adaptive Kernel Algorithm.

Computers & Geosciences21: 877-894.Talbot, T.O., Kulldorff, M., Forand, S.P. and V.B. Haley. 2007. Evaluation of Spatial Filters To Create Smoothed Maps of Health Data.

Statistics in Medicine19: 2399-2408.Waller, L.A. and C.A. Gotway. 2004.

Applied Spatial Statisticsfor PublicHealth Data. Hoboken NJ: Wiley.

Also seeOnline:

spgrid(if installed),spmap(if installed),mylabels(if installed)