help geonear-------------------------------------------------------------------------------

Title

geonear-- Finds nearest neighbors using geodetic distances.

Syntax

geonearbaseid baselat baselonusingnborfile,neighbors(nboridnborlat nborlon)[options]

optionsDescription ------------------------------------------------------------------------- Mainwidereturn nearest neighbors in wide form (the default)longreturn nearest neighbors in long formnearcount(#)find the#nearest neighborsignoreselfignore distance to self (whenbaseid == nborid)ops(#)specify a maximum#of pairwise combinations ofbaseidandnboridper region

Wide Form Onlygenstub(str)prefix nearest neighbor variable names withstr

Long Form Onlywithin(#)return neighbors within a distance of#from eachbaseidlimit(#)return no more than#nearest neighbors for eachbaseidDistances

milesreturn distances in milesradius(#)custom radius#(in km) for spherical distancesellipsoiduse ellipsoidal distancesa(#)custom ellipsoid; semi-major axis parameter#f(#)custom ellipsoid; flattening parameter#Reporting

report(#)display a status report line every#seconds-------------------------------------------------------------------------

DescriptionFor each location identified by

baseidand with coordinatesbaselatbaselon,geonearfinds innborfile, a Stata-format dataset, the nearest neighbor(s), identified bynboridand located atnborlatnborlon. Coordinates must be in signed decimal degrees, positive for north and east, and negative for south and west. Latitudes (baselatandnborlat) range from -90 to 90 and longitudes (baselonandnborlon) from -180 to 180.

geonearcomputes geodetic distances, i.e. the length of the shortest curve between two points along the surface of a mathematical model of the earth. geodist is a standalone implementation ofgeonear's distance routines and is available from SSC. By default,geonearcalculates distances on a sphere but ellipsoidal distances can be requested (slower). All distances are in kilometers and all distance variables are prefixed with "km_to_" unless themilesoption is specified (the prefix changes to "mi_to_").

geonearuses a divide and conquer strategy to significantly reduce the number of distances that must be computed. The approach involves splitting base locations into progressively smaller geographic regions while at the same time safely reducing the set of potential neighbors for each region. Distances are calculated only when the number of pairwise combinations ofbaseidandnboridfalls below a threshold value (adjustable using theops(#)option). Generally, there is about the same number of base locations per region which means thatgeonearwill predictably find the nearest neighbors in a linear time even though the size of the overall problem increases exponentially. In other words doubling the size of the base location setanddoubling the size of the neighbor location set simply doubles the number of regions and thus doubles the run time.By default,

geonearoperates in wide form mode (seewideoption). In this mode,geonearcreates two variables for each neighbor requested. The first variable identifies the nearest neighbor (usingnborididentifiers). The second variable stores the distance to that neighbor from thebaseidlocation.In long form mode (see

longoption),geonearreturns one observation per neighbor found. Thenearcount(#)option can be used to request a specific number of neighbors perbaseid. Thewithin(#)option can be used to request all neighbors within a distance of#from eachbaseid. You can combinewithin(#)withnearcount(#)andgeonearwill return anynboridthat satisfies either condition.

Options+------+ ----+ Main +-------------------------------------------------------------

wideis used to request thatgeonearreturn results in wide form. In this mode,geonearcreates variables to identify neighbors and their distance to thebaseid. Ifnearcount(#)is omitted, the nearest neighbor is identified by thenidandkm_to_nidvariables. Otherwise, the naming follows the following pattern:

nid1 km_to_nid1 nid2 km_to_nid2 [...] nid# km_to_nid#The

genstub(nidstub)option can be used to specify an alternative stubname tonid.

longis used to request results in long form, one observation per neighbor found.geonearreturns the variablesbaseid,nboridandkm_to_nborid, with as many observations perbaseidas there are neighbors found. Because of the output format,baseidandnboridmust have different names. Bothwithin(#)andnearcount(#)can be used to target neighbors.

nearcount(#)is used to request a specific number of nearestnboridperbaseid. If omitted,geonearfinds the nearest neighbor for eachbaseid. If there is a tie for the #thnboridposition, the neighbor is chosen by ascending order ofnborid.

ignoreselfis used whenbaseidandnborididentify the same set of locations. With this option,geonearignores distances whennborid == baseid(as such distance would be zero) and thus ensures that the nearest neighbor is a different entity.

ops(#)specifies the maximum#number of distances to be calculated per region. Within a region, the number of distances to be calculated is determined by(nbases * nnbors), wherenbasesis the count of base locations within the region andnnborsis the number of potential neighbors. The default is 15,000 for results in wide form and 7,000 for results in long form.

genstub(str)is only available in wide form mode. It is used to specify an alternative stubname tonidfor the nearest neighbor variables.

within(#)is only available in long form mode. It is used to request all neighbors within a distance#of eachbaseid. If themilesoption is specified,within(#)is considered specified in miles. This option can be combined withnearcount(#)to request a minimum number of nearest neighbors perbaseid. Combined withnearcount(0),baseids are excluded from the results if they have no neighbors within the specified distance.

limit(#)is only available in long form mode. It can be used to limit the number of neighbor observations perbaseidwhenwithin(#)is used. This option is offered mostly as a convenience. It does not affect the number of distances calculated; it simply drops neighbor observations that exceed the count.+-----------+ ----+ Distances +--------------------------------------------------------

milesrequests that distances be returned in miles instead of kilometers. It also indicates thatwithin(#)is in miles.

radius(#)allows for a user-specified radius of the earth in km. If omitted, a radius of 6371km is used. This is the earth's mean radius (see http://en.wikipedia.org/wiki/Earth_radius#Mean_radii).

ellipsoidindicates that distances should be calculated using the WGS 1984 reference ellipsoid. Calculating ellipsoidal distances requires more computational power; you should expect a significant penalty in terms of total run time.

a(#)is used to specify the length of the semi-major axis parameter in meters for a different ellipsoid. For example, to use the Bessel 1841 reference ellipsoid, specifya(6377397.155).

f(#)is used to specify the flattening ratio parameter for a different ellipsoid. For example, to use the Bessel 1841 reference ellipsoid, specifyf(299.1528128).+-----------+ ----+ Reporting +--------------------------------------------------------

report(#)is used to control the frequency of status reporting. By default, a status report line is displayed every 10 seconds. Each report line includes current region count, number of base locations within the region, number of neighbor locations considered, total number of distances calculated (ops = base locs x nbor locs), cumulative number of distances, percent done, and remaining time.

Accuracy and limitationsBy default,

geonearcomputes great-circle distances using the Haversine formula.geonearuses double precision arithmetic throughout and the formula provides submillimeter accuracy for all distances except within 10 meters of antipodal locations.geonearuses the Vincenty (1975) equations when computing distances on a reference ellipsoid. According to Thomas and Featherstone (2005), these are thought to maintain submillimeter accuracy between all locations. However, the Vincenty algorithm cannot accurately calculate distances for near-antipodal points. In such cases, the distances are undefined and set to missing. See http://geographiclib.sourceforge.net/cgi-bin/Geod for a calculator that does not have this limitation. In either case, sincegeonearseeks nearest neighbors, the issue of near-antipodal locations is not likely to matter.Most longitude and latitude datasets are based on the WGS 1984 datum, which is what GPS devices and Google Earth/Map use. There are other datums and it is important to understand that distances will not be accurate if the base and neighbor locations are not expressed in the same reference system. Here is a photo of my GPS lying on the Prime Meridian line at the Royal Observatory in Greenwich, England. If you use a distance calculator, you will see that it is about 100 meters to the west of the Reference Meridian of WGS 1984 (geodist 51.47803 -0.00145 51.47803 0).

ExamplesSimulate a dataset of 2000 Census Block Group centroids for Colorado.

.clear.set seed 123456.set obs 3278.gen bgid = _n.gen double bglat = 37 + (41 - 37) * uniform().gen double bglon = -109 + (109 - 102) * uniform().tempfile bg.save "`bg'" Create a new dataset that contains locations of cell towers in the State of Colorado.

.clear.set obs 1000.gen ctid = _n.gen double ctlat = 37 + (41 - 37) * uniform().gen double ctlon = -109 + (109 - 102) * uniform().tempfile cell.save "`cell'"To find the nearest neighbor for each of these cell towers:

.geonear ctid ctlat ctlon using "`cell'", n(ctid ctlat ctlon) ignorese > lfTo find all towers within 50km of each tower:

.use "`cell'", clear.rename ctid ctid0.geonear ctid0 ctlat ctlon using "`cell'", n(ctid ctlat ctlon) ign lon > g within(50) near(0)To find all cell towers within 5km from each Census Block Group centroids:

.use "`bg'", clear.geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) within(5 > ) near(0) longTo find the nearest cell tower as well as all cell towers within 5km from each Census Block Group centroids:

.use "`bg'", clear.geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) within(5 > ) long.by bgid: gen N = _N.tab NTo find the nearest 3 Census Block Group centroids for each cell tower in wide form:

.use "`cell'", clear.geonear ctid ctlat ctlon using "`bg'", n(bgid bglat bglon) wide near( > 3) genstub(bgid) The following example finds all Census Block Group centroids that fall within each cell tower's Voronoi (a.k.a. Thiessen) polygon.

.use "`bg'", clear.geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) genstub( > ctid).sort ctid bgid.by ctid: gen N = _N.tab N

References and acknowledgementsMany thanks to Chris Veness for the best web pages on how to compute geodetic distances:

http://www.movable-type.co.uk/scripts/latlong-vincenty.html http://www.movable-type.co.uk/scripts/latlong.html The definition of the World Geodetic System 1984 is available from

http://earth-info.nga.mil/GandG/publications/tr8350.2/wgs84fin.pdf See Appendix A.1 for a list of reference ellipsoids.

C. F. F. Karney, Geodesics on an ellipsoid of revolution, Feb. 2011; preprint arxiv:1102.1215.

R. W. Sinnott, "Virtues of the Haversine",

Sky and Telescope68 (2), 159 (1984). Thanks to the University of Michigan's Shapiro Science Library.C. M. Thomas and W. E. Featherstone, Validation of Vincenty's Formulas for the Geodesic Using a New Fourth-Order Extension of Kivioja's Formula,

J. Surv. Engrg.Volume 131, Issue 1, pp. 20-26 (February 2005), available for download from:http://www.cage.curtin.edu.au/~will/thomas-featherstone.pdf

Vincenty, T. (1975) Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations,

Survey Review22(176): 88-93 is available from:http://www.ngs.noaa.gov/PUBS_LIB/inverse.pdf

AuthorRobert Picard picard@netbox.com