Title
geonear -- Finds nearest neighbors using geodetic distances.
Syntax
geonear baseid baselat baselon using nborfile , neighbors(nborid nborlat nborlon) [options]
options Description ------------------------------------------------------------------------- Main wide return nearest neighbors in wide form (the default) long return nearest neighbors in long form nearcount(#) find the # nearest neighbors ignoreself ignore distance to self (when baseid == nborid) ops(#) specify a maximum # of pairwise combinations of baseid and nborid per region
Wide Form Only genstub(str) prefix nearest neighbor variable names with str
Long Form Only within(#) return neighbors within a distance of # from each baseid limit(#) return no more than # nearest neighbors for each baseid
Distances miles return distances in miles radius(#) custom radius # (in km) for spherical distances ellipsoid use ellipsoidal distances a(#) custom ellipsoid; semi-major axis parameter # f(#) custom ellipsoid; flattening parameter #
Reporting report(#) display a status report line every # seconds
-------------------------------------------------------------------------
Description
For each location identified by baseid and with coordinates baselat baselon, geonear finds in nborfile, a Stata-format dataset, the nearest neighbor(s), identified by nborid and located at nborlat nborlon. Coordinates must be in signed decimal degrees, positive for north and east, and negative for south and west. Latitudes (baselat and nborlat) range from -90 to 90 and longitudes (baselon and nborlon) from -180 to 180.
geonear computes geodetic distances, i.e. the length of the shortest curve between two points along the surface of a mathematical model of the earth. geodist is a standalone implementation of geonear's distance routines and is available from SSC. By default, geonear calculates distances on a sphere but ellipsoidal distances can be requested (slower). All distances are in kilometers and all distance variables are prefixed with "km_to_" unless the miles option is specified (the prefix changes to "mi_to_").
geonear uses a divide and conquer strategy to significantly reduce the number of distances that must be computed. The approach involves splitting base locations into progressively smaller geographic regions while at the same time safely reducing the set of potential neighbors for each region. Distances are calculated only when the number of pairwise combinations of baseid and nborid falls below a threshold value (adjustable using the ops(#) option). Generally, there is about the same number of base locations per region which means that geonear will predictably find the nearest neighbors in a linear time even though the size of the overall problem increases exponentially. In other words doubling the size of the base location set and doubling the size of the neighbor location set simply doubles the number of regions and thus doubles the run time.
By default, geonear operates in wide form mode (see wide option). In this mode, geonear creates two variables for each neighbor requested. The first variable identifies the nearest neighbor (using nborid identifiers). The second variable stores the distance to that neighbor from the baseid location.
In long form mode (see long option), geonear returns one observation per neighbor found. The nearcount(#) option can be used to request a specific number of neighbors per baseid. The within(#) option can be used to request all neighbors within a distance of # from each baseid. You can combine within(#) with nearcount(#) and geonear will return any nborid that satisfies either condition.
Options
+------+ ----+ Main +------------------------------------------------------------- wide is used to request that geonear return results in wide form. In this mode, geonear creates variables to identify neighbors and their distance to the baseid. If nearcount(#) is omitted, the nearest neighbor is identified by the nid and km_to_nid variables. Otherwise, the naming follows the following pattern:
nid1 km_to_nid1 nid2 km_to_nid2 [...] nid# km_to_nid#
The genstub(nidstub) option can be used to specify an alternative stubname to nid.
long is used to request results in long form, one observation per neighbor found. geonear returns the variables baseid, nborid and km_to_nborid, with as many observations per baseid as there are neighbors found. Because of the output format, baseid and nborid must have different names. Both within(#) and nearcount(#) can be used to target neighbors.
nearcount(#) is used to request a specific number of nearest nborid per baseid. If omitted, geonear finds the nearest neighbor for each baseid. If there is a tie for the #th nborid position, the neighbor is chosen by ascending order of nborid.
ignoreself is used when baseid and nborid identify the same set of locations. With this option, geonear ignores distances when nborid == baseid (as such distance would be zero) and thus ensures that the nearest neighbor is a different entity.
ops(#) specifies the maximum # number of distances to be calculated per region. Within a region, the number of distances to be calculated is determined by (nbases * nnbors), where nbases is the count of base locations within the region and nnbors is the number of potential neighbors. The default is 15,000 for results in wide form and 7,000 for results in long form.
genstub(str) is only available in wide form mode. It is used to specify an alternative stubname to nid for the nearest neighbor variables.
within(#) is only available in long form mode. It is used to request all neighbors within a distance # of each baseid. If the miles option is specified, within(#) is considered specified in miles. This option can be combined with nearcount(#) to request a minimum number of nearest neighbors per baseid. Combined with nearcount(0), baseids are excluded from the results if they have no neighbors within the specified distance.
limit(#) is only available in long form mode. It can be used to limit the number of neighbor observations per baseid when within(#) is used. This option is offered mostly as a convenience. It does not affect the number of distances calculated; it simply drops neighbor observations that exceed the count.
+-----------+ ----+ Distances +-------------------------------------------------------- miles requests that distances be returned in miles instead of kilometers. It also indicates that within(#) is in miles.
radius(#) allows for a user-specified radius of the earth in km. If omitted, a radius of 6371km is used. This is the earth's mean radius (see http://en.wikipedia.org/wiki/Earth_radius#Mean_radii).
ellipsoid indicates that distances should be calculated using the WGS 1984 reference ellipsoid. Calculating ellipsoidal distances requires more computational power; you should expect a significant penalty in terms of total run time.
a(#) is used to specify the length of the semi-major axis parameter in meters for a different ellipsoid. For example, to use the Bessel 1841 reference ellipsoid, specify a(6377397.155).
f(#) is used to specify the flattening ratio parameter for a different ellipsoid. For example, to use the Bessel 1841 reference ellipsoid, specify f(299.1528128).
+-----------+ ----+ Reporting +-------------------------------------------------------- report(#) is used to control the frequency of status reporting. By default, a status report line is displayed every 10 seconds. Each report line includes current region count, number of base locations within the region, number of neighbor locations considered, total number of distances calculated (ops = base locs x nbor locs), cumulative number of distances, percent done, and remaining time.
Accuracy and limitations
By default, geonear computes great-circle distances using the Haversine formula. geonear uses double precision arithmetic throughout and the formula provides submillimeter accuracy for all distances except within 10 meters of antipodal locations. geonear uses the Vincenty (1975) equations when computing distances on a reference ellipsoid. According to Thomas and Featherstone (2005), these are thought to maintain submillimeter accuracy between all locations. However, the Vincenty algorithm cannot accurately calculate distances for near-antipodal points. In such cases, the distances are undefined and set to missing. See http://geographiclib.sourceforge.net/cgi-bin/Geod for a calculator that does not have this limitation. In either case, since geonear seeks nearest neighbors, the issue of near-antipodal locations is not likely to matter.
Most longitude and latitude datasets are based on the WGS 1984 datum, which is what GPS devices and Google Earth/Map use. There are other datums and it is important to understand that distances will not be accurate if the base and neighbor locations are not expressed in the same reference system. Here is a photo of my GPS lying on the Prime Meridian line at the Royal Observatory in Greenwich, England. If you use a distance calculator, you will see that it is about 100 meters to the west of the Reference Meridian of WGS 1984 (geodist 51.47803 -0.00145 51.47803 0).
Examples
Simulate a dataset of 2000 Census Block Group centroids for Colorado.
. clear . set seed 123456 . set obs 3278 . gen bgid = _n . gen double bglat = 37 + (41 - 37) * uniform() . gen double bglon = -109 + (109 - 102) * uniform() . tempfile bg . save "`bg'" Create a new dataset that contains locations of cell towers in the State of Colorado.
. clear . set obs 1000 . gen ctid = _n . gen double ctlat = 37 + (41 - 37) * uniform() . gen double ctlon = -109 + (109 - 102) * uniform() . tempfile cell . save "`cell'"
To find the nearest neighbor for each of these cell towers:
. geonear ctid ctlat ctlon using "`cell'", n(ctid ctlat ctlon) ignorese > lf
To find all towers within 50km of each tower:
. use "`cell'", clear . rename ctid ctid0 . geonear ctid0 ctlat ctlon using "`cell'", n(ctid ctlat ctlon) ign lon > g within(50) near(0)
To find all cell towers within 5km from each Census Block Group centroids:
. use "`bg'", clear . geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) within(5 > ) near(0) long
To find the nearest cell tower as well as all cell towers within 5km from each Census Block Group centroids:
. use "`bg'", clear . geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) within(5 > ) long . by bgid: gen N = _N . tab N
To find the nearest 3 Census Block Group centroids for each cell tower in wide form:
. use "`cell'", clear . geonear ctid ctlat ctlon using "`bg'", n(bgid bglat bglon) wide near( > 3) genstub(bgid) The following example finds all Census Block Group centroids that fall within each cell tower's Voronoi (a.k.a. Thiessen) polygon.
. use "`bg'", clear . geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) genstub( > ctid) . sort ctid bgid . by ctid: gen N = _N . tab N
References and acknowledgements
Many thanks to Chris Veness for the best web pages on how to compute geodetic distances:
http://www.movable-type.co.uk/scripts/latlong-vincenty.html http://www.movable-type.co.uk/scripts/latlong.html The definition of the World Geodetic System 1984 is available from
http://earth-info.nga.mil/GandG/publications/tr8350.2/wgs84fin.pdf See Appendix A.1 for a list of reference ellipsoids.
C. F. F. Karney, Geodesics on an ellipsoid of revolution, Feb. 2011; preprint arxiv:1102.1215.
R. W. Sinnott, "Virtues of the Haversine", Sky and Telescope 68 (2), 159 (1984). Thanks to the University of Michigan's Shapiro Science Library.
C. M. Thomas and W. E. Featherstone, Validation of Vincenty's Formulas for the Geodesic Using a New Fourth-Order Extension of Kivioja's Formula, J. Surv. Engrg. Volume 131, Issue 1, pp. 20-26 (February 2005), available for download from:
http://www.cage.curtin.edu.au/~will/thomas-featherstone.pdf
Vincenty, T. (1975) Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations, Survey Review 22(176): 88-93 is available from:
http://www.ngs.noaa.gov/PUBS_LIB/inverse.pdf
Author
Robert Picard picard@netbox.com