{smcl} {* *! version 2.0.0 21feb2012}{...} {cmd:help geonear} {hline} {title:Title} {phang} {bf:geonear} {hline 2} Finds nearest neighbors using geodetic distances. {title:Syntax} {p 8 16 2} {cmd:geonear} {it:baseid baselat baselon} {cmd:using} {it:nborfile} {cmd:,} {opt n:eighbors(nborid nborlat nborlon)} [{it:options}] {synoptset 20 tabbed}{...} {synopthdr:options} {synoptline} {syntab:Main} {synopt:{opt wid:e}}return nearest neighbors in wide form (the default){p_end} {synopt:{opt lo:ng}}return nearest neighbors in long form{p_end} {synopt:{opt nea:rcount(#)}}find the {it:#} nearest neighbors{p_end} {synopt:{opt i:gnoreself}}ignore distance to self (when {it: {bind:baseid == nborid}}){p_end} {synopt:{opt o:ps(#)}}specify a maximum {it:#} of pairwise combinations of {it:baseid} and {it:nborid} per region{p_end} {synopt:{it:Wide Form Only}}{p_end} {synopt:{opt g:enstub(str)}}prefix nearest neighbor variable names with {it:str}{p_end} {synopt:{it:Long Form Only}}{p_end} {synopt:{opt wit:hin(#)}}return neighbors within a distance of {it:#} from each {it:baseid}{p_end} {synopt:{opt li:mit(#)}}return no more than {it:#} nearest neighbors for each {it:baseid}{p_end} {syntab:Distances} {synopt:{opt mi:les}}return distances in miles{p_end} {synopt:{opt ra:dius(#)}}custom radius {it:#} (in km) for spherical distances{p_end} {synopt:{opt e:llipsoid}}use ellipsoidal distances{p_end} {synopt:{opt a(#)}}custom ellipsoid; semi-major axis parameter {it:#}{p_end} {synopt:{opt f(#)}}custom ellipsoid; flattening parameter {it:#} {p_end} {syntab:Reporting} {synopt:{opt re:port(#)}}display a status report line every {it:#} seconds{p_end} {synoptline} {p2colreset}{...} {title:Description} {pstd} For each location identified by {it:baseid} and with coordinates {it:baselat} {it:baselon}, {cmd:geonear} finds in {it:nborfile}, a Stata-format dataset, the nearest neighbor(s), identified by {it:nborid} and located at {it:nborlat} {it:nborlon}. Coordinates must be in signed decimal degrees, positive for north and east, and negative for south and west. Latitudes ({it:baselat} and {it:nborlat}) range from -90 to 90 and longitudes ({it:baselon} and {it:nborlon}) from -180 to 180. {pstd} {cmd:geonear} computes geodetic distances, i.e. the length of the shortest curve between two points along the surface of a mathematical model of the earth. {stata ssc des geodist:geodist} is a standalone implementation of {cmd:geonear}'s distance routines and is available from SSC. By default, {cmd:geonear} calculates distances on a sphere but ellipsoidal distances can be requested (slower). All distances are in kilometers and all distance variables are prefixed with "{it:km_to_}" unless the {opt mi:les} option is specified (the prefix changes to "{it:mi_to_}"). {pstd} {cmd:geonear} uses a divide and conquer strategy to significantly reduce the number of distances that must be computed. The approach involves splitting base locations into progressively smaller geographic regions while at the same time safely reducing the set of potential neighbors for each region. Distances are calculated only when the number of pairwise combinations of {it:baseid} and {it:nborid} falls below a threshold value (adjustable using the {opt o:ps(#)} option). Generally, there is about the same number of base locations per region which means that {cmd:geonear} will predictably find the nearest neighbors in a linear time even though the size of the overall problem increases exponentially. In other words doubling the size of the base location set {it:{hi:and}} doubling the size of the neighbor location set simply doubles the number of regions and thus doubles the run time. {pstd} By default, {cmd:geonear} operates in wide form mode (see {opt wid:e} option). In this mode, {cmd:geonear} creates two variables for each neighbor requested. The first variable identifies the nearest neighbor (using {it:nborid} identifiers). The second variable stores the distance to that neighbor from the {it:baseid} location. {pstd} In long form mode (see {opt lo:ng} option), {cmd:geonear} returns one observation per neighbor found. The {opt near:count(#)} option can be used to request a specific number of neighbors per {it:baseid}. The {opt wit:hin(#)} option can be used to request all neighbors within a distance of {it:#} from each {it:baseid}. You can combine {opt wit:hin(#)} with {opt near:count(#)} and {cmd:geonear} will return any {it:nborid} that satisfies either condition. {title:Options} {dlgtab:Main} {phang} {opt wid:e} is used to request that {cmd:geonear} return results in wide form. In this mode, {cmd:geonear} creates variables to identify neighbors and their distance to the {it:baseid}. If {opt nea:rcount(#)} is omitted, the nearest neighbor is identified by the {it:nid} and {it:km_to_nid} variables. Otherwise, the naming follows the following pattern: {p 12 12 2} {it:nid1{space 2}km_to_nid1{space 2}nid2{space 2}km_to_nid2{space 2}[...]{space 2}nid#{space 2}km_to_nid#} {pmore} The {opt g:enstub(nidstub)} option can be used to specify an alternative stubname to {it:nid}. {phang} {opt lo:ng} is used to request results in long form, one observation per neighbor found. {cmd:geonear} returns the variables {it:baseid}, {it:nborid} and {it:km_to_nborid}, with as many observations per {it:baseid} as there are neighbors found. Because of the output format, {it:baseid} and {it:nborid} must have different names. Both {opt wit:hin(#)} and {opt near:count(#)} can be used to target neighbors. {phang} {opt nea:rcount(#)} is used to request a specific number of nearest {it:nborid} per {it:baseid}. If omitted, {cmd:geonear} finds the nearest neighbor for each {it:baseid}. If there is a tie for the #th {it:nborid} position, the neighbor is chosen by ascending order of {it:nborid}. {phang} {opt ign:oreself} is used when {it:baseid} and {it:nborid} identify the same set of locations. With this option, {cmd:geonear} ignores distances when {it:{bind:nborid == baseid}} (as such distance would be zero) and thus ensures that the nearest neighbor is a different entity. {phang} {opt o:ps(#)} specifies the maximum {it:#} number of distances to be calculated per region. Within a region, the number of distances to be calculated is determined by {bind:{it:(nbases * nnbors)}}, where {it:nbases} is the count of base locations within the region and {it:nnbors} is the number of potential neighbors. The default is 15,000 for results in wide form and 7,000 for results in long form. {phang} {opt g:enstub(str)} is only available in wide form mode. It is used to specify an alternative stubname to {it:nid} for the nearest neighbor variables. {phang} {opt wit:hin(#)} is only available in long form mode. It is used to request all neighbors within a distance {it:#} of each {it:baseid}. If the {opt mi:les} option is specified, {opt wit:hin(#)} is considered specified in miles. This option can be combined with {opt nea:rcount(#)} to request a minimum number of nearest neighbors per {it:baseid}. Combined with {opt nea:rcount(0)}, {it:baseid}s are excluded from the results if they have no neighbors within the specified distance. {phang} {opt li:mit(#)} is only available in long form mode. It can be used to limit the number of neighbor observations per {it:baseid} when {opt wit:hin(#)} is used. This option is offered mostly as a convenience. It does not affect the number of distances calculated; it simply drops neighbor observations that exceed the count. {dlgtab:Distances} {phang} {opt mi:les} requests that distances be returned in miles instead of kilometers. It also indicates that {opt wit:hin(#)} is in miles. {phang} {opt ra:dius(#)} allows for a user-specified radius of the earth in km. If omitted, a radius of 6371km is used. This is the earth's mean radius (see {browse "http://en.wikipedia.org/wiki/Earth_radius#Mean_radii"}). {phang} {opt e:llipsoid} indicates that distances should be calculated using the WGS 1984 reference ellipsoid. Calculating ellipsoidal distances requires more computational power; you should expect a significant penalty in terms of total run time. {phang} {opt a(#)} is used to specify the length of the semi-major axis parameter in meters for a different ellipsoid. For example, to use the Bessel 1841 reference ellipsoid, specify {opt a(6377397.155)}. {phang} {opt f(#)} is used to specify the flattening ratio parameter for a different ellipsoid. For example, to use the Bessel 1841 reference ellipsoid, specify {opt f(299.1528128)}. {dlgtab:Reporting} {phang} {opt rep:ort(#)} is used to control the frequency of status reporting. By default, a status report line is displayed every 10 seconds. Each report line includes current region count, number of base locations within the region, number of neighbor locations considered, total number of distances calculated {bind:(ops = base locs x nbor locs),} cumulative number of distances, percent done, and remaining time. {title:Accuracy and limitations} {pstd} By default, {cmd:geonear} computes great-circle distances using the Haversine formula. {cmd:geonear} uses double precision arithmetic throughout and the formula provides submillimeter accuracy for all distances except within 10 meters of antipodal locations. {cmd:geonear} uses the Vincenty (1975) equations when computing distances on a reference ellipsoid. According to Thomas and Featherstone (2005), these are thought to maintain submillimeter accuracy between all locations. However, the Vincenty algorithm cannot accurately calculate distances for near-antipodal points. In such cases, the distances are undefined and set to missing. See {browse "http://geographiclib.sourceforge.net/cgi-bin/Geod"} for a calculator that does not have this limitation. In either case, since {cmd:geonear} seeks nearest neighbors, the issue of near-antipodal locations is not likely to matter. {pstd} Most longitude and latitude datasets are based on the WGS 1984 datum, which is what GPS devices and Google Earth/Map use. There are other datums and it is important to understand that distances will not be accurate if the base and neighbor locations are not expressed in the same reference system. Here is a {browse "http://robertpicard.com/stata/_MG_2342.jpg":photo of my GPS} lying on the Prime Meridian line at the Royal Observatory in Greenwich, England. If you use a distance calculator, you will see that it is about 100 meters to the west of the Reference Meridian of WGS 1984 ({stata geodist 51.47803 -0.00145 51.47803 0}). {title:Examples} {pstd} Simulate a dataset of 2000 Census Block Group centroids for Colorado. {cmd:.} {stata clear} {cmd:.} {stata set seed 123456} {cmd:.} {stata set obs 3278} {cmd:.} {stata gen bgid = _n} {cmd:.} {stata gen double bglat = 37 + (41 - 37) * uniform()} {cmd:.} {stata gen double bglon = -109 + (109 - 102) * uniform()} {cmd:.} {stata tempfile bg} {cmd:.} {stata save "`bg'"} {pstd} Create a new dataset that contains locations of cell towers in the State of Colorado. {cmd:.} {stata clear} {cmd:.} {stata set obs 1000} {cmd:.} {stata gen ctid = _n} {cmd:.} {stata gen double ctlat = 37 + (41 - 37) * uniform()} {cmd:.} {stata gen double ctlon = -109 + (109 - 102) * uniform()} {cmd:.} {stata tempfile cell} {cmd:.} {stata save "`cell'"} {pstd} To find the nearest neighbor for each of these cell towers: {cmd:.} {stata geonear ctid ctlat ctlon using "`cell'", n(ctid ctlat ctlon) ignoreself} {pstd} To find all towers within 50km of each tower: {cmd:.} {stata use "`cell'", clear} {cmd:.} {stata rename ctid ctid0} {cmd:.} {stata geonear ctid0 ctlat ctlon using "`cell'", n(ctid ctlat ctlon) ign long within(50) near(0)} {pstd} To find all cell towers within 5km from each Census Block Group centroids: {cmd:.} {stata use "`bg'", clear} {cmd:.} {stata geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) within(5) near(0) long} {pstd} To find the nearest cell tower as well as all cell towers within 5km from each Census Block Group centroids: {cmd:.} {stata use "`bg'", clear} {cmd:.} {stata geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) within(5) long} {cmd:.} {stata "by bgid: gen N = _N"} {cmd:.} {stata tab N} {pstd} To find the nearest 3 Census Block Group centroids for each cell tower in wide form: {cmd:.} {stata use "`cell'", clear} {cmd:.} {stata geonear ctid ctlat ctlon using "`bg'", n(bgid bglat bglon) wide near(3) genstub(bgid)} {pstd} The following example finds all Census Block Group centroids that fall within each cell tower's Voronoi (a.k.a. Thiessen) polygon. {cmd:.} {stata use "`bg'", clear} {cmd:.} {stata geonear bgid bglat bglon using "`cell'", n(ctid ctlat ctlon) genstub(ctid)} {cmd:.} {stata sort ctid bgid} {cmd:.} {stata "by ctid: gen N = _N"} {cmd:.} {stata tab N} {title:References and acknowledgements} {pstd} Many thanks to Chris Veness for the best web pages on how to compute geodetic distances: {browse "http://www.movable-type.co.uk/scripts/latlong-vincenty.html"} {browse "http://www.movable-type.co.uk/scripts/latlong.html"} {pstd} The definition of the World Geodetic System 1984 is available from {browse "http://earth-info.nga.mil/GandG/publications/tr8350.2/wgs84fin.pdf"} {pstd} See Appendix A.1 for a list of reference ellipsoids. {pstd} C. F. F. Karney, Geodesics on an ellipsoid of revolution, Feb. 2011; preprint {browse "http://arxiv.org/abs/1102.1215":arxiv:1102.1215.} {pstd} R. W. Sinnott, "Virtues of the Haversine", {it:Sky and Telescope} 68 (2), 159 (1984). Thanks to the University of Michigan's Shapiro Science Library. {pstd} C. M. Thomas and W. E. Featherstone, Validation of Vincenty's Formulas for the Geodesic Using a New Fourth-Order Extension of Kivioja's Formula, {it:J. Surv. Engrg.} Volume 131, Issue 1, pp. 20-26 (February 2005), available for download from: {browse "http://www.cage.curtin.edu.au/~will/thomas-featherstone.pdf"} {pstd} Vincenty, T. (1975) Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations, {it:Survey Review} 22(176): 88-93 is available from: {browse "http://www.ngs.noaa.gov/PUBS_LIB/inverse.pdf"} {title:Author} {pstd}Robert Picard{p_end} {pstd}picard@netbox.com{p_end}