-------------------------------------------------------------------------------
help for nearstat 
-------------------------------------------------------------------------------

Title

nearstat --- Calculates distances, generates distance-based variables, and exports distance matrix to files.

+--------------------+ ----+ Table of Contents +-----------------------------------------------

Syntax General description Description of the options Saved scalars Examples References Citation Author information -------------------------------------------------------------------------------

Syntax

nearstat varlist1 [if] [in], near(varlist2) distvar(newvar1) [ Other options ]

options Description ------------------------------------------------------------------------- Options near(varlist2) indicate latitudes and longitudes for the neighboring observations or areal units; this option is required

distvar(newvar1) calculate distance to nearest neighbors; this option is required

kth(#) specify the order of the nearest neighbors to which distance needs to be calculated; default is kth(1)

nid(varname newvar2) request the identity or name of the specified nearest neighbors

cart request distance calculation for Cartesian coordinates.

r(#) indicate the earth radius value to be used in case of spherical coordinates; default is r(6371.009), i.e., 6371.009 km

contvar(varlist3) specify a list of variables to be used in calculation of descriptive statistics

statvar(newvarlist) specify a list of new variable names to hold calculated descriptive statistics or to hold the variable values corresponding to nearest neighbors

statname(stats) indicate the descriptive statistics to be calculated : where stats is either min, max, mean, std, or sum

knn(#>=2) request statistics for nearest neighbors

dband(numlist) specify a distance band when requesting statistics, a neighbor count variable, or a dummy variable for neighbors falling in the specified distance band

allnei request descriptive statistics for all neighbors

dname(newvar3) request a dummy variable equal to 1 if the specified nearest neighbors are within the distance band specified with dband()

ncount(newvar4) request a variable holding the number of neighbors falling in ring dband()

incdist(newvar5) request a variable holding incremental distance to reach a metropolitan area with a specified population threshold

atpop(#) specify the metropolitan area population threshold

alpha(#) request distance weighted statistics

iidist(newvar6) request a variable holding one-to-one distance between input and near features

minmaxd(newvar7 mmtype) request a variable holding the mininimu or maximum distance for each observation, where mmtype must equal to min or max

expdist(distname) export a matrix containing the distances between each input feature and all near features expto(Stata|tab|csv|Mata) specify a file format for the distance matrix

sparse(tab|csv|Mata) export the distance matrix in sparse form

nozero remove zeros from sparse distance matrix

des(des_option,[des_suboption]) display descriptive statistics for the distances between input and near features, distance to the kth-nearest neighbors (k=1,..., N-1), and neighbor count within a distance band : where des_option=stat to request Min, Mean, Std, Max, and Sum and des_suboption=quart to request quartiles

replace overwrite existing variables and files

favor(speed|space) favor speed or space

-------------------------------------------------------------------------

+-------------+ ----+ Description +------------------------------------------------------

Using Mata, the Stata’s new matrix language, nearstat calculates distances, generates distance-based variables, and exports the distances to a Stata matrix, a Mata file, or a text file. To generate the variables, nearstat performs, for each observation in a Stata dataset, a series of computational tasks including, but not limited to, calculating distance-weighted descriptive statistics over all neighbors, nearest neighbors, and neighbors falling in a specified distance band; calculating distance to nearest neighbors; counting the number of neighbors falling in a specified distance band; and determining whether a specified (e.g., first, second, third,…) nearest neighbor falls within a certain distance band. Distance is calculated as the Great Circle or crow-fly distance depending on whether spherical or Cartesian coordinates are supplied to nearstat.

Spherical coordinates can be obtained from any shapefiles using the Stata command shp2dta written by Kevin Crow (see shp2dta if installed). If not installed, a copy of shp2dta can be found here.

Topologically Integrated Geographic Encoding and Referencing (TIGER)- LINE shapefiles for metropolitan areas, counties, census tracts, etc., are available from the U.S. Census Bureau's website (see the references below). One way to obtain Cartesian or projected coordinates is to project the shapefiles in the ArcGIS software using appropriate projected coordinate systems.

nearstat requires Stata 10.1 or higher.

+---------+ ----+ Options +----------------------------------------------------------

near(varlist2) specifies the variables holding latitudes and longitudes for the observations or areal units to which distance needs to be calculated. This option is required.

varlist1 holds latitudes and longitudes for the observations or areal units from which distance needs to be calculated.

Observations or areal units in varlist1 are referred to as input features and those in varlist2 are referred to as near features.

Note 1: varlist1 and varlist2 may contain the same areal units. For example, you might want to calculate the distance from each county to its nearest neighboring county or to obtain for each county the average per capita income for the eight nearest counties. In this case, the two variables supplied in varlist1 would be exactly the same as those supplied in varlist2.

Different areal units are also allowed. For instance, you might want to calculate the distance from each rural county to its nearest metropolitan area. In this case, varlist1 would hold the latitudes and longitudes of the rural counties while varlist2 would contain the population weighted latitudes and longitudes of the metropolitan areas. Although the areal units supplied in varlist1 and varlist2 may be different, their coordinates must be of the same type.

distvar(newvar1) specifies the name of a variable for holding distance from each input feature to its nearest neighbor specified with the kth(#) option. This option is required.

kth(#) indicates the order of the nearest neighbors to which distance is to be calculated. For example, specifying kth(2) indicates the second nearest neighbors. The default is to calculate distance to the first nearest neighbors, i.e., kth(1).

nid(varname newvar2) requests the identification numbers or names of the nearest neighbors specified with kth(#). This option requires two variable names. The first one should be the name of the identifier variable for the near features. The other one is the name of a variable to hold the requested identitification numbers or names. Obviously, if varlist1 = varlist2 then there is only one identifier variable for both input and near features. If an observation has two or more equidistant neighbors, given the order considered, nearstat will report the first one encountered.

cart indicates that coordinates are projected and that crow-fly distance should be calculated using the Pythagorean formula: dij=sqrt((xj-xi)^2+(yj-yi)^2). When cart is specified, the distance unit is the same as that of the projected coordinates. Option cart may be specified if the coordinates are in arbitrary digitizing units.

By default, distance is calculated for spherical non-projected coordinates. In such a case, nearstat calculates the "Great Circle" distance using the Haversine formula, which yields more accurate distance than the Law of Cosines or Vincenty formula due to problems related to small distances.

The Haversine formula to calculate distance between two points is given as follows:

Haversine Formula dlong = long2 - long1 dlat = lat2 - lat1 z = sin^2(dlat/2) + cos(lat1) * cos(lat2) * sin^2(dlong/2) c = 2 * arcsin(min(1,sqrt(z))) dist = r * c, where c is the Great Circle distance in radians and r is the radius of the earth. dist is in the same unit as r. By default, r is set to 6371.009 km considered to be the Earth's mean radius by the International Union of Geodesy and Geophysics (IUGG). The IUGG's corresponding value in miles is 3958.761, which users can supply with the r() option to obtain distance in miles.

Spherical coordinates must be measured in decimal degrees. If your coordinates are in a degrees, minutes, and seconds format, you can convert them into decimal degrees using the following formula:

Decimal value = Degrees + (Minutes/60) + (Seconds/3600)

For instance, a latitude of 122 degrees 45 minutes 45 seconds north is equal to 122.7625 degrees north.

r(#) indicates the value to be used for the Earth radius or mean radius in case of spherical coordinates. The Earth radius usually refers to various fixed distances and to various mean radii since only a sphere has a true radius. Fortunately, the numerical differences among different radii vary by far less than one percent, making the choice of # less of a concern.

r(#) and cart may not be combined.

contvar(varlist3) specifies the variables to be used in calculating the statistics or the variables whose values associated with the nearest neighbors need to be reported.

Note 2: varlist3 must have the same number of valid (non-missing) observations as varlist2.

statvar(newvarlist) provides a list of names for the variables to hold the calculated descriptive statistics or to hold the values of the variables in varlist3 corresponding to the nearest neighbors. Specify one variable name for each variable in varlist3.

Note 3: Options contvar() and statvar() must be combined.

statname(stats) indicates the statistics to be calculated. stats may be either min, max, mean, std, or sum. When variables listed in varlist3 are dummies, mean is equivalent to proportion or percentage if multiplied by 100.

Note 4: If contvar() and statvar() are specified, but statname() is not, then each variable in listed newvarlist will contain the values of the corresponding variable listed in varlist3 associated with the nearest neighbors, given the order specified with kth() (see examples 2 and 4).

knn(#) indicates the number of nearest neighbors to be used when calculating the descriptive statistics. # cannot be less than 2 or greater than the number of valid observations contained in varlist2.

dband(numlist) indicates the distance band to be used with option dname() and/or ncount(), or requests that statistics be calculated for near features falling in the ring specified with dband().

Note 5: When dband() is specified, by default, the distance unit is assumed to be kilometers, but that can be overridden with option cart or r(#).

allnei requests that statistics be calculated using all near features. allnei and knn() may not be combined.

alpha(#) requests distance-weighted statistics. For instance, if alpha(1) is specified, nearstat will divide the values of the variables listed in varlist3 by distance prior to calculating the statistics. Specifying alpha(2) entails dividing by distance squared.

dname(newvar3) provides the name for a dummy variable equal to one if a nearest neighbor specified with kth() is within the distance band specified with dband() and zero otherwise.

ncount(newvar4) specifies the name of a variable to hold, for each observation, the number of neighbors falling in the distance band specified with dband().

Note 6: When allnei or knn(#) is specified, specifying dband() implies a request for a neighbor count variable or for a dummy variable. As a result, either ncount() or dname() must be specified.

incdist(newvar5) specifies the name of a variable for holding incremental distance to reach the threshold population specified with atpop() (see Partridge and Rickman, 2008). incdist() and statvar() may not be combined.

atpop(#) specifies a metropolitan area population threshold for which incremental distance needs to be calculated. atpop() and incdist() must be combined.

iidist(newvar6) specifies the name of a variable to hold one-to-one distance between input and near features when they are different but have the same number of non-missing observations. Essentially, this variable holds the diagonal elements of the distance matrix.

minmaxd(newvar7 mmtype) requests that a variable for holding the minimum or maximum distance from each observation to its neighbors be generated. mmtype should be equal to min or max to request the minimum or maximum distance respectively.

expdist(distname) requests that distance between input and near features be exported as a matrix to the permanent file or temporary matrix distname.

expto(Stata|tab|csv|Mata) indicates whether the distance matrix should be exported to a Stata matrix loaded in memory or to a file in a tab delimited, csv, or Mata format. Note that if expto(Stata) is specified, a Stata matrix loaded in memory will be created only if the matrix size does not exceed the matsize limit of your Stata flavor.

sparse(tab|csv|Mata) specifies that the distance matrix be written as a three-column matrix (row, column, value) to a tab delimited, a csv, or a Mata file. expto() and sparse() may not be combined, but you must specify one of them when expdist() is specified.

nozero specifies that the diagonal zeros be removed from the sparse distance matrix. By default, the diagonal zeros are not removed.

des(des_option,[des_suboption]) requests that descriptive statistics for the distances between input and near features and the distances to the nearest neighbors specified with kth(#) be displayed. des_option is required when des() is specified.

If des_suboption is not specified, then statistics include number of location pairs, minimum, mean, standard deviation, and maximum distance. Otherwise, lower quartile, median or second quartile, and upper quartile will be displayed as well. With or without the des() option specified, these statistics are returned as saved scalars.

Descriptive statistics for the number of neighbors falling in ring dband() will also be displayed if ncount() is specified.

replace overwrites existing variables newvar1, newvar2, newvar3, newvar4, newvar5, newvar6, newvar7 and any variables listed in newvarlist and existing file distname.

favor(speed|space) instructs nearstat to favor speed or space when performing all the calculations. favor(speed) is the default. This option provides a trade-off between speed and memory use. See [M-3] mata set.

+---------------+ ----+ Saved scalars +----------------------------------------------------

r(nearest_min) = Minimum of the distance to the kth nearest neighbors r(nearest_max) = Maximum of the distance to the kth nearest neighbors r(nearest_mean) = Average of the distance to the kth nearest neighbors r(n_near) = Number of near features r(n_input) = Number of input features r(max_dist) = maximum distance between input and near features r(Q3_dist) = Upper quartile distance between input and near features r(Q2_dist) = Median or middle quartile distance between input and near f > eatures r(mean_dist) = Average distance between input and near features r(Q1_dist) = Lower quartile distance between input and near features r(min_dist) = Minimum distance between input and near features r(Obs) = Number of location pairs between which distance was calcula > ted

+----------+ ----+ Examples +---------------------------------------------------------

1) Calculate average test score and proportion of nonwhite for the first 3 nearest neighbors using Cartesian coordinates

. nearstat latitude longitude, near(latitude longitude) distvar(distname) /// cart contvar(testscore nonwhite) statvar(avtest pctnwhite) knn(3) /// statname(mean)

Note that in this case, varlist1 is the same as varlist2

-----------------------------------------------------------------------------

2) Determine the identification number, test score, and race of each observation's nearest neighbor

. nearstat latitude longitude, near(latitude longitude) distvar(distname) /// cart contvar(testscore nonwhite) statvar(near_score near_race) /// nid(id nei_id) replace

Here option replace is specified to replace the variable distname already created.

-----------------------------------------------------------------------------

3) Calculate distance from each county to the nearest metropolitan area using spherical coordinates

a) First, load the county level data

. use mycountydata

b) Second, merge your metropolitan level data

. merge using mymetrodata

. drop _merge

c) Now you are ready to run nearstat

. nearstat latvar1 longvar1, near(latvar2 longvar2) distvar(distmetro)

Here latvar1 and longvar1 hold latitudes and longitudes of the counties and latvar2 and longvar2 contain population weighted latitudes and longitudes of the metropolitan areas.

-----------------------------------------------------------------------------

4) Calculate distance from each rural county to its nearest metropolitan area and record population of the nearest metropolitan area using spherical coordinates

. nearstat latvar1 longvar1, near(latvar2 longvar2) distvar(distmetro) /// contvar(popmetro) statvar(popnear)

Here popmetro is the variable holding population in each metropolitan area and popnear is the name of a variable to hold population in the nearest metropolitan area.

-----------------------------------------------------------------------------

5) Calculate incremental distance to reach a metropolitan area with a population of at least 500,000

. nearstat latvar1 longvar1, near(latvar2 longvar2) distvar(distmetro) /// contvar(popmetro) incdist(incd5) atpop(500000)

Here incd5 is the name of a variable for holding the calculated incremental distance

-----------------------------------------------------------------------------

6) Create a variable (nearmetro) holding the population of a county if the county is part of a defined metropolitan area or the population of the nearest metropolitan area if the county is a non-metropolitan one.

In addition to the variables in Example 4, you need a variable holding county population and a dummy variable equal to 1 if the county is part of a metropolitan area and zero otherwise.

a) First, set the variable nearmetro equal to the county population variable:

. gen nearmetro=pop2000 // where pop2000 is a variable holding county population in 2000

b) Second, calculate population in the nearest metropolitan area as in Example 4

c)Third, replace nearmetro values with popnear values if the county is non-metro.

. replace nearmetro=popnear if metro==0 // where popnear is a variable holding population in the nearest metropolitan area

-----------------------------------------------------------------------------

7) Calculate average income for 200 nearest neighbors of each county using spherical coordinates

. nearstat lat long, near(lat long) distvar(distname) contvar(income) /// statvar(avincome) statname(mean) knn(200) replace

-----------------------------------------------------------------------------

8) Obtain a dummy variable (dum1_150) equal to 1 if the first nearest neighbor is within 150 kilometers and zero otherwise

. nearstat lat long, near(lat long) distvar(distname) dname(dum1_150) /// dband(0 150) replace

-----------------------------------------------------------------------------

9) Obtain a dummy variable (dum3_150m) equal to one if the third nearest neighbor is within 150 miles and zero otherwise

. nearstat lat long, near(lat long) distvar(distvar3) kth(3) r(3958.761) /// dname(dum3_150m) dband(0 150)

-----------------------------------------------------------------------------

10) Request a variable (nbnei) that holds (for each observation) the number of neighbors located within a two-mile radius

. nearstat lat long, near(lat long) distvar(mydist) ncount(nbnei) dband(0 2) /// r(3958.761)

-----------------------------------------------------------------------------

11) Display descriptive statistics for distance (in miles) between input and near features, assuming spherical coordinates

. nearstat lat long, near(lat long) distvar(mydist) des(stat) r(3958.761)

This line of code will generate a table containing two rows. The second row reports, for example, the maximum distance from the first nearest neighbor, which is the minimum distance (or distance cut-off) to obtain at least one neighbor for each observation.

-----------------------------------------------------------------------------

12) Display descriptive statistics for the distances between input and near features and for the number of neighbors falling in the distance band: 0<dij<=9

. nearstat latitude longitude, near(latitude longitude) distvar(distname) /// des(stat) db(0 9) ncount(neicount) replace

-----------------------------------------------------------------------------

13) Calculate for each county the proportion of surrounding counties with high poverty rate (poverty rate >=20%) in 2000

a) Create a dummy variable (pov20_00) equal to one if a county has a poverty rate >=20% and zero otherwise

. gen pov20_00=(povrt00>=20)

b) Calculate the proportion variable (neipov00) for which eight neighbors are considered.

. nearstat latitude longitude, near(latitude longitude) distvar(nearestnei) /// contvar(pov20_00) statvar(neipov00) statname(mean) knn(8)

-----------------------------------------------------------------------------

14) Calculate distance from each observation to District of Columbia (DC) to analyze housing values for example

a) First, create a one-observation dataset with the latitude and longitude of DC:

. set obs 1

. gen lat_dc=38.8964

. gen lon_dc=-77.0262

. save dc_coord

b) Second, load your housing value dataset:

. use mydataset, clear

c) Third, merge your data with the DC coordinates:

. merge using dc_coord

d) Finally, calculate distance from each observation to DC:

. nearstat lat long, near(lat_dc lon_dc) distvar(distodc) // where lat and long are variables holding the housing coordinates

-----------------------------------------------------------------------------

15) Generate a variable (called dmax) for holding the maximum distance for each observation

. nearstat latitude longitude, near(latitude longitude) distvar(nearestnei) minmaxd(dmax max)

-----------------------------------------------------------------------------

References

de Smith, M.J., M.F. Goodchild, and P.A. Longley, 2007. Geospatial Analysis: A > comprehensive Guide to Principles, Techniques, and Software Tools. Matador: L > eicester, UK http://www.spatialanalysisonline.com

Gould, W. 2007. "Mata Matters: Subscripting". The Stata Journal 7: 106-116.

Gould, W. 2006. "Mata Matters: Creating New Variables—Sounds Boring, Isn't". Th > e Stata Journal 6: 112-123. Available at http://www.stata-journal.com/article.html?article=pr0021

Jeanty, P.W., M. Partridge, and E. Irwin. 2010. Estimation of a Spatial Simulta > neous Equation Model of Population Migration and Housing Price Dynamics. Journal of Regional Science and Urban Economics 40(5): 343-352.

Partridge, M. and R.S. Dan. 2008. Distance from Urban Agglomeration Economies a > nd Rural Poverty. Journal of Regional Science 48(2):285-310.

U.S. Census Bureau Geographic Information Systems FAQ. What is the best way to > calculate the distance between 2 points http://www.movable-type.co.uk/scripts/gis-faq-5.1.html.

U.S. Census Bureau. 2012. Cartographic Boundary Files. http://www.census.gov/ge > o/www/cob/bdy_files.html U.S. Census Bureau. 2011. Using the TIGER/Line Shapefiles and Census Data. http > ://www.census.gov/geo/www/tiger/wwtl/wwtl.html

U.S. Census Bureau. 2011. TIGER Products. http://www.census.gov/geo/www/tiger/i > ndex.html#tl

U.S. Census Bureau. 2010. Census 2000 Gazetteer Files. http://www.census.gov/ge > o/www/gazetteer/places2k.html

Wikipedia. 2012. Great-Circle Distance. http://en.wikipedia.org/wiki/Great-circ > le_distance

---------. 2012. Earth Radius. http://en.wikipedia.org/wiki/Earth_radius#Mean_r > adii.

Citation

Thanks for citing nearestat as follows:

Jeanty, P.W., 2010. nearstat: Stata module to calculate distances, generate dis > tance-based variables, and export distance matrix to text files. Available from http://ideas.repec.org/c/boc/bocode/s457110.html.

Author

P. Wilner Jeanty, The Kinder Institute for Urban Research/Hobby Center for the Study of Texas, Rice University, Houston, Texas

Email to pwjeanty@rice.edu

N.B.: Previous versions of nearstat were written when the author was a Research Economist with the Dept. of Agricultural, Environmental, and Development Economics, The Ohio State University

Also see

Online: vincenty, nearest, distmatch (if installed)