Processing math: 100%
Multivariate Yatchew Test
Let D be a vector of K random variables. Let g(D)=E[Y|D]. Denote with ||.,.|| the Euclidean distance between two vectors. The null hypothesis of the multivariate test is g(D)=α0+A′D, with A=(α1,...,αK), for K+1 real numbers α0, α1, ..., αK. This means that, under the null, g(.) is linear in D. Following the same logic as the univariate case, in a dataset with N i.i.d. realisations of (Y,D) we can approximate the first difference Δε by ΔY valuing g(.) between consecutive observations. The program runs a nearest neighbor algorithm to find the sequence of observations such that the Euclidean distance between consecutive positions is minimized.
The program follows a very simple nearest neighbor approach:
- collect all the Euclidean distances between all the possible unique pairs of rows in D in the matrix M, where Mn,m=||Dn,Dm|| with n,m∈{1,...,N};
- setup the queue to Q={1,...,N}, the (empty) path vector I={} and the starting index i=1;
- remove i from Q and find the column index j of M such that Mi,j=minc∈QMi,c;
- append j to I and start again from step 3 with i=j until Q is empty.
To improve efficiency, the program collects only the N(N−1)/2 Euclidean distances corresponding to the lower triangle of matrix M and chooses j such that Mi,j=minc∈Q1{c<i}Mi,c+1{c>i}Mc,i. The output of the algorithm, i.e. the vector I, is a sequence of row numbers such that the distance between the corresponding rows Dis is minimized. The program also uses two refinements suggested in Appendix A of Yatchew (1997):
- The entries in D are normalized in [0,1];
- The algorithm is applied to sub-cubes, i.e. partitions of the [0,1]K space, and the full path is obtained by joining the extrema of the subpaths.
By convention, the program computes (2⌈log10N⌉)K subcubes, where each univariate partition is defined by grouping observations in 2⌈log10N⌉ quantile bins. If K=2, the user can visualize in a graph the exact path across the normalized Dis by running the command with the option path_plot.
Once the dataset is sorted by I, the program resumes from step (2) of the univariate case.