Multivariate Yatchew Test
Let $\textbf{D}$ be a vector of $K$ random variables. Let $g(\textbf{D}) = E[Y|\textbf{D}]$. Denote with $||.,.||$ the Euclidean distance between two vectors. The null hypothesis of the multivariate test is $g(\textbf{D}) = \alpha_0 + A'\textbf{D}$, with $A = (\alpha_1,..., \alpha_K)$, for $K+1$ real numbers $\alpha_0$, $\alpha_1$, ..., $\alpha_K$. This means that, under the null, $g(.)$ is linear in $\textbf{D}$. Following the same logic as the univariate case, in a dataset with $N$ i.i.d. realisations of $(Y, \textbf{D})$ we can approximate the first difference $\Delta \varepsilon$ by $\Delta Y$ valuing $g(.)$ between consecutive observations. The program runs a nearest neighbor algorithm to find the sequence of observations such that the Euclidean distance between consecutive positions is minimized.
The program follows a very simple nearest neighbor approach:
- collect all the Euclidean distances between all the possible unique pairs of rows in $\textbf{D}$ in the matrix $M$, where $M_{n,m} = ||\textbf{D}_n,\textbf{D}_m||$ with $n,m \in \lbrace 1, ..., N\rbrace$;
- setup the queue to $Q = \lbrace 1, ..., N\rbrace$, the (empty) path vector $I = \lbrace\rbrace$ and the starting index $i = 1$;
- remove $i$ from $Q$ and find the column index $j$ of M such that $M_{i,j} = \min_{c \in Q} M_{i,c}$;
- append $j$ to $I$ and start again from step 3 with $i = j$ until $Q$ is empty.
To improve efficiency, the program collects only the $N(N-1)/2$ Euclidean distances corresponding to the lower triangle of matrix $M$ and chooses $j$ such that $M_{i,j} = \min_{c \in Q} 1\lbrace c < i\rbrace M_{i,c} + 1\lbrace c > i\rbrace M_{c,i}$. The output of the algorithm, i.e. the vector $I$, is a sequence of row numbers such that the distance between the corresponding rows $\textbf{D}_{i}s$ is minimized. The program also uses two refinements suggested in Appendix A of Yatchew (1997):
- The entries in $\textbf{D}$ are normalized in $[0,1]$;
- The algorithm is applied to sub-cubes, i.e. partitions of the $[0,1]^K$ space, and the full path is obtained by joining the extrema of the subpaths.
By convention, the program computes $(2\lceil \log_{10} N \rceil)^K$ subcubes, where each univariate partition is defined by grouping observations in $2\lceil \log_{10} N \rceil$ quantile bins. If $K = 2$, the user can visualize in a graph the exact path across the normalized $\textbf{D}_{i}s$ by running the command with the option path_plot.
Once the dataset is sorted by $I$, the program resumes from step (2) of the univariate case.