Multivariate Yatchew Test

Processing math: 100%

Multivariate Yatchew Test

Let

$\textbf{D}$ be a vector of

$K$ random variables. Let

$g(\textbf{D}) = E[Y|\textbf{D}]$ . Denote with

$||.,.||$ the Euclidean distance between two vectors. The null hypothesis of the multivariate test is

$g(\textbf{D}) = \alpha_0 + A'\textbf{D}$ , with

$A = (\alpha_1,..., \alpha_K)$ , for

$K+1$ real numbers

$\alpha_0$ ,

$\alpha_1$ , ...,

$\alpha_K$ . This means that, under the null,

$g(.)$ is linear in

$\textbf{D}$ . Following the same logic as the univariate case, in a dataset with

$N$ i.i.d. realisations of

$(Y, \textbf{D})$ we can approximate the first difference

$\Delta \varepsilon$ by

$\Delta Y$ valuing

$g(.)$ between consecutive observations. The program runs a nearest neighbor algorithm to find the sequence of observations such that the Euclidean distance between consecutive positions is minimized. The program follows a very simple nearest neighbor approach:

collect all the Euclidean distances between all the possible unique pairs of rows in $\textbf{D}$ in the matrix $M$ , where $M_{n,m} = ||\textbf{D}_n,\textbf{D}_m||$ with $n,m \in \lbrace 1, ..., N\rbrace$ ;
setup the queue to $Q = \lbrace 1, ..., N\rbrace$ , the (empty) path vector $I = \lbrace\rbrace$ and the starting index $i = 1$ ;
remove $i$ from $Q$ and find the column index $j$ of M such that $M_{i,j} = \min_{c \in Q} M_{i,c}$ ;
append $j$ to $I$ and start again from step 3 with $i = j$ until $Q$ is empty.

To improve efficiency, the program collects only the

$N(N-1)/2$ Euclidean distances corresponding to the lower triangle of matrix

$M$ and chooses

$j$ such that

$M_{i,j} = \min_{c \in Q} 1\lbrace c < i\rbrace M_{i,c} + 1\lbrace c > i\rbrace M_{c,i}$ . The output of the algorithm, i.e. the vector

$I$ , is a sequence of row numbers such that the distance between the corresponding rows

$\textbf{D}_{i}s$ is minimized. The program also uses two refinements suggested in Appendix A of Yatchew (1997):

The entries in $\textbf{D}$ are normalized in $[0,1]$ ;
The algorithm is applied to sub-cubes, i.e. partitions of the $[0,1]^K$ space, and the full path is obtained by joining the extrema of the subpaths.

By convention, the program computes

$(2\lceil \log_{10} N \rceil)^K$ subcubes, where each univariate partition is defined by grouping observations in

$2\lceil \log_{10} N \rceil$ quantile bins. If

$K = 2$ , the user can visualize in a graph the exact path across the normalized

$\textbf{D}_{i}s$ by running the command with the option path_plot. Once the dataset is sorted by

$I$ , the program resumes from step (2) of the univariate case.