Egen-Function to generate a variable for mlabvpos()
egen [type] newvar = mlabvpos(yvar xvar) [if exp] [in range] [, log polynom(#) mat:rix( 5x5 matrix)
whereby yvar is the name of a variable which is going to be plotted as Y-variable in a scatterplot and xvar is the name of a variable which forms the X-axis of that scatterplot.
Description
_gmlabvpos is an attempt to automatically generate a variable for the clockpositions of marker labels in scatterplots. That is, the command generates a variable which can be filled into the scatter option mlabvpos().
Note that the program does not attempt to prevent marker labels from overploting, which is quite likely in dataset with many observations. In such situations you might be better of in simply make randomized clock positions:
. gen clock = int(uniform()*12)+1)
The general idea behind _gmlabvpos is to pull the marker label away from the data-region. For example, marker symbols in the lower left edge of the data region are labeled at clock-position 7 or 8, and marker symbols in the upper right edge of the data region are labeled at clock-position 1 or 2, etc. More precisely, if you consider the following rectangle as the data-region of a scatterplot, than marker labels of symbols in the indicated area gets the following clock-position:
+--------------+ |11 12 12 12 1| |10 11 12 1 2| | 9 9 12 3 3| | 8 7 6 5 4| | 7 6 6 6 5| +--------------+
If yvar and xvar are highly correlated, than the clock-positions are generated as follows (which is however the same general idea):
+--------------+ | 12 1 3| | 12 12 3 4| |11 11 12 5 5| |10 9 6 6 | | 9 7 6 | +--------------+
To calculate the the clock-positions, Stata first categorize the x-axis into 5 equal sized intervals around the mean of xvar. Afterwards the residuals of a linear regression of yvar on xvar are categorized into 5 equal sized intervals. Both categorized variables are than used to form the clockpositions according to the rule of the first table above. The rule can be changed with the option matrix().
Options
log is used, if you want to calculate the residuals from the regression of yvar on a logarthmic version of xvar. This might be useful if the scatter shows a strong curvilinar relationship.
polynom(#) is used, if you want to calculate the residuals from the regression of yvar on polynoms of xvar. For example use {cmd:polynom(2) if the scatter shows a u-shaped relationship.
matrix(#) is used to change the general rule for the plot-positions. The clock positions are specified by a 5x5 matrix, whereby the upper left cell refer to the clock position of marker labels in the upper left part of the data-region. etc.
Examples
. egen clock = mlabvpos(mpg weight) . sc mpg weight, mlab(make) mlabvpos(clock) . egen clock2 = mlabvpos(mpg weight), matrix(11 1 12 11 1 \\ 10 2 12 10 2 \\ 9 3 12 9 3 \\ 8 4 6 8 4 \\ 7 5 6 7 5) . sc mpg weight, mlab(make) mlabvpos(clock2)
Also see
Online: help for scatter,
Author
Ulrich Kohler, WZB, kohler@wz-berlin.de