Draw JK-, SQ- and GH- biplots
biplot varlist [weight] [if exp] [in range] [, [jk|sq|gh|mixed(jk|sq|gh jk|sq|gh)] dimensions(# #) [obsonly|varonly] covariance rv mahalanobis subpop(varname) flip(x|y xy) stretch(#) jitter(relativesize) generate(varname1 varname2) scatter_options line_options twoway_options ]
aweights, and fweights, are allowed; see help weights. However, no weights are allowed with option rv, and aweights are not allowed with options sq option gh.
Description
biplot draws biplots of the data matrix defined by varlist. By default, a JK-biplot with standardized values will be drawn. Biplots are useful for visual inspection of data matrices, allowing the eye to identify patterns, regularities and outliers. In a biplot variables (columns) are shown as arrows from the origin and observations (rows) are shown as points.
The configuration of arrows reflects the relations of the variables. The cosine of the angle between the arrows reflects the correlation between the variables they represent. If the variables are not standardized, the length of each arrow reflects the standard deviation of the variable it represents.
The scatter of observations shows relations among observations. The distance between two points approximates the Euclidean distance between two observations of the data matrix. The cutpoint of a perpendicular from a point to an arrow shows the value of the variable the arrow represents.
Options
jk|sq|gh specifies the type of biplot. jk specifies the default, a JK-biplot. The JK biplot approximates the Euclidean distances between observations more closely than the other types. gh specifies a GH-biplot. The GH biplot represents the relations between variables more closely than the other types. sq specifies a SQ biplot (symmetric biplot).
mixed() can be used instead of the biplot-types to combine the relataive advantages of the different biplot-types. Inside the parentheses one first states a byplot-type for the observations and than a type for the variables. The plot positions of observations and variables are than calculeted respectively. Gabriel (2001), for example, proposes a "correspondence analysis", by using a JK-biplot for the observations and a GH-biplot for the variables. This can be achieved with mixed(jk gh).
Note: In Intercooled Stata "matsize too small" is a likely error message with type gh or sq, even with small sample sizes. matsize has to be at least number of observations + 1. With Intercooled Stata, SQ and GH biplots are only recommended for data with few observations and are only possible up to 799 observations.
dimensions(# #) is used specify the meaning of the graph-axis. The default is to use the coordinates which corresponent to the highest two Eigenvalues. For JK-biplots these are the first two principal components. dimensions() allows to use arbitrary axes. A JK-biplot with dim(3 4) for example, would plot all values in the space of the 3rd and 4th principal component.
obsonly|varonly are used to supress either the plotting of observations or variables. A JK-biplot with obsonly is a component score plot, and a JK-biplot with options varonly and stretch(1) is a Plot of the PCA-coefficients.
covariance uses original instead of standardized values.
rv is used to produce relative variation diagramms. Relative variation diagrams are biplots for compositional data and compositional data are data sets with constant row-sums and only positive value (like, for example the row percentages of twoway frequency tables). To get a relative variation diagramm the data matrix needs to be transformed before producing the biplot, and the option rv does this transformtion for you.
mahalanobis can be used for GH biplots to rescale the graph in a way that the distances between the observations approximates the Mahalnobis distances.
generate(varname1 varname2) is used to store the coordinates for the observations and the variables as variables in the dataset. The y-axis coordinates for the observations are stored in name1_y and the x_axis coordinates for the observations are stored in name1_x. Accordingly, the coordinates for the variables are stored in name2_y and name2_x.
subpop(varname) is used to hilite observations from different subpopulations with different plotsymbols. Note, that by default a legend is drawn to identify the subpopulation. The legend, however, changes the aspect ratio of the biplot. If you don't like this, you can turn the legend off or you can refine the aspect ration with xsize(). Another way to hilite subpopulations would be the option mlab(), which is described below.
flip(x|y|xy) exchanges the signs of the axes. flip(x) and flip(y) exchange signs of the indicated axis. flip(xy) flips both axes. flip() is seldom used, but might be useful if you want to compare your results with the results of other software-packages.
stretch(#) draws longer (or if needed shorter) lines for the variable. By default stretch() is set to a value which improves readability. You can set the value to any real positive number. With stretch(1) you will get the original length, and with stretch(2) the lines will be drawn twice as long as the original values. stretch() is seldom used.
jitter(relativesize) adds spherical random noise to the plot symbols of observations. This is useful when plotting data which otherwise would result in points plotted on top of each other. Commonly specified are jitter(5) or jitter(6); jitter(0) is the default. See help relativesize for a description of relative sizes.
scatter_options are the following set of the options allowed with scatter:
---------------------------------------------------------------------- msymbol(symbolstylelist) shape of marker mcolor(colorstylelist) color of marker, inside and out msize(markersizestylelist) size of marker mlabel(varlist) specify marker variables mlabposition(clockposlist) where to locate label mlabvposition(varlist) where to locate label 2 mlabgap(relativesizelist) gap between marker and label mlabsize(textsizestylelist) size of label mlabcolor(colorstylelist) color of label ----------------------------------------------------------------------
You can specify up to two elements within each option. The first element refers to the display of the observations, the second element refers to the variables. Note, that the default plot symbol for the position of the variables is invisible, that is the default value for msymbol is msymbol(oh i). The lines for the variables are, however, changed with the line_options.
line_options are the following set of the options allowed with line:
---------------------------------------------------------------------- clpattern(linepatternstylelist) whether line solid, dashed, etc. clwidth(linewidthstylelist) thickness of line clcolor(colorstylelist) color of line ----------------------------------------------------------------------
Note that the line_options only refer to the display of the variable vectors.
twoway_options are those allowed with {cmd:graph twoway} see help twoway_options:
Examples
. biplot mpg weight length turn . biplot mpg weight length turn, gh mlabel(make) . biplot mpg weight length turn, gh mlabel(make) msymbol(oh o)
Also see
Online: help for twoway, graph, scatter,
Author
Ulrich Kohler, WZB, kohler@wz-berlin.de
References
Gabriel, K.R. 1971. The biplot graphical display of matrices with application to principal component analysis. Biometrika 58, 453-467.
Gower, J.C. and Hand, D.J. 1996. Biplots. London: Chapman and Hall.
Gabriel, K.R. 2002. Goodness of Fit of Biplots and Correspondence Analysis. Biometrica, 89, 423--436