Title
usesome -- Use subset of Stata dataset
Syntax
usesome [varspec] using filename [, clear not s-options options]
where varspec is
[varlist] [(numlist) [(numlist) ...]]
The general rules apply to varlist, with the exception that variable names may not be abbreviated. In numlist integers in the range 0 < # <= k and k are allowed, where k is the number of variables in the dataset to be used and k will be replaced with k. Note that each numlist is limited to 1600 elements. Parentheses around numlist are required.
Description
usesome loads a subset of a Stata dataset into memory. The program is similar to official Stata's use but enhances its functionality in three ways:
1. the user may specify variables that are not to be used
2. the user may specify variable characteristics instead of variable names
3. the user may specify variable indices instead of variable names
These enhancements are intended for use with flavors of Stata, where the maximum number of variables allowed in a dataset is rather small (2,047 for Stata IC, 99 for Small Stata).
Options
clear allows the data in memory to be replaced.
not loads variables not specified in varspec from filename.
s-options select variables by characteristics. These are the Advanced ds options has(spec) or not(spec), and insensitive. All variables ds returns in r(varlist) are added to the variables specified in varspec. All ds options (except not) are allowed. However, since usesome does not show any ds output, specifying Main options will merely slow down execution time.
in(range) specifies observations to be used from filename. Note that it is not possible to indicate observations not to be used.
nolabel is the same as nolabel used with use.
findname[not] uses user-written findname (Cox 2010, 2012) instead of ds to select variables by characteristics. findnamenot finds, and adds to varspec, variables that do not have the specified properties. As with ds, all findname options (except not) are allowed, but not all of them are useful. It is not allowed to mix ds and findname options.
Remarks
usesome is not to be understood as an alternative to use, as its enhancements may come with severe speed penalties (see Technical remarks). Whenever use can be used, doing so is probably faster than using usesome, although slightly more typing might be involved. I will consider three situations where usesome comes handy, before discussing circumstances under which use might be preferred.
Suppose we want to load a subset of a Stata dataset, but we do not know the names of all the variables we want to use. Alternatively, suppose we know the names, but there are a couple of hundred variable names and, unfortunately, the variable order in the respective dataset does not permit us to specify only a few variable lists with the dash (-) character to represent these variables. Also the names differ considerably, so representing them using wildcards (* and/or ?) would still be rather burdensome.
Using usesome
Suppose, in a first situation, we know that all variable names we do not want to use end with _xyz, but none of the variables we want to use does. We can load the subset of the dataset typing
usesome *_xyz using filename ,not
In a similar situation, suppose we know that all variables we want to use have variable labels containing xyz, while none of the variables we do not want to use has. We load the subset of the dataset typing
usesome using filename ,has(varlabel *xyz*)
In a third situation, suppose we do not know the names of the variables we want to use. We do however know the variables' position in the dataset. Suppose we want to load variables 1-50, 100-200, 500, 510 and 520. We do so typing
usesome (1/50 100/200 500(10)520) using filename
Using use
If the number of variables in filename, in the situations described above, does not exceed the limits of our Stata version (see maxvar), we can use official Stata's use to load the entire dataset into memory and select the subset afterwards. Doing so, in the first situation we type the two lines
use filename drop *_xyz
In the second situation we type three lines of code.
use filename ds ,has(varlabel *xyz*) keep `r(varlist)'
The third situation requires two lines, including one line of Mata code. The code is
use filename mata : st_keepvar((1..50, 100..200, 500, 510, 520))
If, however, the number of variables in filename exceeds the limit of our Stata version, use can only be directly applied if we know (and are willing to specify) the names of all variables we wish to load. Whenever this "keep-logic" is convenient, use is convenient.
As stated, usesome might be pretty slow, and will be if s-options are specified. Loading a subset of filename, usesome calls describe to obtain a list of all variables in filename. If no s-options are specified, variables indicated by varspec are selected from this list and used. If s-options are specified, the variable list is split into parts, small enough not to hit the limits imposed by maxvar. usesome loads each part of filename into memory, selects the variables indicated by s-options and then loads the specified subset of filename.
Examples
. usesome foreign using http://www.stata-press.com/data/r11/auto ,not
. usesome (1/3 k-2/k) using http://www.stata-press.com/data/r11/auto ,clear
. usesome using http://www.stata-press.com/data/r11/auto ,has(vallabel) clear
References
Cox, N. J. 2012. Update: Finding variable names. Stata Journal volume 12, number 1. (dm0048_2)
Cox, N. J. 2010. Update: Finding variable names. Stata Journal volume 10, number 4. (dm0048_1)
Cox, N. J. 2010. Speaking Stata: Finding variables. Stata Journal volume 10, number 2. (dm0048)
Author
Daniel Klein, University of Bamberg, klein.daniel.81@gmail.com
Also see
Online: use, describe, ds, drop, nlist
if installed: findname, chunky, savesome