------------------------------------------------------------------------------- help for convert_top_lines -------------------------------------------------------------------------------

Convert the first one or two observations to variable names and lables.

convert_top_lines [, line2labels list drop]

Description

convert_top_lines will take values in the first observation, and use them as variable names. Optionally, it will take values in the second observation, and use them as variable labels. This works only when all datatypes are string, and the existing names are v1, v2, etc.

Options

line2labels specifies that values in the second observation are to be taken as variable labels.

drop specifies that the first, and possibly the second observations are to be dropped - after the names and labels are extracted from them. With this option, the first observation is always dropped; when combined with line2labels, the second observation is also dropped. Typically you would want to specify this option, since, if you need to use this program, then these particular observations do not contain "regular" data, and don't belong with the others.

list specifies that the first three observations will be listed (after the renaming, but prior to the optional drop operation), so you can see the information that has been converted to names and labels. Typically, in the first observation, you will see values equaling the names, and the third observation would typically contain regular data values.

Remarks

This is intended to aid in clearing up some problems that may occur with insheet.

Often, comma-separated-value (csv) files have the variable names in the first line. If the data follow, starting on the second line, then insheet knows what to do; it uses the values in the first line as the variable names, and collects the regular data beginning with the second.

But sometimes, csv raw data files come with descriptive information in the second line, in addition to variable names in the first. This descriptive information is often suitable as variable labels. But insheet is not able to handle that situation, and will...

a, use default names, v1, v2, v3, etc.

b, use long string datatypes, such as str68.

insheet, as it stands at the time of this writing, is not set up to recognize this situation, and it invokes its "take everything as string" mode. Thus, it selects datatypes that can accomodate the values in all the lines, including the second, where those long descriptions dwell. Often, in this situation, the datatype is tailored to that one longest value, and all other "actual" data values are much shorter, and possibly numeric.

This program is meant to partly remedy that situation. It will first check that all the variables are named v1, v2, etc. Then it renames them to the values contained in the first line, with these names converted to lower case. With the line2labels option, the values in the second line become variable labels.

Typically, you would want the line2labels optioon, because, if you don't have descriptive information in the second line, then insheet probably would have succeeded at taking variable names from the first line (if they exist therein), and you wouldn't be needing this program. But this feature was made optional for the sake of generality and to give you more control.

Some truncation may occur when the values in the second line are read into the variables during insheet, and possibly when these values are converted to variable labels. (The latter would occur with Stata SE only.) Whenever this possibility is detected (when the length of the value is >=80), a note is added to the variable, indicating the possibility of truncation.

Examples

. convert_top_lines

. convert_top_lines, line2labels list drop

Additional Remarks

In the situation where you would want to use this, all datatypes are initially string, which may not be appropriate after the operation is completed. But it is beyond the scope of convert_top_lines to try to remedy that situation. Thus, you may want to follow this with some changes to datatypes, such as with compress and destring.

-------------------------------------------------------------------- Technical note: While the presence of string types may be ultimately undesirable for most variables, it makes the operations within convert_top_lines possible, as variable names and labels are string values. --------------------------------------------------------------------

This always converts the names to lower case, which would be a problem if some names are distinguished only by case. If users find they need to be able to control that, they should contact the author.

If you are using insheet on a large csv file with descriptive information in the second line, then you may need a large amount of memory just for the insheet operation. Once convert_top_lines is done, followed by appropriate changes to datatypes, the dataset will be much smaller, and you can return to using less memory. See memory.

Author

David Kantor, Institute for Policy Studies, Johns Hopkins University. Email dkantor@jhu.edu if you observe any problems.

Also see

insheet, datatypes, destring, compress