Read text files into string variables in the memory (without losing blanks)
intext using filename , generate(prefix) [ length(#) clear
tfconcat filename_list , generate(prefix) [ length(#) tfid(newvarname) tfname(newvarname) obsseq(newvarname) ]
where filename_list is a list of filenames separated by spaces.
Description
intext inputs a single text file into a set of generated string variables in the memory, generating as many string variables as is necessary to store the longest records in full, without trimmming leading and trailing blanks (as infix does). tfconcat takes, as input, a list of filenames, assumed to belong to text files, and concatenates them (without losing blanks) to create a new data set in memory, overwriting any pre-existing data. The new data set contains one observation for each record in each text file, ordered primarily by source text file and secondarily by order of record within source text file, and contains a set of generated string variables containing the text, as created by intext. Optionally, tfconcat creates new variables, specifying, for each observation, the input text file of origin and/or the sequential order of the observation in its input text file of origin.
Options for intext and tfconcat
generate(prefix) is not optional. It specifies a prefix for the names of the new string variables generated, which will be named as prefix1 ... , prefixn, where n is the number of string variables required to contain the longest text record in any input data set, with length as specified by the length option.
length specifies the maximum length of the generated text variables. If absent, it is set to 80.
Options for intext only
clear specifies that any existing data set in the memory is to be removed before the generated text variables are created. If clear is absent, then intext attempts to add the generated variables to the existing data set, failing if there is an existing variable with the same name as one of the generated variables. (tfconcat always removes any existing data set before generating new variables.)
Options for tfconcat only
tfid(newvarname) specifies a new integer variable to be created, containing, for each observation in the new data set, the sequential order, in the filename_list, of the input text file of origin of the observation. If possible, tfconcat creates a value label for the newvarname with the same name, assigning, to each positive integer i from 1 to the number of input file names in the list, a label equal to the filename of the ith input text file, truncated if necessary to the maximum label length in the version of Stata being used (eg 80 characters for Small or Intercooled Stata 7). If a value label of that name already exists in one of the input data sets, and nolabel is not specified, then dsconcat adds new labels, but does not replace existing labels.
tfname(newvarname) specifies a new string variable containing, for each observation in the new data set, the name of the input text file of origin of that observation, truncated if necessary to the maximum string length in the version of Stata being used (eg 80 characters for Small or Intercooled Stata 7, or 244 for Stata 7 SE).
obsseq(newvarname) specifies a new integer variable containing, for each observation in the new data set, the sequential order of that observation as a text record in its input text data set of origin.
Remarks
intext is an inverse of outfile with the runtogether option. That is to say, if the user inputs a text file into a list of generated string variables using intext and then outputs them to a second text file using outfile with the runtogether option, then the second text file will be identical to the first text file. tfconcat is similar to dsconcat (downloadable from SSC), but it concatenates text files instead of Stata data sets into the memory. tfconcat works by calling intext multiple times to create a data set for each text file, and concatenating these data sets into the memory. Both programs make it possible to use Stata for text file processing, especially when the text files may be indented Stata programs. This cannot be done properly using infix, which uses fixed-field input, but trims leading and trailing blanks from strings. Therefore, the intext package enables Stata programs to read Stata programs, just as outfile with the runtogether option enables Stata programs to write Stata programs.
Examples
. intext using intext.ado,gene(sect) clear
. tfconcat auto1.txt auto2.txt auto3.txt auto4.txt,gene(piece) tfid(tfseq) obs(recnum) . sort tfseq recnum
The following example is equivalent to copy tfconcat.ado trash1.txt,text replace
. intext using tfconcat.ado,gene(slice) clear . outfile slice* using trash1.txt,runtogether replace
The following advanced example works under Windows, and might be used if the user has a library of Stata ado-files in the current directory. It inputs the ado-files into the memory and lists the lines beginning with "*!", which are echoed by the which command. The vallist package, written by Nick Cox and downloadable from SSC, is assumed to be installed.
. tempfile dirf . shell dir/b *.ado > `dirf' . intext using `dirf',gene(fn) clear . vallist fn1,quote . tfconcat `r(list)',gene(line) tfid(adofile) obs(lseq) . list adofile line* if substr(line1,1,2)=="*!"
Author
Roger Newson, King's College, London, UK. Email: roger.newson@kcl.ac.uk
Also see
Manual: [U] 24 Commands to input data, [R] infix, [R] append, [R] outfile On-line: help for infiling, infix, append, outfile, file help for dsconcat if installed