Title
chunky8 -- Text file chunking utility
Syntax
chunky8 using filename [,options]
options Description ------------------------------------------------------------------------- index(#) Starting line in file chunk(#) Size of chunk in lines keepfirst Keep the first line (e.g. with variable names) of the file with each chunk list List the lines as they are being processed (used for peeking at the first couple of lines or in debugging) ------------------------------------------------------------------------- saving options saving Filename to save html replace Replace previously saved file
Note that if your filename contains embedded spaces, remember to enclose it in double quotes.
Description
chunky8 is typically used in one of two ways:
(1) When a text file is too large for infiling, it can be broken into chunks that can be imported individually.
(2) Sometimes one needs to get a sense of what the variable names and structure of a text file are by pulling off just the first few lines.
Options
+------+ ----+ Main +-------------------------------------------------------------
index(#) Thinking of a text file as having lines numbered 1 to N, index indicates the starting line number for the chunk
chunk(#) The chunk is the number of lines to be returned. Ideally the chunksize would be the largest that could be used in the subsequent infiling process.
keepfirst Comma Separated Value (CSV) files frequently come with the variable names in the first line of the file. For this type of file, the first line of names should be retained for all file chunks. keepfirst writes out the first line of the using file at the beginning of each chunk. This allows a subsequent insheet to be done easily on each chunk.
list Listing the first few lines of a text file can be useful. You can use the type command but the list option is an alternative and can be set to display a single line.
+--------+ ----+ saving +-----------------------------------------------------------
saving("path & filename") Filename to save chunk
replace Replace previously saved file
Examples
.chunky8 using ReallyBig.csv, index(1) chunk(10000) saving(chunk1.csv, repla > ce) .chunky8 using ReallyBig.csv, list .chunky8 using ReallyBig.csv, index(100001) chunk(100000) keepfirst saving(c > hunk2.csv, replace)
Saved results
chunky8 saves the following in r():
Scalars r(eof) end of file indicator r(eof)=1 if end of file reached r(index) line number of the next line after the last line read, i.e. the starting index for the next chunk
Notes
When used on a very large file, chunky8 would normally be employed in a loop that reads in and saves chunks as serially numbered files. The program listing below demonstrates a typical way in which chunky8 would be employed.
chunky8 is a deprecated version of chunky maintained for users who are using pre-Mata releases of Stata or users who require a line-indexed method of extracting part of a test file. The current version of chunky for Stata releases 9 and above uses a different logic, syntax and has the file I/O written utilizing highly efficient Mata routines. See chunky if installed or read description from net.
// chunky8.ado demo // Make sure you have a minimum of 50M of memory to run this demo
// Set up a test comma separated value file sysuse auto, clear expand 10000 describe outsheet using test.csv, names comma replace
// If you want to look at an example of the file's // variables and first few lines: chunky8 using test.csv, list
// Set up chunksize and locals for the loop local part 0 // Counter for parts local index 1 // Keeps running line index local chunksize 250000 // # of lines in each chunk tempfile chunkfile // Temporary filename for chunks
// Loop until end of file is reached while r(eof)!=1{ chunky8 using test.csv, /// index(`index') chunk(`chunksize') /// saving("`chunkfile'", replace) keepfirst if r(eof) { // break out when end of file reached continue, break } else { // get starting index for next chunk local index `r(index)' } insheet using "`chunkfile'", clear comma names // keep fields you want keep make mpg headroom weight length displacement // save part of file and increment chunk count save test_`++part', replace }
// Append parts together use test_1, clear forvalues i=2/`part' { append using test_`i' erase test_`i'.dta } describe
Author
David C. Elliott, Nova Scotia Department of Health, Halifax, Nova Scotia, Canada