help chunky
-------------------------------------------------------------------------------

Title

chunky -- Large text file chunking utility

Syntax

chunky using filename [,[ [peek(#) analyze] | [chunksize(#.#) header(string) stub(string) replace] ]

options Description ------------------------------------------------------------------------- pre-chunking options peek(#) list the first # observations analyze checks the composition of filename in terms of letters, numbers and special characters (which can cause infiling problems) ------------------------------------------------------------------------- chunking options chunksize(string) size of chunk in bytes header(string) whether filename has a header and to include or skip it ------------------------------------------------------------------------- saving options stub(string) filename stub for chunks replace replace previously saved chunk filenames

Description

Some users, especially those using 32-bit versions of Stata, may find themselves faced with a huge data download from a database that is too large for infiling. In this situation, the huge file must be broken into smaller chunks that can be imported individually.

chunky provides the user with two tools:

1. In preparation for chunking sometimes one just wants to get a sense of the file structure and variable names (if present) by peeking at just the first few lines. chunky can allow display of the first n lines of the file. It can also provide a more complete analysis of the file including the number of observations, average line lengths and the presence of special characters that could be problematic for import. 2. Once the user has determined a chunking strategy, chunky will break the huge file into chunks of a size specified by the user and save them in serially numbered files.

chunky returns the list of chunk filenames in s(filelist) for subsequent processing.

Options

+--------------+ ----+ Pre-chunking +-----------------------------------------------------

peek(#) Listing the first few lines of a text file can be useful. You can use the type command but the peek option is an alternative and can be set to display a single line. It will display the end of line characters (for reference: EOL characters 0d0a (CRLF) indicate Windows, 0a (LF) Unix and 0d (CR) Mac. 09 is the TAB character.)

analyze This option allows detailed examination of the input file looking for problems that may cause difficulty in chunking or with subsequent import of the chunks. analyze uses the hexdump routine which can identify the file format as binary or ascii and what operating system wrote the file (based on the end of line characters used) A small table is produced that gives rough approximations of the number of chunks that would be created at various chunksizes and the number of observations in each chunk. This may help in planning one's chunking strategy.

Note: The peek(#) and analyze options are intended to be used prior to chunking. They may be used together but use of either of them takes precedent over any of the chunking commands which may still be specified, but will not run and a warning will be generated.

+----------+ ----+ Chunking +---------------------------------------------------------

chunksize(#.# [[k|kb]|[m|mb]|[g|gb]]) The size of the chunk in bytes. For convenience, standard power of ten, case-insensitive, single or two-letter multiplier abbreviations are allowed. When using the multiplier form, decimal numbers are allowed and a space can exist between number and multiplier. e.g. 5000Kb = 5m = .005 GB

header(none|include|skip) Comma Separated Value (CSV) files frequently come with the variable names in the first line of the file. For this type of file, the first line of names should be retained for all file chunks. header(include) writes out the first line of the using file at the beginning of each chunk. This allows a subsequent insheet to be done easily on each chunk. header(skip) tells chunky that a header is present but to omit it. Finally, one may specify header(none) to indicate the absence of a header row. This is the default if header is not specified. The header options may be minimally abbreviated as shown. e.g. h(s) h(i)

+--------+ ----+ Saving +-----------------------------------------------------------

stub(string) Filename stub to use for individual chunks. Stub may contain a directory path allowing chunks to be saved to a different directory (Default is to use the working directory). Chunks will be numbered consecutively using stub0001, stub0002, stub0003... Obviously this naming convention imposes a chunk maximum of 9999.

replace Replace previously saved chunk file

Note that if your filename or stub contains embedded spaces that they must be enclosed in double quotes.

Examples

.chunky using ReallyBig.csv, peek(5)

.chunky using ReallyBig.csv, analyze

.chunky using ReallyBig.csv, chunksize(100m) header(include) stub(part) repl > ace

.chunky using "c:\rawfiles\dump_07_09.raw", chunksize(.5 GB) header(none) st > ub("07-09 data/import")

Cautions

This routine has not been tested in a Mac environment. Stata appears to be able to read and write files coming from Unix and Mac systems in a Windows environment but cross-OS testing has not been done at this point. The author welcomes any feedback in this regard.

Notes

This version of chunky has been extensively rewritten and replaces the previous version which has been deprecated to become chunky8. The routine now handles the consecutive naming of chunks and removes the need for the user to write the looping. It uses a much more efficient chunking strategy and employs Mata functions for the file I/O. The speed improvements on very large files and over network connections are very considerable.

These changes have necessitated a complete change in the command syntax, but a single command now replaces what previously required a block of programming to loop through the chunks. If a user still requires breaking a large file apart according to a fixed number of lines per chunk, chunky8 or Roy Wada's chewfile may be appropriate.

Once the chunks have been created, it may be convenient to use an infiling method appropriate to the storage format. The returned s(filelist) can easily be processed:

foreach in_fn in `s(filelist)' { insheet using `"`in_fn'"' [, options ] // create a saving filename based on input filename minus the extension local save_fn = cond(regexm(`"`in_fn'"',"(.*)[.].*$"),regexs(1),"") if `"`save_fn'"' != "" { save(`"`save_fn'"')[, replace ] } else { display `"{err: Cannot extract savename from `in_fn'}"' error } }

This will result in a number of individual data files being created. Subsets of data can then be appended together to create a larger working dataset. A very useful tool for this purpose is Roger Newson's dsconcat (dsconcat if installed or describe dsconcat)

Users are reminded that they can obtain a macrolist by creating an appropriate filemask with wildcards for use with the dir extended function:

local my_filelist : dir . files "stub*.ext"

Similarly, Nick Cox's fs (fs if installed or describe fs) can be used and returns a macrolist in r(files).

Acknowledgements

I would like to thank Amresh Hanchate and Dan Blanchette for helpful feedback and beta-testing the routine.

Author

David C. Elliott, Nova Scotia Department of Health, Halifax