help insheetjson
-------------------------------------------------------------------------------

Title

insheetjson: Importing tablular data from JSON sources

Syntax

insheetjson varlist using, columns(string) [options] insheetjson using, showresponse [flatten]

options Description ------------------------------------------------------------------------- Required...

varlist are the string variables to store the retreived data > into

using "URL or file path" specifies the source of the JSON data

columns(string) specifies the column selectors associated with each specified variable

showresponse Make no changes to the Stata environment and instead just print the JSON reponse to the console

Optionals...

tableselector(string) specifies the selector to use in finding the table burried in the json object tree

replace Necessary if you need to replace existing data in varlist

printonly Make no changes to the Stata environment and instead just print the results to the console

flatten Output results as key(selector)-value pairs

savecontents(filename) Copies the contents of the raw response to the specified file

limit(#) specifies the maximum number of observations to add to the stat environment

followurl(selector [#]) specifies the selector to use to find the follow-on url, and the maximum number of times to follow it

offset(#) Observation to begin filling in the results at

-------------------------------------------------------------------------

Description

There are many data sources out on the web that operate by a URL query, and which then respond with data formatted in JavaScript Object Notation (JSON). This command is designed to support querying those sources, assuming that the user has some idea of what the JSON reponse will look like (or can puzzle it out. See examples below.) Through the application of data selectors, the JSON response can be examined and the data extracted in the format of string data. This data can then be stored in string variables in the Stata enviroment for later processing by the user.

To use this command, you will need to know at least the URL of the JSON data source.

Required Library To use this command, you will need to have installed libjson, which is also available from SSC via the following link

ssc install libjson

Latest Version

The latest version is always kept on the SSC website. To install the latest version click on the following link

ssc install insheetjson, replace

Options

+----------+ ----+ Required +---------------------------------------------------------

varlist is a list of string stata variables for the retureived data to be stored into. Note that if the string variable is not large enough to hold the string, it will be truncated. If a non-string variable is specified, then the data destined for that variable will quiety be discarded.

using(URL or file path) specifies where to get the JSON data from There are two types of sources: local (file) and web (URL). Normally, you want to give a URL that includes any needed query options (the things which follow a question mark), and a URL must have the leading "http:\\" to distiguish them from file paths.

columns(string) specifies the column selectors associated with each specified variable (in order of appearance in varlist)

showresponse makes no changes to the Stata environment and instead just print the JSON reponse to the console. Useful for exploring the data source to figure out what your selectors should be.

+-----------+ ----+ Optionals +--------------------------------------------------------

tableselector(string) specifies the selector to use in finding the table burried in the json object tree. The default is that the table starts immediately as an array of objects and contains no meta information.

replace is necessary if you need to overrite existing data in varlist.

printonly makes no changes to the Stata environment and instead just prints the results to the console. Useful for making "dry runs" to see what you will get without commiting the results to Stata.

showresponse makes no changes to the Stata environment and instead just (pretty) print the actual JSON reponse to the console. Useful for exploring the data source to figure out what your selectors should be.

flatten Cause the results to be converted into key(selector)-value pairs. Useful for figuring out what the exact (flattented) selector is for a given piece of data in the JSON response. For complex JSON responses, this can result is a prodigeous amount of output.

savecontents(filename) causes the raw contents (usually a JSON string) of the web response to the specified file. Useful when you want to cache the results locally for some reason, or if you are want to see the actual server response for debugging your scripts.

limit(#) specifies the maximum number of observations to add to the Stata environment.

followurl(selector [#]) is for those servers which spread their results over more than one page, and give a url to follow for the next block. This option specifies the selector to use to find this url, as well the maximum number of times to follow the link. If not specified, the default is 9 to to avoid hammering servers when the followurl() is used (for a total of 10 page loads). A limit of zero (or less) signals that there is no limit to the number of pages to be loaded, but this can be very dangerous with some servers.

offset(#) indicates the observation number to begin filling in the results at, and defaults to 1. Useful when a single dataset request must be broken up over multiple distinct queries and the server does not provide a follow-on URLs.

Examples

Load a single page of a twitter feed (about 15 observations) into three variables: . gen str240 tw_fu="" . gen str240 tw_uid ="" . gen str240 tw_geo ="" . insheetjson tw_fu tw_uid tw_geo using "http://search.twitter.com/s > earch.json?q=stata", table(results) col("from_user" "from_user_id_str" "geo:c > oordinates")

Load a single page of a twitter feed (about 15 observations), but don't copy the obervations into Stata: . gen str240 tw_fu="" . gen str240 tw_uid ="" . gen str240 tw_geo ="" . insheetjson tw_fu tw_uid tw_geo using "http://search.twitter.com/s > earch.json?q=stata", table(results) col("from_user" "from_user_id_str" "geo:c > oordinates") print

Load exactly 1000 entries from a twitter feed (possibly tossing some away): . gen str240 tw_fu="" . gen str240 tw_uid ="" . gen str240 tw_geo ="" . insheetjson tw_fu tw_uid tw_geo using "http://search.twitter.com/s > earch.json?q=stata", table(results) col("from_user" "from_user_id_str" "geo:c > oordinates") followurl(next_page) limit(1000)

Load an entire twitter feed (this may take some time): . gen str240 tw_fu="" . gen str240 tw_uid ="" . gen str240 tw_geo ="" . insheetjson tw_fu tw_uid tw_geo using "http://search.twitter.com/s > earch.json?q=stata", table(results) col("from_user" "from_user_id_str" "geo:c > oordinates") followurl(next_page) pagelimit(0)

Examine the json source to figure out what columns are possible: . insheetjson using "http://search.twitter.com/search.json?q=stata", > showresponse

Selectors Selectors are a series of named (or implicitly named in the case of arrays, which start at index "1") branches to take, starting from the given node (usually the root node). All column names in the {opt:columns()} are flattened Selectors.

Given the following example JSON object: { "foo" : "1", "bar": { "bar2":"2" }, "foobar": [ "bar1","bar2"] } the results of the following selectors would be... ("foo") --> "1" ("foo","bar","bar2") --> "2"

("foobar","2") --> "bar2"

("foo","bar") --> Depends. If a node is expected, then the nod > e is selected. If a Scalar (string, real) is expected, then is NOT FOUND and > considered missing.

Flattened Selectors A "flattened" selector is a single string with a colon inserted between each selector. For example, ("foo":"bar") --> "foo:bar". Selectors must be quoted if a space appears in one of the sub-selectors (ie. "foo bar:bar2"), though such selectors are rare.

Conversion Notes Be aware that all values are ultimately converted into strings, so any further conversions are up to the user. Also, for a given position in the table (after the column selector has been applied), the following are attempted in this order: (1) Is it a string or other un-quoted literal (such as 'null')? If yes, > use it. (2) Is it an Array? If yes, combine the array values (scalars only) into > a single string in the familiar "[ , ... ]" format.

Missing libjson This command uses the generic libjson library (libjson.mlib), and must be available in the Stata search path. The latest library file can be downloaded from SSC (Boston College Archive) via the following link:

ssc install libjson, replace

Author

Erik Lindsley, Ph.D. (ssc@holocron.org)

See also

Required library: libjson

Special Thanks

To my alpha tester: Andrew Dyck ( http://www.andrewdyck.com/ )