High performance JSON streaming in R: Part 1

November 6, 2014

The jsonlite stream_in and stream_out functions implement line-by-line processing of JSON data over a connection, such as a socket, url, file or pipe. Thereby we can construct a data processing pipeline that can handle large (or unlimited) amounts of data with limited memory. This post will walk through some examples from the help pages.

The json streaming format

Because parsing huge JSON strings is difficult and inefficient, JSON streaming is done using lines of minified JSON records. This is pretty standard: JSON databases such as MongoDB use the same format to import/export large datasets. Note that this means that the total stream combined is not valid JSON itself; only the individual lines are.

x <- iris[1:3,]
stream_out(x, con = stdout())
# {"Sepal.Length":5.1,"Sepal.Width":3.5,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"}
# {"Sepal.Length":4.9,"Sepal.Width":3,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"}
# {"Sepal.Length":4.7,"Sepal.Width":3.2,"Petal.Length":1.3,"Petal.Width":0.2,"Species":"setosa"}

Also note that because line-breaks are used as separators, prettified JSON is not permitted: the JSON lines must be minified. In this respect, the format is a bit different from fromJSON and toJSON where all lines are part of a single JSON structure with optional line breaks.

Streaming to/from a file

The nycflights13 package contains a dataset with about 5 million values. To stream this to a file:

stream_out(flights, con = file("~/flights.json"))

Running this code will open the file connection, write json to the connection in batches of 500 rows, and afterwards close the connection. Status messages will be printed to the console while writing output. The entire process should take a few seconds and generate a json file of about 7MB.

We use the same file to illustrate how to stream the json back into R. The following code will stream-parse the json in batches of 500 lines. Afterward we verify that the output is indeed identical to the original one:

flights2 <- stream_in(file("~/flights.json"))
all.equal(flights2, as.data.frame(flights))
# [1] TRUE

Because the data is read in small batches, this require much less memory than when we would try to parse a huge json blob all at once. The pagesize argument in stream_in and stream_out can be used to specify the number of rows that will be read/written per iteration.

Streaming from a URL

We can use the standard url function in R to stream from a HTTP connection.

diamonds2 <- stream_in(url("http://jeroenooms.github.io/data/diamonds.json"))

If the data source is gzipped, simply wrap the connection in gzcon.

flights3 <- stream_in(gzcon(url("http://jeroenooms.github.io/data/nycflights13.json.gz")))
all.equal(flights3, as.data.frame(flights))

Because R currently does not support SSL, we use a curl pipe to stream over HTTPS:

flights4 <- stream_in(gzcon(pipe("curl https://jeroenooms.github.io/data/nycflights13.json.gz")))
all.equal(flights4, as.data.frame(flights))

For this to work, the curl executable needs to be installed and available in the search path, which requires cygwin on Windows. Unfortunately the RCurl package does not seem to support binary streaming at this point.

Next up

These examples illustrate basic line-by-line json streaming of data frames from/to a connection, which allows for importing/exporting large json datasets.

In the next blog post we will make the step to full JSON IO streaming by defining a custom handler function. This allows for constructing a json data processing pipeline in R that can handle an infinite data stream. Impatient readers can have a look at the examples in the stream_in help page.