New jsonlite gets a major speed boost!

September 6, 2014

The jsonlite package is a JSON parser/generator optimized for the web. It implements a bidirectional mapping between JSON data and the most important R data types, which allows for converting objects to JSON and back without manual data restructuring. This is ideal for interacting with web APIs, or to build pipelines where data seamlessly flow in and out of R through JSON. The quickstart vignette gives a brief introduction, or just try:

fromJSON(toJSON(mtcars))

Or use some data from the web:

# Latest commits in r-base
r_source <- fromJSON("https://api.github.com/repos/wch/r-source/commits")

# Pretty print:
committer <- format(r_source$commit$author$name)
date <- as.Date(r_source$commit$committer$date)
message <- sub("\n\n.*","", r_source$commit$message)
paste(date, committer, message)

New in 0.9.11: performance!

Version 0.9.11 has a few minor bugfixes, but most of the work of this release has gone into improving performance. The implementation of toJSON has been optimized in many ways, and with a little help from Winston Chang, the most CPU intensive bottleneck has been ported to C code. The result is quite impressive: encoding dataframes to row-based JSON format is about 3x faster, and encoding dataframes to column-based JSON format is nearly 10x faster in comparision with the previous release.

The diamonds dataset from the ggplot2 package has about 0.5 million values which makes a nice benchmark. On my macbook it takes jsonlite on average 1.18s to encode it to row-based JSON, and 0.34s for column-based json:

library(jsonlite)
library(microbenchmark)
data("diamonds", package="ggplot2")
microbenchmark(json_rows <- toJSON(diamonds), times=10)
# Unit: seconds
#              expr     min       lq   median       uq     max neval
#  toJSON(diamonds) 1.12773 1.140724 1.175872 1.180354 1.21786    10

microbenchmark(json_columns <- toJSON(diamonds, dataframe="col"), times=10)
# Unit: milliseconds
#                                 expr      min      lq   median       uq      # max neval
#  toJSON(diamonds, dataframe = "col") 333.9494 334.799 338.0843 340.0929 350.3026    10

Parsing and simplification performance

The performance of fromJSON has been improved as well. The parser itself was already a high performance c++ library that was borrowed from RJSONIO, which has not changed. However the simplification code used to reduce deeply nested lists into nice vectors and data frames has been tweaked in many places and is on average 3 to 5 times faster than before (depending on what the JSON data look like). For the diamonds example, the row-based data gets parsed in about 2.32s and column based data in 1.25s.

microbenchmark(fromJSON(json_rows), times=10)
# Unit: seconds
#                 expr      min       lq   median       uq      max neval
#  fromJSON(json_rows) 2.178211 2.278337 2.319519 2.376085 2.423627    10

microbenchmark(fromJSON(json_columns), times=10)
# Unit: seconds
#                    expr     min       lq   median       uq      max neval
#  fromJSON(json_columns) 1.17289 1.252284 1.253999 1.265763 1.306357    10

For comparison, we can also disable simplification in which case parsing takes respectively 0.70 and 0.39 seconds for these data. However without simplification we end up with a big nested list of lists which is often not very useful.

microbenchmark(fromJSON(json_rows, simplifyVector=F), times=10)
# Unit: milliseconds
#                                     expr      min       lq   median       uq      max neval
#  fromJSON(json_rows, simplifyVector = F) 635.5767 648.4693 704.6996 720.0335 727.8869    10

microbenchmark(fromJSON(json_columns, simplifyVector=F), times=10)
# Unit: milliseconds
#                                        expr      min       lq   median       uq      max neval
#  fromJSON(json_columns, simplifyVector = F) 385.3224 388.4772 395.1916 409.3432 463.9695    10