Publishing dynamic data on ocpu.io

February 16, 2014


Suppose you would like to publish some data, for example to accompany a journal article. One way would be to put a CSV file on your website, and share the URL with your colleagues. However CSV has many limitations: it only works for tabular structures, has limited type safety (pretty much everything gets coersed into strings) and leads to loss of numeric precision.

There are many alternative data interchange formats, each with their own benefits and limitations. For example JSON is widely supported and can be parsed in almost any language, however it can be verbose and slow. A binary format such as Protocol Buffers is more efficient, but many users might not know how to parse it. You could even use save or saveRDS in R to share the native R structures, however this limits your audience to R users.

Retrieving dynamic data

What we really need is a method to publish the data itself rather than some representation of the data in a particular format. With OpenCPU you can publish R objects (including datasets) in a way that lets the clients select the format and formatting options for retrieving the dataset. This is implemented using native R functionality to include arbitrary data/objects in packages, and standard R functions for exporting these data. For example, the CRAN package MASS includes a dataset called bacteria:

library(MASS)
data(bacteria)
print(bacteria)

Via OpenCPU, the dataset can downloaded by anyone, using one of many formats:

Format Export Function URL (short)
text print cran.ocpu.io/MASS/data/bacteria/print
CSV write.csv cran.ocpu.io/MASS/data/bacteria/csv
TSV write.table cran.ocpu.io/MASS/data/bacteria/tab
JSON jsonlite::toJSON cran.ocpu.io/MASS/data/bacteria/json
Protocol Buffers RProtoBuf::serialize_pb cran.ocpu.io/MASS/data/bacteria/pb
RData save cran.ocpu.io/MASS/data/bacteria/rda
RDS saveRDS cran.ocpu.io/MASS/data/bacteria/rds
ascii R dput cran.ocpu.io/MASS/data/bacteria/ascii

The client can also control formatting options by passing HTTP parameters. These parameters map directly to function arguments for the respective export function in the table above. Some random examples:

Output Format Equivalent URL on Public OpenCPU Server
write.csv(bacteria, row.names=TRUE) cran.ocpu.io/MASS/data/bacteria/csv?row.names=true
jsonlite::toJSON(Boston, digits=4) cran.ocpu.io/MASS/data/Boston/json?digits=4
jsonlite::toJSON(Boston, dataframe="columns") cran.ocpu.io/MASS/data/Boston/json?dataframe=columns
jsonlite::toJSON(Boston, pretty=FALSE) cran.ocpu.io/MASS/data/Boston/json?pretty=false

Creating a data package

To start publishing your own dynamic data you need to put your data objects in an R package following the standard guidelines as documented in section 1.1.6 of Writing R Extensions. This might sound cumbersome, but once you get a hold of it, it only takes a few seconds. You’ll realize that packages are actually a beautiful, standardized and well-tested container format for R objects and much more. Have a look at the data folder in the opencpu/appdemo package for some examples.

After creating and installing your package on your local R, test it using the OpenCPU single user server:

library(opencpu)
opencpu$browse("/library/mypackage/data")
opencpu$browse("/library/mypackage/data/myobject")

Publishing dynamic data on ocpu.io

To make your data available through the public OpenCPU server and ocpu.io, all you need to do is put your package up on Github. OpenCPU requires the name of the Github repository to match the name of the R package it contains. Use devtools to test if your package is working:

library(devtools)
install_github("pkgname", "username")

If this succeeds you’re good to go. Navigate to username.ocpu.io/pkgname/data where username is your Github login. By default the OpenCPU public server updates packages installed from Github every 24 hours. However, the Github webhook can be used to update the package immediately every time a commit is pushed to github.

Publishing dynamic data on your own server

OpenCPU does not lock you into some commercial hosting service. Your data is stored on Github in a standard format under your control. The ocpu.io public server is there for your convenience. You can also install your own OpenCPU cloud server to publish data at e.g. http://opencpu.yourserver.com/ocpu/library/pkgname/data/myobject. No need to put anything on Github, just install the package in R on the server.