OpenCPU

OpenCPU 2.1 Release: Scalable R Services

2018-11-22T00:00:00+00:00

OpenCPU provides a mature and robust system for hosting R based services. The server exposes a simple HTTP API for calling R functions, scripts and managing data. The Cloud Server is completely free and scales up to many concurrent users. This provides a reliable foundation for intergrating R into any environment.

The 2.1 branch is the new major release of OpenCPU. The changes in this version are mostly internal, and make the server a bit lighter and faster. The built-in CI system has switched to the lightweight remotes package for installing packages from GitHub. Moreover the opencpu-server package has been tweaked to work better inside docker. Also we now target R 3.5 on server installations.

The user facing features are unchanged; see the opencpu 2.0 announcement post for a brief overview.

Upgrading

The version 2.1.0 is available from CRAN, Launchpad, Dockerhub, OBS and the server archive.

The recommended platform for running the server is Ubuntu 18.04 or 16.04, which can be installed directly from the PPA. For Fedora and CentOS you can download installers from the server achive. All binaries from the archive have been built on dockerhub and depend on the current version of R from Fedora / EPEL.

The easiest way to get started is by deploying your packages on the public cloud server by enabling the opencpu webhook in your GitHub repository.

Docker

Another easy way to get started is using docker, which also runs on Windows these days. Images based on various platforms are published on dockerhub The opencpu/rstudio image is recommended for development: it runs both opencpu-server and rstudio-server which are very powerful together.

# Run server as executable
docker run --name mybox -t -p 80:80 opencpu/rstudio

# OR: if port 80 is taken use port 8004
docker run --name mybox -t -p 8004:8004 opencpu/rstudio

Now simply open http://localhost/ocpu/ and http://localhost/rstudio/ in your browser! Login via rstudio with user: opencpu (passwd: opencpu) to build and install packages.

To connect to a running container (e.g. for installing system libraries) get a root shell:

# Replace 'mybox' with the container name or id
docker exec -i -t mybox /bin/bash

Use the opencpu/base image for deployments. Also see the docker readme.

Why Use Docker with R? A DevOps Perspective

2017-10-16T00:00:00+00:00

There have been several blog posts going around about why one would use Docker with R. In this post I’ll try to add a DevOps point of view and explain how containerizing R is used in the context of the OpenCPU system for building and deploying R servers.

Has anyone in the #rstats world written really well about the *why* of their use of Docker, as opposed to the the *how*?
— Jenny Bryan (@JennyBryan) September 29, 2017

1: Easy Development

The flagship of the OpenCPU system is the OpenCPU server: a mature and powerful Linux stack for embedding R in systems and applications. Because OpenCPU is completely open source we can build and ship on DockerHub. A ready-to-go linux server with both OpenCPU and RStudio can be started using the following (use port 8004 or 80):

docker run -t -p 8004:8004 opencpu/rstudio

Now simply open http://localhost:8004/ocpu/ and http://localhost:8004/rstudio/ in your browser! Login via rstudio with user: opencpu (passwd: opencpu) to build or install apps. See the readme for more info.

Docker makes it easy to get started with OpenCPU. The container gives you the full flexibility of a Linux box, without the need to install anything on your system. You can install packages or apps via rstudio server, or use docker exec to a root shell on the running server:

# Lookup the container ID
docker ps

# Drop a shell
docker exec -i -t eec1cdae3228 /bin/bash

From the shell you can install additional software in the server, customize the apache2 httpd config (auth, proxies, etc), tweak R options, optimize performance by preloading data or packages, etc.

2: Shipping and Deployment via DockerHub

The most powerful use if Docker is shipping and deploying applications via DockerHub. To create a fully standalone application container, simply use a standard opencpu image and add your app.

For the purpose of this blog post I have wrapped up some of the example apps as docker containers by adding a very simple Dockerfile to each repository. For example the nabel app has a Dockerfile that contains the following:

FROM opencpu/base

RUN R -e 'devtools::install_github("rwebapps/nabel")'

It takes the standard opencpu/base image and then installs the nabel app from the Github repository. The result is a completeley isolated, standalone application. The application can be started by anyone using e.g:

docker run -d -p 8004:8004 rwebapps/nabel

The -d daemonizes on port 8004. Now open the app via: http://localhost:8004/ocpu/library/nabel. Obviously you can tweak the Dockerfile to install whatever extra software or settings you need for your application.

Containerized deployment shows the true power of docker: it allows for shipping fully self contained appliations that work out of the box, without installing any software or relying on paid hosting services. If you do prefer professional hosting, there are many companies that will gladly host docker applications for you on scalable infrastructure.

3 Cross Platform Building

There is a third way Docker is used for OpenCPU. At each release we build the opencpu-server installation package for half a dozen operating systems, which get published on https://archive.opencpu.org. This process has been fully automated using DockerHub. The following images automatically build the enitre stack from source:

DockerHub automatically rebuilds this images when a new release is published on Github. All that is left to do is run a script which pull down the images and copies the opencpu-server binaries to the archive server.

Announcing OpenCPU 2.0: Building and Deploying Scalable R Apps and Services

2017-07-14T00:00:00+00:00

OpenCPU 2.0 provides the most robust system available today for building and deploying R based apps and services. The server exposes a simple HTTP API for calling with R functions, scripts and managing data, which provides a very solid basis for intergrating R into any environment. The OpenCPU 2.0 cloud server naturally scales up to many concurrent users and is entirely available under the business friendly Apache2 license – at no extra cost.

The 2.0 branch is the biggest upgrade to the system since the 1.0 release 4 years ago. The server API is backwards compatible so that existing clients and apps will keep working. Internals have been rewritten to make development easier and further enhance the performance and robustness of the server system.

The version 2.0.3 is available from CRAN, Launchpad, Dockerhub, OBS and the server archive. Below a brief overview of improvements in OpenCPU 2.0!

OpenCPU Apps

The 2.0 version makes it even easier to build and deploy R webapps. An app in OpenCPU is simply an R package which may include a web frontend that interacts with R functions from the same package via the OpenCPU API. By using the R package format as a container for shipping web applications OpenCPU apps natively support for dependencies, namespaces, embedded data, documentation, etc.

Apps can be run or deployed in many ways.

Run or develop locally using the single user server in R using opencpu::ocpu_start_app()
Deploy for free on <yourname>.ocpu.io or cloud.opencpu.org using the CI webhook
Host your own opencpu-server, either internally or on the internet
Ship and deploy apps in docker containers

Several example apps are available from rwebapps Github repository. You can try each app on the public cloud server or you can run it locally in R using the single-user server.

Single User server

Ther OpenCPU single-user server allows for running OpenCPU inside an interactive R session on any platform. To install the latest version in R:

install.packages("opencpu")

Version 2.0 has made it much easier to run and develop OpenCPU apps using the single user server. For example to run the rwebapps/stockapp app:

opencpu::ocpu_start_app("rwebapps/stockapp")

Or try the very cool rwebapps/markdownapp:

opencpu::ocpu_start_app("rwebapps/markdownapp")

Also try any of the other rwebapps. Each of these apps can also be used on https://rwebapps.ocpu.io/<app>.

Cloud Server and OCPU.IO

The new version makes it super easy to publish your apps and packages on the public cloud server via the Github CI. All you need to do is set the OpenCPU webhook in your Github repository or Github organization.

Upon your next git push, your package will immediately become available on a fancy private subdomain https://<yourname>.ocpu.io/<pkg> named after your github username or organization.

Note again that in OpenCPU an app is just an R package. You can start deploying any R package on ocpu.io to call it remotely or just for fun, even if the package does not contain any special web front-end.

Dependency Remotes

Your app or package might depend on other CRAN packages as specified in the package DESCRIPTION file according to the standard R mechanics. However sometimes your package depends on an R package which is not on CRAN, for example from Github.

To deploy packages on OpenCPU which have non-cran dependencies, specify the Remote in the DESCRIPTION according to the devtools vignette. Internally the OpenCPU webhook simply uses devtools::install_github() to install your package, so it supports everything that install_github does.

You can even pass custom arguments to install_github by adding them to the webhook URL as http parameters.

Improved Data Interchange

The most difficult part of building R apps and services is data interchange: getting complex structures efficiently and reliably in and out of R. A lot of energy in OpenCPU 2.0 has gone into further optimizing this critical part of the system.

The three major data formats in OpenCPU are now fully implemented by myself in highly optimized C/C++ packages:

json: opencpu uses jsonlite::fromJSON() for reading and jsonlite::toJSON() for writing json.
protobuf: opencpu uses protolite::serialize_pb() and protolite::unserialize_pb() to convert between objects and protocol buffers.
multipart/form-data: (POST only) opencpu uses webutils::parse_multipart() for parsing multipart.

Obviously these packages are not limited to OpenCPU; they may be used by other systems as well.

DataFrames

A special role in R is reserved for Data Frames, the common data structure for storing tabular data sets. OpenCPU adds additional output types for retrieving data frames in NDJSON, SPSS, SAS or STATA format.

For example the following URLS retrieve the “diamonds” dataset from the “ggplot2” package in various formats:

https://cran.ocpu.io/ggplot2/data/diamonds/csv
https://cran.ocpu.io/ggplot2/data/diamonds/json
https://cran.ocpu.io/ggplot2/data/diamonds/ndjson
https://cran.ocpu.io/ggplot2/data/diamonds/pb
https://cran.ocpu.io/ggplot2/data/diamonds/feather
https://cran.ocpu.io/ggplot2/data/diamonds/rda
https://cran.ocpu.io/ggplot2/data/diamonds/rds
https://cran.ocpu.io/ggplot2/data/diamonds/spss
https://cran.ocpu.io/ggplot2/data/diamonds/sas
https://cran.ocpu.io/ggplot2/data/diamonds/stata

This also shows an additional use case for OpenCPU: publishing datasets in an format agnostic way using the “lazydata” feature from R packaging format.

It is completely valid to create an R package which contains only a dataset (no functions) and deploy it on OCPU.IO to make it available in a dozen formats at once!

Server Binaries

OpenCPU 2.0 has further improved opencpu-server, the highly configurable multi-user server implementation, to run on various distributions as well as docker. This makes installing (and uninstalling) an opencpu production server easy for users or system administrators.

The recommended platform is still Ubuntu 16.04 (Xenial) because it supports AppArmor. This is also the platform we use to host cloud.opencpu.org and ocpu.io. Installation is easy:

# Requires Ubuntu 16.04 (Xenial)
sudo add-apt-repository -y ppa:opencpu/opencpu-2.0
sudo apt-get update 
sudo apt-get upgrade

# Installs OpenCPU server
sudo apt-get install -y opencpu-server

# Optional: installs rstudio in http://yourhost/rstudio
sudo apt-get install -y rstudio-server 

New in version 2.0 is that we provide binary installation packages for Debian 9, Fedora 25, CentOS 6 and 7. These binaries are built on dockerhub:opencpu and can also be dowloaded from https://archive.opencpu.org.

Docker

We now provide serveral docker images for running opencpu-server both for development or deployment. The opencpu/rstudio docker image runs both opencpu-server as well as rstudio-server which is nice for development. To start the docker container on port 80 with name “mybox” you would run:

docker run --name mybox -t -p 80:80 opencpu/rstudio

If port 80 is taken on your machine you can also use 8004. Once this runs you can navigate to http://localhost/ocpu and http://localhost/rstudio in your browser to get started. You can login rstudio with username/password: opencpu/opencpu.

To get a root shell on the server (for example to install system libraries needed by certain R packages) simply run:

# Replace 'mybox' with the --name above
docker exec -i -t mybox /bin/bash

From the shell you can easily install R packages or apt-get install system libraries or modify the server configuration in /etc/opencpu.

Roadmap

OpenCPU 2.0 server is a major step forward towards a robust system for building and deploying R based apps and services. We will keep improving the server implementations based on our experiences and feedback from users and developers.

Next up is updating the documentation to explain some of the powerful new features that were introduced in the 2.0 branch. We will also be updating the opencpu.js JavaScript client and build some cool new R webapps, which is what OpenCPU was built for in the first place!

New in jsonlite 0.9.22: distinguish between double and integer

2016-06-15T00:00:00+00:00

Today a new version of the jsonlite package was released to CRAN. This update includes a few internal enhancements and one new feature.

Doubles vs integers

The new always_decimal parameter forces formatting of doubles in decimal notation. That is to include at least one digit right of the decimal dot. This allows us to distingish them from integers, if you need this.

x <- 1:5
y <- as.numeric(x)
(json_x <- jsonlite::toJSON(x, always_decimal = TRUE))
# [1,2,3,4,5] 

(json_y <- jsonlite::toJSON(y, always_decimal = TRUE))
# [1.0,2.0,3.0,4.0,5.0] 

By formatting doubles this way they naturally get parsed back into doubles. So we can roundtrip numbers between R and json without losing type:

identical(x, jsonlite::fromJSON(json_x))
# TRUE

identical(y, jsonlite::fromJSON(json_y))
# TRUE

You should only use this if you really need it. The json format itself does not specify number types, hence there is no guarantee that an arbitrary json parser will distinguish between integers and doubles. Indeed, most json parsers might simply parse any number into a double, which is totally correct as well.

Also setting always_decimal = TRUE introduces some performance overhead.

Numbers in MongoDB and Mongolite

The main motivation for this feature was to insert data from R into MongoDB using the mongolite package. Several users of mongolite had requested that it would be nice to retain number types, especially when reading the data from MongoDB back into a strong typed language such as C++.

The latest version of mongolite automatically takes advantage of this feature:

# Get latest mongolite
devtools::install_github("jeroenooms/mongolite")

# Assuming you have a local `mongod` running
library(mongolite)
df <- data.frame(x = 1:5, y = as.numeric(1:5))
m <- mongo("testnum")
m$insert(df)
out <- m$find()
identical(out, df)
# TRUE

This makes it even more seamless to use MongoDB as a backend for storing data frames in R!

OpenCPU release 1.6

2016-05-20T00:00:00+00:00

Following a few weeks of testing, OpenCPU 1.6 has been released. OpenCPU is a production-ready system for embedded statistical computing with R. It provides a neat API for remotely calling R functions over HTTP via e.g. JSON or Protocol Buffers. The OpenCPU server implementation is stable and has been thorougly tested. It runs on all major Linux distributions and plays nicely with the RStudio server IDE (demo).

Similarly to shiny, OpenCPU can run as a single-user development server within the interactive R session, and as a multi-user (cloud) server for deployments on Linux. Unlinke shiny however, the cloud server comes at no extra cost. On the contrary: you are encouraged to take advantage of the cloud server which is much faster and includes cool features like user libraries, concurrent sessions, continuous integration, customizable security policies, etc.

Improvements: protolite and feather

The OpenCPU API has not changed from the 1.4 and 1.5 branch. The version bump indicates that this version targets the R 3.3 and supports the new Ubuntu 16.04. Furthermore the underlying stack of bundled R packages has been upgraded. Navigate to /ocpu/info on your OpenCPU server to inspect the exact versions of all packages used by the system.

This version introduces two major improvements for binary data interchange. First the RProtoBuf dependency has been replaced by the much smaller protolite package, which has an optimized version of protobuf object serialization. The OpenCPU already had an API for exporting data to Protocol Buffers, it’s just much faster now.

library(httr)
library(protolite)
req <- GET("https://demo.ocpu.io/ggplot2/data/diamonds/pb")
mydiamonds <- unserialize_pb(content(req))

New in this version is the feather output format which can be parsed/generated with the new feather package.

library(curl)
library(feather)
curl_download("https://demo.ocpu.io/ggplot2/data/diamonds/feather", "diamonds.feather")
mydiamonds <- read_feather("diamonds.feather")

Both pb and feather are a binary alternative to the text based json format:

library(curl)
library(jsonlite)
con <- curl("https://demo.ocpu.io/ggplot2/data/diamonds/json")
mydiamonds <- fromJSON(con)

Installation and upgrading

The download page has instructions for installing the opencpu server on various distributions, either from source or using precompiled binaries. To upgrade an existing installation of opencpu on ubuntu, simply run:

sudo add-apt-repository ppa:opencpu/opencpu-1.6
sudo apt-get update
sudo apt-get dist-upgrade

Note that this will also upgrade the version of R to 3.3.0 (if you have not already done so) which might require that you reinstall some of your R packages.

You can also install opencpu-server on any version of Debian/Ubuntu/Fedora/CentOS/RHEL by building the deb/rpm installation package from source. This is really easy, see the readme for deb or rpm.

Getting started

For those completely new to OpenCPU there several resources to get started. The presentation from last year’s useR conference gives a broad overview of the system including some basic demo’s. The example apps and jsfiddle scripts show how to use the opencpu.js JavaScript client. The server manual has contains documentation on configuring your opencpu cloud server (although installation should work out of the box).

Finally this paper from my thesis describes more generally the challenges of embedded scientific computing, and the benefits (both technical and human) of decoupling your statistical computing from your front-end or application layer.

The public demo server

To deploy your OpenCPU apps on the public server, simply push your R package to Github and configure the webhook in your repository. Whenever you push an update to Github the package will be reinstalled on the server and can directly be used remotely by anyone on the internet. You can either use the full url or the ocpu.io shorthand url:

https://cloud.opencpu.org/ocpu/github/{username}/{package}/
https://{username}.ocpu.io/{package}/

These urls are fully equivalent. Simply replace {username} with your github username, and {package} with your package name. Note that the package name must be identical to the github repository name (as is usually the case).

On writing packages

One prerequisite for using OpenCPU is knowing how to create an R package. There is no way around this; packages are the natural container format for shipping and deploying code/data/manuals in R, and the OpenCPU API assumes this format. Luckily, writing R packages is super easy these days and can be done in less than (10 seconds) using for example RStudio.

The good thing is that once you passed this little hurdle, the full power and flexibility of R and it’s packaging become available to your applications and APIs. Hadley’s latest book on writing R packages gives a nice overview of the R packaging system, and the OpenCPU API provides an easy HTTP interface to all of these features.

Faster arrays and matrices in jsonlite 0.9.20

2016-05-11T00:00:00+00:00

Yesterday a new version of the jsonlite package was released to CRAN. This update includes no new features, it only introduces performance optimizations.

Large Matrices

The jsonlite package was already highly optimized for converting vectors and data frames to json. However Gregory Jefferis and Duncan Murdoch had found that conversion of tall matrices as used by rglwidget was slower than expected.

It turned out this was indeed an edge case that I had overlooked. The new version of jsonlite fixes this problem and matrix conversion should be about 200 times faster than before. Technical details follow below; first a benchmark:

# Old version!
> system.time(j<-toJSON(matrix(1L, ncol = 3, nrow = 50000)))
   user  system elapsed
  4.715   0.015   4.729

# New version!
> system.time(j<-toJSON(matrix(1L, ncol = 3, nrow = 50000)))
   user  system elapsed
  0.022   0.002   0.023

This artificial example (every field has the number 1) highlights the improvement. The relative improvement might be less for matrices with actual data because of additional time spent on number formatting double/integer values (which was already optimized in jsonlite a while ago).

Technical Details

So what was the problem? The previous version of jsonlite had an elegant solution that would recurse through the dimensions of a matrix/array and apply json conversion on each of its elements. E.g. for a matrix (2D array) it would convert each row to json, and then combine the results. However it turns out that the apply call below is really slow.

# Technical example, don't use this code !
x <- matrix(1L, ncol = 3, nrow = 50000)
rows <- apply(x, 1, jsonlite:::asJSON)
json <- jsonlite:::collapse(rows, indent = NA)

The new version exploits the fact that matrices and arrays are homogenous (i.e. all elements have the same type). It first removes the dimensions from the array using c(x) and converts all of the individual elements to json with a single call to asJSON. This results in a significant speedup because asJSON is only called once rather than n times.

# Technical example, don't use this code !
str <- jsonlite:::asJSON(c(x), collapse = FALSE)
dim(str) <- dim(x)
rows <- apply(str, 1, jsonlite:::collapse, indent = NA)
json <- jsonlite:::collapse(rows, indent = NA)

Things get a bit more complicated for higher dimensional arrays, especially with toJSON(x, pretty = TRUE) but this illustrates the core issue.

You might be thinking: can we avoid apply alltogether? Yes! For the important case of 2 dimensional arrays jsonlite has a complete C implementation which makes toJSON on matrices is extra fast. For higher dimensional arrays it currently still uses the solution above, which performs quite well. We might be able to further optimize this case by porting this to C as well, but working with high dimensional arrays in C makes my head hurt.

Stemming and Spell Checking in R

2016-03-21T00:00:00+00:00

Last week we introduced the new hunspell R package. This week a new version was released which adds support for additional languages and text analysis features.

Additional languages

By default hunspell uses the US English dictionary en_US but the new version allows for checking and analyzing in other languages as well. The ?hunspell help page has detailed instructions on how to install additional dictionaries.

> library(hunspell)
> hunspell_info("ru_RU")
$dict
[1] "/Users/jeroen/workspace/hunspell/tests/testdict/ru_RU.dic"

$encoding
[1] "UTF-8"

$wordchars
[1] NA

> hunspell("чёртова карова", dict = "ru_RU")[[1]]
[1] "карова"

It turned out this feature was much more difficult to implement than I expected. Much of the Hunspell library dates from before UTF-8 became popular and therefore many dictionaries use local 8 bit character encodings such as ISO-8859-1 for English and KOI8-R for Russian. To spell check in these languages, the character encoding of the document text has to match that of the dictionary. However R only supports latin and UTF-8 so we need to convert strings in C with iconv, which opens up a new can of worms. Anyway it should all work now.

@opencpu hunspell_stem could be very useful in interpretation issues of e.g. #wordclouds.
— Jelle Geertsma (@rdatasculptor) March 14, 2016

Text analysis and wordclouds

In last weeks post we showed how to parse and spell check a latex file:

# Check an entire latex document
library(hunspell)
setwd(tempdir())
download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz",  mode = "wb")
untar("1406.4806v1.tar.gz")
text <- readLines("content.tex", warn = FALSE)
bad_words <- hunspell(text, format = "latex")
sort(unique(unlist(bad_words)))

The new version also exposes the parser directly, so you can easily extract words and derive the stems to summarize some text, for example to display in a wordcloud.

# Summarize text by stems (e.g. for wordcloud)
allwords <- hunspell_parse(text, format = "latex")
stems <- unlist(hunspell_stem(unlist(allwords)))
words <- head(sort(table(stems), decreasing = TRUE), 200)

Hunspell: Spell Checker and Text Parser for R

2016-03-14T00:00:00+00:00

Hunspell is the spell checker library used in LibreOffice, OpenOffice, Mozilla Firefox, Google Chrome, Mac OS X, InDesign, and a few more. Base R has some spell checking functionality via the aspell function which wraps the aspell or hunspell command line program on supported systems. The new hunspell R package on the other hand directly links to the hunspell c++ library and works on all platforms without installing additional dependencies.

Basic tools

The hunspell_check function takes a vector of words and checks each individual word for correctness.

library(hunspell)
words <- c("beer", "wiskey", "wine")
hunspell_check(words)
## [1]  TRUE FALSE  TRUE

The hunspell function takes a character vector with text (in plain, latex or man format) and returns a list with incorrect words for each line.

bad_words <- hunspell("spell checkers are not neccessairy for langauge ninja's")
print(bad_words)
## [1] "neccessairy" "langauge"    "ninja's"    

Finally hunspell_suggest is used to suggest correct alternatives for each (incorrect) input word.

hunspell_suggest(bad_words[[1]])
## [[1]]
## [1] "necessary"    "necessarily"  "necessaries"  "recessionary" "accessory"    "incarcerate" 
##
## [[2]]
## [1] "language"  "Langeland" "Lagrange"  "Lange"     "gaugeable" "linkage"   "Langland" 
##
## [[3]]
## [1] "ninjas"   "Janina's" "Nina's"   "ninja"    "Janine's" "meninx"   "nark's"

Parsing text

The first challenge in spell-checking is extracting individual words from formatted text. The hunspell function supports three parsers via the format parameter: plain text, latex and man. For example to check the OpenCPU paper for spelling errors we use the latex source code:

download.file("http://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz",  mode = "wb")
untar("1406.4806v1.tar.gz")
text <- readLines("content.tex", warn = FALSE)
words <- hunspell(text, format = "latex")
sort(unique(unlist(words)))

Base R also has a few filters to extract words from R, Sweave or Rd code, see RdTextFilter, SweaveTeXFilter in tools. For example to check your R package manual for typos (assuming you are in the pkg source dir)

for(file in list.files("man", full.names = TRUE)){
  cat("\nFile", file, ":\n  ")
  txt <- tools::RdTextFilter(file, keepSpacing = FALSE)
  cat(sQuote(sort(unique(unlist(hunspell(txt))))), sep =", ")
}

Morphological analysis

A cool feature in hunspell is the morphological analysis. The hunspell_analyze function will show you how a word breaks down into a valid stem plus affix. Hunspell uses a special dictionary format to determine if a stem+affix combination is valid in a given language.

For example suppose we take a few variations of the word love. To get the possible stems+affix for each word:

hunspell_analyze(c("love", "loving", "lovingly", "loved", "lover", "lovely", "love"))
## [1] " st:love"
## [1] " st:loving"    " st:love fl:G"
## [1] " st:lovingly"
## [1] " st:loved"     " st:love fl:D"
## [1] " st:lover"     " st:love fl:R"
## [1] " st:lovely"    " st:love fl:Y"
## [1] " st:love"

Alternatively the hunspell_stem returns only the stem. Not sure how you would use this but it’s certainly cool.

Thanks!

Thanks to Daniel Falbel for suggesting this package on the rOpenSci forums!

OpenCPU Server Release 1.5.4

2016-02-05T00:00:00+00:00

Version 1.5.4 of the OpenCPU server has been released to Launchpad (Ubuntu) and OBS (Fedora). This update does not introduce any changes to the OpenCPU API itself; it improves to the deb/rpm installation packages and upgrades the bundled opencpu system R package library.

Installing and Updating

Existing Ubuntu and Fedora serves that are already running the 1.5 branch will automatically update the next time they run apt-get update or yum update. Alternatively, to install OpenCPU server on a fresh Ubuntu 14.04 machine:

sudo add-apt-repository -y ppa:opencpu/opencpu-1.5
sudo apt-get update 
sudo apt-get install -y opencpu

Or to install it on Fedora 22 or 23 from OBS:

cd /etc/yum.repos.d/
wget http://download.opensuse.org/repositories/home:jeroenooms:opencpu-1.5/Fedora_23/home:jeroenooms:opencpu-1.5.repo
yum install opencpu

To install OpenCPU server on other distributions, simplfy follow the instructions to build the deb (Debian/Ubuntu) or rpm (Fedora/CentOS/RHEL) packages from source, which is very easy.

The OpenCPU Package Library

Because OpenCPU is implemented completely in R, the server stack ships with a private library of R packages needed by the system in /usr/lib/opencpu/library. The isolated library allows you to freely install/upgrade/uninstall your own R packages on your server without accidentaly breaking the OpenCPU server. This is critical to guarantee the system is stable at all times and unaffected by whatever crazy things are happening in R.

However a side effect of this design is that for these system packages, the user might see a different package version when calling R via the OpenCPU API than when running R from the terminal on the same server. This is unfortunate because the OpenCPU is meant to provide a transparent HTTP API to the system’s R installation. One solution would be to add the opencpu library to your .libPaths() but this is unnecessarily annoying and complicated.

To make this easier, the OpenCPU rpm/deb packages now automatically create symlinks to the OpenCPU system library in the global R package library. Thereby the OpenCPU system library is still safely isolated, but the packages are also visible when running R in the terminal, hence we don’t need to install them again. Hopefully this makes managing packages on your OpenCPU server a little easier.

Commonmark: Super Fast Markdown Rendering in R

2016-02-03T00:00:00+00:00

A few months ago I first announced the commonmark R package. Since then there have been a few more releases… time for an update!

What is CommonMark?

Markdown is used in many places these days, however the original spec actually leaves some ambiguity which makes it difficult to optimize and leads to inconsistencies between implementations. Commonmark is an initiative led by John MacFarlane at UC Berkeley (also the author of pandoc) to standardize the markdown syntax. Besides a specification, the commonmark team provides reference implementations for C (cmark) and JavaScript (commonmark.js).

The commonmark R package wraps around cmark which converts markdown text into various formats, including html, latex and groff man. This makes commonmark very suitable for e.g. writing manual pages which are often stored in exactly these formats. In addition the package exposes the markdown parse tree in xml format to support customized output handling.

# Load library
library(commonmark)

# Render some markdown
md <- readLines(curl::curl("https://raw.githubusercontent.com/yihui/knitr/master/NEWS.md"))
html <- markdown_html(md)
man <- markdown_man(md)
tex <- markdown_latex(md)

# Syntax tree
xml <- markdown_xml(md)

# Back to (standardized) markdown
cm <- markdown_commonmark(md)

Currently, commonmark only specifies the original markdown elements: italic, bold, headings, links, images, quotes, paragraphs, lists, horizontal rule, and code blocks. Extensions from pandoc that were introduced later on such as tables are not supported.

CommonMark is fast

The cmark library is written in elegant C code and highly optimized. It renders a Markdown version of War and Peace in the blink of an eye (127 milliseconds on a ten year old laptop, vs. 100-400 milliseconds for an eye blink). A simple benchmark in R confirms that our example above is converted to any of the formats in only a few milliseconds.

library(microbenchmark)
microbenchmark(
  markdown_html = markdown_html(md),
  markdown_man = markdown_man(md),
  markdown_latex = markdown_latex(md)
)
# Unit: milliseconds
#            expr      min       lq     mean   median       uq      max neval
#   markdown_html 3.228492 3.243339 3.318437 3.263184 3.359420 3.902745   100
#    markdown_man 5.768978 5.803062 5.885971 5.862607 5.942159 6.177985   100
#  markdown_latex 5.906757 5.946995 6.049409 6.001677 6.107563 7.619014   100

The main benefit, besides Tolstoy saving some time on typesetting, is that cmark alows for shipping documents such as help pages in native markdown format and render them on-the-fly in html/latex/man without noticable performance overhead. This is very nice for editing and maintaining any sort of portable, dynamic documentation.

Markdown in R documentation

Several people have independently had the idea to add support for markdown to R documentation which would be super awesome. Gábor has started a package called maxygen which might get merged into roxygen2 at some point. This allows for inserting emphasis, boldface, codeblocks, lists, links, and images in your roxygen fields using simple markdown notation rather than the ugly Rd format.

There has also been some discussion on the r-devel mailing list about extending support for markdown in R and CRAN, but that mostly seems to concern NEWS and README files.

New in V8: Calling R, from JavaScript, from R, from Javascript...

2016-02-02T00:00:00+00:00

The V8 package provides an R interface to Google’s open source JavaScript engine. The package is completely self contained and requires no runtime dependencies, making it very easy to execute JavaScript code from R. A hand full of CRAN packages use V8 to provide R bindings to useful JavaScript libraries. Have a look at the v8 vignette to get started.

Callback To R

New in version 0.10 is the ability to call back to R from within JavaScript using the console.r API. This is most easily demonstrated via V8’s interactive JavaScript console:

ctx <- V8::v8()
ctx$console()

From JavaScript we can read/write R objects via console.r.get and console.r.assign, analogous to get and assign in R. The final argument is an optional list with arguments passed to toJSON or fromJSON which are used behind the scenes to convert objects between R and JavaScript.

// read the iris object into JS
var iris = console.r.get("iris")
var iris_col = console.r.get("iris", {dataframe : "col"})

//write an object back to the R session
console.r.assign("iris2", iris)
console.r.assign("iris3", iris, {simplifyVector : false})

Use console.r.call to call R functions. The first argument should be a string which evaluates to a function. The second argument contains a list of arguments passed to the function, similar to do.call in R. Both named and unnamed lists are supported. The return object is returned to JavaScript via JSON.

//calls rnorm(n=2, mean=10, sd=5)
var out = console.r.call('rnorm', {n: 2,mean:10, sd:5})
var out = console.r.call('rnorm', [2, 20, 5])

//anonymous function
var out = console.r.call('function(x){x^2}', {x:12})

There is also a console.r.eval function, which evaluates raw R code. It takes only a single argument (the string to evaluate) and does not return anything. Output is printed to the console.

console.r.eval('sessionInfo()')

Besides automatically converting objects, V8 also propagates exceptions between R, C++ and JavaScript up and down the stack. Hence you can catch R errors as JavaScript exceptions when calling an R function from JavaScript or vice versa. If nothing gets caught, exceptions bubble all the way up as R errors in your top-level R session.

//raise an error in R
console.r.call('stop("ouch!")')

//catch error from JavaScript
try {
   console.r.call('stop("ouch!")')
} catch (e) {
   console.log("Uhoh R had an error: " + e)
}

Thanks to Barret Schloerke for suggesting this feature and Dirk for pointing me in the right direction on how to call R functions from Rcpp (which is surprisingly easy).

Using webp in R: A New Format for Lossless and Lossy Image Compression

2016-01-25T00:00:00+00:00

A while ago I blogged about brotli, a new general purpose compression algorithm promoted by Google as an alternative to gzip. The same company also happens to be working on a new format for images called webp, which is actually a derivative of the VP8 video format. Google claims webp provides superior compression for both lossless (png) and lossy (jpeg) bitmaps, and even though the format is currently only supported in Google Chrome, it seems indeed promising.

The webp R package allows for reading/writing webp bitmap arrays so that we can convert between other bitmap formats. For example, let’s take this photo of a delicious and nutritious feelgoodbyfood spelt-pancake with coconut sprinkles and homemade espresso (see here for 7 other healthy winter breakfasts!)

We read the jpeg file into a bitmap and then write it to webp:

library(webp)
library(jpeg)
library(curl)
curl_download("https://www.opencpu.org/images/pancake.jpg", "pancake.jpg")
bitmap <- readJPEG("pancake.jpg")
write_webp(bitmap, "pancake.webp")

# Only works in Google Chrome
browseURL("pancake.webp")

Of course it works the other way around as well. To read the webp image back into a bitmap and write it to png:

library(png)
bitmap2 <- read_webp("pancake.webp")
writePNG(bitmap2, "pancake.png")
browseURL("pancake.png")

Rendering graphics to webp

The best way to write plots in webp format is using an svg device and then render to bitmap with the rsvg package:

# create an svg image
library(svglite)
library(ggplot2)
svglite("plot.svg", width = 10, height = 7)
qplot(mpg, wt, data = mtcars, colour = factor(cyl))
dev.off()

# render it into a high definition bitmap image
library(rsvg)
rsvg_webp("plot.svg", "plot.webp", width = 1920)
browseURL("plot.webp")

The write_webp function has a quality parameter (integer between 1 and 100) which can be used to tune the quality-size trade-off for lossy compression. A quality=100 equals lossless compression; the default quality=80 provides considerable size reduction with negligible loss of quality.

library(rsvg)
library(webp)
tiger <- rsvg("http://dev.w3.org/SVG/tools/svgweb/samples/svg-files/tiger.svg", height = 720)
write_webp(tiger, "tiger100.webp", quality = 100)
write_webp(tiger, "tiger80.webp", quality = 80)
write_webp(tiger, "tiger50.webp", quality = 50)

Unfortunately webp will probably not become mainstream until it gets implemented by all browsers. But performance seems pretty good so perhaps it could actually be useful for large image compression in scientific applications.

The 'rsvg' Package: High Quality Image Rendering in R

2016-01-25T00:00:00+00:00

The new rsvg package renders (vector based) SVG images into high-quality bitmap arrays. The resulting image is an array of 3 dimensions: height * width * 4 (RGBA) and can be written to png, jpeg or webp format:

# create an svg image
library(svglite)
library(ggplot2)
svglite("plot.svg", width = 10, height = 7)
qplot(mpg, wt, data = mtcars, colour = factor(cyl))
dev.off()

# render it into a bitmap array
library(rsvg)
bitmap <- rsvg("plot.svg")
dim(bitmap)
## [1] 504 720   4

# write to format
png::writePNG(bitmap, "bitmap.png")
jpeg::writeJPEG(bitmap, "bitmap.jpg", quality = 1)
webp::write_webp(bitmap, "bitmap.webp", quality = 100)

The advantage of storing your plots in svg format is they can be rendered later into an arbitrary resolution and format without loss of quality! Each rendering fucntion takes a width and height parameter. When neither width or height is set bitmap resolution matches that of the input svg. When either width or height is specified, the image is scaled proportionally. When both width and height are specified, the image is stretched into the requested size. For example suppose we need to render the plot into ultra HD so that it is crisp as toast when printed a conference poster:

# render it into a bitmap array
bitmap <- rsvg("plot.svg", width = 3840)
png::writePNG(bitmap, "plot_4k.png", dpi = 144)
browseURL("plot_4k.png")

Rather than actually dealing with the bitmap array in R, rsvg also allows you to directly render the image to various output formats, which is slighly faster.

# render straight to output format
rsvg_pdf("plot.svg", "out.pdf")
rsvg_ps("plot.svg", "out.ps")
rsvg_svg("plot.svg", "out.svg")
rsvg_png("plot.svg", "out.png")
rsvg_webp("plot.svg", "out.webp")

Added bonus is that librsvg does not only do a really good job rendering, it is also super fast. It would even be fast enough to render the svg tiger on the fly at 10~20fps!

download.file("http://dev.w3.org/SVG/tools/svgweb/samples/svg-files/tiger.svg", "tiger.svg")
system.time(bin <- rsvg_raw("tiger.svg"))
#   user  system elapsed
#  0.048   0.003   0.057
system.time(rsvg_webp("tiger.svg", "tiger.webp"))
#    user  system elapsed
#  0.097   0.006   0.115

Note the webp format is the new high-quality image format by Google which I will talk about in another post.

Compression Benchmarks: brotli, gzip, xz, bz2

2015-11-27T00:00:00+00:00

Brotli is a new compression algorithm optimized for the web, in particular small text documents. Brotli decompression is at least as fast as for gzip while significantly improving the compression ratio. The price we pay is that compression is much slower than gzip. Brotli is therefore most effective for serving static content such as fonts and html pages.

The brotli package is now on CRAN and supports both compression and decompression of the brotli format. Let’s benchmark the available compression formats in R using a some example text data from the COPYING file.

library(brotli)
library(ggplot2)

# Example data
myfile <- file.path(R.home(), "COPYING")
x <- readBin(myfile, raw(), file.info(myfile)$size)

# The usual suspects
y1 <- memCompress(x, "gzip")
y2 <- memCompress(x, "bzip2")
y3 <- memCompress(x, "xz")
y4 <- brotli_compress(x)

Confirm that all algorithms are indeed lossless:

stopifnot(identical(x, memDecompress(y1, "gzip")))
stopifnot(identical(x, memDecompress(y2, "bzip2")))
stopifnot(identical(x, memDecompress(y3, "xz")))
stopifnot(identical(x, brotli_decompress(y4)))

Compression ratio

If we compare compression ratios, we can see Brotli significantly outperformes the competition for this example.

# Combine data
alldata <- data.frame (
  algo = c("gzip", "bzip2", "xz (lzma2)", "brotli"),
  ratio = c(length(y1), length(y2), length(y3), length(y4)) / length(x)
)

ggplot(alldata, aes(x = algo, fill = algo, y = ratio)) + 
  geom_bar(color = "white", stat = "identity") +
  xlab("") + ylab("Compressed ratio (less is better)")

Decompression speed

Perhaps the most important performance dimension for internet formats is decompression speed. Clients should be able to decompress quickly, even with limited resources such as on browsers and mobile devices.

library(microbenchmark)
bm <- microbenchmark(
  memDecompress(y1, "gzip"),
  memDecompress(y2, "bzip2"),
  memDecompress(y3, "xz"),
  brotli_decompress(y4),
  times = 1000
)

alldata$decompression <- summary(bm)$median
ggplot(alldata, aes(x = algo, fill = algo, y = decompression)) + 
  geom_bar(color = "white", stat = "identity") +
  xlab("") + ylab("Decompression time (less is better)")

We see that brotli is very similar to gzip in decompression speed. We also see why bzip2 and xz have never replaced gzip as the standard compression method on the internet, even though they have better compression ratio: they are several times slower to decompress.

Compression speed

So far Brotli showed the best compression ratio, with decompression performance comparable to gzip. But there is no such thing as a free pastry in Switzerland. Here is the caveat: compressing data with brotli is complex and slow:

library(microbenchmark)
bm <- microbenchmark(
  memCompress(x, "gzip"),
  memCompress(x, "bzip2"),
  memCompress(x, "xz"),
  brotli_compress(x),
  times = 20
)

alldata$compression <- summary(bm)$median
ggplot(alldata, aes(x = algo, fill = algo, y = compression)) + 
  geom_bar(color = "white", stat = "identity") +
  xlab("") + ylab("Compression time (less is better)")

Hence we can conclude that Brotli is mostly nice for clients, with decompression performance comparable to gzip while significantly improving the compression ratio. These are powerful properties for serving static content such as fonts and html pages.

However compression performance, at least for the current implementation, is considerably slower than gzip, which makes Brotli unsuitable for on-the-fly compression in http servers or other data streams.

Sodium: A Modern and Easy-to-Use Crypto Library

2015-10-19T00:00:00+00:00

This week a new package called sodium was released on CRAN. This package implements bindings to libsodium: a modern, easy-to-use software library for encryption, decryption, signatures, password hashing and more.

Libsodium is actually a portable fork of Daniel Bernstein’s famous NaCL crypto library, which provides core operations needed to build higher-level cryptographic tools. It is not intended for implementing standardized protocols such as TLS, SSH or GPG, you still need something like OpenSSL for that. Sodium only supports a limited set of state-of-the-art elliptic curve methods, resulting in a simple but very powerful tool-kit for building secure applications.

Getting started with Sodium

The package includes two nice vignettes to get you started:

Introduction to Sodium for R: basic hands-on introduction to the sodium R package. Gives an overview of the available encryption methods and examples of how to use them
How does cryptography work: a conceptual intro on cryptographic methods with examples from Sodium

If you always wanted to understand how encryption works without getting a degree in computer science, check out the latter. The basic techniques are easy to understand because cryptographers have done a great job at abstracting the mathematical details into simple hash functions and Diffie-Hellman functions.

Installing Sodium

On Windows on OSX simply install the binary packages from CRAN:

install.packages("sodium")

On Linux you need sodium shared library which is called libsodium-dev on Debian/Ubuntu and libsodium-devel on Fedora/EPEL. Because this library is relatively young, it is only available for recent versions of these distributions. For Ubuntu 12.04 and 14.04 there are backports available from Launchpad:

sudo add-apt-repository ppa:chris-lea/libsodium
sudo apt-get update
sudo apt-get install libsodium-dev

On CentOS/RHEL you need to activate EPEL before installing libsodium-devel.

Curl 0.9.2: tweaks and proxies for windows

2015-08-10T00:00:00+00:00

Version 0.9.2 of curl has been released to CRAN. The curl package implements a modern and flexible web client for R and is the foundation for the popular httr package. This update includes mostly tweaks for Windows.

Faster downloading

Alex Deng from Microsoft had diagnosed a problem with curl_fetch_memory (which is used by httr) being slower than expected on Windows. After some testing it turned out that the implemenation of realloc (to grow the buffer that holds downloaded data) is poorly optimized on Windows. It basically copies the entire memory block every time the size is increased, which results in a lot of copying for large downloads.

The new release includes a tweak to increase the buffer size exponentially, which solves the problem. This fix is wrapped in an #ifdef _WIN32 because usually the operating system does a better job in optimizing memory allocation than the programmer. But Windows needs a little help sometimes.

Updated libcurl

This release uses the latest build of libcurl and its dependencies from the rwinlib repository. These include:

libcurl 7.43.0
openssl 1.0.2d
libssh2 1.6.0
libiconv 1.14-5
libidn 1.31-1

The libcurl changelog lists the new features and bug fixes from this release.

Working with proxies

The new version includes two functions specifically for Windows to lookup system proxy settings. This can be used to configure curl to use the same proxy server, which is required to connect to the internet on some networks.

The ie_proxy_info function looks up your current proxy settings as configured in Internet Explorer. In the case of a dynamic proxy, the ie_get_proxy_for_url function shows if and which proxy should be used to connect to a particular URL. If your have an “automatic configuration script” this involves downloading and executing a PAC file.

You should be able to use address returned by ie_get_proxy_for_url as the proxy option in the curl handle to automatically use the correct proxy server for a given URL. However I do not have access to a network with a proxy server so I cannot actually test this feature. If you are on such a network, please help testing this feature.

curl_proxy <- function(url, verbose = TRUE){
  proxy <- ie_get_proxy_for_url(url)
  h <- new_handle(verbose = verbose, proxy = proxy)
  curl(url, handle = h)
}

con <- curl_proxy("https://httpbin.org/get")
readLines(con)

I also created a gist with some more details to test this feature. If it doesn’t work immediately, try fiddling around with some of the other libcurl proxy options and let me know what works!

Mongolite 0.5: authentication and iterators

2015-07-29T00:00:00+00:00

A new version of the mongolite package has appeared on CRAN. Mongolite builds on jsonlite to provide a simple, high-performance MongoDB client for R, which makes storing small or large data in a database as easy as converting it to/from JSON. Have a look at the vignette or useR2015 slides to get started with inserting, json queries, aggregation and map-reduce.

Authentication and mongolabs

This release fixes an issue with the authentication mechanism that was reported by Dean Attali. The new version should properly authenticate to secured mongodb servers.

Try running the code below to grab some flights data from my mongolabs server:

# load the package
library(mongolite)
stopifnot(packageVersion("mongolite") >= "0.5")

# Connect to the 'flights' dataset
flights <- mongo("flights", url = "mongodb://readonly:test@ds043942.mongolab.com:43942/jeroen_test")

# Count data for query
flights$count('{"day":1,"month":1}')

# Get data for query
jan1_flights <- flights$find('{"day":1,"month":1}')

While debugging this, I found that mongolab is actually very cool. You can sign up for a your own free (up to 500MB) mongodb server and easily create data collections with one or more read-only and/or read-write user accounts. This provides a pretty neat way to publish some data (read-only) or sync and collaborate with colleagues (read-write).

Iterators

Another feature request from some early adopters was to add support for iterators. Usually you want to use the mongo$find() method which automatically converts data from a query into a dataframe. However sometimes you need finer control over the individual documents.

The new version adds a mongo$iterate() method to manually iteratate over the individual records from a query without any automatic simplification. Using the same example query as above:

# Connect to the 'flights' dataset
flights <- mongo("flights", url = "mongodb://readonly:test@ds043942.mongolab.com:43942/jeroen_test")

# Create iterator
iter <- flights$iterate('{"day":1,"month":1}')

# Iterate over individual records
while(!is.null(doc <- iter$one())){
	# do something with the row here
	print(doc)
}

Currently the iterator has 3 methods: one(), batch(n = 1000) and page(n = 1000). The iter$one method will pop one document from iterator (it would be called iter$next() if that was not a reserved keyword in R). Both iter$batch(n) and iter$page(n) pop n documents at once. The difference is that iter$batch returns a list of at most length n whereas iter$page returns a data frame with at most n rows.

Once the iterator is exhausted, its methods will only return NULL.

OpenCPU release 1.5

2015-07-05T00:00:00+00:00

Following a few weeks of testing, OpenCPU 1.5 has been released. OpenCPU is a production-ready framework for embedded statistical computing with R. The system provides a neat API for remotely calling R functions over HTTP via e.g. JSON or Protocol Buffers. The OpenCPU server implementation is very stable and has been thorougly tested. It runs on all major Linux distributions and plays nicely with the RStudio server IDE (demo).

Similarly to shiny, OpenCPU has a single-user/development edition that runs within the interactive R session, and a multi-user (cloud) server for deployments on Linux. Unlinke shiny however, the cloud server comes at no extra cost. On the contrary: you are encouraged to take advantage of the cloud server which is much faster and includes cool features like user libraries, concurrent sessions, continuous integration, customizable security policies, etc.

New in OpenCPU 1.5

The OpenCPU API itself has not changed from the 1.4 branch, but the entire underlying stack has been upgraded, hence the version bump. The server now builds on:

R 3.2.1
stringi 0.5-5
jsonlite 0.9.16
devtools 1.8.0
RStudio 0.99 (optional)

Navigate to /ocpu/info on your OpenCPU server to inspect the exact versions of all packages used by the system.

In addition to an upgraded package library, this version includes many small tweaks for the deb/rpm installation packages and docker files. Redhat distributions like Fedora and CentOS are now automatically configured with the required SELinux policies.

Installation and upgrading

sudo add-apt-repository ppa:opencpu/opencpu-1.5
sudo apt-get update
sudo apt-get dist-upgrade

Note that this will also upgrade the version of R to 3.2.1 (if you have not already done so) which might require that you reinstall some of your R packages.

Getting started

The public demo server

https://cloud.opencpu.org/ocpu/github/{username}/{package}/
https://{username}.ocpu.io/{package}/

On writing packages

Secure password hashing in R with bcrypt

2015-06-19T00:00:00+00:00

The new package bcrypt provides an R interface to the OpenBSD ‘blowfish’ password hashing algorithm described in A Future-Adaptable Password Scheme by Niels Provos. The implementation is derived from the py-bcrypt module for Python which is a wrapper for the OpenBSD implementation.

Bcrypt is used for secure password hashing. The main difference with regular digest algorithms such as md5 / sha256 is that the bcrypt algorithm is specifically designed to be cpu intensive in order to protect against brute force attacks. This means that hasing with bcrypt is terribly slow, which is a feature. The complexity of the algorithm is configurable via the log_rounds parameter.

The API from the R package is exactly the same as the one from python: the hashpw function calculates a hash from a password using a random salt. Validating the hash is done by reshashing the password using the hash as a salt.

# Secret message as a string
passwd <- "supersecret"

# Create the hash
hash <- hashpw(passwd)
hash
## [1] "$2a$12$1G8N3Xnp11oHt0RJf7SCMeWib7DpEOgpE5lXwjE2BATHJqFFxci6u"

# To validate the hash
identical(hash, hashpw(passwd, hash))
## TRUE

# Wrapper that does the same
checkpw(passwd, hash)
## TRUE

The gensalt function generates a salt for use with hashpw and specifies the complexity of the algorithm via the log_rounds parameter. The first few characters in the salt string hold the bcrypt version and value for log_rounds. The remainder stores 16 bytes of base64 encoded randomness for seeding the hashing algorithm.

# Use varying complexity:
hash11 <- hashpw(passwd, gensalt(11))
hash12 <- hashpw(passwd, gensalt(12))
hash13 <- hashpw(passwd, gensalt(13))

# Takes longer to verify (or crack)
system.time(checkpw(passwd, hash11))
##   user  system elapsed 
##  0.155   0.000   0.156 
system.time(checkpw(passwd, hash12))
##   user  system elapsed 
##  0.312   0.000   0.312 
system.time(checkpw(passwd, hash13))
##   user  system elapsed 
##  0.640   0.002   0.642

HTTPS for CRAN: how and why

2015-06-14T00:00:00+00:00

Correction (June 18): An earlier version of this post stated that currently no CRAN mirrors support https. Martin has pointed out that this is incorrect. As of writing, 7 of the official CRAN mirrors already have full https support.

R gained some basic support for https in version 3.2.0 (see NEWS) via the method = "libcurl" argument in base functions download.file and url. The global option download.file.method is used to make this the default.

Unfortunately the implementation has a few limitations: there is no way to set request options (authentication, proxy, headers, TLS options, etc) and the functions do not expose an http status code or response headers. Because they also do not raise an error when the request fails with an http error (as do the other download methods), this leaves you to guess if the retrieved content is what you were expecting or an error page.

# Raises an error
download.file("http://httpbin.org/status/418", tempfile(), method = "internal")

# Does not raise an error
download.file("http://httpbin.org/status/418", tempfile(), method = "libcurl")

# What it should do
library(curl)
curl_download("http://httpbin.org/status/418", tempfile())

Anyway it is good enough for downloading static files from public servers, which is all we need for now.

CRAN and libcurl

Because install.packages and friends wrap around download.file, we can use this new feature to download R packages from CRAN via https. ~~None of the currently available CRAN servers seems to support https, so~~ I created a demo server at https://cran.opencpu.org. This is not a real mirror, it is just a https proxy to the US mirror. See below for a list of other CRAN servers that support https.

# Install a package over https
install.packages("ggplot2", repos = "https://cran.opencpu.org", method = "libcurl")

Use a script like this to opt-in globally on machines where libcurl is available:

# Enable CRAN https everywhere
if(capabilities("libcurl")){
  options(repos = "https://cran.opencpu.org", download.file.method = "libcurl")
}

Hopefully the admins in Vienna will at some point enable https for the main cran server in the same way they have done for r-forge (which is literally the neighborhing ip address).

Update: CRAN servers with https

As Martin has pointed out in his comment, some CRAN mirrors do already support https without advertising it. Below a script that tests each available server from the mirror list for https:

# Script to list CRAN servers with https
library(curl)
h <- new_handle(timeout_ms = 30000, connecttimeout_ms = 5000)
mirrors <- read.csv(curl("https://svn.r-project.org/R/trunk/doc/CRAN_mirrors.csv"))
mirrors$SSL <- vapply(mirrors$URL, function(url){
  https_url <- paste0(sub("^http://", "https://", url), "src/contrib/PACKAGES")
  cat("Trying", https_url, "\n")
  identical(200L, try(curl_fetch_memory(https_url, handle = h)$status))
}, logical(1))
subset(mirrors, SSL == TRUE, select = c("Name","URL"))

It turns out that there are currently 7 servers that have properly setup https:

                Name                                       URL
China (Beijing 4) https://mirrors.tuna.tsinghua.edu.cn/CRAN/
   China (Hefei)          https://mirrors.ustc.edu.cn/CRAN/
 Colombia (Cali)             https://www.icesi.edu.co/CRAN/
     Switzerland                 https://stat.ethz.ch/CRAN/
    UK (Bristol)            https://www.stats.bris.ac.uk/R/
        USA (KS)            https://rweb.quant.ku.edu/cran/
        USA (TN)         https://mirrors.nics.utk.edu/cran/

Hopefully more will follow soon.

Why CRAN and https?

Using https can stop some, but not all, MITM attacks. Encrypting the connection with the CRAN server prevents intermediate parties such as your ISP, (anti)virus, or any other user on your network from snooping or tampering with the connection. When it comes to CRAN, security is probably more of a concern than privacy, especially when using public networks on e.g. airports, coffee shops or campuses. It is easy for hackers or viruses to hijack wifi connections and inject malicious code or executables into unencrypted traffic. Using https guarantees that at least the connection between you and your CRAN mirror is secure.

Of course this does not fully guarantee the integrity of your download. You are basically putting your faith in the hands of your CRAN mirror (or the owner of the domain to be more specific). If the mirror server gets hacked, or somebody manages to tamper with the mirroring process itself (which is done using rsync without any encryption) packages can still get infected.

Linux distributions solve this problem by making package authors sign the checksum of the package with a private key. This signature is used to automatically verify the integrity of a download from the author’s public key before installation, regardless of how the package was obtained. Simon has implemented some of this for R in PKI but unfortunately this was never adopted by CRAN. But at least with https we can somewhat safely install R packages from within a coffee shop now, which solves the most urgent problem.

The curl package: a modern R interface to libcurl

2015-06-09T00:00:00+00:00

TL;DR: Check out the vignette or the development version of httr.

The package I put most time and effort in this year is curl. Last week version 0.8 was published on CRAN which fixes the last outstanding bug for Solaris. The package is pretty much done at this point: stable, well tested, and does everything it needs to; nothing more nothing less…

From the description:

The curl() and curl_download() functions provide highly configurable drop-in replacements for base url() and download.file() with better performance, support for encryption (https://, ftps://), ‘gzip’ compression, authentication, and other ‘libcurl’ goodies. The core of the package implements a framework for performing fully customized requests where data can be processed either in memory, on disk, or streaming via the callback or connection interfaces.

The initial motivation of the package was to implement a connnection interface with SSL (https) support, something R has always been lacking (see also json streaming). But since then the package has matured into a full featured HTTP client. By now it has become exactly what I promised it would not be: a complete replacement of RCurl.

What about RCurl?

Good question. The RCurl package by all-star R-core member Duncan Temple-Lang is one of the most widely used R packages. The first CRAN release was about 11 years ago and it has since then been the standard networking client for R. The paper shows that Duncan was (as with most of his work) ahead of his time, describing tools and technology that are now part of the standard data-science workflow.

The RCurl package was also the basis of Hadley’s popular httr package, which started to reveal some shortcomings, including memory leaks, build problems, performance regressions and mysterious errors. Now a bug or two we can fix, but from the RCurl source code it becomes obvious that a lot has changed over the past 10 years. Both R and libcurl have matured a lot, and the internet has largely converged to (REST style) HTTP and SSL, with other protocols slowly being phased out. Also Duncan is a busy guy and seems to have largely moved on to other projects. And so we are going to need a rewrite from scratch…

The curl package is inspired by the good parts of RCurl but with an implementation that takes advantage of modern features in R such as the connection interface and external pointers with proper finalizers. This allows for a much simpler interface to libcurl that has better performance, supports streaming, and handles that automatically clean up after themselves. Moreover curl is deliberately very minimal and only contains the essential foundations for interacting with libcurl. High-level logic and utilities can be provided by other packages that build on curl, such as httr. The result is a small, clean and powerful package that takes 2 seconds to compile and will hopefully prove to be reliable and low maintenance.

Getting started with curl and httr

The best introduction to the curl package is the vignette which has some nice examples to get you started. Moreover the development version of httr has already been migrated from RCurl to curl. To install using devtools:

library(devtools)
install_github("hadley/httr")

Note that devtools itself depends on httr so you might need to restart R after updating httr. If you are seeing some ERROR: loading failed error (especially on Windows) just restart R and try again.

New package commonmark: yet another markdown parser?

2015-06-03T00:00:00+00:00

Last week the commonmark package was released on CRAN. The package implements some very thin R bindings to John Macfarlane’s (author of pandoc) cmark library. From the cmark readme:

cmark is the C reference implementation of CommonMark, a rationalized version of Markdown syntax with a spec. It provides a shared library (libcmark) with functions for parsing CommonMark documents to an abstract syntax tree (AST), manipulating the AST, and rendering the document to HTML, groff man, CommonMark, or an XML representation of the AST.

Each of the R wrapping functions parses markdown and renders it to one of the output formats:

md <- "
## Test
My list:
  - foo
  - bar"

The markdown_html function converts markdown to HTML:

library(commonmark)
cat(markdown_html(md))

<h2>Test</h2>
<p>My list:</p>
<ul>
<li>foo</li>
<li>bar</li>
</ul>

Obviously the dynamic content rendered from markdown is not a full HTML document in itself. To create a full HTML page you would insert one or more of these snippets in an HTML template with static header and footer content and possibly some css/js to make the page more exciting.

The markdown_xml function gives the parse tree in xml format:

cat(markdown_xml(md))

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document>
  <header level="2">
    <text>Test</text>
  </header>
  <paragraph>
    <text>My list:</text>
  </paragraph>
  <list type="bullet" tight="true">
    <item>
      <paragraph>
        <text>foo</text>
      </paragraph>
    </item>
    <item>
      <paragraph>
        <text>bar</text>
      </paragraph>
    </item>
  </list>
</document>

Most of the value in commonmark and is probably in the latter. There already exist a few nice markdown converters for R including the popular rmarkdown package, which uses pandoc to convert markdown to several presentation formats.

The formal commonmark spec makes markdown suitable for more strict documentation purposes, where we might currently be inclined to use json or xml. For example we could use it to parse NEWS.md files from R packages in a way that allows for archiving and indexing individual news items, without ambiguity over indentation rules and such.

Getting started with MongoDB in R

2015-05-15T00:00:00+00:00

The first stable version of the new mongolite package has appeared on CRAN. Mongolite builds on jsonlite to provide a simple, high-performance MongoDB client for R, which makes storing and accessing small or large data as easy as converting it to/from JSON. The package vignette has some examples to get you started with inserting, json queries, aggregation and map-reduce. MongoDB itself is open source and installation is easy (e.g. brew install mongodb).

If you use, or (think) you might want to use MongoDB with R, please get in touch. I am interested to hear your about your problems and use cases to make this package fit everyones needs. I will also be presenting this and related work at UseR 2015 and the annual French R Meeting.

Upcoming talks about jsonlite and mongolite

2015-05-01T00:00:00+00:00

This summer I will be giving an invited talk at the annual French R Meeting in Grenoble as well as a shorter talk at UseR 2015 in Aalborg. The presentations will feature some recent R packages in the json/web space (curl, jsonlite, mongolite, V8), and show how these tools can be combined for building interoperable data pipelines with R.

Below the official abstract.

Abstract: jsonlite and mongolite

The jsonlite package provides a powerful JSON parser and generator that has become one of standard methods for getting data in and out of R. We discuss some recent additions to the package, in particular support streaming (large) data over http(s) connections. We then introduce the new mongolite package: a high-performance MongoDB client based on jsonlite. MongoDB (from “humongous”) is a popular open-source document database for storing and manipulating very big JSON structures. It includes a JSON query language and an embedded V8 engine for in-database aggregation and map-reduce. We show how mongolite makes inserting and retrieving R data to/from a database as easy as converting it to/from JSON, without the bureaucracy that comes with traditional databases. Users that are already familiar with the JSON format might find MongoDB a great companion to the R language and will enjoy the benefits of using a single format for both serialization and persistency of data.

JSON serialization now even faster and prettier

2015-04-13T00:00:00+00:00

The jsonlite package implements a robust, high performance JSON parser and generator for R, optimized for statistical data and the web. This week version 0.9.16 appeared on CRAN which has a new prettifying system, improved performance and some additional tweaks for the new mongolite package.

Improved Performance

Everyones favorite feature of jsonlite: performance. We found a way to significanlty speed up toJSON for data frames for the cases of dataframe="rows" (the default) or dataframe="values". On my macbook I now get these results:

data(diamonds, package="ggplot2")
system.time(toJSON(diamonds, dataframe = "rows"))
#   user  system elapsed
#  0.133   0.003   0.136
system.time(toJSON(diamonds, dataframe = "columns"))
#   user  system elapsed
#  0.070   0.003   0.072
system.time(toJSON(diamonds, dataframe = "values"))
#   user  system elapsed
#  0.094   0.005   0.099

A somewhat larger dataset:

data(flights, package="nycflights13")
system.time(toJSON(flights, dataframe = "rows"))
#   user  system elapsed
#  1.506   0.072   1.578
system.time(toJSON(flights, dataframe = "columns"))
#   user  system elapsed
#  0.585   0.024   0.608
system.time(toJSON(flights, dataframe = "values"))
#   user  system elapsed
#  0.873   0.039   0.912

That is pretty darn fast for a text based serialization format. By comparison, we easily beat write.csv which is actually a much more simple output format:

system.time(write.csv(diamonds, file="/dev/null"))
#   user  system elapsed
#  0.361   0.003   0.364
system.time(write.csv(flights, file="/dev/null"))
#   user  system elapsed
#  3.284   0.033   3.318

Pretty even prettier

Yihui has pushed for a new prettifying system that inserts indentation directly in the R code rather than making yajl prettify the entire JSON blob at the end. As a result we can use different indentation rules for different R classes. See the PR for details. The main differce is that atomic vectors are now prettified without linebreaks:

x <- list(foo = 1:3, bar = head(cars, 2))
toJSON(x, pretty=TRUE)

{
  "foo": [1, 2, 3],
  "bar": [
    {
      "speed": 4,
      "dist": 2
    },
    {
      "speed": 4,
      "dist": 10
    }
  ]
}

toJSON(x, pretty=T, dataframe = "col")

{
  "foo": [1, 2, 3],
  "bar": {
    "speed": [4, 4],
    "dist": [2, 10]
  }
}

This can be helpful for manually inspecting or debugging your JSON data. The prettify function still uses yajl, so if you prefer this style, simply use prettify(toJSON(x)).

New mongolite package

There were some additional internal enhancements to support the new mongolite package, which will be announced later this month. This package will extend the concepts and power of jsonlite to the in-database JSON documents. Have a look at the git repository for a sneak preview.

Improved memory usage and RJSONIO compatibility in jsonlite 0.9.15

2015-03-31T00:00:00+00:00

The jsonlite package implements a robust, high performance JSON parser and generator for R, optimized for statistical data and the web. Last week version 0.9.15 appeared on CRAN which improves memory usage and compatibility with other packages.

Migrating to jsonlite

The upcoming release of shiny will switch from RJSONIO to jsonlite. To make the transition painless for shiny users, Winston Chang has added some compatibility options to jsonlite that mimic the (legacy) behavior of RJSONIO. The following wrapper results in the same output as RJSONIO::toJSON for the majority of cases. Hopefully this will make it easier for other package authors to make the transition to jsonlite as well.

# RJSONIO compatibility wrapper
toJSON_legacy <- function(x, ...) {
  jsonlite::toJSON(I(x), dataframe = "columns", null = "null", na = "null",
   auto_unbox = TRUE, use_signif = TRUE, force = TRUE,
   rownames = FALSE, keep_vec_names = TRUE, ...)
}

However be aware that the RJSONIO defaults can sometimes result in unexpected behavior and odd edge cases (which is why jsonlite was created in the first place). Therefore it is still recommended to switch to the jsonlite defaults when possible (see jsonlite paper for a discussion on the mapping). One exception is perhaps the auto_unbox argument, which many people seem to prefer to TRUE for encoding relatively simple static data structures.

Memory usage

The new version should use less memory when parsing JSON, especially from a file or URL. This is mostly due to a new push-parser implementation that can incrementally parse JSON in little pieces, which eliminates overhead of copying gigantic JSON strings. In addition, jsonlite now uses the new curl package for retrieving data via a connection interface.

mydata1 <- jsonlite::fromJSON("https://jeroenooms.github.io/data/dmd.json")

The call above is results in the same output as the call below, but it should consume less memory, especially for very large json files.

library(httr)
req <- GET("https://jeroenooms.github.io/data/dmd.json")
mydata2 <- jsonlite::fromJSON(content(req, "text"))

None of this changes anything in the API, these changes are all internal.

OpenCPU server update for R 3.1.3

2015-03-12T00:00:00+00:00

Following the release of R 3.1.3, I have pushed a new build of the OpenCPU server to launchpad, dockerhub and OBS. This update has no changes in OpenCPU itself, but includes updated versions of R, RStudio and R packages used by OpenCPU.

To upgrade your OpenCPU server:

sudo apt-get update
sudo apt-get dist-upgrade

If you are running OpenCPU in production and you do not want to receive automatic updates, make sure to remove or comment-out the opencpu repository in /etc/apt/sources.list.d/opencpu-opencpu-1_4-trusty.list on your server. The opencpu-1.4 repo now contains:

OpenCPU 1.4.6
R 3.1.3
RStudio Server 0.98.1103
Rcpp 0.11.5

To list the versions of other R packages included with the cloud server have a look at the opencpu-lib directory on Github or navigate to /ocpu/info on your opencpu server.

Compiling CoffeeScript in R with the js package

2015-02-27T00:00:00+00:00

A new release of the js package has made it’s way to CRAN. This version adds support for compiling Coffee Script. Along with the uglify and jshint tools already in there, the package now provides a very complete suite for compiling, validating, reformatting, optimizing and analyzing JavaScript code in R.

Coffee Script

According to its website, CoffeeScript is a little language that compiles into JavaScript. It is an attempt to expose the good parts of JavaScript in a simple way. The coffee_compile function binds to the coffee script compiler. A hello world example from the package vignette:

# Hello world
cat(coffee_compile("square = (x) -> x * x"))

This outputs the following JavaScript code:

(function() {
  var square;

  square = function(x) {
    return x * x;
  };

}).call(this);

Or to compile without the closure:

# Hello world
cat(coffee_compile("square = (x) -> x * x", bare = TRUE))

var square;

square = function(x) {
  return x * x;
};

The package vignette includes some more examples.

Why coffee script?

Coffee script is not some sort of widget factory or other “use JavaScript without learning JavaScript” tool kit. From the website:

The golden rule of CoffeeScript is: “It’s just JavaScript”. The code compiles one-to-one into the equivalent JS, and there is no interpretation at runtime. You can use any existing JavaScript library seamlessly from CoffeeScript (and vice-versa). The compiled output is readable and pretty-printed, will work in every JavaScript runtime, and tends to run as fast or faster than the equivalent handwritten JavaScript.

CoffeeScript is popular among web developers for writing JavaScript applications using a syntax that is more readable and less error prone, but without being constrained by some sort of framework. CoffeeScript is often used in conjunction with an HTML templating engine such as jade (see rjade) and a CSS pre-processor such as Less or SASS or Stylus.

Together, these tools are helpful in organizing and maintaining a non-trivial web applications. Given the recent mass adoption of HTML/JavaScipt based widgets and visualization in the R community, they can be a valuable addition to the R developer tool kit as well.

RMySQL version 0.10.2: Full SSL Support

2015-02-26T00:00:00+00:00

RMySQL version 0.10.2 has appeared on CRAN. This is a maintenance release to streamline the build process on various platforms. Most importantly, the Windows/OSX binary packages from CRAN are now built with full SSL support. On Linux, the configure script has been updated a bit to automatically find the mysql client library.

A big thanks to epoch.com for sponsoring the development of this important package.

How to install RMySQL

RMySQL is a very old package, and as a result there is a lot of outdated and incorrect information on the interwebs. Back in the day (up till version 0.9.3) you had to manually install mysql on your machine to make the package work. But since the 0.10 series earlier this year, the package is now entirely self contained. The recommended way to install RMySQL on Windows and OSX is simply:

install.packages("RMySQL")

On Linux the package still links against the system libmysqlclient. On most deb systems (Debian/Ubuntu) you need to install either libmysqlclient-dev or libmariadbclient-dev, and on rpm systems such as Fedora/CentOS/RHEL you need mariadb-devel. It should also work with less known variants of MySQL such as Percona but this doesn’t get a lot of testing coverage.

Using SSL with MySQL

MySQL is not always used with SSL because often the client and server run on the same machine, or within a private network. Moreover encryption introduces some performance overhead, which slows down your database connection a bit. But if you are connecting to a MySQL server over the internet, then enabling SSL is probably a good idea if you don’t want everyone to see your data.

Most MySQL servers have been built with SSL support. To configure RMySQL to connect to server over SSL you need to set the certificates in your ~/.my.cnf file:

[client]
ssl-ca=c:/ssl_certs/ca-cert.pem
ssl-cert=c:/ssl_certs/client-cert.pem
ssl-key=c:/ssl_certs/client-key.pem

I’m not using this myself but others are so I’m taking their word that this works. If you’re experiencing any problems, open an issue on github.

Future Development

This is likely the final release of the 0.10 series. We (well mostly Hadley) are working on a full rewrite of the package based on Rcpp. The readme on Github contains instructions on how to install the latest version from source (it is really easy, even on Windows).

Past experiences have shown that problems in this package are often specific to the operating system and version of mysql. Therefore we really appreciate feedback and testing of the new version. If you use RMySQL, please check out the development version at some point so that we can make sure everything works as expected when it gets released. Report bugs or suggestions on the github page; please include your OS and RMySQL version.

Jade: a clean, whitespace-sensitive template language for writing HTML

2015-02-20T00:00:00+00:00

Jade is a high performance template engine heavily influenced by Haml. It is designed for writing HTML pages using a concise, modern syntax without the verbosity of old fashioned XML-like tags that we all want to forget about. The new rjade package implements convenient bindings from R to this popular JavaScript library.

An example template

Below an example of a Jade template, taken from the jade homepage. Notice that the notation of tags, classes and id’s much resembles CSS selectors. The template also includes one variable called youAreUsingJade, which we can use to control the rendering output.

doctype html
html(lang="en")
  head
    title= pageTitle
    script(type='text/javascript').
      if (foo) {
         bar(1 + 5)
      }
  body
    h1 Jade - node template engine
    #container.col
      if youAreUsingJade
        p You are amazing
      else
        p Get on it!
      p.
        Jade is a terse and simple
        templating language with a
        strong focus on performance
        and powerful features.

Converting a template to HTML text involves two steps. The first step compiles the template with some formatting options into a closure. The binding for this is implemented in jade_compile.

# Compile a Jade template in R
library(rjade)
text <- readLines(system.file("examples/test.jade", package = "rjade"))
tpl <- jade_compile(text, pretty = TRUE)

The second step calls the closure with optionally some local variables to render the output to HTML.

# Render the template
tpl()

The output looks like this:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title></title>
    <script type="text/javascript">
      if (foo) {
         bar(1 + 5)
      }
    </script>
  </head>
  <body>
    <h1>Jade - node template engine</h1>
    <div id="container" class="col">
      <p>Get on it!</p>
      <p>
        Jade is a terse and simple
        templating language with a
        strong focus on performance
        and powerful features.
      </p>
    </div>
  </body>
</html>

Note how the HTML output changes when setting local variables:

tpl(youAreUsingJade = TRUE)

<!DOCTYPE html>
<html lang="en">
  <head>
    <title></title>
    <script type="text/javascript">
      if (foo) {
         bar(1 + 5)
      }
    </script>
  </head>
  <body>
    <h1>Jade - node template engine</h1>
    <div id="container" class="col">
      <p>You are amazing</p>
      <p>
        Jade is a terse and simple
        templating language with a
        strong focus on performance
        and powerful features.
      </p>
    </div>
  </body>
</html>

That’s it. Hover over to the jade website to learn about the full power of this amazing templating language.

Introducing js: tools for working with JavaScript in R

2015-02-17T00:00:00+00:00

A new package has appeared on CRAN called js. This package implements bindings to several popular JavaScript libraries for validating, reformatting, optimizing and analyzing JavaScript code. It builds on the V8 engine, the fully standalone JavaScript engine in R.

Syntax Validation

Several R packages allow the user to supply JavaScript code to be used as callback function or configuration object within a visualization or web application. By validating in R that the JavaScript code is syntactically correct and of the right type before actually inserting it in the HTML, we can avoid many annoying bugs.

The js_typeof function simply calls the typeof operator on the given code. If the code is syntactically invalid, a SyntaxError will be raised.

callback <- 'function(x, y){
  var z = x*y ;
  return z;
}'
js_typeof(callback)
# [1] "function"

Same for objects:

conf <- '{
  foo : function (){},
  bar : 123
}'
js_typeof(conf)
# [1] "object"

Catch JavaScript typos:

js_typeof('function(x,y){return x + y}}')
# Error in eval(expr, envir, enclos): SyntaxError: Unexpected token }

Script Validation

A JavaScript program typically consists of script with a collection of JavaScript statements. The js_validate_script function can be used to validate an entire script.

jscode <- readLines(system.file("js/uglify.min.js", package="js"), warn = FALSE)
js_validate_script(jscode)
# [1] TRUE

Note that JavaScript does not allow for defining anonymous functions in the global scope:

js_validate_script('function(x, y){return x + y}', error = FALSE)
# [1] FALSE

To validate individual functions or objects, use the js_typeof function.

Uglify: reformatting and optimization

One of the most popular and powerful libraries for working with JavaScript code is uglify-js. This package provides an extensive toolkit for manipulating the syntax tree of a piece of JavaScript code.

The uglify_reformat function parses a string with code and then feeds it to the uglify code generator which converts it back to a JavaScript text, with custom formatting options such as fixing whitespace, semicolons, etc.

code <- "function test(x, y){ x = x || 1; y = y || 1; return x*y;}"
cat(uglify_reformat(code, beautify = TRUE, indent_level = 2))
# function test(x, y) {
#   x = x || 1;
#   y = y || 1;
#   return x * y;
# }

The more impressive part of uglify-js is the compressor which refactors the entire syntax tree, effectively rewriting your code into a more compact but equivalent program. The uglify_optimize function in R is a simple wrapper which parses code and then feeds it to the compressor.

cat(code)
# function test(x, y){ x = x || 1; y = y || 1; return x*y;}
cat(uglify_optimize(code))
# function test(x,y){return x=x||1,y=y||1,x*y}

You can pass compressor options to uglify_optimize to control the various uglify optimization techniques.

JSHint: code analysis

JSHint will automatically detect errors and potential problems in JavaScript code. The jshint function is R will return a data frame where each row is a problem detected by the library (type, line and reason of error):

code <- "var foo = 123"
jshint(code)
#
#       id                raw code      evidence line character  scope             reason
# 1 (error) Missing semicolon. W033 var foo = 123    1        14 (main) Missing semicolon.

JSHint has many configuration options to control which types of code propblems it will report on.

Minimist: an example of writing native JavaScript bindings in R

2015-02-16T00:00:00+00:00

A new package has appeared on CRAN called minimist, which implements an interface to the popular JavaScript library. This package has only one function, used for argument parsing. For example in RGui on OSX, the output of commandArgs() looks like this:

> commandArgs()
[1] "R" "--no-save" "--no-restore-data" "--gui=aqua"

Minimist turns that into this:

> library(minimist)
> minimist(commandArgs())
$`_`
[1] "R"

$save
[1] FALSE

$`restore-data`
[1] FALSE

$gui
[1] "aqua"

Note how it interprets the --no- prefix as FALSE and the --foo=bar as a key-value pair. It has some more of these rules, following the usual scripting argument syntax conventions. Cool, but not exactly ground breaking; there are already half a dozen packages on CRAN for parsing arguments (although this one is particularly nice :P).

Writing JavaScript bindings using V8

The main purpose of this new package is to exemplify how to write a package with bindings to a JavaScript library using V8. If you take a look at the package source, you might be surprised how small it is. The package consists of:

A copy of the minimist.js library in the package inst dir
Two lines of standard code to initiate the V8 engine and read minimist when loading the R package
A one-line wrapper function to call the JavaScript function from R

That’s it. To install this package from source no compiler is required. It will build out of the box, even on machines without Rtools or Xcode. Moreover, there are no external dependencies as is the case for e.g. Java code, where we need to install a JVM. Everything is self contained within R and V8. It’s fast too:

> system.time(minimist(commandArgs()))
   user  system elapsed 
  0.001   0.000   0.001

I’m working on several other packages to implement bindings to cool JavaScript libraries (see also yesterdays post). If you have some suggestions for other JavaScript libraries that might be useful in R, get in touch.

V8 version 0.5: typed arrays and sql.js

2015-02-15T00:00:00+00:00

Earlier this month, V8 version 0.5 appeared on CRAN. This version adds support typed arrays as specified in ECMA 6 in order to support high performance computing and libraries compiled with emscripten. A big thanks goes to Kenton Russell (@timelyportfolio) for suggesting these features.

Example: sql.js

These new features increase the amount of JavaScript libraries that will run out-of-the-box on V8. For example, sql.js is a port of SQLite to JavaScript, by compiling the SQLite C code with Emscripten:

# Load V8
library(V8)
stopifnot(packageVersion("V8") >= "0.5")

# Create JavaScript context and load sql.js
ct <- new_context()
ct$source("https://raw.githubusercontent.com/kripken/sql.js/master/js/sql.js")
 
# Evaluate JavaScript code
ct$eval('
var db = new SQL.Database()
db.run("CREATE TABLE hello (person char, age int);")
db.run("INSERT INTO hello VALUES (\'jerry\', 34);")
db.run("INSERT INTO hello VALUES (\'mary\', 27);")
db.run("INSERT INTO hello VALUES (\'joe\', 65);")
db.run("INSERT INTO hello VALUES (\'anna\', 18);")

// query:
var out = []
var stmt = db.prepare("SELECT * FROM hello WHERE age < 40");
while (stmt.step()) out.push(stmt.getAsObject());
')
 
# Copy the object from JavaScript to R
data <- ct$get("out")
print(data)

More V8 fun

Several other examples are available on gist, for example cheerio (html parsing), turf.js (geojson), viz.js and KaTeX. I am working on several packages that implement actual bindings to JavaScript libraries using V8. The first ones have just landed on CRAN: minimist and js.

To learn more, have a look at the vignettes:

Questions, suggestions? Find me on twitter or github.

V8 version 0.4: console.log and exception handling

2015-01-13T00:00:00+00:00

V8 version 0.4 has appeared on CRAN. This version introduces several new console functions (console.log, console.warn, console.error) and two vignettes:

I will talk more about using NPM in another blog post this week.

JavaScript Exceptions

Starting V8 version 0.4 each context has a console object in the global namespace:

Object.keys(console)
log,warn,error

The console.log, console.warn and console.error functions can be used to generate stdout, warnings or errors in R from JavaScript. This allows for writing embedded JavaScript functions that propagate exceptions back to R, similar as we would do for other foreign language interfaces such as C or C++:

library(V8)
ct <- new_context()
ct$eval('console.log("Bla bla")')
# Bla bla
ct$eval('console.warn("Heads up!")')
# Warning: Heads up!
ct$eval('console.error("Oh noes!")')
# Error: Oh noes!

For example you can use this to verify that external resources were loaded:

ct$source("https://cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.11/crossfilter.min.js")
ct$eval('var cf = crossfilter || console.error("failed to load crossfilter!")')

Of course, in R you could use tryCatch or whatever you like to catch exceptions that were raised this way in your JavaScript code.

Interactive Console

The interactive console has been enhanced a bit as well. It no longer prints redundant “undefined” returns:

library(V8)
ct <- new_context()
ct$console()
# This is V8 version 3.14.5.10. Press ESC or CTRL+C to exit.

From here we can try our new functions:

console.log("Bla bla")
console.warn("Heads up!")
console.error("Oh noes!")

Bindings to JavaScript Libraries

V8 provides a JavaScript call interface, data interchange, exception handling and interactive debugging console. This is everything we need to embed JavaScript code and libraries in R.

If you are curious how this would work, I have started working on a new R package implementing bindings to some of the very best libraries available for working with JavaScript and HTML. I hope this package will make it’s way to CRAN soon, but until then it is available from github

library(devtools)
install_github("jeroenooms/js")

Some silly example illustrating jshint:

library(js)
code = "var foo = 123\nvar bar = 456\nfoo + bar"
cat(code)
# var foo = 123
# var bar = 456
# foo + bar

jshint(code)[c("line", "reason")]
#  line                                                                 reason
#     1                                                     Missing semicolon.
#     2                                                     Missing semicolon.
#     3 Expected an assignment or function call and instead saw an expression.
#     3                                                     Missing semicolon.

Or the brilliant uglify-js:

uglify_reformat(code)
# [1] "var foo=123;var bar=456;foo+bar;"
uglify_optimize(code)
# Warning: Dropping side-effect-free statement [null:3,0]
# [1] "var foo=123,bar=456;"

curl 0.4 bugfix release

2015-01-11T00:00:00+00:00

This week curl version 0.4 appeared on CRAN. This release fixes a memory bug that was introduced in the previous version, and which could under some circumstances crash your R session. The new version is well tested and super stable. If you are using this package, updating is highly recommended.

What is curl again?

From the manual

The curl() function provides a drop-in replacement for base url() with better performance and support for http 2.0, ssl (https://, ftps://), gzip, deflate and other libcurl goodies. This interface is implemented using the RConnection API in order to support incremental processing of both binary and text streams.

Some examples from the help page illustrating https, gzip, redirects and other stuff that base url doesn’t do well:

library(curl)

# Read from a connection
con <- curl("https://httpbin.org/get")
readLines(con)

# HTTP error
curl("https://httpbin.org/status/418", "r")

# Follow redirects
readLines(curl("https://httpbin.org/redirect/3"))

# Error after redirect
curl("https://httpbin.org/redirect-to?url=http://httpbin.org/status/418", "r")

# Auto decompress Accept-Encoding: gzip / deflate (rfc2616 #14.3)
readLines(curl("http://httpbin.org/gzip"))
readLines(curl("http://httpbin.org/deflate"))

Streaming

The advantage of curl over RCurl and httr is that the connection interface allows for streaming. For example you can use readLines to download and process data line-by-line:

con <- curl("http://jeroenooms.github.io/data/diamonds.json", open = "r")
readLines(con, n = 3)
readLines(con, n = 3)
readLines(con, n = 3)
close(con)

We can combine this with stream_in from jsonlite to stream-parse sizable datasets:

library(jsonlite)
con <- gzcon(curl("https://jeroenooms.github.io/data/nycflights13.json.gz"))
nycflights <- stream_in(con)

New in openssl 0.3: hash functions

2015-01-10T00:00:00+00:00

This week version 0.3 of the openssl package appeared on CRAN. New in this release are bindings to the cryptographic hashning functions in OpenSSL. Not exactly ground breaking (hashing functions have long been available from digest) but nice to have anyway. An overview from the new vignette:

Hashing functions

The functions sha1, sha256, sha512, md4, md5 and ripemd160 bind to the respective digest functions in OpenSSL’s libcrypto. Both binary and string inputs are supported and the output type will match the input type.

library(openssl)
md5("foo")
# [1] "acbd18db4cc2f85cedef654fccc4a4d8"
md5(charToRaw("foo"))
# [1] ac bd 18 db 4c c2 f8 5c ed ef 65 4f cc c4 a4 d8

Functions are fully vectorized for the case of character vectors: a vector with n strings will return n hashes.

# Vectorized for strings
md5(c("foo", "bar", "baz"))
# [1] "acbd18db4cc2f85cedef654fccc4a4d8" "37b51d194a7513e45b56f6524f2d51f2"
# [3] "73feffa4b7f6bb68e44cf984c85f6e88"

Besides character and raw vectors we can pass a connection object (e.g. a file, socket or url). In this case the function will stream-hash the binary contents of the conection.

# Stream-hash a file
myfile <- system.file("CITATION")
md5(file(myfile))
# Hashing....
# [1] e4 4f 1b 99 e3 2f 27 e0 a7 e6 a0 0a 36 07 0e 1b

Same for URLs. The hash of the R-3.1.1-win.exe below should match the one in md5sum.txt

# Stream-hash from a network connection
md5(url("http://cran.us.r-project.org/bin/windows/base/old/3.1.1/R-3.1.1-win.exe"))
# Hashing................................................................................................................
# [1] 0b 48 29 e8 92 10 eb 6d 13 71 24 8c d0 97 d1 fc

Compare to digest

Similar functionality is also available in the digest package, but with a slightly different interface:

# Compare to digest
library(digest)
digest("foo", "md5", serialize = FALSE)
# [1] "acbd18db4cc2f85cedef654fccc4a4d8"

# Other way around
digest(cars, skip = 0)
# [1] "81919836edd7b5a422700ac32bbccd7d"
md5(serialize(cars, NULL))
# [1] 81 91 98 36 ed d7 b5 a4 22 70 0a c3 2b bc cd 7d

OpenCPU release 1.4.6: gzip and systemd

2014-12-30T00:00:00+00:00

OpenCPU server version 1.4.6 has been released to launchpad, OBS, and dockerhub (more about docker in a future blog post). I also updated the instructions to install the server or build from source for rpm or deb. If you have a running deployment, you should be able to upgrade with apt-get upgrade or yum update respectively.

Compression

This release enables gzip compression in the default apache2 configuration for ocpu, which was suggested by several smart users. As was explained in an earlier post about the curl package:

Support for compression can make a huge difference when streaming large data. Text based formats such as json are popular because they are human readable, but the main downside of plain-text is inefficiency for storing numbers. However when gzipped, json payloads are often comparable to binary formats, giving you the best of both worlds.

The nice thing about http is that compression is handled entirely on the level of the protocol so it works for all content types and you don’t have to do anything to take advantage of it. Client and server will automatically negotiate a method of compression that they both support via the Accept-Encoding header.

Try playing around with the ocpu test page by looking at the Content-Encoding response header, or just use curl with the --compress flag (use -v to see headers)

curl https://demo.ocpu.io/MASS/data/Boston/json -v > /dev/null
curl https://demo.ocpu.io/MASS/data/Boston/json --compress -v > /dev/null

As usual, I also updated the library of R packages included with the server, including the latest jsonlite 0.9.14 which allows for controlling prettify indentation:

Support for systemd and docker

Apart from enabling compression and updating the R package library, this release has some internal changes to support systemd on Debian 8 (Jessie), on which the r-base docker images are based.

The introduction of systemd has been quite controversial in the Debian community, to say the least, which is perhaps why things are not working as smoothly yet as in Fedora. My current init scripts definitely did not work out of the box with systemd (as advertised) and getting them fixed was quite painful.

However I did figure everything out eventually, and learned a lot about systemd while debugging it. I can see it being a very powerful system, definitely a big improvement over the old style init scripts. The way services are specified has a lot in common with how docker does it, which I’m sure is not a conicidence. I look forward to taking full advantage of it once it has landed in all major distributions.

I really hope the Debian folks will resolve their differences sooner rather than later though, because the current state of Jessie is not very good. Even popular packges such as nginx are currently broken due to the chaos and uncertainty surrounding the transition to systemd, which is not helping anyone. On the other hand, I do admire the Debian tradition of transparent and democratic decision making (even when messy) which is something the R community seems to be missing sometimes…

Interactive JavaScript in R with V8: a crossfilter example

2014-12-24T00:00:00+00:00

In last weeks blog post introducing the new V8 package I showed how you can use context$eval and context$source to execute commands and scripts of JavaScript in R.

It turns out that typing context$eval() for each JavaScript command gets annoying very quickly, so the new V8 version 0.3 adds an interactive console feature that works very similar to the one in chrome developer tools or Firebug. Playing in the interactive console is a nice way to debug a session, or just to learn JavaScript.

# Load stuff
library(V8)
data(diamonds, package="ggplot2")

# Create JavaScript session
ct <- new_context()
ct$assign("diamonds", diamonds)

# Load CrossFilter JavaScript library
ct$source("http://cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.11/crossfilter.min.js")

The code above loads the diamonds dataset from the ggplot2 package and assigns it to a new JavaScript context. We also load the crossfilter JavaScript library. We can now use the console method to enter an interactive JavaScript console for this session:

ct$console()
# This is V8 version 3.14.5.10. Press ESC or CTRL+C to exit.
# ~

The ~ prompt indicates that we are in V8 now and can start typing JavaScript. For example to filter the 10 diamonds with the highest depth in the price range between 2000 and 3000:

//now we are in javasript :)
var cf = crossfilter(diamonds)
var price = cf.dimension(function(x){return x.price})
var depth = cf.dimension(function(x){return x.depth})
price.filter([2000, 3000])
output = depth.top(10)

You’ll notice that crossfilter is pretty fast! To in inspect the data in JavaScript we can convert it to JSON:

JSON.stringify(output)

But easier might be to read the data in R. Exit the prompt by pressing ESC, which gives you back R’s default > prompt. From there we can read the retrieve the output object using ct$get:

# Pressing ESC
# Exiting V8 console.
output <- ct$get("output")
print(output)

All of this will work seamlessly in most editors too. For example if you load this script in RStudio, you can execute it by selecting the code and pressing the Run button in the script editor, and it does exactly what you would expect!

However, the console is of course mostly for debugging and interactive use. If you plan to share your R script, the most elegant way to include some JavaScript code is by putting it in a seperate file myscript.js and then load it from R using ct$source("myscript.js").

Introducing V8: An Embedded JavaScript Engine for R

2014-12-17T00:00:00+00:00

JavaScript is an fantastic language for building applications. It runs on browsers, servers and databases, making it possible to design an entire web stack in a single language.

The OpenCPU JavaScript client already allows for calling R functions from JavaScript (see jsfiddles and apps). With the new V8 package we can now do the reverse as well: run JavaScript inside R!

The V8 Engine

V8 is Google’s open source, high performance JavaScript engine. It is written in C++ and implements ECMAScript as specified in ECMA-262, 5th edition. The V8 R package builds on C++ library to provide a completely standalone JavaScript engine within R:

library(V8)

# Create a new context
ct <- new_context();

# Evaluate some code
ct$eval("foo=123")
ct$eval("bar=456")
ct$eval("foo+bar")
# [1] "579"

However note that V8 by itself is just the naked JavaScript engine. Currently, there is no DOM, no network or disk IO, not even an event loop. Which is fine because we already have all of those in R. In this sense V8 resembles other foreign language interfaces such as Rcpp or rJava, but then for JavaScript.

A major advantage over the other foreign language interfaces is that V8 requires no compilers, external executables or other run-time dependencies to execute JavaScript. The entire engine is contained within a 6MB R package (2MB when zipped) and works on all major platforms.

ct$eval("JSON.stringify({x:Math.random()})")
# [1] "{\"x\":0.08649904327467084}"
ct$eval("(function(x){return x+1;})(123)")
# [1] "124"

Sounds promising? There is more!

V8 + jsonlite = awesome

The native data structure in JavaScript is basically JSON, hence we can use jsonlite for seamless data interchange between V8 and R:

ct$assign("mydata", mtcars)
out <- ct$get("mydata")
all.equal(out, mtcars)
# TRUE

Because jsonlite stores data in its natural structure, we can plug it staight into existing JavaScript libraries:

# Use a JavaScript library
ct$source("http://underscorejs.org/underscore-min.js")
ct$call("_.filter", mtcars, I("function(x){return x.mpg < 15}"))
#                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
# Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
# Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
# Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
# Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4

JavaScript Libraries

JavaScript libraries specifically written for the Browser (such as Jquery or D3) or libraries for Node that depend on disk/network functionality might not work in plain V8, but many of them actually do.

For example, crossfilter is a high performance data filtering library that I have used for creating D3 data dashboards in the browser, but crossfilter itself is just vanilla JavaScript:

ct$source("cdnjs.cloudflare.com/ajax/libs/crossfilter/1.3.11/crossfilter.min.js")

I’ll continue here in the next blog post later this week. Have a look at the (very short) package manual in the mean time.

New features in jsonlite 0.9.14

2014-12-05T00:00:00+00:00

The jsonlite package implements a robust, high performance JSON parser and generator for R, optimized for statistical data and the web. This week version 0.9.14 appeared on CRAN which adds some handy new features.

Significant Digits

By default, the digits argument in toJSON specifies the number of decimal digits to print:

toJSON(pi, digits=3)
# [3.142]

A feature requested by Winston Chang was to control precision of number formatting. You can now specify the number of significant digits, analogous to the signif function in base R. Either set signif = TRUE or specify the digits argument using I():

> toJSON(pi, digits = 3, use_signif = TRUE)
# [3.14]

toJSON(pi, digits = I(3))
# [3.14]

Prettify Indent

A feature requested by Yihui Xie was to control the number of spaces to indent prettified json. The default is still 4 spaces:

toJSON(pi, pretty = TRUE)
# [
#     3.1416
# ]

The number of indent spaces can be changed by setting the pretty argument to an integer. For example to indent by only 2 spaces:

toJSON(pi, pretty = 2)
# [
#   3.1416
# ]

Support for 64bit integers in toJSON

Another new feature is support for 64bit integers from the bit64 package. R does not support 64 bit integers by default, and doubles have limited precision:

x <- 2^60 + 1:3
toJSON(x)
# [1.15292150460685e+18,1.15292150460685e+18,1.15292150460685e+18]

But when the number is stored as 64 bit integer, jsonlite will print the full integer in the JSON output:

library(bit64)
x <- as.integer64(2)^60 + 1:3
toJSON(x)
# [1152921504606846977,1152921504606846978,1152921504606846979]

Currently this is only supported in toJSON. The parser in fromJSON still uses doubles for very large integers.

New package: curl. High performance http(s) streaming in R

2014-11-22T00:00:00+00:00

A bit ago I blogged about new streaming features in jsonlite:

library(jsonlite)
diamonds2 <- stream_in(url("http://jeroenooms.github.io/data/diamonds.json"))

In the same blog post it was also mentioned that R does currently not support https connections. The RCurl package does support https, but does not have a connection interface. This bothered me so I decided to write one. The result is the new curl package.

Encryption, compression and more

From the package description:

The curl() function provides a drop-in replacement for base url() with better performance and support for http 2.0, ssl (https, ftps), gzip, deflate and other libcurl goodies. This interface is implemented using the RConnection API in order to support incremental processing of both binary and text streams.

What this means is that curl() should be able to do anything that url() does, but better. The same example as above, but now with https:

library(curl)
library(jsonlite)
diamonds2 <- stream_in(curl("https://jeroenooms.github.io/data/diamonds.json"))

That was easy. Switching to curl has other benefits as well. For example it automatically recognizes and decompresses gzipped or deflated connections from the Accept-Encoding header:

readLines(curl("http://httpbin.org/gzip"))
readLines(curl("http://httpbin.org/deflate"))

Support for compression can make a huge difference when streaming large data. Text based formats such as json are popular because they are human readable, but the main downside of plain-text is inefficiency for storing numbers. However when gzipped, json payloads are often comparable to binary formats, giving you the best of both worlds.

Performance

One thing that did surprise me a bit is the difference in performance. Especially the implementation of readLines for url connections seems to be inefficient in base R.

con2 <- curl("http://jeroenooms.github.io/data/diamonds.json")
system.time(readLines(con2))
#   user  system elapsed
#  0.238   0.096   0.334

con1 <- url("http://jeroenooms.github.io/data/diamonds.json")
system.time(readLines(con1))
#   user  system elapsed
#  0.236   0.113   3.858

I’m not quite sure why this is. Maybe the base R version does some additional character recoding that I am not aware of, although I have not observed such behavior. Also measuring performance is tricky in this case because it depends on the connection bandwidth, caching settings, etc.

OpenCPU release 1.4.5: configurable webhooks

2014-11-10T00:00:00+00:00

OpenCPU 1.4.5 is a patch release that improves performance by taking advantage of latest versions of jsonlite, devtools, knitr, openssl, etc. Also new in this release is the option to pass build parameters for deploying on ocpu.io (or your own opencpu server) using the github webhook.

As usual, server binaries for Ubuntu, Fedora and Suse are available from Launchpad and Build Service. There should not be any breaking changes, but perhaps double check that all is OK next time you run apt-get upgrade on your server. If you are in production and do not want to upgrade, make sure to comment-out the opencpu-1.4 ppa in the /etc/apt/sources.list.d/ conf files.

The opencpu-1.4 repository now ships with:

OpenCPU 1.4.5
R 3.1.2
Rcpp 0.11.3
RApache 1.2.5
RStudio-Server 0.98.1087

For Debian/CentOS users, instructions to build opencpu-server packages from source are on github: rpm and deb.

Configurable webhooks

Any R package on Github can automatically be deployed to https://yourname.ocpu.io/yourpkg by setting the ocpu webhook in your github repository. It takes about 15 seconds to setup, and is a great way to continuously publish and test code, data, documentation, vignettes from your package. You will also get notified by email if your package fails to build. If you are not using ocpu.io yet, now would be a good time to add the webhook :-)

New in this release is that http parameters added to the webhook URL will be passed to install_github. For example if you want to build vignettes of your package, use webhook:

https://cloud.opencpu.org/ocpu/webhook?build_vignettes=true

Or if your package is in a subdir in the repo:

https://cloud.opencpu.org/ocpu/webhook?build_vignettes=true&subdir=pkgdir

In addition to parameters for install_github, there is currently one extra parameter sendmail (true/false) which specifies if the server sends an email with the build status.

High performance JSON streaming in R: Part 1

2014-11-06T00:00:00+00:00

The jsonlite stream_in and stream_out functions implement line-by-line processing of JSON data over a connection, such as a socket, url, file or pipe. Thereby we can construct a data processing pipeline that can handle large (or unlimited) amounts of data with limited memory. This post will walk through some examples from the help pages.

The json streaming format

Because parsing huge JSON strings is difficult and inefficient, JSON streaming is done using lines of minified JSON records. This is pretty standard: JSON databases such as MongoDB use the same format to import/export large datasets. Note that this means that the total stream combined is not valid JSON itself; only the individual lines are.

library(jsonlite)
x <- iris[1:3,]
stream_out(x, con = stdout())
# {"Sepal.Length":5.1,"Sepal.Width":3.5,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"}
# {"Sepal.Length":4.9,"Sepal.Width":3,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"}
# {"Sepal.Length":4.7,"Sepal.Width":3.2,"Petal.Length":1.3,"Petal.Width":0.2,"Species":"setosa"}

Also note that because line-breaks are used as separators, prettified JSON is not permitted: the JSON lines must be minified. In this respect, the format is a bit different from fromJSON and toJSON where all lines are part of a single JSON structure with optional line breaks.

Streaming to/from a file

The nycflights13 package contains a dataset with about 5 million values. To stream this to a file:

library(nycflights13)
stream_out(flights, con = file("~/flights.json"))

Running this code will open the file connection, write json to the connection in batches of 500 rows, and afterwards close the connection. Status messages will be printed to the console while writing output. The entire process should take a few seconds and generate a json file of about 7MB.

We use the same file to illustrate how to stream the json back into R. The following code will stream-parse the json in batches of 500 lines. Afterward we verify that the output is indeed identical to the original one:

flights2 <- stream_in(file("~/flights.json"))
all.equal(flights2, as.data.frame(flights))
# [1] TRUE

Because the data is read in small batches, this require much less memory than when we would try to parse a huge json blob all at once. The pagesize argument in stream_in and stream_out can be used to specify the number of rows that will be read/written per iteration.

Streaming from a URL

We can use the standard url function in R to stream from a HTTP connection.

diamonds2 <- stream_in(url("http://jeroenooms.github.io/data/diamonds.json"))

If the data source is gzipped, simply wrap the connection in gzcon.

flights3 <- stream_in(gzcon(url("http://jeroenooms.github.io/data/nycflights13.json.gz")))
all.equal(flights3, as.data.frame(flights))

Because R currently does not support SSL, we use a curl pipe to stream over HTTPS:

flights4 <- stream_in(gzcon(pipe("curl https://jeroenooms.github.io/data/nycflights13.json.gz")))
all.equal(flights4, as.data.frame(flights))

For this to work, the curl executable needs to be installed and available in the search path, which requires cygwin on Windows. Unfortunately the RCurl package does not seem to support binary streaming at this point.

Next up

These examples illustrate basic line-by-line json streaming of data frames from/to a connection, which allows for importing/exporting large json datasets.

In the next blog post we will make the step to full JSON IO streaming by defining a custom handler function. This allows for constructing a json data processing pipeline in R that can handle an infinite data stream. Impatient readers can have a look at the examples in the stream_in help page.

Parsing multipart/form-data with webutils

2014-11-01T00:00:00+00:00

As part of a larger effort to clean up and rewrite the opencpu package, some of the more general utilities will be moved into a new, separate package called webutils. The first release of webutils is now on CRAN.

The package contains a simple http request body parser that supports application/x-www-form-urlencoded, multipart/form-data, and application/json. The multipart parser is written in pure R but surprisingly fast. Furthermore, two demo functions are included that illustrate how to host and parse simple HTML forms (with file uploads) using either rhttpd or httpuv.

library(webutils)
demo_rhttpd()
demo_httpuv()

Nothing ground breaking in a time of interactive graphics and restful data science as a service, but sometimes all you need is a simple form. I had a hard time finding a decent multipart parser for R, and this one does the job quite nicely.

jsonlite 0.9.13: high performance number formatting

2014-10-25T00:00:00+00:00

The jsonlite package implements a robust, high performance JSON parser and generator for R, optimized for statistical data and the web. This week version 0.9.13 appeared on CRAN which is the third release in a relatively short period focusing on performance optimization.

Fast number formatting

Version 0.9.11 and 0.9.12 had already introduced majors speedup by porting critical bottlenecks to C code and switching to a better JSON parser. The current release focuses on number formatting and incorporates C code from modp_numtoa which is several times faster than as.character, formatC or sprintf for converting doubles and integers to strings (your mileage may vary depending on platform and precision).

library(ggplot2)
nrow(diamonds)
# [1] 53940
system.time(jsonlite::toJSON(diamonds, dataframe = "row"))
#   user  system elapsed
#  0.319   0.007   0.325
system.time(jsonlite::toJSON(diamonds, dataframe = "col"))
#   user  system elapsed
#  0.073   0.002   0.075

Using the same benchmark from previous posts, time to convert the diamonds data to row-based json has gone down from 0.619s to 0.325s on my machine (about 2x speedup from jsonlite 0.9.12), and converting to column-based json has gone down from 0.330s to 0.075s (about 4x speedup).

Comparing to other JSON packages

When comparing JSON packages, it should be noted that the comparsion is never entirely fair because different packages use different settings and defaults for missing values, number of digits, etc. Both rjson and RJSONIO only support the column based format for encoding data frames. Using their default settings:

system.time(rjson::toJSON(diamonds))
#   user  system elapsed
#  0.279   0.004   0.281
system.time(RJSONIO::toJSON(diamonds))
#   user  system elapsed
#  0.918   0.027   0.944

For this particular dataset, jsonlite is about 3.5x faster than rjson and about 12x faster than RJSONIO (on my machine) to generate column-based JSON. These differences are relatively large because 7 out of the 10 columns in the diamonds dataset are numeric.

Generating secure random numbers with openssl

2014-10-24T00:00:00+00:00

I started working on a new R package with bindings for OpenSSL. The initial release is now available from CRAN. To install the package on Linux you need libssl-dev (Debian/Ubuntu) or openssl-devel (Fedora, RHEL, CentOS). For Mac and Windows, precompiled binaries are available from CRAN as usual. The Mac version is compiled against the version of OpenSSL that is included with OSX. See the comments in Makevars if you want to compile against a more recent version of OpenSSL.

Secure random numbers

The initial release of openssl implements bindings to the OpenSSL random number generator, which will be used to generate session keys in the upcoming version of the OpenCPU system. This feature was requested by Ruben Arslan who noted that the default RNG in R is not suitable for this because it is predictable and lack of entropy can lead to collisions. I’m not a crypto expert but it seems like everyone uses OpenSSL for secure RNG, hence this new package. For implementation details, see the respective OpenSSL documentation pages.

The rand_bytes and rand_pseudo_bytes functions return a raw vector with random bytes:

library(openssl)
rand_bytes(10)
# [1] 3b a7 0f 85 e7 c6 cd 15 cb 5f

To convert them to integers (0-255) simply use as.numeric:

> as.numeric(rand_bytes(10))
# [1]  15 149 231  77  18  29 219 191 165 112

Or convert bits to booleans:

> rnd <- rand_bytes(1)
> as.logical(rawToBits(rnd))
# [1] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

Probability distributions

Mapping random bytes to a continuous distribution requires a bit of math. For example to combine four 8bit bytes into a single 32bit double from the standard uniform distribution:

rand_unif <- function(n){
  x <- matrix(as.numeric(openssl::rand_bytes(n*4)), ncol = 4)
  as.numeric(x %*% 256^-(1:4))
}
rand_unif(5)
# [1] 0.8094907 0.8180394 0.0743821 0.6031131 0.8488938

And from U(0,1) we can map into draws from a probability distribution using its CDF:

rand_norm <- function(n, ...){
  qnorm(rand_unif(n), ...)
}
rand_norm(5, mean = 100, sd = 15)
# [1] 101.86120 123.84420  70.15235  81.50505  86.46514

However note the native R random number generators are much faster and have better numeric properties. Also the OpenSSL RNG is not intended for generating large sequences of random numbers as often used in statistics. It is mainly useful in situations where it is critical to create a little bit of secure randomness that can not be manipulated. Typical applications include encryption keys, drinking games, or raffle drawings at your local R user group.

More fun stuff

OpenSSL has a lot of other useful stuff which we coud add to the R package in future versions. In particular public key methods to sign and verify packages is something that R and CRAN could really benefit from. Simon Urbanek is working on something similar as well in the PKI package, which also builds on OpenSSL.

If you you would like to see some other OpenSSL functionality in the R package, feel free to send a pull request with bindings on github. It would be great to have people involved with better understanding cryptographic methods.

jsonlite 0.9.12: now even lighter and faster

2014-09-29T00:00:00+00:00

The jsonlite package implements a robust, high performance JSON parser and generator for R, optimized for statistical data and the web. This week version 0.9.12 appeared on CRAN which includes a completely rewritten json parser and more optimized C code for json generation. The new parser is based on yajl which is smaller and faster than libjson, and much easier to compile.

Error handling

My favorite feature of yajl is that it gives helpful error messages when parsing invalid JSON, for example:

fromJSON('[1,2,falsse,4]')
# Error in parseJSON(txt) : lexical error: invalid string in json text.
#                               [1,2,falsse,4]
#                     (right here) ------^

fromJSON('["foo", "bla\nbla"]')
# Error in parseJSON(txt) : lexical error: invalid character inside string.
#                            ["foo", "bla bla"]
#                     (right here) ------^

fromJSON('[1,2,3,4] {}')
# Error in parseJSON(txt) : parse error: trailing garbage
#                             [1,2,3,4] {}
#                     (right here) ------^

This makes debugging much easier, especially when dealing fast changing dynamic data from the web.

Unicode parsing

The yajl parser always correctly converts escaped unicode sequences into UTF-8 characters:

fromJSON('["\\u5bff\u53f8","Z\\u00fcrich"]')
# [1] "寿司"   "Zürich"

Escaped unicode was already supported in the previous version of jsonlite, however it was expensive and not enabled by default. With yajl we get this for free :-)

Integer parsing

Another cool feature is that yajl parses numbers into integers when possible:

class(fromJSON('[13,14,15]'))
# [1] "integer"

Performance

Performance of both parsing and generating JSON has again tremendously improved in this version. Some benchmarks:

library(jsonlite)
library(microbenchmark)
data(diamonds, package="ggplot2")
json_rows <- toJSON(diamonds)
json_columns <- toJSON(diamonds, dataframe = "columns")
microbenchmark(
   toJSON(diamonds),
   toJSON(diamonds, dataframe = "columns"),
   fromJSON(json_rows),
   fromJSON(json_columns),
   times=10
)
# Unit: milliseconds
#                                    expr      min       lq   median       uq       max neval
#                        toJSON(diamonds) 587.6984 591.3231 619.1590 630.3588  661.5118    10
# toJSON(diamonds, dataframe = "columns") 317.6793 325.3809 330.6444 339.9898  343.7466    10
#                     fromJSON(json_rows) 890.9832 899.3334 939.3230 979.6338 1059.9770    10
#                  fromJSON(json_columns) 188.5764 201.8463 238.1272 279.7607  293.1195    10

If we compare this to the previous blog post we can see that generating JSON to row-based data frames (the default) is approx 2x faster than the previous version. Parsing row-based json is about 2.5x faster, and parsing column-based json is almost 5x faster!

Streaming JSON

Version 0.9.12 introduces some cool streaming functionality. This is a topic in itself and I will blog about this later in the week. Have a look at examples from the stream_in and stream_out manual pages till then.

New jsonlite gets a major speed boost!

2014-09-06T00:00:00+00:00

The jsonlite package is a JSON parser/generator optimized for the web. It implements a bidirectional mapping between JSON data and the most important R data types, which allows for converting objects to JSON and back without manual data restructuring. This is ideal for interacting with web APIs, or to build pipelines where data seamlessly flow in and out of R through JSON. The quickstart vignette gives a brief introduction, or just try:

fromJSON(toJSON(mtcars))

Or use some data from the web:

# Latest commits in r-base
r_source <- fromJSON("https://api.github.com/repos/wch/r-source/commits")

# Pretty print:
committer <- format(r_source$commit$author$name)
date <- as.Date(r_source$commit$committer$date)
message <- sub("\n\n.*","", r_source$commit$message)
paste(date, committer, message)

New in 0.9.11: performance!

Version 0.9.11 has a few minor bugfixes, but most of the work of this release has gone into improving performance. The implementation of toJSON has been optimized in many ways, and with a little help from Winston Chang, the most CPU intensive bottleneck has been ported to C code. The result is quite impressive: encoding dataframes to row-based JSON format is about 3x faster, and encoding dataframes to column-based JSON format is nearly 10x faster in comparision with the previous release.

The diamonds dataset from the ggplot2 package has about 0.5 million values which makes a nice benchmark. On my macbook it takes jsonlite on average 1.18s to encode it to row-based JSON, and 0.34s for column-based json:

library(jsonlite)
library(microbenchmark)
data("diamonds", package="ggplot2")
microbenchmark(json_rows <- toJSON(diamonds), times=10)
# Unit: seconds
#              expr     min       lq   median       uq     max neval
#  toJSON(diamonds) 1.12773 1.140724 1.175872 1.180354 1.21786    10

microbenchmark(json_columns <- toJSON(diamonds, dataframe="col"), times=10)
# Unit: milliseconds
#                                 expr      min      lq   median       uq      # max neval
#  toJSON(diamonds, dataframe = "col") 333.9494 334.799 338.0843 340.0929 350.3026    10

Parsing and simplification performance

The performance of fromJSON has been improved as well. The parser itself was already a high performance c++ library that was borrowed from RJSONIO, which has not changed. However the simplification code used to reduce deeply nested lists into nice vectors and data frames has been tweaked in many places and is on average 3 to 5 times faster than before (depending on what the JSON data look like). For the diamonds example, the row-based data gets parsed in about 2.32s and column based data in 1.25s.

microbenchmark(fromJSON(json_rows), times=10)
# Unit: seconds
#                 expr      min       lq   median       uq      max neval
#  fromJSON(json_rows) 2.178211 2.278337 2.319519 2.376085 2.423627    10

microbenchmark(fromJSON(json_columns), times=10)
# Unit: seconds
#                    expr     min       lq   median       uq      max neval
#  fromJSON(json_columns) 1.17289 1.252284 1.253999 1.265763 1.306357    10

For comparison, we can also disable simplification in which case parsing takes respectively 0.70 and 0.39 seconds for these data. However without simplification we end up with a big nested list of lists which is often not very useful.

microbenchmark(fromJSON(json_rows, simplifyVector=F), times=10)
# Unit: milliseconds
#                                     expr      min       lq   median       uq      max neval
#  fromJSON(json_rows, simplifyVector = F) 635.5767 648.4693 704.6996 720.0335 727.8869    10

microbenchmark(fromJSON(json_columns, simplifyVector=F), times=10)
# Unit: milliseconds
#                                        expr      min       lq   median       uq      max neval
#  fromJSON(json_columns, simplifyVector = F) 385.3224 388.4772 395.1916 409.3432 463.9695    10

New in OpenCPU 1.4.4: session namespaces

2014-08-25T00:00:00+00:00

The OpenCPU system exposes an HTTP API for embedded scientific computing with R. This provides reliable and scalable foundations for integrating R based analysis and visualization modules into pipelines, web applications or big data infrastructures.

This week version 1.4.4 was released on Launchpad (Ubuntu), and OBS (Fedora, SUSE) and CRAN.

New: session namespaces

A new feature in this version is support for session namespaces. Clients can now refer to objects within a temporary session using sessionid::name. This makes it easier to reuse objects that were created from a script. For example let’s execute the ch01.R script which is included with the MASS package:

>> curl https://cloud.opencpu.org/ocpu/library/MASS/scripts/ch01.R -X POST
/ocpu/tmp/x05af9fe89a/R/dd
/ocpu/tmp/x05af9fe89a/R/m
/ocpu/tmp/x05af9fe89a/R/std.dev
/ocpu/tmp/x05af9fe89a/R/t.stat
/ocpu/tmp/x05af9fe89a/R/t.test.p
/ocpu/tmp/x05af9fe89a/R/v
/ocpu/tmp/x05af9fe89a/R/z
/ocpu/tmp/x05af9fe89a/stdout
/ocpu/tmp/x05af9fe89a/source
/ocpu/tmp/x05af9fe89a/console
/ocpu/tmp/x05af9fe89a/info
/ocpu/tmp/x05af9fe89a/files/ch01.pdf

The x05af9fe89a is the temporary session ID, which will be different for every execution. From the output we can see that this script stored 7 objects in the session namespace. To retrieve the z object in json format, use:

https://cloud.opencpu.org/ocpu/tmp/x05af9fe89a/R/z/json?pretty=FALSE

But what if we want to reuse z the object in a subsequent function call? We can now do this using the sesssion namespace. For example, to calculate stats::sd(x = z), we need to refer to x05af9fe89a::z as shown below:

curl https://cloud.opencpu.org/ocpu/library/stats/R/sd/json -d x=x05af9fe89a::z
[
	1.9368
]

This way, we can chain script executions and function calls by passing output objects as arguments to subsequent requests.

Function calls

For remote function calls, you can still use the session id alone to refer to the return object of the function call. For example to calculate stats::rnorm(n = 5) we do:

>> curl https://cloud.opencpu.org/ocpu/library/stats/R/rnorm -d n=5
/ocpu/tmp/x009f9e7630/R/.val
/ocpu/tmp/x009f9e7630/stdout
/ocpu/tmp/x009f9e7630/source
/ocpu/tmp/x009f9e7630/console
/ocpu/tmp/x009f9e7630/info

To calculate the standard deviation of our newly created object, the client can either use x009f9e7630::.val or simply x009f9e7630:

curl https://cloud.opencpu.org/ocpu/library/stats/R/sd -d x=x009f9e7630
curl https://cloud.opencpu.org/ocpu/library/stats/R/sd -d x=x009f9e7630::.val

The above two requests are equivalent.

CRAN release jsonlite 0.9.10 (RC)

2014-08-20T00:00:00+00:00

The jsonlite package is a JSON parser/generator optimized for the web. It implements a bidirectional mapping between JSON data and the most important R data types. This is very powerful for interacting with web APIs, or to build pipelines where data seamlessly flows in and out of R through JSON without any manual serializing, parsing or data munging.

The jsonlite package is one of the pillars of the OpenCPU system, which provides an interoperable API to interact with R over HTTP+JSON. However since its release, jsonlite has been adopted by many other projects as well, mostly to grab JSON data from REST APIs in R.

New in this version

Version 0.9.10 includes two new vignettes to get you up and running with JSON and R in a few minutes.

These vignettes show how to get started analyzing data from Twitter, NY Times, Github, NYC CitiBike, ProPublica, Sunlight Foundation and much more, with 2 or 3 lines of R code.

There are also a few other improvements, most notably support parsing of escaped JSON unicode sequences, which could be important if you are from a country with a non-latin alphabet.

Release candidate

This is the 10th CRAN version of jsonlite, and we are getting very close to a 1.0 release. By now the package does what it should do, has been tested by many users and all outstanding issues have been addressed. The mapping between JSON data and R classes is described in detail in the jsonlite paper, and unit tests are available to validate that implementations behave as prescribed for all data and edge cases. Once the version bumps to 1.0, we plan to switch gears and start focussing more on optimizing performance.

Running OpenCPU server on Fedora and Enterprise Linux

2014-08-15T00:00:00+00:00

Starting version 1.4.4, the OpenCPU cloud server can run on Redhat distributions, i.e. Fedora and Enterprise Linux (CentOS/RHEL). This post explains how to install and use OpenCPU on these systems. But before continuing I should emphasize that the preferred distribution to run OpenCPU servers is still Ubuntu, which has better support for R than any other server OS. If you would like to run OpenCPU (or other R based software) on a server, you can save yourself lots of time and headaches down the road by wisely choosing your OS. But if you like Redhat, know what you are doing and want to try OpenCPU, this post is for you.

OpenCPU rpm packages

A spec file and instructions to build the opencpu-server rpm package from source are available from the rpm readme in the Github repository. The build process is very easy and I verified that it works out of the box on Fedora 19, 20 and CentOS 6. For recent versions of Fedora, prebuilt binaries are available from build service, so all you need to do is add the repository and run:

yum install opencpu-server

If you find any issues with the rpm packages please report them on the issues page.

OpenCPU and SELinux

In general, the opencpu-server rpm package is very similar to the deb one, and most information in the server manual applies to Fedora/EL the same way as it does to Ubuntu. However one aspect is completely different: security.

Because OpenCPU has no notion of users or privileges, the server relies on Mandatory Access Control (MAC) style security. On Debian and Ubuntu, MAC is available through AppArmor and the opencpu-server package includes customisable apparmor profiles defining policies designed specifically for R and OpenCPU (see also RAppArmor). Redhat distributions on the other hand use SELinux and do not support AppArmor. The SELinux system is more complex and requires a lot of manual effort from the system administrator to configure and maintain security policies on the server (a popular introduction is SELinux for Mere Mortals). This is perhaps very powerful if you’re a bank or government agency with a team of dedicated security experts, but otherwise it can be pretty painful.

Because the OpenCPU server builds on rApache (mod_R), it runs by default in the SElinux httpd_modules_t context. This standard SELinux policy is designed for Apache modules, and prevents most types of malicious use that you would expect from a web service. Running OpenCPU in this context is fine for internal use, but it is not recommended to expose your Fedora/EL OpenCPU server to the web without further fine tuning SELinux for your application. Furthermore, if you experience unexpected persmission denied errors, you probably need to enable some of the httpd_ selinux “booleans”. A boolean in SElinux is the term for a global flag that enables/disables a particular privilege within a particular context. The httpd_selinux man page lists some important booleans for httpd that you might want to turn on/off.

Some more information is available in the earlier mentioned rpm readme, which I will be updating regularly.

About R in CentOS/RHEL

The above should get you started on Fedora, but on Enterprise Linux there is another catch. Officially, Enterprise Linux does not support R! The standard repositories for CentOS and RHEL do not include the R-core and R-devel packages that are available in Fedora. The workaround that is recommended by for example CRAN and RStudio is to add the EPEL (Extra Packages for Enterprise Linux) repository, which includes ports of many Fedora packages, including R-core and R-devel.

However it is important to realize that packages in EPEL are not frozen: they include whatever is latest on the most recent version of Fedora. This means that each time a new version of Fedora gets released (every 6 months), the latest development versions of all EPEL packages get pushed to your server the next time you run yum update. This is usually precisely what to avoid on servers. I stress this because I learned this the hard way, when yum accidentily upgraded R from 2.15 to 3.0, breaking every currently installed package, when all I wanted was security updates.

None of this is a problem on distributions which have native support for R, such as Ubuntu, Debian or Fedora. But if you do decide to use CentOS/RHEL for R based services/applications, make sure you either disable EPEL after installing R, or be very careful with yum update on long running servers.

Combining pages of JSON data with jsonlite and plyr

2014-07-25T00:00:00+00:00

The jsonlite package is a JSON parser/generator for R which is optimized for pipelines and web APIs. It is used by the OpenCPU system and many other packages to get data in and out of R using the JSON format.

A bidirectional mapping

One of the main strenghts of jsonlite is that it implements a bidirectional mapping between data frames and JSON. Thereby it can convert nested collections of JSON records, as they often appear on the web, immediately into the appropriate R structures, without complicated manual data munging by the user. For example, if a journalist wants to grab some data from ProPublica, she can simply use something like:

library(jsonlite)
mydata <- fromJSON("http://projects.propublica.org/forensics/geos.json")
View(mydata$geo)

Here, the mydata$geo object is a data frame which can be used directly for modeling or visualization, without the need for advanced data minipulation skills.

Paging with jsonlite and plyr

A question that comes up frequently is how to combine pages of data. Most web APIs limit the amount of data that can be retrieved per request. If the client needs more data than what can fits in a single request, it needs to break down the data into multiple requests that each retrieve a fragment (page) of data, not unlike pages in a book. In practice this is often implemented using a page parameter in the API. Below an example from the ProPublica Nonprofit Explorer API where we retrieve the first 3 pages of tax-exempt organizations in the USA, ordered by revenue:

baseurl <- "http://projects.propublica.org/nonprofits/api/v1/search.json?order=revenue&sort_order=desc"
mydata0 <- fromJSON(paste0(baseurl, "&page=0"))
mydata1 <- fromJSON(paste0(baseurl, "&page=1"))
mydata2 <- fromJSON(paste0(baseurl, "&page=2"))

#The actual data is in the filings element
print(mydata0$filings)
print(mydata0$filings$organization)

To analyze or visualize these data, we need to combine the pages into a single dataset. This is best done using rbind.fill from the plyr package. However because rbind.fill does not support nested data frames, we need to flatten the JSON data by passing the flatten = TRUE argument to fromJSON.

#Note flatten=TRUE requires jsonlite => 0.9.9
baseurl <- "http://projects.propublica.org/nonprofits/api/v1/search.json?order=revenue&sort_order=desc"
mydata0 <- fromJSON(paste0(baseurl, "&page=0"), flatten = TRUE)
mydata1 <- fromJSON(paste0(baseurl, "&page=1"), flatten = TRUE)
mydata2 <- fromJSON(paste0(baseurl, "&page=2"), flatten = TRUE)

#Combine data pages
library(plyr)
filings <- rbind.fill(mydata0$filings, mydata1$filings, mydata2$filings)

#Check output
colnames(filings)
nrow(filings)

Automatically combining many pages

We can write a simple loop that automatically downloads and combines many pages. For example to retrieve the first 20 pages with non-profits from the example above:

#requires jsonlite >= 0.9.9
library(jsonlite)

#store all pages in a list first
baseurl <- "http://projects.propublica.org/nonprofits/api/v1/search.json?order=revenue&sort_order=desc"
pages <- list()
for(i in 0:20){
  mydata <- fromJSON(paste0(baseurl, "&page=", i), flatten=TRUE)
  message("Retrieving page ", i)
  pages[[i+1]] <- mydata$filings
}

#combine all into one 
library(plyr)
filings <- rbind.fill(pages)

#check output
nrow(filings)
colnames(filings)

From here, our journalist can go straight to analyzing the data without any further tedious, complicated and time consuming data manipulation.

Recording of OpenCPU talk at #useR2014

2014-07-09T00:00:00+00:00

A recording of the useR! 2014 prentation about OpenCPU is now available on Youtube. This talk gives a brief (20 minute) motivation and introduction to some of the high level concepts of the OpenCPU system. The video contains mostly screen recording, mixed with some AV footage provided by Timothy Phan (thanks!).

The future of R on the web at #user2014

2014-06-27T00:00:00+00:00

The schedule and abstracts for useR! 2014 have been posted on the conference website. Session 2 (tuesday 1pm) of the Kaleidoscope track will feature a fantastic set of talks about R and the web, including RCloud (Gordon Woodhull, AT&T), OpenCPU (Jeroen Ooms, UCLA), Shiny (Joe Cheng, RStudio) and rOpenSci (Karthik Ram, UC Berkeley).

The presentation about OpenCPU will be a high level introduction and go over some of the concepts from the recent whitepaper. The abstract and slides are available from the website. Update: a recording of the presentation is available below.

Deploying a scoring engine for predictive analytics with OpenCPU

2014-06-23T00:00:00+00:00

TLDR/abstract: See the tvscore demo app or this jsfiddle for all of this in action.

This post explains how to use the OpenCPU system to setup a scoring engine for calculating real time predictions. In our example we use the predict.gam function from the mgcv package to make predictions based on a generalized additive model. The entire process consists of four steps:

Building a model
Create an R package containing the model and a scoring function
Install the package on your OpenCPU server
Remotely call the scoring function through the OpenCPU API

Let’s get started!

Step 1: creating a model

For this example, we use data from the General Social Survey, which is a very rich dataset on demographic characteristics and attitudes of United States residents. To load the data in R:

#Data info: http://www3.norc.org/GSS+Website/Download/SPSS+Format/
download.file("http://publicdata.norc.org/GSS/DOCUMENTS/OTHR/2012_spss.zip", destfile="2012_spss.zip")
unzip("2012_spss.zip")
GSS <- foreign::read.spss("GSS2012.sav", to.data.frame=TRUE)

The GSS data has 1974 rows for 816 variables. To keep our example simple, we create a model with only 2 predictor variables. The code below fits a GAM which predicts the average number of hours per day that a person watches TV, based on their age and marital status. In these data tvhours and age are numeric variables, whereas marital is categorical (factor) variable with levels MARRIED, SEPARATED,DIVORCED, WIDOWED and NEVER MARRIED.

#Variable info: http://www3.norc.org/GSS+Website/Browse+GSS+Variables/Mnemonic+Index/
library(mgcv)
mydata <- na.omit(GSS[c("age", "tvhours", "marital")])
tv_model <- gam(tvhours ~ s(age, by=marital), data = mydata)

The predict function is used to score data against the model. We test with some random cases:

newdata <- data.frame(
  age = c(24, 54, 32, 75),
  marital = c("MARRIED", "DIVORCED", "WIDOWED", "NEVER MARRIED")
)

predict(tv_model, newdata = newdata)
       1        2        3        4 
3.022650 3.693640 1.556342 3.665077

All seems good, this completes step 1. But just to get a sense of what our example model actually looks like before we start scoring, a simple viz:

library(ggplot2)
qplot(age, predict(tv_model), color=marital, geom="line", data=mydata) +
  ggtitle("gam(tvhours ~ s(age, by=marital))") +
  ylab("Average hours of TV per day")

Seems like people that get married start watching less TV, who would have thought :-) In a real study we should probably tune the smoothing a bit and add parenting as predictor (also in the data), but for simplicity we’ll stick with this model for now.

Step 2: creating a package

In order to score cases via the OpenCPU API, we need to turn the model into an R package. Making R packages is very easy these days, especially when using RStudio. Our package needs to contain at least two things: the tv_model object that we created above, and a wrapper function that calls out to predict(tv_model, ...). You can make the wrapper as simple or sophisticated as you like, based on the type of input and output data that you want to send/receive from your scoring engine.

The tvscore package that is available from the opencpu github repository is an example of such a package. The important thing to note is that the tv_model object is included in the data directory of the package. Saving objects to a file is done using the save function in R:

#Store the model as a data object
save(tv_model, file="data/tv_model.rda")

To load the model with the package, we can either set LazyData=true in the package DESCRIPTION, or manually load it using the data() function in R. For details on including data in R packages, see section 1.1.6 of writing R extensions.

Finally the package contains a scoring function called tv, which calls out to predict.gam. The scoring function is what clients will call remotely through the OpenCPU API. We use a smart function that supports both data frames as well as CSV files for input:

tv <- function(input){
  #input can either be csv file or data	
  newdata <- if(is.character(input) && file.exists(input)){
  	read.csv(input)
  } else {
  	as.data.frame(input)
  }
  stopifnot("age" %in% names(newdata))
  stopifnot("marital" %in% names(newdata))
  
  newdata$age <- as.numeric(newdata$age)

  #tv_model is included with the package
  newdata$tv <- predict.gam(tv_model, newdata = newdata)
  return(newdata)
}

Note how the function does a bit of input validation by checking that the age and marital columns are present. As usual, the tv function is saved in the R directory of the source package. Install the package locally to verify that it works as expected in a clean R session. To install our example package from github, restart R and do:

#install the tv score package
library(devtools)
install_github("opencpu/tvscore")

First we test the tv function with data frame input:

#test scoring with data frame input
library(tvscore)
newdata <- data.frame(
  age = c(24, 54, 32, 75),
  marital = c("MARRIED", "DIVORCED", "WIDOWED", "NEVER MARRIED")
)
tv(input = newdata)

And then we test if it works for CSV data:

#test scoring with CSV file input
setwd(tempdir())
write.csv(newdata, "testdata.csv")
library(tvscore)
tv(input = "testdata.csv")

If all of this works as expected, the package is ready to be deployed on your OpenCPU server!

Step 3: Install the package on the server

To deploy your scoring engine, simply install the package on your OpenCPU server. If you are running the OpenCPU cloud server, make sure to install your package as root. For example if you built the package into a tar.gz archive:

sudo -i
R CMD INSTALL tvscore_0.1.tar.gz

To install our example package straight from R, either on an OpenCPU cloud server or OpenCPU single-user server:

#install the tv score package
library(devtools)
install_github("opencpu/tvscore")

If you are running the cloud server, you are done with this step. If you are running the single-user server, start OpenCPU using:

library(opencpu)
opencpu$browse()

To verify that the installation succeeded, open your browser and navigate to the /ocpu/library/tvscore path on the OpenCPU server. Also have a look at /ocpu/library/tvscore/R/tv and /ocpu/library/tvscore/man/tv.

Step 4: Scoring through the API

Once the package is installed on the server, we can remotely call the tv function via the OpenCPU API. In the examples below we use the public demo server: https://cloud.opencpu.org/. For example, to call the tv function with curl using basic JSON RPC:

curl https://cloud.opencpu.org/ocpu/library/tvscore/R/tv/json \
 -H "Content-Type: application/json" \
 -d '{"input" : [ {"age":26, "marital" : "MARRIED"}, {"age":41, "marital":"DIVORCED"}, {"age":53, "marital":"NEVER MARRIED"} ]}'

Note how the OpenCPU server automatically converts input and output data from/to JSON using jsonlite. See the API docs for more details on this process. Alternatively we can batch score by posting a CSV file (example data)

curl https://cloud.opencpu.org/ocpu/library/tvscore/R/tv -F "input=@testdata.csv"

The response to a successful HTTP POST request contains the location of the output data in the Location header. For example if the call returned a HTTP 201 with Location header /ocpu/tmp/x036bf30e82, the client can retrieve the output data in various formats using a subsequent HTTP GET request:

curl https://cloud.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/csv
curl https://cloud.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/json
curl https://cloud.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/tab

This completes our scoring engine. Using these steps, clients from any language can remotely score cases by calling the tv function using standard HTTP and JSON libraries.

Extra credit: performance optimization

When using a scoring engine based on OpenCPU in production, it is worthwile configuring your server to optimize performance. In particular, we can add our package to the preload field in the /etc/opencpu/server.conf file on the OpenCPU cloud server. This will automatically load (but not attach) the package when the OpenCPU server starts, which eliminates package loading time from the individual scoring requests. In our example this is important because tvscore depends on the mgcv package, which takes about 2 seconds to load.

Note that R does not load LazyData objects when the package loads. Hence, preload in combination with lazy loading of data might not have the desired effect. When using preload, make sure to design your package such that all data gets loaded when the package loads (example).

Finally in production you might want to tweak the timelimit.post (timeout), rlimit.as (mem limit), rlimit.fsize (disk limit) and rlimit.nproc (parallel process limit) options in /etc/opencpu/server.conf to fit your needs. Also see the server manual on this topic.

Bonus: creating an OpenCPU app

By including web pages in the /inst/www/ directory of the source package, we can turn our scoring engine into a standalone web application. The tvscore example package contains a simple web interface that makes use of the opencpu.js JavaScript client to interact with R via OpenCPU in the browser. Navigate to /ocpu/library/tvscore/www/ on the public demo server to see it in action!

To install and run the same app in your local R session, use:

#Install the app
library(devtools)
install_github("opencpu/tvscore")

#Load the app
library(opencpu)
opencpu$browse("/library/tvscore/www")

We can also call the OpenCPU server from an external website using cross domain ajax requests (CORS). See this jsfiddle for a simple example that calls the public server using the ocpu.rpc function from opencpu.js.

OpenCPU whitepaper published on arXiv

2014-06-20T00:00:00+00:00

This week a new paper appeared on arXiv titled: The OpenCPU System: Towards a Universal Interface for Scientific Computing through Separation of Concerns. It is based on a chapter of my thesis and provides a conceptual introduction to embedded scientific computing and the OpenCPU system.

The article deliberately does not describe any software specifics. Instead, it takes a high-level view and discusses domain logic of scientific computing, the benefits of using a standardized application protocol to interface statistical methods, and the importance of clearly separating statistical computing from application and implementation logic. The R software and OpenCPU API are used to illustrate the advocated approach. However, it is emphasized that the API is designed to describe general logic of data analysis rather than that of a particular language, and the system should generalize quite naturally to other computational back-ends, such as Julia, Python or Matlab.

This paper is an accumulation of many experiences with building statistical web applications in academic and industry organizations over the past years. I hope it will be a good read for anyone who wishes to build stacks, applications, and pipelines with integrated analysis/visualization components, with or without OpenCPU.

Go and grab the (open access) pdf from arXiv!

OpenCPU Gem for Ruby

2014-05-22T00:00:00+00:00

The guys from roqua.nl are working on a OpenCPU wrapper Gem. This simple API client provides a pretty nice basis for building R web applications with Ruby. A minimal example from the readme:

client.execute :digest, :hmac, { key: 'foo', object: 'bar', algo: 'md5' }
# => ['0c7a250281315ab863549f66cd8a3a53']

Which performs the following JSON RPC request:

digest::hmac(key="foo", object="bar", algo="md5")

They are accepting pull requests!

OpenCPU release 1.3 and 1.4

2014-04-20T00:00:00+00:00

After a few months of testing we present OpenCPU versions 1.3 and 1.4. These releases do not introduce any major changes in the OpenCPU HTTP API but focus entirely on performance, reliability and security to support long running servers. The only minor API change in the switch to absolute URLs in the location header. Upgrading from OpenCPU 1.2 should be painless and is recommended.

These and future releases of the OpenCPU cloud server will target Ubuntu 14.04 in order to take advantage of recent features in R, Apache2, AppArmor and nginx. Because this is a Long Term Support (LTS) Ubuntu release it includes 5 years of updates. Hence your OpenCPU server can run safely until April 2019 (or until you decide to upgrade).

Version 1.3 versus 1.4

OpenCPU versions 1.3 and 1.4 build on exactly the same version of the HTTP API and server code. The only difference is the version of R that is used in the cloud server. OpenCPU version 1.3 uses R 3.0.2 included with Ubuntu, whereas OpenCPU version 1.4 uses the current version: R 3.1.0.

If you have no preference, OpenCPU 1.4 is recommended because many of the packages on CRAN require the current version of R and will therefore only work with OpenCPU 1.4.

How to upgrade

Because of some internal cleanup and refactoring of configuration files, it is highly recommended to install the new version of OpenCPU on a clean fresh Ubuntu 14.04 server. Usually installing a new Ubuntu server is safer and quicker than upgrading and old server anyway. See the Server Manual for standard instructions on a clean installation.

However if for whatever reason you need to upgrade a previous installation, the safest way is to uninstall previous versions before installing the new one. This ensures that no old files keep lingering around.

# remove old versions
sudo apt-get purge opencpu-*
sudo apt-get autoremove --purge

# upgrade Ubuntu to 14.04 (if not done so yet)
sudo do-release-upgrade

# install new version on Ubuntu 14.04
sudo add-apt-repository opencpu/opencpu-1.4
sudo apt-get update
sudo apt-get install opencpu

OpenCPU and RStudio

Using OpenCPU together with RStudio is now even easier! The opencpu-1.3 and opencpu-1.4 repositories include a copy of rstudio server that you can install with a single line:

# install rstudio
sudo apt-get install rstudio-server

Both apache and nginx are preconfigured to proxy the /rstudio/ path to rstudio. Hence after installing both opencpu and rstudio-server they can be accessed directly through:

https://your.server.com/ocpu/
https://your.server.com/rstudio/

Appendix B of the OpenCPU Server Manual has some more details.

Questions

If you have any problems, questions, feedback or suggestions feel free to send an email on the mailing list or open an issue on github. As is the case for many open source projects, good software comes with terrible documentation. But if anything is not working or unclear please do let me know; it is probably something small.

Getting ready for OpenCPU 1.3

2014-03-17T00:00:00+00:00

The OpenCPU public demo server and ocpu.io have been upgraded to an early version of the upcoming OpenCPU 1.3 release. This release is scheduled for April 17 along with Ubuntu 14.04 (Trusty). By deploying it on the public demo server we get some testing before the actual release. Please report any problems.

New in OpenCPU 1.3

The improvements in this release are mostly internal. However there will be one subtle change: starting version 1.3, all HTTP API responses with status code 201, 301 or 302 will use an absolute url in the Location response header. For example, the response headers of a request could contain:

...
Date: Mon, 17 Mar 2014 06:59:26 GMT
Location: http://cloud.opencpu.org/ocpu/tmp/x0e28afb7/
Content-Length: 44
...

Whereas in previous versions, the same response would have looked like:

...
Date: Mon, 17 Mar 2014 06:59:26 GMT
Location: /ocpu/tmp/x0e28afb7/
Content-Length: 44
...

However to scale up to distributed environments where resources can be hosted on various servers, we need to start using absolute URLs.

How to update my client/app?

Most HTTP clients natively understand both absolute and relative urls, so you probably won’t notice the difference. For example the opencpu.js client library requires no changes or updates. However for the few of you that implemented a custom OpenCPU client, you might want to double check that your code understands both absolute and relative urls in the Location header, to make sure your application will be compatible with future versions of OpenCPU.

OpenCPU 1.2.3 release

2014-03-12T00:00:00+00:00

A new version of OpenCPU was released to CRAN and Launchpad. Besides some minor bugfixes, the single-user has better support for configuration. By default, the single-user server will now load configuration from the following file:

path.expand("~/.opencpu.conf")

If this file does not exist, the default configuration is used.

Future plans

This is likely the final release in the 1.2 series. Future versions of OpenCPU will be targeting R 3.1 and Ubuntu 14.04 (both to be released in April), and the version number will be bumped to emphasize this.

No changes in the API are scheduled. Future work will focus on improving performance, documentation and client libraries.

Release of jsonlite 0.9.4

2014-03-02T00:00:00+00:00

A new version of the jsonlite package was released to CRAN. In addition to adding small new features, this release cleans up code and documentation. Some annoying compiler warnings inherited from RJSONIO are fixed and the reference manual is a bit more concise. Also some new examples of public JSON APIs were added to the package vignette. These are great to see the power of jsonlite in action when working with real world JSON structures.

What is jsonlite again?

The jsonlite package is a fork of RJSONIO. It builds on the same libjson c++ parser (although a more recent version), but implements a different system for converting between R objects and JSON structures. The most powerful feature is the option to automatically convert tabular JSON structures into R data frames and vice versa. Tabular structures are very common in JSON data, but usually difficult to read and manipulate. By automatically turning these into data frames jsonlite can save you many hours and bugs in getting your JSON data in and out of R. This blog post has some nice examples with data from the Github API.

New in this release

Two new functions were introduced in this release. The minify function is the opposite of prettify, and reduces the size of a JSON blob by removing all redundant whitespace.

The new unbox function was requested several users. It can be used to force atomic vectors of length 1 to be encoded as a JSON scalar rather than an array. To understand why this should not be default behavior, see the vignette or this github issue. However it can be useful to do this for individual object elements:

> cat(toJSON(list(foo=123)))
{ "foo" : [ 123 ] }
> cat(toJSON(list(foo=unbox(123))))
{ "foo" : 123 }

In the context of a script or function, the unbox function should only be used for elements that are always exactly length 1, otherwise unbox will throw an error. This is to protect you from writing code that generates inconsistent JSON i.e. an array one time and a scalar another time.

The same unbox function can be used for data frames with exactly 1 row:

> mycar <- cars[23,]
> cat(toJSON(mycar))
[ { "speed" : 14, "dist" : 80 } ]
> cat(toJSON(unbox(mycar)))
{ "speed" : 14, "dist" : 80 }

But again this should be used sparsely and with care. When in doubt, always stick with the default toJSON encodings.

Publishing dynamic data on ocpu.io

2014-02-16T00:00:00+00:00

Suppose you would like to publish some data, for example to accompany a journal article. One way would be to put a CSV file on your website, and share the URL with your colleagues. However CSV has many limitations: it only works for tabular structures, has limited type safety (pretty much everything gets coersed into strings) and leads to loss of numeric precision.

There are many alternative data interchange formats, each with their own benefits and limitations. For example JSON is widely supported and can be parsed in almost any language, however it can be verbose and slow. A binary format such as Protocol Buffers is more efficient, but many users might not know how to parse it. You could even use save or saveRDS in R to share the native R structures, however this limits your audience to R users.

Retrieving dynamic data

What we really need is a method to publish the data itself rather than some representation of the data in a particular format. With OpenCPU you can publish R objects (including datasets) in a way that lets the clients select the format and formatting options for retrieving the dataset. This is implemented using native R functionality to include arbitrary data/objects in packages, and standard R functions for exporting these data. For example, the CRAN package MASS includes a dataset called bacteria:

library(MASS)
data(bacteria)
print(bacteria)

Via OpenCPU, the dataset can downloaded by anyone, using one of many formats:

Format	Export Function	URL (short)
text	`print`	`cran.ocpu.io/MASS/data/bacteria/print`
CSV	`write.csv`	`cran.ocpu.io/MASS/data/bacteria/csv`
TSV	`write.table`	`cran.ocpu.io/MASS/data/bacteria/tab`
JSON	`jsonlite::toJSON`	`cran.ocpu.io/MASS/data/bacteria/json`
Protocol Buffers	`RProtoBuf::serialize_pb`	`cran.ocpu.io/MASS/data/bacteria/pb`
RData	`save`	`cran.ocpu.io/MASS/data/bacteria/rda`
RDS	`saveRDS`	`cran.ocpu.io/MASS/data/bacteria/rds`
ascii R	`dput`	`cran.ocpu.io/MASS/data/bacteria/ascii`

The client can also control formatting options by passing HTTP parameters. These parameters map directly to function arguments for the respective export function in the table above. Some random examples:

Output Format	Equivalent URL on Public OpenCPU Server
`write.csv(bacteria, row.names=TRUE)`	`cran.ocpu.io/MASS/data/bacteria/csv?row.names=true`
`jsonlite::toJSON(Boston, digits=4)`	`cran.ocpu.io/MASS/data/Boston/json?digits=4`
`jsonlite::toJSON(Boston, dataframe="columns")`	`cran.ocpu.io/MASS/data/Boston/json?dataframe=columns`
`jsonlite::toJSON(Boston, pretty=FALSE)`	`cran.ocpu.io/MASS/data/Boston/json?pretty=false`

Creating a data package

To start publishing your own dynamic data you need to put your data objects in an R package following the standard guidelines as documented in section 1.1.6 of Writing R Extensions. This might sound cumbersome, but once you get a hold of it, it only takes a few seconds. You’ll realize that packages are actually a beautiful, standardized and well-tested container format for R objects and much more. Have a look at the data folder in the opencpu/appdemo package for some examples.

After creating and installing your package on your local R, test it using the OpenCPU single user server:

library(opencpu)
opencpu$browse("/library/mypackage/data")
opencpu$browse("/library/mypackage/data/myobject")

Publishing dynamic data on ocpu.io

To make your data available through the public OpenCPU server and ocpu.io, all you need to do is put your package up on Github. OpenCPU requires the name of the Github repository to match the name of the R package it contains. Use devtools to test if your package is working:

library(devtools)
install_github("pkgname", "username")

If this succeeds you’re good to go. Navigate to username.ocpu.io/pkgname/data where username is your Github login. By default the OpenCPU public server updates packages installed from Github every 24 hours. However, the Github webhook can be used to update the package immediately every time a commit is pushed to github.

Publishing dynamic data on your own server

OpenCPU does not lock you into some commercial hosting service. Your data is stored on Github in a standard format under your control. The ocpu.io public server is there for your convenience. You can also install your own OpenCPU cloud server to publish data at e.g. http://opencpu.yourserver.com/ocpu/library/pkgname/data/myobject. No need to put anything on Github, just install the package in R on the server.

Share and access R code, data, apps on ocpu.io

2014-02-12T00:00:00+00:00

ocpu.io is a new domain for publishing code, data and apps based on the OpenCPU system. Any R package on Github is directly available via yourname.ocpu.io. Thereby the package can be used remotely via the OpenCPU API to access data, perform remote function calls, reproduce results, publish webapps, and much more. The OpenCPU public server page explains how requests to ocpu.io map to the existing public demo server.

Examples

Action	URL (short)
List packages on CRAN	`cran.ocpu.io`
List packages on BioConductor	`bioc.ocpu.io`
Github repositories from: Hadley	`hadley.ocpu.io`
Package Info
MASS from CRAN	`cran.ocpu.io/MASS/`
plyr from CRAN	`cran.ocpu.io/plyr/`
plyr from Github	`hadley.ocpu.io/plyr/`
Package Contents
MASS datasets	`cran.ocpu.io/MASS/data/`
plyr datasets	`hadley.ocpu.io/plyr/data/`
plyr R objects	`hadley.ocpu.io/plyr/R/`
plyr help pages	`hadley.ocpu.io/plyr/man/`
plyr files	`hadley.ocpu.io/plyr/DESCRIPTION`
Datasets
mammals sleep data (print)	`hadley.ocpu.io/ggplot2/data/msleep/print`
mammals sleep data (csv)	`hadley.ocpu.io/ggplot2/data/msleep/csv`
mammals sleep data (json)	`hadley.ocpu.io/ggplot2/data/msleep/json`
mammals sleep data (json columns)	`hadley.ocpu.io/ggplot2/data/msleep/json?dataframe=column`
Manual pages
msleep help (text)	`hadley.ocpu.io/ggplot2/man/msleep/text`
msleep help (html)	`hadley.ocpu.io/ggplot2/man/msleep/html`
msleep help (pdf)	`hadley.ocpu.io/ggplot2/man/msleep/pdf`
Example Apps
appdemo (src)	`opencpu.ocpu.io/appdemo/www`
stocks (src)	`opencpu.ocpu.io/stocks/www`
nabel (src)	`opencpu.ocpu.io/nabel/www`
markdownapp (src)	`opencpu.ocpu.io/markdownapp/www`
mapapp (src)	`opencpu.ocpu.io/mapapp/www`

How to use

To start publishing on ocpu.io you need to put your R functions, datasets, scripts, sweave/knitr documents into an R package and put it up on Github. This is not too difficult, there are many guides on how to do this. OpenCPU requires the name of the Github repository to match the name of the R package it contains. Use devtools to test if your package is working:

library(devtools)
install_github("pkgname", "username")

If this succeeds you’re good to go. Navigate to username.ocpu.io/pkgname where username is your Github login. The API docs and JavaScript docs explain how to read objects, files and datasets, RPC functions and develop apps.

By default the OpenCPU public server updates packages installed from Github every 24 hours. However, the Github webhook can be used to update the package immediately every time a commit is pushed to github.

OpenCPU 1.2: Flexible and reliable R function RPC over HTTPS + JSON

2013-12-19T00:00:00+00:00

Earlier this week, OpenCPU 1.2 was released. This release uses the new jsonlite package for JSON conversion, which puts in place the final fundamental piece of the OpenCPU framework. This post describes what has changed, why this is important, and how to upgrade.

From here, no major changes in the OpenCPU API are planned for quite a while, so that we can shift focus towards optimizing performance, implementing client-libraries and developing applications.

HTTPS, JSON and OpenCPU

Let’s first explain why this piece is important. The OpenCPU API defines a mapping between HTTP request and R function calls. This is easy for simple input and output, such as numbers or vectors:

curl https://cloud.opencpu.org/ocpu/library/stats/R/rnorm/json -d 'n=10&mean=5'

But what if the R function has a return value or arguments which require more advanced objects, such as a matrix or data frame? This is where jsonlite comes in. The jsonlite vignette defines a practical and consistent mapping between JSON data and R Objects</i>. This allows OpenCPU to automatically convert incoming JSON arguments into R objects using jsonlite::fromJSON, and convert output values back into JSON using jsonlite::toJSON. Thereby the cycle is complete, and we can call advanced R functions over http(s)+json without requiring clients to have any understanding of R.

An example: melting data frames

Examples with curl get a bit verbose with a large payload, but to get an idea, let’s melt some data using the melt function in the reshape2 package. This function has an argument data (data frame) and an argument id (character vector). It returns another data frame. In this example, we pass it the first three rows of the AirQuaility dataset, very similar to the example in the melt manual page. The API docs explain that the JSON objects can either be posted as HTTP parameters in a standard HTTP POST formats (i.e. multipart or x-www-form-urlencoded):

curl https://cloud.opencpu.org/ocpu/library/reshape2/R/melt/json \
-d 'data=[{"Ozone":41, "Solar.R":190, "Wind":7.4, "Temp":67, "Month":5, "Day":1}, 
{"Ozone":36, "Solar.R":118, "Wind":8, "Temp": 72, "Month":5, "Day":2}, 
{"Ozone":12, "Solar.R":149, "Wind":12.6, "Temp": 74, "Month":5, "Day":3}]&id=["Month", "Day"]'

Alternatively, we can do pure JSON RPC by setting the Content-Type: application/json header:

curl https://cloud.opencpu.org/ocpu/library/reshape2/R/melt/json \
-H 'Content-Type: application/json' \
-d '{
  "data": [
    {"Ozone":41, "Solar.R":190, "Wind":7.4, "Temp":67, "Month":5, "Day":1}, 
    {"Ozone":36, "Solar.R":118, "Wind":8, "Temp": 72, "Month":5, "Day":2}, 
    {"Ozone":12, "Solar.R":149, "Wind":12.6, "Temp": 74, "Month":5, "Day":3}
  ], 
  "id" :["Month", "Day"]
 }'

Note that if you use Windows, the curl examples might need to be modified to properly escape the quotes in the windows terminal. This is just a limitation of using the windows command line; it won’t be a problem for actual clients (e.g. a browser). If you don’t like curl, the same request can be performed using the ocpu test page.

The above RPC request is equivalent to the R code below. You can use this code as a template to see how your R functions would behave when called remotely over OpenCPU.

# Load required packages
library(jsonlite)
library(reshape2)

# Input arguments in JSON format
input <- '{
  "data": [
    {"Ozone":41, "Solar.R":190, "Wind":7.4, "Temp":67, "Month":5, "Day":1}, 
    {"Ozone":36, "Solar.R":118, "Wind":8, "Temp": 72, "Month":5, "Day":2}, 
    {"Ozone":12, "Solar.R":149, "Wind":12.6, "Temp": 74, "Month":5, "Day":3}
  ], 
  "id" :["Month", "Day"]
 }'

# The actual function call
args <- fromJSON(input)
result <- do.call(reshape2::melt, args)

# This is what you get back from OpenCPU
output <- toJSON(result, pretty=TRUE)
cat(output)

Upgrading to OpenCPU 1.2

It is recommended to update your servers and applications to version 1.2 rather sooner than later. The 1.0 branch will keep working, but it won’t get any new fixes or updates. We plan to stay on the 1.2 branch for quite a while.

The introduction of jsonlite does not affect the HTTP API itself, but existing applications that rely heavily on JSON to get data in and out of R might need some modification. For this reason we decided to bump the version to the 1.2 series. If you have existing OpenCPU clients/applications that use JSON, have a look at the post about jsonlite to get a better understanding of how JSON data map to R objects and vice versa. Installing or upgrading the OpenCPU single-user development server is business as usual:

update.packages(ask = FALSE)
install.packages("opencpu")

Servers running the OpenCPU 1.0 cloud server will not automatically receive the update to 1.2, to prevent existing applications from breaking. In order to update a previous installation of the OpenCPU cloud server, you need to add the new repository first:

sudo add-apt-repository ppa:opencpu/opencpu-1.2
sudo apt-get update
sudo apt-get upgrade

To see if the update was successful, navigate to /ocpu/library/opencpu on your server to check the currently installed version of the opencpu package.

New package: jsonlite. A smart(er) JSON encoder/decoder.

2013-12-06T00:00:00+00:00

This week we released a new package on CRAN: jsonlite. This package is a fork of RJSONIO by Duncan Temple Lang and builds on the same parser, but uses a different mapping between R objects and JSON data. The package vignette goes in great detail and has many examples on how JSON data are converted to R objects and vice versa. To try it:

#install
install.packages("jsonlite", repos="http://cran.r-project.org")

#load
library(jsonlite)

#convert object to json
myjson <- toJSON(iris, pretty=TRUE)
cat(myjson)

#convert json back to object
iris2 <- fromJSON(myjson)
print(iris2)

So what’s new?

The jsonlite package implements functions toJSON and fromJSON similar to those in packages as RJSONIO and rjson, but options and output are quite different. The primary goal in the design of jsonlite is to recognize and comply with conventional ways of encoding data in JSON (outside the R community), in particular (relational) datasets. This increases interoperability when dealing with external data from within R, or when reading/writing R objects from an external client (e.g. through OpenCPU). For example, consider structures as returned by the Github API:

Simple dataset: https://api.github.com/users/hadley/orgs
Nested dataset: https://api.github.com/users/hadley/repos

These JSON structures obviously represent data tables, or in R terminology: data frames. The first dataset is a single table; the second dataset has a relational structure with two tables: the owner property in the main table was generated from a foreign key that points to a record in a second table (owners). However, in their JSON representation these tables are structured by row, wereas R likes data frames by column. This is one example where jsonlite goes beyond other packages, and actually returns a data frame:

library(jsonlite)
library(httr)

#get data
data1 <- fromJSON("https://api.github.com/users/hadley/orgs")

#it's a data frame
names(data1)
data1$login

The second example is a bit more complicated because of the relational structure. jsonlite tries to stay as close as possible to the original structure by returing a nested data frame:

data2 <- fromJSON("https://api.github.com/users/hadley/repos")

#it's a data frame...
names(data2)
data2$name

#...with has a nested data frame
names(data2$owner)
data2$owner$login

#these are equivalent :)
data2[1,]$owner$login
data2[1,"owner"]$login
data2$owner[1,"login"]
data2$owner[1,]$login

The package vignette gives many more examples of how various structures map to R objects.

On correctness and performance

The initial emphasis in jsonlite has been on correctness: rather than rushing towards performance, we want to explicity specify intended behavior covering all important structures. The complexity of this problem is easily understimated, which can result in unexpected behavior, ambiguous edge cases and differences between implementations. A set of conventions for a consistent and practical mapping are proposed in the package vignette. If you are using JSON with R, free to join the discussion.

Premature optimization is the root of all evil.
Donald Knuth

We hope that a clear specifiction will make it much easier to optimize performance or write alternate implementations. The package vignette and package unit tests are intended to take away ambiguity on what exactly toJSON and fromJSON are supposed to do. From here we will start optimizing R code, port pieces to C++, or perhaps even write an entirely new implementation, without breaking software that depends on it.

If you would like to contribute to jsonlite, you can submit patches or pull requests on github, as long as they don’t alter the behavior of the functions. At a minimum, they should pass the package unit tests… or you should modify the unit tests that are overly strict :-)

library(testthat)
test_package("jsonlite")

Continuous Integration with OpenCPU

2013-11-27T00:00:00+00:00

Starting version 1.0.7, the OpenCPU cloud server adds support for continuous integration (CI). This means that Github repositories can be configured to automatically install your package on an OpenCPU server, every time a commit is pushed. To take advantage of this feature, it is required that:

Your R source package is hosted on Github.
The name of the Github repository is identical to the name of the R package
Your Github user account has a public email address

To setup CI, add the following URL as a ‘WebHook’ in your Github repository:

https://cloud.opencpu.org/ocpu/webhook

Make sure to select payload version application/vnd.github.v3+form. To trigger a build push a commit to the master branch. The build will show up under Recent Deliveries in your github project page and if you should receive an email reporting if the installation was successful. If it was, the package will directly be available for remote use through the OpenCPU API.

But why?

Continuous Integration in OpenCPU addresses several issues at once:

If you introduced a bug and your package fails to install, you get notified by email immediately.
Deploy packages/apps on OpenCPU public cloud servers without having to wait until the server synchronizes.
You can use CI without relying on a 3rd party service; installing your own OpenCPU server is easy.

Every active R package maintainer could benefit from some sort of CI environment, with or without OpenCPU. Earlier this year, Yihui had a cool blog post about Travis CI (also see r-travis). Simon Urbanek’s rforge.net is another service that provides some auto-building functionality. One way or another, it’s important to frequently check that your all your packages still build, pass unit tests, haven’t introduced conflicts, etc. That way you catch problems immediately while the changes are still fresh in your memory.

Moreover, unexpected changes in R or dependencies are often beyond your control, but can cause your package to work one day, and break the next. The article on Possible Directions for Improving Dependency Versioning in R (The R Journal Vol. 5/1, June 2013) explained that CRAN requires all “current” packages to compatible, which assumes that all package authors are constantly on the lookout for changes in dependencies and reverse dependencies, forever. This system is unsustainable and will eventually have to be revised, but continuous integration can at least help detecting problems as soon as possible.

Final notes

Some final notes/disclaimers: this feature is currently being tested; please let me know if something is not working. To setup your own OpenCPU CI server, you need to configure an SMTP server; which is not yet documented in the PDF manual. Also note that currently only the default (master) branch will be deployed; pushes to other branches are ignored. Finally some packages might not build on the public demo server because of missing system dependencies. If your package needs any particular libraries, send me an email (or set up your own cloud server :-)

The RAppArmor Package: Enforcing Security Policies in R Using Dynamic Sandboxing on Linux

2013-11-14T00:00:00+00:00

An article called The RAppArmor Package: Enforcing Security Policies in R Using Dynamic Sandboxing on Linux has appeared in the latest volume of he Journal of Statistical Software: http://www.jstatsoft.org/v55/i07. The RAppArmor package is one of the foundations of the OpenCPU framework. It protects against malicious use and excessive use of hardware resources when executing arbitrary R code. From the abstract:

The increasing availability of cloud computing and scientific super computers brings great potential for making R accessible through public or shared resources. This allows us to efficiently run code requiring lots of cycles and memory, or embed R functionality into, e.g., systems and web services. However some important security concerns need to be addressed before this can be put in production. The prime use case in the design of R has always been a single statistician running R on the local machine through the interactive console. Therefore the execution environment of R is entirely unrestricted, which could result in malicious behavior or excessive use of hardware resources in a shared environment. Properly securing an R process turns out to be a complex problem. We describe various approaches and illustrate potential issues using some of our personal experiences in hosting public web services. Finally we introduce the RAppArmor package: a Linux based reference implementation for dynamic sandboxing in R on the level of the operating system.

Code, documentation, examples and videos are available from Github: https://github.com/jeroenooms/RAppArmor. A quick preview of what the package does below. The eval.secure function evaluates an expression in a sandboxed process. This way it is possible to set limits on hardware resources such as memory allocation, cpu usage, etc:

library(RAppArmor)

#sandboxed evaluation: setting 500MB memory limit
A <- eval.secure(rnorm(1e7), RLIMIT_AS = 512*1024*1024);
length(A)
> [1] 10000000

B <- eval.secure(rnorm(1e8), RLIMIT_AS = 512*1024*1024);
> Error: cannot allocate vector of size 762.9 Mb

RAppArmor can also set hard time limits to kill jobs that are not returning timely. These time limits always work, unlike e.g. R's built-in setTimeLimit which won't work for the example below:

cputest <- function() {
  A <- matrix(rnorm(1e7), 1e3)
  B <- svd(A)
}

system.time(x <- eval.secure(cputest(), timeout = 5))
> Error: R call did not return within 5 seconds. Terminating process.
> Timing stopped at: 0.003 0.006 5.008

But the most important feature is enforce Mandatory Access Control policies by applying an AppArmor profile. In this profile you can specify exactly which files and resources on the system a process is allowed to access and which not. For example, the r-user profile used below does not have permission to list the contents of the root of the system:

> list.files("/")
 [1] "bin"            "boot"           "cdrom"          "dev"           
 [5] "etc"            "home"           "initrd.img"     "initrd.img.old"
 [9] "lib"            "lib64"          "lost+found"     "media"         
[13] "mnt"            "opt"            "proc"           "root"          
[17] "run"            "sbin"           "srv"            "sys"           
[21] "tmp"            "usr"            "var"            "vmlinuz"       
[25] "vmlinuz.old"   
> eval.secure(list.files("/"), profile="r-user")
character(0)

This and much more is described in detail in the Journal of Statistical Software: http://www.jstatsoft.org/v55/i07.

OpenCPU Release 1.0.4

2013-10-17T00:00:00+00:00

OpenCPU version 1.0.4 was released to CRAN and Launchpad this week. This release brings some bug fixes/improvements and no breaking changes so you can safely upgrade your 1.0.x installations. Upgrade an existing OpenCPU cloud server using:

sudo apt-get update
sudo apt-get upgrade

Or to install the latest version of the OpenCPU local single-user server in R:

update.packages(ask=FALSE)
install.packages("opencpu", repos="http://cran.r-project.org")

New in this release

One improvement in this release is the capturing of output from the package installation process. This is surprisingly difficult in R, but thanks to some helpful tips on r-devel, we found a way to implement it. This makes it much easier to diagnose the problem if a certain package fails to install on OpenCPU.

For example: as described in the API manual section on libraries, the /ocpu/cran/, /ocpu/bioc/ and /ocpu/github/ APIs represent remote libraries: when a client calls a package in any of these libraries for the first time, the OpenCPU server will attempt to install the current version of the corresponding package on the fly (if not already available), before processing the request. In a previous post we described how this allows anyone on the internet to use your R package without even installing R.

However, sometimes the installation of a package fails, for example because of a missing dependency or version conflict. To make it easier to diagnose the problem, the OpenCPU server now returns the output from the package installation process for failed installations. For example, here are two packages that fail to install, and now we know why :-)

Loading these pages can take a couple of seconds because we have to wait for the installation process to complete. However once a package installation has succeeded it is stored for 24 hours so that the next request/user will be able to use it instantaneously.

About Local and Remote libraries

It is important to note that the above only applies to the mentioned remote libraries. Package in any of the local libraries</strong> such as /ocpu/library/ are already installed on the server. When running your own OpenCPU server, it is preferable to install your package on the server in the usual ways and call it via the local library API. The remote libraries are mostly intended to allow anyone to share and use arbitrary packages on public OpenCPU servers.

Remotely use R packages on Github through OpenCPU

2013-10-01T00:00:00+00:00

Any R package on Github can be used remotely on OpenCPU through the /ocpu/github/ API. Users on the internet can browse code, objects, help pages, or call functions in the package without having to learn R or install it on their local machine. Thereby you can make your method, algorithm, plot or DPU more accessible outside the R community.

For example: last time we discussed how OpenMHealth uses the geodistance function to calculate the total distance along a set of lon/lat coordinates using Haversine formula. The geodistance function is included in the dpu.mobility R package and avaible on the openmhealth github repository. By putting the dpu.mobility package on Github, all functionality in the package can now be accessed directly though the OpenCPU cloud server. Try opening some of the URL’s below in your browser (play around with the URL to get a sense of the API). The package help pages are available under /man/ (in several formats):

The R functions and objects in the package are available under /R/:

Any R function in the package can be called remotely using HTTP POST. For example to calculate the distance from LA to NY and back with curl:

#POST function call as url-encoded
curl https://cloud.opencpu.org/ocpu/github/openmhealth/dpu.mobility/R/geodistance/json -d \
 'long=[-74.0064,-118.2430,-74.0064]&lat=[40.7142,34.0522,40.7142]'

#POST equivalent call using json
curl https://cloud.opencpu.org/ocpu/github/openmhealth/dpu.mobility/R/geodistance/json \
 -H "Content-Type: application/json" -d '{"long":[-74.0064,-118.2430,-74.0064],"lat":[40.7142,34.0522,40.7142]}'

We use curl for illustration in this example, but any browser or web client could do the same thing, allowing anyone to embed your algorithms or plots in systems and applications.

Try it yourself!

For an R package to be used remotely on OpenCPU, it must be installible with install_github and the R package name must be identical to the repository name. I.e. if this works on your local machine:

library(devtools)
install_github("plyr", "hadley")

Then the package will be available remotely though:

/ocpu/github/hadley/plyr/

Try to see if you can access your own packages! Some of the usual suspects:

HTTP POST calls a function in any of these packages straight from github:

#from ?llply
curl https://cloud.opencpu.org/ocpu/github/hadley/plyr/R/llply/json -d '.data=baseball&.fun=summary'

#simple plot
curl https://cloud.opencpu.org/ocpu/github/hadley/ggplot2/R/qplot -d 'x=[1,2,3,4,5]&y=[2,3,2,4,2]'

Publishing OpenCPU apps

An OpenCPU app is an R package which includes some web page(s) that call the R functions in the package using the OpenCPU API. Some public example apps are published on the OpenCPU Github Repo, but you can just as easily develop and publish apps by putting them on your own Github repository. For example: Scott Chamberlain has an (old) version of the gitstats app on his personal github repo at github.com/SChamberlain/gitstats We can access this version of the app directly on the OpenCPU cloud server using the corresponding url: /ocpu/github/SChamberlain/gitstats/www/

Final note

One final note: in the current implementation of OpenCPU, packages from Github are installed no more than once every 24 hours. So your most recent Github commits might not show up immediately. The recommended workflow is to use the OpenCPU local single user server to develop your package/app. Once it works locally, push your package to Github to make it available on the OpenCPU cloud server.

Calling R functions through AJAX using opencpu.js

2013-09-21T00:00:00+00:00

The opencpu.js library builds on jQuery to call R functions through AJAX, straight from the browser. This makes it easy to embed R based computation or graphics in apps. Moreover, asynchronous requests (which are native in Javascript) make parallelization a natural part of the application. This post introduces some of the basic features of the library.

Getting started with opencpu.js

The readme page for opencpu.js has some brief documentation, but perhaps the easiest way to get started with opencpu.js is by example. The opencpu apps page lists a couple of example apps that you can play around with. The source code for each app is available from the opencpu github organization, and each app is based on opencpu.js. The appdemo app contains some pages with minimal examples illustrating the basic opencpu.js functionality. Like all OpenCPU apps, you can either use it on the public cloud server, or install for local use:

#install the appdemo app
library(devtools)
install_github("appdemo", "opencpu")

#load the app
library(opencpu)
opencpu$browse("/library/appdemo/www")

Hello World: calling a function

The hello.html page demonstrates how to call an R function that is included with the R package containing the app. In this example we call the R function named hello. Navigate to the hello.html page in your favorite browser and look at the html source code to see what is going on. The magic happens in these lines of javascript:

//read the value for 'myname'
var myname = $("#namefield").val();

//perform the request
var req = ocpu.rpc("hello", {
  myname : myname
}, function(output){
  $("#output").text(output.message);
});

The first line is basic jQuery syntax and reads the value from the page element with id namefield down in the html. In the next line we use ocpu.rpc to call the R function hello (included in the app package) and pass the value to the myname argument of the R function. The final argument is the callback handler: a function to (asynchronously) processes the output once the request has returned from the server. In this case our callback handler writes output$message value returned by our R function to the html field with id output.

The above is all that is needed to call R from Javascript in the browser. The remaining lines form this example:

//if R returns an error, alert the error message
req.fail(function(){
  alert("Server error: " + req.responseText);
});

//after request complete, re-enable the button 
req.always(function(){
  $("#submitbutton").removeAttr("disabled")
});

Web developers will immediately recognize this pattern: all functions in the opencpu.js library wrap around the jQuery $.ajax method and return the jqXHR object. Thereby you (the programmer) have full control over the request using all methods and properties from jQuery.ajax. So you can register additional handlers to deal with errors or to add additional behavior after the request has completed (in the example to re-enable a button).

Making a plot

The opencpu.js library also makes it easy to embed your R plots in a website. The plot.html page illustrates this with a very simple example. Again, look at the source of the HTML page:

//create the plot area on the plotdiv element
var req = $("#plotdiv").rplot("randomplot", {
  n : nfield,
  dist : distfield
})

The syntax for is slightly different than when calling a function before: the plotting widget is implemented as a jQuery plugin and hence called on a dom element, usually an empty <div>. In this case we call the R function randomplot (included with the appdemo package) and pass arguments n and dist. Once completed, a png image of the plot is displayed in #plotdiv and links to pdf and svg images.

Real world examples of apps using $.rplot are nabel, gitstats and stocks.

Uploading a File

In many statistical applications the user needs to provide some data, often in the form of a file. When using opencpu.js, calling an R function with a file works exactly the same as calling it with any other value. Look at the source code for upload.html to see this in action.

//arguments
var myheader = $("#header").val() == "true";
var myfile = $("#csvfile")[0].files[0];

//perform the request
var req = ocpu.rpc("readcsvnew", {
  file : myfile,
  header : myheader
}, function(session){
  alert("success:\n" + location.protocol + "//" + location.host + session.getLoc())  
});

Basically for any <input type="file"> HTML element we can pass the file to an R function using $("#id")[0].files[0] (note this requires HTML5 support). OpenCPU will then copy this file to the working directory of the R process and use the filename as the parameter value. The next section shows how we would actually use this object.

Simulating state by chaining function calls

Thus far all examples contained a single R function call and we would either grab the output or some plot to display on the page. However in practice your application might involve several steps: the user uploads some data, specifies variables, fits a model on the data, etc.

The OpenCPU API is stateless. Clients do not have a private R process and each call to the server is independent of the previous one. Instead, the way you can introduce state is by chaining function calls: the OpenCPU server stores the return object from a function call, and you can pass a reference to such an object as a argument to subsequent function calls. This might sound cumbersome at first, but it results in well organized, scalable applications and makes asynchronous parallel requests a native feature of your application.

A simple example of this concept which builds on the previous example is illustrated in chain.html. Because this example is a bit larger, the javascript code was placed in a seperate file called chain.js. The example starts with:

//perform the request
var req = ocpu.call("readcsvnew", {
  file : file,
  header : header
}, function(session){
  //on success call printsummary()
  printsummary(session);
});

This look very similar as before: ocpu.call is used to call the R function readcsvnew. However this time the callback function calls another function by passing on the reference to the object returned by readcsvnew (which we called session in this example) The printsummary javascript function then uses this object for the argument mydata when calling the R function printsummary:

function printsummary(mydata){
  //perform the request
  var req = ocpu.call("printsummary", {
    mydata : mydata
  }, function(session){
    var url = session.getLoc() +  "console/text";
    downloadfile(url);
  }).fail(function(){
    alert("Server error: " + req.responseText);
  });        
}

This illustrates the concept of function chaining. We can keep going on and keep calling new functions and pass output from previous function calls as the argument. To see a real world example of this, try the mapapp OpenCPU app.

Implementing a DPU with OpenCPU

2013-09-11T00:00:00+00:00

One of the prime use cases in the design of OpenCPU has been the “Data Processing Unit”, for short: DPU. A DPU is a modular, stateless data I/O unit which is called remotely by other software. In the OpenMHealth architecture a DPU must use JSON for data input and output, and is called over HTTPS. Below two simple examples.

Basic example

Suppose your software needs to calculate a correlation between two vectors. In R we would use the cor function from the stats package to do this:

> cor(x=c(1,2,3,4,5), y=c(3,1,5,2,2));
[1] -0.1042572

Using OpenCPU we can perform the same function call remotely just as easily:

curl https://cloud.opencpu.org/ocpu/library/stats/R/cor/json -d 'x=[1,2,3,4,5]&y=[3,1,5,2,2]'
[
	-0.10426
]

We can go full JSON by specifying the request Content-type to be application/json. This is exactly the same request and will yield the same output.

curl https://cloud.opencpu.org/ocpu/library/stats/R/cor/json -H "Content-Type: application/json" -d '{"x":[1,2,3,4,5],"y":[3,1,5,2,2]}'

Note that curl is used here for illustration only, your actual application could use whatever HTTP client library is available for the programming language at hand.

Another example

One real application for OpenMHealth required calculation of the total distance between a set of longitude-latitude coordinates as recorded by a mobile device. Wikipedia tells us that the distance between two points on a sphere is calculated from their longitudes and latitudes using Haversine formula. The geosphere package has a function distHaversine (help), (source) which implements this equation.

So we created dpu.mobility package with a function geodistance (help), (source) that iterates over a set of the locations to calculate the total distance among all points. We also added an option to smooth away outliers (caused by noisy GPS signal). Now to calculate the distance from LA to NYC and back:

curl https://cloud.opencpu.org/ocpu/library/dpu.mobility/R/geodistance/json -H "Content-Type: application/json" -d '{"long":[-74.0064,-118.2430,-74.0064],"lat":[40.7142,34.0522,40.7142]}'

Or in miles:

curl https://cloud.opencpu.org/ocpu/library/dpu.mobility/R/geodistance/json -H "Content-Type: application/json" -d '{"long":[-74.0064,-118.2430,-74.0064],"lat":[40.7142,34.0522,40.7142],"unit":"miles"}'

When to use a DPU

So how is this useful? Suppose you are building a system or application and would like to embed some statistical functionality. One solution is implement the required statistical methods yourself in the language at hand. However for complex methods this is time consuming and your code might not be as reliable as what is available in R. Another solution is to call R directly from the application language, using a bridge like RInside or JRI, rpy2, etc. This might work, but managing R sessions, error handling, security, data I/O, etc can be painful. And if the R session crashes, so does your application. Furthermore this means that each installation of the application must have a local copy of R and all required packages installed, which quickly becomes a maintenance nightmare.

If what you are doing fits the DPU paradigm, this might make a more elegant design. Most programming languages these days know their way around http(s) and JSON. Implement your statistical methods simply as an R function and have OpenCPU deal with management of sessions, security, JSON, etc. A single OpenCPU cloud server serves all your application instances/users, which is cheap and easier to maintain. The cloud server might considerably improve performance by caching requests and if your application becomes popular and you need to scale up to serve many simultaneous request per second, you just install a http load balancer with multiple back-end servers. No need to change any code :-)

Knitr/Markdown OpenCPU App

2013-08-30T00:00:00+00:00

A new little OpenCPU app allows you to knit and markdown in the browser. It has a fancy pants code editor which automatically updates the output after 3 seconds of inactivity. It uses the Ace web editor with mode-r.js (thanks to RStudio for making the latter available).

Like all OpenCPU apps, the source package lives in the opencpu app repo on github. You can try it out on the public cloud server, or run it locally:

#install the package
library(devtools)
install_github("markdownapp", "opencpu")

#open it in opencpu
library(opencpu)
opencpu$browse("/library/markdownapp/www")

The app uses the knitr R package and a some standard javascript libraries. What remains is a few lines of javascript to call OpenCPU when the editor is inactive. The entire app was created in about an hour. Feel free to fork and modify :-)

OpenCPU 1.0 release!

2013-08-27T00:00:00+00:00

After more than 3 years of development, we release the first official version of the OpenCPU system. Based on feedback and experiences from the beta series, OpenCPU version 1.0 has been rewritten entirely from scratch. The result is simple and flexible API that is easier to understand yet more powerful than before.

With the new release also comes a new website and blog in which we will post tutorials and examples over the upcoming weeks/months. This first post features some highlights to get your feet wet.

The package API

Try opening these URL’s in your browser to explore objects and manuals (help pages) from a package:

https://cloud.opencpu.org/ocpu/library/
https://cloud.opencpu.org/ocpu/library/ggplot2/
https://cloud.opencpu.org/ocpu/library/ggplot2/R/    
https://cloud.opencpu.org/ocpu/library/ggplot2/R/diamonds
https://cloud.opencpu.org/ocpu/library/ggplot2/R/mpg/json    
https://cloud.opencpu.org/ocpu/library/ggplot2/R/mpg/csv
https://cloud.opencpu.org/ocpu/library/ggplot2/R/mpg/rda
https://cloud.opencpu.org/ocpu/library/ggplot2/R/qplot
https://cloud.opencpu.org/ocpu/library/ggplot2/man/
https://cloud.opencpu.org/ocpu/library/ggplot2/man/qplot/text
https://cloud.opencpu.org/ocpu/library/ggplot2/man/qplot/html
https://cloud.opencpu.org/ocpu/library/ggplot2/man/qplot/pdf

Or interface static files:

https://cloud.opencpu.org/ocpu/library/MASS/DESCRIPTION
https://cloud.opencpu.org/ocpu/library/MASS/NEWS
https://cloud.opencpu.org/ocpu/library/MASS/scripts/
https://cloud.opencpu.org/ocpu/library/MASS/scripts/ch01.R

External Repositories

The /ocpu/library/ API interfaces to packages which are installed in the global library on the server. Want to try another package? With a little extra patience, you can open any package straight from cran or github:

https://cloud.opencpu.org/ocpu/cran/JJcorr/
https://cloud.opencpu.org/ocpu/github/hadley/plyr/

Of course this will only work if the package installation is successful. When a package on an external repository is accessed for the first time, the request might take quite a while because it is installed on the fly. But once it is working, you can use it just like packages installed on the server.

https://cloud.opencpu.org/ocpu/github/jeroenooms/gitstats/www/

This way you can share your own packages and apps without hosting a personal OpenCPU cloud server.

Running a function / script

The core feature of OpenCPU is the ability to call functions and run scripts (including sweave/knitr scripts). To get started, you can use the testing page to poke around in the API. Alternatively use curl to call OpenCPU from your command line:

#run a script
curl -X POST https://cloud.opencpu.org/ocpu/library/MASS/scripts/ch01.R

#call a function
curl https://cloud.opencpu.org/ocpu/library/stats/R/rnorm -d 'n=10&mean=5'

A successful POST will always return a HTTP 201 response indicating the location of where to retrieve results from the execution (objects, graphics, files, stdout, etc)

OpenCPU Apps

One of the major improvements in OpenCPU 1.0 is improved support for apps. An OpenCPU app is an R package which includes some web page(s) that call the R functions in the package using the OpenCPU API. This makes a convenient way to develop, package and ship standalone R web applications. Have a look at the example apps.

The single-user server

OpenCPU 1.0 is available both as a cloud server, and single-user server. The latter will run inside an interactive R session and is used to run and develop local apps.

install.packages("opencpu")
library(opencpu)

After installing OpenCPU, we install apps just like we would install a package:

library(devtools)

#gitstats app
install_github("gitstats", "opencpu")
opencpu$browse("/library/gitstats/www")

#stocks app
install_github("stocks", "opencpu")
opencpu$browse("/library/stocks/www")

#nabel app
install_github("nabel", "opencpu")
opencpu$browse("/library/nabel/www")