Jeroen Ooms

Post doc hacker at UC Berkeley for rOpenSci

Deploying a scoring engine for predictive analytics with OpenCPU

June 23, 2014

TLDR/abstract: See the tvscore demo app or this jsfiddle for all of this in action.

This post explains how to use the OpenCPU system to setup a scoring engine for calculating real time predictions. In our example we use the predict.gam function from the mgcv package to make predictions based on a generalized additive model. The entire process consists of four steps:

Building a model
Create an R package containing the model and a scoring function
Install the package on your OpenCPU server
Remotely call the scoring function through the OpenCPU API

Let’s get started!

Step 1: creating a model

For this example, we use data from the General Social Survey, which is a very rich dataset on demographic characteristics and attitudes of United States residents. To load the data in R:

#Data info: http://www3.norc.org/GSS+Website/Download/SPSS+Format/
download.file("http://publicdata.norc.org/GSS/DOCUMENTS/OTHR/2012_spss.zip", destfile="2012_spss.zip")
unzip("2012_spss.zip")
GSS <- foreign::read.spss("GSS2012.sav", to.data.frame=TRUE)

The GSS data has 1974 rows for 816 variables. To keep our example simple, we create a model with only 2 predictor variables. The code below fits a GAM which predicts the average number of hours per day that a person watches TV, based on their age and marital status. In these data tvhours and age are numeric variables, whereas marital is categorical (factor) variable with levels MARRIED, SEPARATED,DIVORCED, WIDOWED and NEVER MARRIED.

#Variable info: http://www3.norc.org/GSS+Website/Browse+GSS+Variables/Mnemonic+Index/
library(mgcv)
mydata <- na.omit(GSS[c("age", "tvhours", "marital")])
tv_model <- gam(tvhours ~ s(age, by=marital), data = mydata)

The predict function is used to score data against the model. We test with some random cases:

newdata <- data.frame(
  age = c(24, 54, 32, 75),
  marital = c("MARRIED", "DIVORCED", "WIDOWED", "NEVER MARRIED")
)

predict(tv_model, newdata = newdata)
       1        2        3        4 
3.022650 3.693640 1.556342 3.665077

All seems good, this completes step 1. But just to get a sense of what our example model actually looks like before we start scoring, a simple viz:

library(ggplot2)
qplot(age, predict(tv_model), color=marital, geom="line", data=mydata) +
  ggtitle("gam(tvhours ~ s(age, by=marital))") +
  ylab("Average hours of TV per day")

Seems like people that get married start watching less TV, who would have thought :-) In a real study we should probably tune the smoothing a bit and add parenting as predictor (also in the data), but for simplicity we’ll stick with this model for now.

Step 2: creating a package

In order to score cases via the OpenCPU API, we need to turn the model into an R package. Making R packages is very easy these days, especially when using RStudio. Our package needs to contain at least two things: the tv_model object that we created above, and a wrapper function that calls out to predict(tv_model, ...). You can make the wrapper as simple or sophisticated as you like, based on the type of input and output data that you want to send/receive from your scoring engine.

The tvscore package that is available from the opencpu github repository is an example of such a package. The important thing to note is that the tv_model object is included in the data directory of the package. Saving objects to a file is done using the save function in R:

#Store the model as a data object
save(tv_model, file="data/tv_model.rda")

To load the model with the package, we can either set LazyData=true in the package DESCRIPTION, or manually load it using the data() function in R. For details on including data in R packages, see section 1.1.6 of writing R extensions.

Finally the package contains a scoring function called tv, which calls out to predict.gam. The scoring function is what clients will call remotely through the OpenCPU API. We use a smart function that supports both data frames as well as CSV files for input:

tv <- function(input){
  #input can either be csv file or data	
  newdata <- if(is.character(input) && file.exists(input)){
  	read.csv(input)
  } else {
  	as.data.frame(input)
  }
  stopifnot("age" %in% names(newdata))
  stopifnot("marital" %in% names(newdata))
  
  newdata$age <- as.numeric(newdata$age)

  #tv_model is included with the package
  newdata$tv <- predict.gam(tv_model, newdata = newdata)
  return(newdata)
}

Note how the function does a bit of input validation by checking that the age and marital columns are present. As usual, the tv function is saved in the R directory of the source package. Install the package locally to verify that it works as expected in a clean R session. To install our example package from github, restart R and do:

#install the tv score package
library(devtools)
install_github("opencpu/tvscore")

First we test the tv function with data frame input:

#test scoring with data frame input
library(tvscore)
newdata <- data.frame(
  age = c(24, 54, 32, 75),
  marital = c("MARRIED", "DIVORCED", "WIDOWED", "NEVER MARRIED")
)
tv(input = newdata)

And then we test if it works for CSV data:

#test scoring with CSV file input
setwd(tempdir())
write.csv(newdata, "testdata.csv")
library(tvscore)
tv(input = "testdata.csv")

If all of this works as expected, the package is ready to be deployed on your OpenCPU server!

Step 3: Install the package on the server

To deploy your scoring engine, simply install the package on your OpenCPU server. If you are running the OpenCPU cloud server, make sure to install your package as root. For example if you built the package into a tar.gz archive:

sudo -i
R CMD INSTALL tvscore_0.1.tar.gz

To install our example package straight from R, either on an OpenCPU cloud server or OpenCPU single-user server:

#install the tv score package
library(devtools)
install_github("opencpu/tvscore")

If you are running the cloud server, you are done with this step. If you are running the single-user server, start OpenCPU using:

library(opencpu)
opencpu$browse()

To verify that the installation succeeded, open your browser and navigate to the /ocpu/library/tvscore path on the OpenCPU server. Also have a look at /ocpu/library/tvscore/R/tv and /ocpu/library/tvscore/man/tv.

Step 4: Scoring through the API

Once the package is installed on the server, we can remotely call the tv function via the OpenCPU API. In the examples below we use the public demo server: https://cloud.opencpu.org/. For example, to call the tv function with curl using basic JSON RPC:

curl https://cloud.opencpu.org/ocpu/library/tvscore/R/tv/json \
 -H "Content-Type: application/json" \
 -d '{"input" : [ {"age":26, "marital" : "MARRIED"}, {"age":41, "marital":"DIVORCED"}, {"age":53, "marital":"NEVER MARRIED"} ]}'

Note how the OpenCPU server automatically converts input and output data from/to JSON using jsonlite. See the API docs for more details on this process. Alternatively we can batch score by posting a CSV file (example data)

curl https://cloud.opencpu.org/ocpu/library/tvscore/R/tv -F "input=@testdata.csv"

The response to a successful HTTP POST request contains the location of the output data in the Location header. For example if the call returned a HTTP 201 with Location header /ocpu/tmp/x036bf30e82, the client can retrieve the output data in various formats using a subsequent HTTP GET request:

curl https://cloud.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/csv
curl https://cloud.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/json
curl https://cloud.opencpu.org/ocpu/tmp/x036bf30e82/R/.val/tab

This completes our scoring engine. Using these steps, clients from any language can remotely score cases by calling the tv function using standard HTTP and JSON libraries.

Extra credit: performance optimization

When using a scoring engine based on OpenCPU in production, it is worthwile configuring your server to optimize performance. In particular, we can add our package to the preload field in the /etc/opencpu/server.conf file on the OpenCPU cloud server. This will automatically load (but not attach) the package when the OpenCPU server starts, which eliminates package loading time from the individual scoring requests. In our example this is important because tvscore depends on the mgcv package, which takes about 2 seconds to load.

Note that R does not load LazyData objects when the package loads. Hence, preload in combination with lazy loading of data might not have the desired effect. When using preload, make sure to design your package such that all data gets loaded when the package loads (example).

Finally in production you might want to tweak the timelimit.post (timeout), rlimit.as (mem limit), rlimit.fsize (disk limit) and rlimit.nproc (parallel process limit) options in /etc/opencpu/server.conf to fit your needs. Also see the server manual on this topic.

Bonus: creating an OpenCPU app

By including web pages in the /inst/www/ directory of the source package, we can turn our scoring engine into a standalone web application. The tvscore example package contains a simple web interface that makes use of the opencpu.js JavaScript client to interact with R via OpenCPU in the browser. Navigate to /ocpu/library/tvscore/www/ on the public demo server to see it in action!

To install and run the same app in your local R session, use:

#Install the app
library(devtools)
install_github("opencpu/tvscore")

#Load the app
library(opencpu)
opencpu$browse("/library/tvscore/www")

We can also call the OpenCPU server from an external website using cross domain ajax requests (CORS). See this jsfiddle for a simple example that calls the public server using the ocpu.rpc function from opencpu.js.