Deploying a scoring engine for predictive analytics with OpenCPU
June 23, 2014
TLDR/abstract: See the tvscore demo app or this jsfiddle for all of this in action.
This post explains how to use the OpenCPU system to setup a scoring engine for calculating real time predictions. In our example we use the predict.gam function from the mgcv
package to make predictions based on a generalized additive model. The entire process consists of four steps:
- Building a model
- Create an R package containing the model and a scoring function
- Install the package on your OpenCPU server
- Remotely call the scoring function through the OpenCPU API
Let’s get started!
Step 1: creating a model
For this example, we use data from the General Social Survey, which is a very rich dataset on demographic characteristics and attitudes of United States residents. To load the data in R:
The GSS data has 1974 rows for 816 variables. To keep our example simple, we create a model with only 2 predictor variables. The code below fits a GAM which predicts the average number of hours per day that a person watches TV, based on their age and marital status. In these data tvhours
and age
are numeric variables, whereas marital
is categorical (factor) variable with levels MARRIED
, SEPARATED
,DIVORCED
, WIDOWED
and NEVER MARRIED
.
The predict
function is used to score data against the model. We test with some random cases:
All seems good, this completes step 1. But just to get a sense of what our example model actually looks like before we start scoring, a simple viz:
Seems like people that get married start watching less TV, who would have thought :-) In a real study we should probably tune the smoothing a bit and add parenting as predictor (also in the data), but for simplicity we’ll stick with this model for now.
Step 2: creating a package
In order to score cases via the OpenCPU API, we need to turn the model into an R package. Making R packages is very easy these days, especially when using RStudio. Our package needs to contain at least two things: the tv_model
object that we created above, and a wrapper function that calls out to predict(tv_model, ...)
. You can make the wrapper as simple or sophisticated as you like, based on the type of input and output data that you want to send/receive from your scoring engine.
The tvscore
package that is available from the opencpu github repository is an example of such a package. The important thing to note is that the tv_model
object is included in the data
directory of the package. Saving objects to a file is done using the save
function in R:
To load the model with the package, we can either set LazyData=true
in the package DESCRIPTION, or manually load it using the data()
function in R. For details on including data in R packages, see section 1.1.6 of writing R extensions.
Finally the package contains a scoring function called tv
, which calls out to predict.gam
. The scoring function is what clients will call remotely through the OpenCPU API. We use a smart function that supports both data frames as well as CSV files for input:
Note how the function does a bit of input validation by checking that the age
and marital
columns are present. As usual, the tv
function is saved in the R
directory of the source package. Install the package locally to verify that it works as expected in a clean R session. To install our example package from github, restart R and do:
First we test the tv
function with data frame input:
And then we test if it works for CSV data:
If all of this works as expected, the package is ready to be deployed on your OpenCPU server!
Step 3: Install the package on the server
To deploy your scoring engine, simply install the package on your OpenCPU server. If you are running the OpenCPU cloud server, make sure to install your package as root. For example if you built the package into a tar.gz
archive:
To install our example package straight from R, either on an OpenCPU cloud server or OpenCPU single-user server:
If you are running the cloud server, you are done with this step. If you are running the single-user server, start OpenCPU using:
To verify that the installation succeeded, open your browser and navigate to the /ocpu/library/tvscore
path on the OpenCPU server. Also have a look at /ocpu/library/tvscore/R/tv
and /ocpu/library/tvscore/man/tv
.
Step 4: Scoring through the API
Once the package is installed on the server, we can remotely call the tv
function via the OpenCPU API. In the examples below we use the public demo server: https://cloud.opencpu.org/
. For example, to call the tv
function with curl
using basic JSON RPC:
Note how the OpenCPU server automatically converts input and output data from/to JSON using jsonlite
. See the API docs for more details on this process. Alternatively we can batch score by posting a CSV file (example data)
The response to a successful HTTP POST request contains the location of the output data in the Location
header. For example if the call returned a HTTP 201 with Location
header /ocpu/tmp/x036bf30e82
, the client can retrieve the output data in various formats using a subsequent HTTP GET request:
This completes our scoring engine. Using these steps, clients from any language can remotely score cases by calling the tv
function using standard HTTP
and JSON
libraries.
Extra credit: performance optimization
When using a scoring engine based on OpenCPU in production, it is worthwile configuring your server to optimize performance. In particular, we can add our package to the preload
field in the /etc/opencpu/server.conf
file on the OpenCPU cloud server. This will automatically load (but not attach) the package when the OpenCPU server starts, which eliminates package loading time from the individual scoring requests. In our example this is important because tvscore
depends on the mgcv
package, which takes about 2 seconds to load.
Note that R does not load LazyData objects when the package loads. Hence, preload
in combination with lazy loading of data might not have the desired effect. When using preload
, make sure to design your package such that all data gets loaded when the package loads (example).
Finally in production you might want to tweak the timelimit.post
(timeout), rlimit.as
(mem limit), rlimit.fsize
(disk limit) and rlimit.nproc
(parallel process limit) options in /etc/opencpu/server.conf
to fit your needs. Also see the server manual on this topic.
Bonus: creating an OpenCPU app
By including web pages in the /inst/www/
directory of the source package, we can turn our scoring engine into a standalone web application. The tvscore
example package contains a simple web interface that makes use of the opencpu.js JavaScript client to interact with R via OpenCPU in the browser. Navigate to /ocpu/library/tvscore/www/ on the public demo server to see it in action!
To install and run the same app in your local R session, use:
We can also call the OpenCPU server from an external website using cross domain ajax requests (CORS). See this jsfiddle for a simple example that calls the public server using the ocpu.rpc
function from opencpu.js
.