This is part seven of a series on launching a data science project.
Up to this point, we’ve worked out a model which answers important business questions. Now our job is to get that model someplace where people can make good use of it. That’s what today’s post is all about: deploying a functional model.
Back in the day (by which I mean, say, a decade ago), one team would build a solution using an analytics language like R, SAS, Matlab, or whatever, but you’d almost never take that solution directly to production. These were analytical Domain-Specific Languages with a set of assumptions that could work well for a single practitioner but wouldn’t scale to a broad solution. For example, R had historically made use of a single CPU core and was full of memory leaks. Those didn’t bother analysts too much because desktops tended to be single-core and you could always reboot the machine or restart R. But that doesn’t work so well for a server—you need something more robust.
So instead of using the analytics DSL directly in production, you’d use it indirectly. You’d use R (or SAS or whatever) to figure out the right algorithm and determine weights and construction and toss those values over the wall to an implementation team, which would rewrite your model in some other language like C. The implementation team didn’t need to understand all of the intricacies of the problem, but did need to have enough practical statistics knowledge to understand what the researchers meant and translate their code to fast, efficient C (or C++ or Java or whatever). In this post, we’ll look at a few changes that have led to a shift in deployment strategy, and then cover what this shift means for practitioners.
The first shift is the improvement in languages. There are good libraries for Java, C#, and other “production” languages, so that’s a positive. But that’s not one of the two positives I want to focus on today. The first positive is the general improvement in analytical DSLs like R. We’ve gone from R being not so great when running a business to being production-quality (although not without its foibles) over the past several years. Revolution Analytics (now owned by Microsoft) played a nice-sized role in that, focusing on building a stable, production-ready environment with multi-core support. The same goes for RStudio, another organization which has focused on making R more useful in the enterprise.
The other big positive is the introduction of Python as a key language for data science. With libraries like NumPy, scikit-learn, and Pandas, you can build quality models. And with Cython, a data scientist can compile those models down to C to make them much faster. I think the general acceptance of Python in this space has helped spur on developers around other languages (whether open-source like R or closed-source commercial languages like SAS) to get better.
The Era Of The Microservice
The other big shift is a shift away from single, large services which try to solve all of the problems. Instead, we’ve entered the era of the microservice: a small service dedicated to providing a single answer to a single problem. A microservice architecture lets us build smaller applications geared toward solving the domain problem rather than trying to solve the integration problem. Although you can definitely configure other forms of interoperation, most microservices typically are exposed via web calls and that’s the scenario I’ll discuss today. The biggest benefit to setting up a microservice this way is that I can write my service in R, you can call it from your Python service, and then some .NET service could call yours, and nobody cares about the particular languages used because they all speak over a common, known protocol.
One concern here is that you don’t want to waste your analysts time learning how to build web services, and that’s where data science workbenches and deployment tools like DeployR come into play. These make it easier to deploy scalable predictive services, allowing practitioners to build their R scripts, push them to a service, and let that service host the models and turn function calls into API calls automatically.
But if you already have application development skills on your team, you can make use of other patterns. Let me give two examples of patterns that my team has used to solve specific problems.
Machine Learning Services
The first pattern involves using SQL Server Machine Learning Services as the core engine. We built a C# Web API which calls ML Services, passing in details on what we want to do (e.g., generate predictions for a specific set of inputs given an already-existing model). A SQL Server stored procedure accepts the inputs and calls ML Services, which farms out the request to a service which understands how to execute R code. The service returns results, which we interpret as a SQL Server result set, and we can pass that result set back up to C#, creating a return object for our users.
In this case, SQL Server is doing a lot of the heavy lifting, and that works well for a team with significant SQL Server experience. This also works well if the input data lives on the same SQL Server instance, reducing data transit time.
The second pattern that I’ll cover is a bit more complex. We start once again with a C# Web API service. On the opposite end, we’re using Keras in Python to make predictions against trained neural network models. To link the two together, we have a couple more layers: first, a Flask API (and Gunicorn as the production implementation). Then, we stand nginx in front of it to handle load balancing. The C# API makes requests to nginx, which feeds the request to Gunicorn, which runs the Keras code, returning results back up the chain.
So why have the C# service if we’ve already got nginx running? That way I can cache prediction results (under the assumption that those results aren’t likely to change much given the same inputs) and integrate easily with the C#-heavy codebase in our environment.
If you don’t need to run something as part of an automated system, another deployment option is to use notebooks like Jupyter, Zeppelin, or knitr. These notebooks tend to work with a variety of languages and offer you the ability to integrate formatted text (often through Markdown), code, and images in the same document. This makes them great for pedagogical purposes and for reviewing your work six months later, when you’ve forgotten all about it.
Interactive Visualization Products
Over the course of this post, I’ve looked at a few different ways of getting model results and data into the hands of end users, whether via other services (like using the microservice deployment model) or directly (using notebooks or interactive applications). For most scenarios, I think that we’re beyond the days of needing to have an implementation team rewrite models for production, and whether you’re using R or Python, there are good direct-to-production routes available.