Bringing machine learning models into production

Developing and bringing machine learning models into production is a task with a lot of challenges. These include model and attribute selection, dealing with missing values, normalization and others.

Finding a workflow that puts all the gears, from data preprocessing and analysis over building models and selecting the best performing one to serving the model in a real time API, into motion is the one I want to discuss here.

Life cycle of a machine learning model

The life cycle of machine learning is basically described by the iteration of the following four steps.

Life cycle of machine learning models
Machine learning steps

Each of these steps is under constant evaluation. Especially in case model performance can be enhanced by adding different data attributes or preprocessing methods.

For the presented approach we split the process of modeling into two other parts. Part one contains the above mentioned four steps and we call it Manual Run Modeling. Step two is automating the steps of part one.

machine learning for production

Manual Run Modeling

In the manual part we start by analysing our new task. After that we come up with a hypothesis we want to prove and test.

Development and Prototyping Environment

First we set up a development environment for working on the new task. For this we spin up a Jupyter notebook server. We deploy it on Google Cloud AI Platform. It provides ready to run containers for Jupyter. The notebook approach enables us to develop fast and share results with the team using a browser. With the ability to easily visualize data inline in a notebook, this approach is especially useful in the data extraction and preprocessing process.

Data preparation and visualization

Python provides some nice packages for generating graphics on data for faster insights. This speeds up our prototyping process in the notebook. We are especially fond of using Seaborn.

We load the data identified for this model into a dataframe in the notebook. After that we start looking at each attribute and its values, often in combination with the other attributes. For a first overview we use a pairplot provided by Seaborn.

We use a combination of visualizations, such as e.g. a correlation matrix. Then we decide which attributes to use and how to handle outliers and missing values. Finally we use one hot encoding for categorical attributes and normalize the continuous attributes to create the input into our models.

pairplot of attributes

Model Selection and Evaluation

When the data is ready we choose several models to find a solution for our problem. These models can range from a multilinear regression model over random forests to deep neural networks with Tensorflow.

After splitting the data for training, evaluation and test we decide on a measure each model has to optimize for, e.g. mean squared error or precision. This depends on the kind of problem. Once we identified the best performing model we start by transforming the code for Google Cloud AI Platform.

RT Prediction Deployment

After manual evaluation of preprocessing and modeling, we start the task of automating training and deployment for bringing machine learning models into production. This can be split into three tasks:

  • Training the model with hyperparameter optimization on Google Cloud AI platform
  • Deploying the model on Google Cloud AI platform
  • Deploying an API to access the model for real time predictions

Training on Google Cloud AI Platform

After deciding on a model to go forward into production with, we optimize our code for data extraction and preprocessing. The reason for this is to make it reusable and compliant with Google Cloud AI Platform rules. This means basically we have to create a Python package out of the first three steps.

A project could be set up as shown in the picture below.

Sample structure of AI platform package

Sample structure of AI platform package

This Python package is then deployed to Google Cloud platform and executed there. If you have custom packages there is an option to supply those too. An example call for training on the cloud would then look like the following example.

One advantage of using Google Cloud AI platform is the possibility of using automated hyperparameter tuning for models. This enables us to train a model automatically with different configurations. Then we select the one performing best for the defined measure in hptuning_config.yaml.

hyperparameter yaml example

In the AI platform dashboard you can then see, which hyperparameter combination of your defined values in params had the best results for the defined hyperparameterMetricTag and goal.

The identified model is then ready to be deployed to the platform, where Google provides an URL to access the model in real time.

Deploying the model on GCP

Deploying to production is done with a Jenkins job. We use Jenkinsfile to define our jobs as part of our code. A model deployment consists of the following steps:

  • Copying the model to the correct GCP bucket (differs for our three development systems development, staging and production)
  • Deploy model to AI platform using a gcloud command
  • Test model with a prepared test dataset
deploy model on ai platform

If all of these steps are successful the model is ready for usage in the specified environment via an URL endpoint.

Deploying the Real Time API

Since the model is deployed and accessible using an URL endpoint, we now have to build a transformation API that takes the input data and transforms it into the needed format for the model endpoint and then calls the model.

To make using the model easier for other services, our data entry format is JSON. This makes the data human readable and changes to any steps concerning the model, except changing the number of attributes, can be done without dependencies on our client services.

REST Service

As framework for our REST API we chose Flask, since it is lightweight, flexible, easy to use and also written in Python. Since API and model are written in the same language we can make use of the preprocessing from the training package we needed for training above. The main work here lies in adapting the code to only run one single event, instead of the batch prediction, used to validate the result during training.

For stability and security reasons we added some additional checks:

  • JWT token authorization with Flask-JWT
  • Input format checks
    • all required fields in request
    • filling in default values for optional fields
    • checking values for validity (ee.g. range or location checks)

We also created an extra package containing all transformation functions, we use in several of our models. This package contains, e.g. min-max-normalization and distance calculation functions.

Speed is important in this component, so we store all data for enriching and transforming the incoming data inside a cache.

After receiving the prediction from the model, we qualify the results for regression models, by adding a confidence value. This helps our clients to better understand the results, especially if they are meant to be shown to end users.

Each of our responses has its own error code and message that is supplied in the result. The result is again in JSON format with the following fields:

  • success: true or false, indicating result of request
  • message: (error) message for response
  • prediction object

Deployment of API

Deployment to our production system is then handled by a Jenkins job with the following steps:

  • Unit and integration testing of Flask API
  • Building a Docker container for the Flask API
  • Pushing Container Image to GCP project repository
  • Deploying Container to Google Cloud Run

By using Cloud Run we do not need to worry about hardware configuration and can focus on optimizing the API and the model.

Conclusion

By following this process we make sure that the time spent on the necessary things, beside building a model, for bringing machine learning models into production is kept to a minimum and does not include managing any hardware resources or availability concerns.

Especially the part after the manual data and model selection process is usable as a template to fasten the deployment process. This is thanks to the tools provided by Google and our extracting reusable functions into their own Python package.

Apache Zeppelin: Use with remote Spark cluster and Yarn

Apache Zeppelin is pretty usefull for interactive programming using the web browser. It even comes with its own installation of Apache Spark. For further information you can check my earlier post.
But the real power in using Spark with Zeppelin lies in its easy way to connect it to your existing Spark cluster using YARN. The following steps are necessary:

  • Copy your Hadoop configuration files to your Zeppelin installation under $ZEPPELIN_HOME/conf
  • Restart your Zeppelin Notebook
  • Insert the value “yarn-client” into the field master in the spark interpreter, as shown in the picture below.

spark_interpreter_yarn

After these steps you can use your notebooks with spark running on a yarn cluster. So you can make use of all the resources in the queue you assigned spark on you cluster.

Apache Zeppelin: Visualization and Spark data processing

Apache Zeppelin

Apache Zeppelin is a web-based notebook for interactive data analytics. It comes will features for all the steps of data analysis:

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration

Besides that feature set it also supports multiple languages in the backend. Currently it supports languages like:

But there is also the possibility to add your own interpreter to Zeppelin. This makes this tool really flexible.
Another feature it has, is the built in integration of Apache Spark. It ships with the following features and more:

  • Automatic SparkContext and SQLContext injection
  • Runtime jar dependency loading from local filesystem or maven repository.
  • Canceling job and displaying its progress

It also has built in visualization, which is an improvemnt over using ipython notebooks I think. The visualization covers the most basic graphs, like:

  • Tables
  • BarCharts
  • Pies
  • Scatterplot
  • Lines

These visualizations can be used with all interpreters and are always the same. So you can show data from Postgres and Spark in the same notebook with the same functions used. There is no need to handle different data sources differently.
You can also use dynamic forms in your notebooks, e.g. to provide filter options to the user. This comes in handy, if you embedd a notebook in your own website.

Python vs. R for Data Science

In Data Science there are two languages that compete for users. On one side there is R, on the other Python. Both have a huge userbase, but there is some discussion, which is better to use in a Data Science context. Lets explore both a bit:

R
R is a language and programming environment especially developed for statistical computing and grahics. It has been around some time and several thousand packages to tackle statistical problems. With RStudio it also provides an interactive programming environment, that makes analysing data pretty easy.

Python
Python is a full range programming language, that makes it easy to integrate into a company wide system. With the packages Numpy, Pandas and Scikit-learn, Mathplotlib in combination with IPython, it also provides a full range suite for statistical computing and interactive programming environment.

R was developed solely for the purpose of statistical computing, so it has some advantages there, since it is specialized and has been around some years. Python is coming from a programming language and moves now into the data analysis field. In combination with all the other stuff it can do, websites and easy integrations into Hadoop Streaming or Apache Spark.
And for people who want to use the best of both sides can always use the R Python integration Rpy2.

I personally am recently working with Python for my ETL processes, including MapReduce, and anlysing data, which works awesome in combination with IPython as interactive development tool.

Visualization: Enhancing the Palo Suite with NVD3.js

After my previous post How to visualize data? I was unsatisfied with the visualization provided by the Palo Suite provided by Jedox. This could have several reasons, not the least, that I may not have been able to get the max out of it. But the quality of the resulting diagramms and it’s interactivity were lacking for the purposes I have to deal with, especially after working with Circos the last few weeks.
So I went hunting for something easy to integrate into my Palo Suite.
Palo provides an interface for integration “widgets” into their webreporting environment. This interface provides one Javascript function that is easy to use. This made the choice of what kind of library to use easier, but there are still a lot available. Here is a list of some I came across:

There are a lot more out there and sometime I had to decide on one. So I settled on NVD3.js since I liked the look of the graphics and because it is based on Data Driven Documents.
It supports several types of graphs already and integrating them all into the interface provided by Jedox, got me quick results. Here is a quick view on the difference between Palo built-in graphics and NVD3.js. Both graphs are based on the same data.

Palo Suite Webreporting graph
jedox_bib

NVD3.js graph
nvd3

For anyone interessted I uploaded the file here. This is just a quick hack and not very representable, but it shows how it works.

How to visualize data?

Data visualization is something like an art. How to make results from your research in data easy to understand by management, business users or just everyone out there? A list of data, like an Excel sheet ist not what catches the eye. The art in visualization is shown perfectly on the site of Martin Wattenberg.
Now the questions is, what tools are easy to use in a company environment to visualize your data?

There are several classes of tools you can use:

  • Beginner: These are tools with a wide knowlegde throughout the company, mainly MS Excel. You can explore data easily and make diagramms without too much hazzle. It provides Barcharts, Lines, Pies and a combination of those. It is also very easy to use for adhoc analysis and making the data and graphs available to business users, if necessary.
  • Online Libraries: If you don’t want to be limited to Excel and use a Web-based reporting / analysis tool, you maybe can integrate one of the libraries available. There are several for all purpose you can imagine:
      Google Charts: For dynamic charts it has everything you need, as long as you are not bother by the Google look. They are running in every browser that supports SVG, canvas and VML. But there are JavaScript based, so there is a problem, if they should be used offline or in browsers without JS.
      Circos is a great tool, if you want to use circles to visualize your data. It is written in Perl and produces PNG output.
      panel-general
      Visual.ly focusses more on the infographics side of graph. It is mainly a marketplace, but you can make your own cartoon like graph with it.
      Kartograph is a tool for creating interactive vector maps. It is available as JavaScript or Python library. This is a great tool, especially since most people totally love maps and to use them.
  • Professional tools: The opposite of Excel in manners of manipulating and analysing data. These tools are sometimes pretty expensive, such as SAS and SPSS. But there are also open source and free to use tools, that sometimes are more flexible and easier to use, since they have a strong user base.
      R: Besides its nearly unlimited supply of libraries for all manners of analysis, R also has lots of packages concerning visualizing data and makes good use of them. It is one of the complexest tools I mentioned here.
      hpgraphic
      Gephi is a graph-based tool for data exploration. It is most useful for relations of notes of all kinds.

These are some examples and I evaluated even more tools. So there are many ways to visualize data and what you use, is depending on your environment and skills. I mostly use R for generating complex graphs, but only because I use that tool for the analysis. I will be integrating more Circos into our autmated scripts soon, since they are all based on Perl anyways.