Bringing machine learning models into production

Developing and bringing machine learning models into production is a task with a lot of challenges. These include model and attribute selection, dealing with missing values, normalization and others.

Finding a workflow that puts all the gears, from data preprocessing and analysis over building models and selecting the best performing one to serving the model in a real time API, into motion is the one I want to discuss here.

Life cycle of a machine learning model

The life cycle of machine learning is basically described by the iteration of the following four steps.

Life cycle of machine learning models
Machine learning steps

Each of these steps is under constant evaluation. Especially in case model performance can be enhanced by adding different data attributes or preprocessing methods.

For the presented approach we split the process of modeling into two other parts. Part one contains the above mentioned four steps and we call it Manual Run Modeling. Step two is automating the steps of part one.

machine learning for production

Manual Run Modeling

In the manual part we start by analysing our new task. After that we come up with a hypothesis we want to prove and test.

Development and Prototyping Environment

First we set up a development environment for working on the new task. For this we spin up a Jupyter notebook server. We deploy it on Google Cloud AI Platform. It provides ready to run containers for Jupyter. The notebook approach enables us to develop fast and share results with the team using a browser. With the ability to easily visualize data inline in a notebook, this approach is especially useful in the data extraction and preprocessing process.

Data preparation and visualization

Python provides some nice packages for generating graphics on data for faster insights. This speeds up our prototyping process in the notebook. We are especially fond of using Seaborn.

We load the data identified for this model into a dataframe in the notebook. After that we start looking at each attribute and its values, often in combination with the other attributes. For a first overview we use a pairplot provided by Seaborn.

We use a combination of visualizations, such as e.g. a correlation matrix. Then we decide which attributes to use and how to handle outliers and missing values. Finally we use one hot encoding for categorical attributes and normalize the continuous attributes to create the input into our models.

pairplot of attributes

Model Selection and Evaluation

When the data is ready we choose several models to find a solution for our problem. These models can range from a multilinear regression model over random forests to deep neural networks with Tensorflow.

After splitting the data for training, evaluation and test we decide on a measure each model has to optimize for, e.g. mean squared error or precision. This depends on the kind of problem. Once we identified the best performing model we start by transforming the code for Google Cloud AI Platform.

RT Prediction Deployment

After manual evaluation of preprocessing and modeling, we start the task of automating training and deployment for bringing machine learning models into production. This can be split into three tasks:

  • Training the model with hyperparameter optimization on Google Cloud AI platform
  • Deploying the model on Google Cloud AI platform
  • Deploying an API to access the model for real time predictions

Training on Google Cloud AI Platform

After deciding on a model to go forward into production with, we optimize our code for data extraction and preprocessing. The reason for this is to make it reusable and compliant with Google Cloud AI Platform rules. This means basically we have to create a Python package out of the first three steps.

A project could be set up as shown in the picture below.

Sample structure of AI platform package

Sample structure of AI platform package

This Python package is then deployed to Google Cloud platform and executed there. If you have custom packages there is an option to supply those too. An example call for training on the cloud would then look like the following example.

One advantage of using Google Cloud AI platform is the possibility of using automated hyperparameter tuning for models. This enables us to train a model automatically with different configurations. Then we select the one performing best for the defined measure in hptuning_config.yaml.

hyperparameter yaml example

In the AI platform dashboard you can then see, which hyperparameter combination of your defined values in params had the best results for the defined hyperparameterMetricTag and goal.

The identified model is then ready to be deployed to the platform, where Google provides an URL to access the model in real time.

Deploying the model on GCP

Deploying to production is done with a Jenkins job. We use Jenkinsfile to define our jobs as part of our code. A model deployment consists of the following steps:

  • Copying the model to the correct GCP bucket (differs for our three development systems development, staging and production)
  • Deploy model to AI platform using a gcloud command
  • Test model with a prepared test dataset
deploy model on ai platform

If all of these steps are successful the model is ready for usage in the specified environment via an URL endpoint.

Deploying the Real Time API

Since the model is deployed and accessible using an URL endpoint, we now have to build a transformation API that takes the input data and transforms it into the needed format for the model endpoint and then calls the model.

To make using the model easier for other services, our data entry format is JSON. This makes the data human readable and changes to any steps concerning the model, except changing the number of attributes, can be done without dependencies on our client services.

REST Service

As framework for our REST API we chose Flask, since it is lightweight, flexible, easy to use and also written in Python. Since API and model are written in the same language we can make use of the preprocessing from the training package we needed for training above. The main work here lies in adapting the code to only run one single event, instead of the batch prediction, used to validate the result during training.

For stability and security reasons we added some additional checks:

  • JWT token authorization with Flask-JWT
  • Input format checks
    • all required fields in request
    • filling in default values for optional fields
    • checking values for validity (ee.g. range or location checks)

We also created an extra package containing all transformation functions, we use in several of our models. This package contains, e.g. min-max-normalization and distance calculation functions.

Speed is important in this component, so we store all data for enriching and transforming the incoming data inside a cache.

After receiving the prediction from the model, we qualify the results for regression models, by adding a confidence value. This helps our clients to better understand the results, especially if they are meant to be shown to end users.

Each of our responses has its own error code and message that is supplied in the result. The result is again in JSON format with the following fields:

  • success: true or false, indicating result of request
  • message: (error) message for response
  • prediction object

Deployment of API

Deployment to our production system is then handled by a Jenkins job with the following steps:

  • Unit and integration testing of Flask API
  • Building a Docker container for the Flask API
  • Pushing Container Image to GCP project repository
  • Deploying Container to Google Cloud Run

By using Cloud Run we do not need to worry about hardware configuration and can focus on optimizing the API and the model.

Conclusion

By following this process we make sure that the time spent on the necessary things, beside building a model, for bringing machine learning models into production is kept to a minimum and does not include managing any hardware resources or availability concerns.

Especially the part after the manual data and model selection process is usable as a template to fasten the deployment process. This is thanks to the tools provided by Google and our extracting reusable functions into their own Python package.

AVRO schema generation with reusable fields

Why use AVRO and AVRO Schema?

There are several serialized file formats out there, so chosing the one most suited for your needs is crucial. This blog entry will not compare them, but it will just point out some advantages of AVRO and AVRO Schema for an Apache Hadoop ™ based system.

  • Avro schema can be written in JSON
  • Avro schema is always present with data, getting rid of the need to know the schema before accessing the data
  • small file size, since schema is always present there need to be stored less type information
  • schema evolution is possible by using a union field type with default values. This was explained here. Deleted fields also need to be defined with a default value.
  • Avro files are compressible and splitable by Hadoop MapReduce and other tools from the Hadoop universe.
  • files can be compressed with Snappy and Deflate.

AVRO Schema generation

Generating Apache AVRO ™ schemas is pretty straight forward. They can be written in JSON and are always stored with the data. There are field types for everything needed, even complex types, such as maps and arrays. A schema can also contain a record, which is in itself an independent schema, as a field. This makes it possible to store data of almost unlimited complexity in AVRO. In case of very complex schema definitions keep in mind, that to access complex data structures can be very expensive later on in the process of transforming and working with such data. Here are some examples of AVRO supported datatypes

AVRO Datatypes

  • Primitive types as null, integer, long, boolean float, double, string and byte
  • Complex types such as records. This fields are basically complete schemas in their own right. These fields consist of:
    • name
    • namespace
    • fields
  • Enums
  • Arrays
  • Maps
  • Fixed length fields
  • Logical datatypes

Logical datatypes are something special and by using these you can define other fields you might need. As you can see in the list above there is no datatype for date or datetime. These are implemented using logical datatypes. Define a logical type like this:

{
  "type": "bytes",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}

Supported logical datatypes are decimal, date, time, timestamp and duration.

Downsides in Schema Generation

There is one downside though, namely that individual fields are not reusable. This topic was addressed by Treselle Systems in this entry. They introduce a way to make fields in a AVRO Schema reusable by working with placeholders and then replacing them with before defined subschemas. This comes in handy when you have fields, that should be available in each AVRO schema, such as meta information for a message pushed into your system.

AVRO Schema Generator

To make AVRO Schema generation more comfortable, I worked on a project, inspired by Treselle Systems’ text and combined it with other tools I use daily:

  • Jupyter Notebook
  • AVRO-Doc: a JS based server reformatting AVRO schemas into an easily readable HTML format.
  • AVRO schema repo server: a simple REST based server to publish schemas to and provide them for all parties that generate and consume the data stored in AVRO format
AVRO Schema Generator
AVRO Schema Generator

This combination of several tools makes it possible to handle data more easily.

Schema generator

Schemas are written using a Jupyter notebook server. The project contains:

  • AVRO Schema Editor.ipynb: To create new schemas and adapt existing ones. You load the existing files into the notebook and then edit them before saving them to file again.
  • Avro Schema Generator.ipynb: This notebook checks schema syntax and replaces subschemas in a generated version of the schema. Subschemas need to be defined before generating a final version of a schema. This notebook also implents functions to upload the schemas to the repository server.
  • Docker file for setting up the schema repository server in
    docker_schema_repo. Make sure to set the correct URL before trying to upload the generated schemas.
  • Docker file for setting up the avrodoc server, with built in active directory plugin in Nginx. Find this file in
    docker_avrodoc

The project contains an example schema for reference.

Schema repository

The schema repository provides a generally available schema store. This store has a built-in version control. It helps sources to take their time adapting to a new version of the schema.

This asynchronity is possible, because all schemas are compatible with previous version. With that restraint it is also possible to have different sources push one schema in different versions and still be able to transform the data using one process. Not existing values in different version of a schema are filled with the mandatory default value and this default value can even be NULL.

Conclussion

This project aims to help managing data definitions in Hadoop based systems. With the schema repository it provides a single source of truth for all data definitions, at least for data entry and if you decide to use AVRO schemas thourghout your system, even after transformation, you can manage all data definition here.

There are several other schema repositories out there, that can be used, e.g. the one provided by Confluent or the one introduced by Hortonworks for Apache Nifi. The tools used here are just examples of how such a system can be set up and how to introduce reusable AVRO fields into your schemas.

The code can be found in our repository.

Plumber: Getting R ready for production environments?

R Project and Production

Running R Project in production is a controversially discussed topic, as is everything concerning R vs Python. Lately there have been some additions to the R Project, that made me look into this again. Researching R and its usage in production environments I came across several packages / project, that can be used as a solution for this:

There are several more, but those I found the most interesting.

Plumber


For reasons of ease of use and because it was not a hosted version, I took a deeper look into Plumber. This felt quite natural as it uses function decorators for defining endpoints and parameters. This is similar to Spring Boot, which I normally use for programming REST APIs.
So using Plumber is really straight forward, as the example below shows:

#' return text "Hello"
#' @get /hello
function() {
  list(msg = "hello")
}

The #’ @get defines the endpoint for this request. In this case /hello, so the full url on localhost is http://127.0.0.1:8001/hello. To pass in one or more parameters you can use the decorator #’ @param parameter_name parameter_description. A more complicated example using Plumber is hosted on our Gitlab. This example was built with the help of Tidy Textmining.

Production ready?

Plumber comes with Swagger, so the webserver is automatically available. As the R instance is already running, processing the R code does not take long. If your model is complicated, then, of course, this is reflected in the processing time. But as R is a single thread programming language, Plumber can only process one request at a time.
There are ways to tweak this of course. You can run several instances of the service, using a Docker image. This is decribed here. There is also the option of using a webserver to fork the request to serveral instances of R. Depending on the need of the API, single thread processing can be fast enough. If the service has to be highly available the Docker solutions seems like a good choice, as it comes with a load balancer.

Conclussion

After testing Plumber I am surprised by the easy of use. This package makes deploying an REST API in R really easy. Depending on your business needs, it might even be enough for a productive scenario, especially when used in combination with Docker.

Analytics Platform: An Evolution from Data Lake

Analytics Platform

Having built a Data Lake for your company’s analytical needs, there soon will arise new use cases, that cannot be easily covered with the Data Lake architecture I covered in previous posts, like Apache HAWQ™: Building an easily accessable Data Lake. You will need to adapt or enhance your architecture to become more flexible. One way to make this flexibility happens, is to transform your Data Lake into an Analytics Platform.

Definition of an Analytics Platform

An Analytics Platform is a platform that provides all kind of services needed for building data products. This often exceeds the functionality of a pure RDBMS or even a Data Lake based on Apache HAWQ™. There are data products that have more requirements than a SQL inferface. Reporting and basic analysis are addressed by this setup, but products dealing with predictions or recommendations often have different needs. An Analytics Platform provides flexibility in the tools used. There can be, for example, a Apache HAWQ™ setup and at the same time an environment for running Tensorflow applications.

Using existing parts: Multi-colored YARN

When you are running a Hadoop Cluster™, you are already familiar with a resource manager. This manager is YARN. With YARN you can already deploy Linux Containers, and support for Docker containers has already progressed pretty far (YARN-3611). Building complex applications and managing them with YARN is called Multi-colored YARN by Hortonworks.
Following through on this idea you will have a cluster with just some central services installed directly on bare metal. You will deploy the rest in containers, as shown in the images below.

Analytics Platform
Analytics Platform based on YARN and Docker

The example makes use of Kubernetes and Docker for virtualization and provides the following services on bare meta, since they are needed by most applications:

  • Ambari
  • Kuberneted
  • YARN
  • ZooKeeper
  • HDFS

Especially the HDFS is important as a central services. This makes it possible for all applications to access the some data. The Picture above shows, that there can be several instances of a Hadoop distribution. This is possible even in different version. So the platform allows for multi tenancy, while all instances are still processing the same data.

Development changes

Having an Analytics Platform makes the development of data products easier. There always was the problem of developing a product on a sample of the data, when you used development and staging systems, as decribed by me here. In same cases these did not contain all possibly combinations of data. This could result in error after a deployemnt on the production environment. Even going through all development and staging could not change this. This new approach allows you to deploy all three systems on the same data. So there you can account for all data induced errors on the development and staging systems already.
You can even become more agile in your development process. The picture below shows an example deployment process, that uses this system.

Analytics Platform: Deployment Process

Conclussion

Moving from a pure Data Lake to an Analytics Platform give you are flexibility and helps in the development of data products. Especially since you can develop on the same data as is available on production. Of course it brings more complexity to an already complex environment. But since it is possible to keep YARN as resource manager and move to a more agile way of development and deployment, it might be worth considering. Once Multi-Colored YARN is finished, it will be easier to make this happen.