Plumber: Getting R ready for production environments?

R Project and Production

Running R Project in production is a controversially discussed topic, as is everything concerning R vs Python. Lately there have been some additions to the R Project, that made me look into this again. Researching R and its usage in production environments I came across several packages / project, that can be used as a solution for this:

There are several more, but those I found the most interesting.

Plumber


For reasons of ease of use and because it was not a hosted version, I took a deeper look into Plumber. This felt quite natural as it uses function decorators for defining endpoints and parameters. This is similar to Spring Boot, which I normally use for programming REST APIs.
So using Plumber is really straight forward, as the example below shows:

#' return text "Hello"
#' @get /hello
function() {
  list(msg = "hello")
}

The #’ @get defines the endpoint for this request. In this case /hello, so the full url on localhost is http://127.0.0.1:8001/hello. To pass in one or more parameters you can use the decorator #’ @param parameter_name parameter_description. A more complicated example using Plumber is hosted on our Gitlab. This example was built with the help of Tidy Textmining.

Production ready?

Plumber comes with Swagger, so the webserver is automatically available. As the R instance is already running, processing the R code does not take long. If your model is complicated, then, of course, this is reflected in the processing time. But as R is a single thread programming language, Plumber can only process one request at a time.
There are ways to tweak this of course. You can run several instances of the service, using a Docker image. This is decribed here. There is also the option of using a webserver to fork the request to serveral instances of R. Depending on the need of the API, single thread processing can be fast enough. If the service has to be highly available the Docker solutions seems like a good choice, as it comes with a load balancer.

Conclussion

After testing Plumber I am surprised by the easy of use. This package makes deploying an REST API in R really easy. Depending on your business needs, it might even be enough for a productive scenario, especially when used in combination with Docker.

Visualization: Enhancing the Palo Suite with NVD3.js

After my previous post How to visualize data? I was unsatisfied with the visualization provided by the Palo Suite provided by Jedox. This could have several reasons, not the least, that I may not have been able to get the max out of it. But the quality of the resulting diagramms and it’s interactivity were lacking for the purposes I have to deal with, especially after working with Circos the last few weeks.
So I went hunting for something easy to integrate into my Palo Suite.
Palo provides an interface for integration “widgets” into their webreporting environment. This interface provides one Javascript function that is easy to use. This made the choice of what kind of library to use easier, but there are still a lot available. Here is a list of some I came across:

There are a lot more out there and sometime I had to decide on one. So I settled on NVD3.js since I liked the look of the graphics and because it is based on Data Driven Documents.
It supports several types of graphs already and integrating them all into the interface provided by Jedox, got me quick results. Here is a quick view on the difference between Palo built-in graphics and NVD3.js. Both graphs are based on the same data.

Palo Suite Webreporting graph
jedox_bib

NVD3.js graph
nvd3

For anyone interessted I uploaded the file here. This is just a quick hack and not very representable, but it shows how it works.

How to visualize data?

Data visualization is something like an art. How to make results from your research in data easy to understand by management, business users or just everyone out there? A list of data, like an Excel sheet ist not what catches the eye. The art in visualization is shown perfectly on the site of Martin Wattenberg.
Now the questions is, what tools are easy to use in a company environment to visualize your data?

There are several classes of tools you can use:

  • Beginner: These are tools with a wide knowlegde throughout the company, mainly MS Excel. You can explore data easily and make diagramms without too much hazzle. It provides Barcharts, Lines, Pies and a combination of those. It is also very easy to use for adhoc analysis and making the data and graphs available to business users, if necessary.
  • Online Libraries: If you don’t want to be limited to Excel and use a Web-based reporting / analysis tool, you maybe can integrate one of the libraries available. There are several for all purpose you can imagine:
      Google Charts: For dynamic charts it has everything you need, as long as you are not bother by the Google look. They are running in every browser that supports SVG, canvas and VML. But there are JavaScript based, so there is a problem, if they should be used offline or in browsers without JS.
      Circos is a great tool, if you want to use circles to visualize your data. It is written in Perl and produces PNG output.
      panel-general
      Visual.ly focusses more on the infographics side of graph. It is mainly a marketplace, but you can make your own cartoon like graph with it.
      Kartograph is a tool for creating interactive vector maps. It is available as JavaScript or Python library. This is a great tool, especially since most people totally love maps and to use them.
  • Professional tools: The opposite of Excel in manners of manipulating and analysing data. These tools are sometimes pretty expensive, such as SAS and SPSS. But there are also open source and free to use tools, that sometimes are more flexible and easier to use, since they have a strong user base.
      R: Besides its nearly unlimited supply of libraries for all manners of analysis, R also has lots of packages concerning visualizing data and makes good use of them. It is one of the complexest tools I mentioned here.
      hpgraphic
      Gephi is a graph-based tool for data exploration. It is most useful for relations of notes of all kinds.

These are some examples and I evaluated even more tools. So there are many ways to visualize data and what you use, is depending on your environment and skills. I mostly use R for generating complex graphs, but only because I use that tool for the analysis. I will be integrating more Circos into our autmated scripts soon, since they are all based on Perl anyways.

Data Science and Machine Learning

Machine Learning is acknowlegded as a part of Data Science, but will it be able to replace a Data Scientist?
There have been several articles around that topic in the last few years and months. It’s true there has been some major progress in the field of machine learning and there are already articles about the beginning of automated science like Lipson and Schmidt.
During the SXSWeek there will even be a Panel concerning this topic The Data Scientist Will Be Replaced By Tools.
The main question is, will machine results replace human expertise? There are several startups that provide data science as a service, like Prior Knowledge or Platfora. These companies help to discover knowledge hidden in a company’s data. PK looks in the provided data for correlations and helps to build predictive models. Pltfora on the other hand wants to make Hadoop usable for everyone.
These companies can help discover information, but only combined with human expertise from inside the company it is possible to make the most of the uncovered information. So, in my opinion, machine learning helps making the job of a data scientist easier, because he can concentrate more on his expertise with the context the data was created in.
This may even help in broadening the access to data science to more people.

Data Science: What is it?

Data Science is an interdisciplinary field of sciences. It includes:

      Data Engineering
      Math
      Statistics
      Advanced Computing
      Visualization
      Domain Expertise

It revolves around, as the name suggests, working with data. With the development in creating big sets of data in our society, the need for analyzing this data grows across all industries. And this calls for people, who can work with data, analyze it and make their findings available for management decisions. Here comes into play math, statistics, expertise and visualization.
One of the most important challenges is, to integrate findings from unstructured data, like logfiles, and structured data, like databases. This enables a company to develop a whole new view of their customers and their products.
One task of data science is to make all necessary information available and filter out the noise that this information age provides us with. This is where statistics and data mining come into play. Finding effective ways to distribute these findings is partly data engineering, math and visualization. Presenting results from analysing data in graphical form is a way to make it easier to grasp the results.