Apache Zeppelin: Visualization and Spark data processing

Apache Zeppelin

Apache Zeppelin is a web-based notebook for interactive data analytics. It comes will features for all the steps of data analysis:

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration

Besides that feature set it also supports multiple languages in the backend. Currently it supports languages like:

But there is also the possibility to add your own interpreter to Zeppelin. This makes this tool really flexible.
Another feature it has, is the built in integration of Apache Spark. It ships with the following features and more:

  • Automatic SparkContext and SQLContext injection
  • Runtime jar dependency loading from local filesystem or maven repository.
  • Canceling job and displaying its progress

It also has built in visualization, which is an improvemnt over using ipython notebooks I think. The visualization covers the most basic graphs, like:

  • Tables
  • BarCharts
  • Pies
  • Scatterplot
  • Lines

These visualizations can be used with all interpreters and are always the same. So you can show data from Postgres and Spark in the same notebook with the same functions used. There is no need to handle different data sources differently.
You can also use dynamic forms in your notebooks, e.g. to provide filter options to the user. This comes in handy, if you embedd a notebook in your own website.

Please follow and like us:

Visualization: Enhancing the Palo Suite with NVD3.js

After my previous post How to visualize data? I was unsatisfied with the visualization provided by the Palo Suite provided by Jedox. This could have several reasons, not the least, that I may not have been able to get the max out of it. But the quality of the resulting diagramms and it’s interactivity were lacking for the purposes I have to deal with, especially after working with Circos the last few weeks.
So I went hunting for something easy to integrate into my Palo Suite.
Palo provides an interface for integration “widgets” into their webreporting environment. This interface provides one Javascript function that is easy to use. This made the choice of what kind of library to use easier, but there are still a lot available. Here is a list of some I came across:

There are a lot more out there and sometime I had to decide on one. So I settled on NVD3.js since I liked the look of the graphics and because it is based on Data Driven Documents.
It supports several types of graphs already and integrating them all into the interface provided by Jedox, got me quick results. Here is a quick view on the difference between Palo built-in graphics and NVD3.js. Both graphs are based on the same data.

Palo Suite Webreporting graph
jedox_bib

NVD3.js graph
nvd3

For anyone interessted I uploaded the file here. This is just a quick hack and not very representable, but it shows how it works.

Please follow and like us:

How to visualize data?

Data visualization is something like an art. How to make results from your research in data easy to understand by management, business users or just everyone out there? A list of data, like an Excel sheet ist not what catches the eye. The art in visualization is shown perfectly on the site of Martin Wattenberg.
Now the questions is, what tools are easy to use in a company environment to visualize your data?

There are several classes of tools you can use:

  • Beginner: These are tools with a wide knowlegde throughout the company, mainly MS Excel. You can explore data easily and make diagramms without too much hazzle. It provides Barcharts, Lines, Pies and a combination of those. It is also very easy to use for adhoc analysis and making the data and graphs available to business users, if necessary.
  • Online Libraries: If you don’t want to be limited to Excel and use a Web-based reporting / analysis tool, you maybe can integrate one of the libraries available. There are several for all purpose you can imagine:
      Google Charts: For dynamic charts it has everything you need, as long as you are not bother by the Google look. They are running in every browser that supports SVG, canvas and VML. But there are JavaScript based, so there is a problem, if they should be used offline or in browsers without JS.
      Circos is a great tool, if you want to use circles to visualize your data. It is written in Perl and produces PNG output.
      panel-general
      Visual.ly focusses more on the infographics side of graph. It is mainly a marketplace, but you can make your own cartoon like graph with it.
      Kartograph is a tool for creating interactive vector maps. It is available as JavaScript or Python library. This is a great tool, especially since most people totally love maps and to use them.
  • Professional tools: The opposite of Excel in manners of manipulating and analysing data. These tools are sometimes pretty expensive, such as SAS and SPSS. But there are also open source and free to use tools, that sometimes are more flexible and easier to use, since they have a strong user base.
      R: Besides its nearly unlimited supply of libraries for all manners of analysis, R also has lots of packages concerning visualizing data and makes good use of them. It is one of the complexest tools I mentioned here.
      hpgraphic
      Gephi is a graph-based tool for data exploration. It is most useful for relations of notes of all kinds.

These are some examples and I evaluated even more tools. So there are many ways to visualize data and what you use, is depending on your environment and skills. I mostly use R for generating complex graphs, but only because I use that tool for the analysis. I will be integrating more Circos into our autmated scripts soon, since they are all based on Perl anyways.

Please follow and like us: