Big Data and Data Warehouse Architecture

Further development and new additions to the Hadoop framework, such as Stinger from HortonWorks or Impala from Cloudera try to bridge the gap between traditional EDWH architectures and big data architectures.
Especially initiative with the goal of speeding up Hive and delivering SQL 2011 standard to use on Map / Reduce Hadoop clusters makes this technology usable for developers with a SQL background. This next iteration in Hive optimization also brings an ACID framework with transactions and writeable tables. This is especially useful in data warehouse contexts, for example when you need to add meta data.

With these developments it seems plausible, that Hadoop and with it Big Data as a whole will move from ETL plattform for traditional EDWH architectures using traditional database systems, to a unified plattform, where Hadoop stores all data from raw unstructured data to structured data from the companies transactional systems and the meta data created in for reporting purposes. So access to all data would be given in the same system and query-able with SQL.
Standard reporting and deeper analysis on all data could then be accessed on the same system, so that all analysts and traditional BI developers share one platform and a better understanding of all the data needed and used in the data warehouse system.

I already did a benchmark on query speed for MySQL, Stinger and Impala here and will update this, once is out.

Comparing Stinger to Impala

With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup:

  • 4 GB RAM
  • Intel Core i5 2500 3.3 GHz

The datasets were the following:

  1. Dataset 1: 71.386.291 rows and 5 columns
  2. Dataset 2: 132.430.086 rows and 4 columns
  3. Dataset 3: partitioned data of 2.153.924 rows and 32 columns
  4. Dataset 4: unpartitioned data of 2.153.924 rows and 32 columns

The results were the following:

QueryHive (0.10.0)ImpalaStinger (Hive 0.12.0)
Join tables167.61 sec31.46 sec122.58 sec
Partitioned tables Dataset 342.45 sec0.29 sec20.97 sec
Unpartitioned tables Dataset 447.92 sec1.20 sec36.46 sec
Grouped Select Dataset 1533.83 sec81.11 sec444.634 sec
Grouped Select Dataset 2323.56 sec49.72 sec313.98 sec
Count Dataset 1252.56 sec66.48 sec243.91 sec
Count Dataset 2158.93 sec41.64 sec174.46 sec
Compare Impala vs. Stinger
Compare Impala vs. Stinger

This shows that Stinger provides a faster SQL interface on Hive, but since it is still using Map / Reduce when calculating data it is no match for Impala that doesn’t use Map / Reduce. So using Impala makes sense when you want to analyse data in Hadoop using SQL even on a small installation. This should give you easy and fast access to all data stored in your Hadoop cluster, that was before not possible.
Facebook’s Presto should achieve nearly the same results, since the underlying technique is similar. These latest additions and changes to the Hadoop framework really seem like a big boost in making this project more accessible for many people.

SQL on Hadoop: Facebook’s Presto

Earlier this month Facebook open sourced its own product for using SQL on Hadoop. It is called Presto and is something like Facebook’s answer to Cloudera’s Impala or Hortonwork’s Stinger already presented in an earlier post called SQL and Hadoop on this site.
Presto is unlike Hive and more like Impala, since it doesn’t rely on MapReduce for its queries. This makes it about 10 times faster than Hive on large datasets, or so Facebook claims in a blog post.
This product may have a huge impact on the further development of SQL on Hadoop tools, if it’s taken up by enough companies. But since there is no commercial goal linked to it right now, it seems more like Facebook will develop it as their needs increase. So they will not be hurried along.
Like Impala it does support a huge subset of ANSI SQL contrary to Hive’s SQL like HiveQL. So it again aims on making Hadoop more accessible for a broader audience of analyst, that already are familiar with SQL.
Analysis on Big Data sets have been strengthened by this release even more and the entry level investments for more companies to use Hadoop as data storage system are decreasing with every development in this direction.

SQL and Hadoop

Bringing SQL to Hadoop has been one of the major trends in Big Data these last twelve months. Reason enough for me to take a closer look at that scene right now. One reason to build an interface based on SQL for Hadoop is to make the technology available for more people. Companies that have used SQL for decades won’t just stop and use something different for analysing and accessing their data.
Another reason lies in the nature of Hadoop, as it’s build as a batch processing system, which can be slow in answering queries. These new products emerging are trying to speed up the already existing SQL product Apache released named Hive.
There are two approaches to bringing SQL to Hadoop:

  • SQL natively on Hadoop
  • DBMS on Hadoop

SQL natively on Hadoop

Some example products in this category are:

  • Stinger from HortonWorks, which claims to make SQL on Hadoop 100x faster than Hive. This product is based on Hadoop 2.0 and the new YARN framework.
  • Impala from Coudera, which also claims speed up SQL queries compared to Hive. It is also design to co-exist with MapReduce and can be cleanly integrated into the Hadoop stack.
  • Drill from Apache, which is similar to Googles Dremel.

DBMS on Hadoop

Some example products in this category are:

  • Hadapt, which includes a PostgreSQL instance on each node and takes advantage of the distirubted filesystem for speed and supports advanced SQL functions. They recently introduced a feature called “Schemaless SQL” for their product. This integrates data such as JSON, Documents, etc. into their system and lets you access them by SQL. This stores the data in the original form on the HDFS and emerges columns in a Multistructured table as needed. They posted a detailed explanation here.
  • CitusDB, which also includes a PostgreSQL instance on each node. This means advanced SQL functions are supported here too.
  • Tajo founded in South Korea is still in incubator mode with Apache, but will bear watching too.

The two different approaches have their benefits each, and to decide which fits you better, I would test both of them. The main issue with all the products is, that this is all relatively new and there is little experience with the technology yet. Some of the products even are still in development, only offering Beta access.
But here is where the future of Big Data will take us. Making the benefits of Hadoop available for more analysts by building an interface they already can use.

REST (Representational state transfer) APIs and Big Data

Getting data, huge amounts of data, out of some systems tends to be quite a hazzle sometimes. Often you are required to use techniques such as FTP or SSH for transfering files. But with RESTful APIs getting more attention in the last few years, there is a new way to get your data.
The charm of REST APIs is, that they are stateless and use HTTP methods explicitly. This makes getting data pretty straight forward:

  • Use POST to create a resource on the server.
  • Use GET to retrieve a resource.
  • Use PUT to change a resource.
  • User DELETE to remove a resource.

The result can be returned in any defined format, but mostly it is XML or JSON. Security is also provided, if you integrate authentification methods like OAUTH or LDAP.

This gives you new possibilities to integrate your data into webbased reporting systems, since you only have to use the HTTP protocol to get your data and can work on the results as they stream in.
Since most REST APIs have the possibility to store results of a request, you could get the same result again at a later time, without having to process it on the source system again.

Hadoop even provides a REST API called WebHDFS REST API developed by Hortonworks, which supports the complete filesystem interface of HDFS. This is a great help, if you are running applications using your Hadoop cluster that are not using Java. So you can mainpulate and access your data from about everywhere.

Big Data in Learning

There are many fields in which big data can improve results. One of these being (e-)learning. Until recently the focus on analysing learning lay on analysing results of exams but with big data and analytics there are new possibilities to enhance the experience of learning as a whole. For example there is the possibility to personalize learning and helping students to achieve better results. Big Data makes this possible in nearly real time. There is the possibility to help students in the process of learning, as soon as the programm realizes a problem and providing a solution in the workflow, instead of the student having to stop his learning process for his problem to be solved and then continue. This also applies for working environments.
Not only inside a process of learning or work analysing data can come in handy. Even after a course is finished analysing the data produced during the course by all students can help optimize the course and resulting exam. Identifying where users got stuck or what was to easy will improve the learning experience for everyone.
There are already efforts to integrate this into the learning experience like Predictive Analytics Reporting (PAR) Framework.
PAR is trying to integrate several data sources and base their studies on this data instead of the studies that are based on individual programms. This approach broadens the base and this may make it able to find other (better) insights into the educational system of the U.S.

Hadoop and MPP

With Big Data Map/Reduce is always the first term that comes into mind. But it’s not the only way to handle large amounts of data. There are databasesystems especially built to deal with huge amounts of data and they are called Massively Parallel Processing (MPP) databases.
MPP database systems have been around for a longer time than Map/Reduce and its most popular integration Hadoop and are based on a shared nothing architecture. The data is partitioned across severel nodes of hardware and queries are processed via network interconnect on a central server. They often use commodity hardware that is as inexpensive as hardware for Map/Reduce. For working with data they have the advantage to make use of SQL as their interface, the language used by most Data Scientists and other analytic prefessionals so far.
Map/Reduce provides a Java interface to analyse the data, which comes with more time to implement than just write an SQL statement. Hadoop has some projects, that provide a SQL similar query language, like Hive which provides HiveQL, a SQL like query language, as interface.
Since both systems handle data, there will be a lot gained, when both are combined. There are already projects working on that, like Aster Data nCluster or Teradata and Hortonworks.
There is even a new product bringing both worlds together as one product, Hadapt. With this product you can access all your data, structured or unstructured, in a single plattform. Each node has space for SQL as well as for Map/Reduce.

Last but not least a list of some MPP databases available right now:

Depending on your business needs, you may not need a Map/Reduce cluster, but a MPP database, or both to benefit from their respective strenghts in your implementation.

Visualization: Enhancing the Palo Suite with NVD3.js

After my previous post How to visualize data? I was unsatisfied with the visualization provided by the Palo Suite provided by Jedox. This could have several reasons, not the least, that I may not have been able to get the max out of it. But the quality of the resulting diagramms and it’s interactivity were lacking for the purposes I have to deal with, especially after working with Circos the last few weeks.
So I went hunting for something easy to integrate into my Palo Suite.
Palo provides an interface for integration “widgets” into their webreporting environment. This interface provides one Javascript function that is easy to use. This made the choice of what kind of library to use easier, but there are still a lot available. Here is a list of some I came across:

There are a lot more out there and sometime I had to decide on one. So I settled on NVD3.js since I liked the look of the graphics and because it is based on Data Driven Documents.
It supports several types of graphs already and integrating them all into the interface provided by Jedox, got me quick results. Here is a quick view on the difference between Palo built-in graphics and NVD3.js. Both graphs are based on the same data.

Palo Suite Webreporting graph

NVD3.js graph

For anyone interessted I uploaded the file here. This is just a quick hack and not very representable, but it shows how it works.

How to visualize data?

Data visualization is something like an art. How to make results from your research in data easy to understand by management, business users or just everyone out there? A list of data, like an Excel sheet ist not what catches the eye. The art in visualization is shown perfectly on the site of Martin Wattenberg.
Now the questions is, what tools are easy to use in a company environment to visualize your data?

There are several classes of tools you can use:

  • Beginner: These are tools with a wide knowlegde throughout the company, mainly MS Excel. You can explore data easily and make diagramms without too much hazzle. It provides Barcharts, Lines, Pies and a combination of those. It is also very easy to use for adhoc analysis and making the data and graphs available to business users, if necessary.
  • Online Libraries: If you don’t want to be limited to Excel and use a Web-based reporting / analysis tool, you maybe can integrate one of the libraries available. There are several for all purpose you can imagine:
      Google Charts: For dynamic charts it has everything you need, as long as you are not bother by the Google look. They are running in every browser that supports SVG, canvas and VML. But there are JavaScript based, so there is a problem, if they should be used offline or in browsers without JS.
      Circos is a great tool, if you want to use circles to visualize your data. It is written in Perl and produces PNG output.
      panel-general focusses more on the infographics side of graph. It is mainly a marketplace, but you can make your own cartoon like graph with it.
      Kartograph is a tool for creating interactive vector maps. It is available as JavaScript or Python library. This is a great tool, especially since most people totally love maps and to use them.
  • Professional tools: The opposite of Excel in manners of manipulating and analysing data. These tools are sometimes pretty expensive, such as SAS and SPSS. But there are also open source and free to use tools, that sometimes are more flexible and easier to use, since they have a strong user base.
      R: Besides its nearly unlimited supply of libraries for all manners of analysis, R also has lots of packages concerning visualizing data and makes good use of them. It is one of the complexest tools I mentioned here.
      Gephi is a graph-based tool for data exploration. It is most useful for relations of notes of all kinds.

These are some examples and I evaluated even more tools. So there are many ways to visualize data and what you use, is depending on your environment and skills. I mostly use R for generating complex graphs, but only because I use that tool for the analysis. I will be integrating more Circos into our autmated scripts soon, since they are all based on Perl anyways.

Data Science Tools

What tools are used for Data Science? There are a lot of them out there and in this post I want to tell you about the ones I currently use or used before.

  • KNIME is a graphical tool to analyse data. It uses an interface to build process flows that contain everything, from data transformation, initial analysis, predictive analysis, vizualisation and reporting. One of it’s advantages is the huge community and it being an open source tool, that encourages the community to contribute.
  • Rapid Miner from Rapid-I is also a graphical tool to analyse data. Processes are built using predifined steps. It provides data transformation, initial analysis, predictive analysis, vizualisation and reporting. Since it is based on Java it is plattform independent. There is a community too, that helps to improve the programm and expands the available resources.
  • SAShas a whole suite of tools for data manipulation and analysis. They provide Olap tool, predictive analytics, reporting and vizualisation. Being in the market for a long time, they have a huge customer base and lots of experience. There is also a system of trainings with exams to provide certified qualifications in using there tools.
  • R is a free tool, developed for scientists in biology first, but it is spread through all kinds of industries now, due to its wide range of packages. There is no graphical interface but the language is easy to learn. R provides data manipulation, visualization, predictive analysis, reporting and initial analysis. Also there is an integration into Hadoop for better interaction with Big Data.
  • Splunk is a tool primarily for analysing unstructured data, like logfiles. It provides real time statistics and a outstanding visualization for reports. Its language is related to SQL, so it is pretty easy to learn, if you used SQL queries before.
  • Jedox provides an Olap server with an interface that looks like MS Excel on the web and they have a plugin into MS Excel too. It caters mainly to controlling need, but has some advantages regarding self-service BI. Based on PHP and Java it is available in a community version and a professional version.
  • FastStats from Apteco uses a easy to understand graphical interface and some basic predictive methods. It enables business users to analyse their data themselves and even build small models. It also provides visualization tools. This is also a tool catering to self-service BI.

If you have other tools you use and like, please feel free to share them with me. I am always interessted in learning about new tools.