SQL and Hadoop

Bringing SQL to Hadoop has been one of the major trends in Big Data these last twelve months. Reason enough for me to take a closer look at that scene right now. One reason to build an interface based on SQL for Hadoop is to make the technology available for more people. Companies that have used SQL for decades won’t just stop and use something different for analysing and accessing their data.
Another reason lies in the nature of Hadoop, as it’s build as a batch processing system, which can be slow in answering queries. These new products emerging are trying to speed up the already existing SQL product Apache released named Hive.
There are two approaches to bringing SQL to Hadoop:

  • SQL natively on Hadoop
  • DBMS on Hadoop

SQL natively on Hadoop

Some example products in this category are:

  • Stinger from HortonWorks, which claims to make SQL on Hadoop 100x faster than Hive. This product is based on Hadoop 2.0 and the new YARN framework.
  • Impala from Coudera, which also claims speed up SQL queries compared to Hive. It is also design to co-exist with MapReduce and can be cleanly integrated into the Hadoop stack.
  • Drill from Apache, which is similar to Googles Dremel.

DBMS on Hadoop

Some example products in this category are:

  • Hadapt, which includes a PostgreSQL instance on each node and takes advantage of the distirubted filesystem for speed and supports advanced SQL functions. They recently introduced a feature called “Schemaless SQL” for their product. This integrates data such as JSON, Documents, etc. into their system and lets you access them by SQL. This stores the data in the original form on the HDFS and emerges columns in a Multistructured table as needed. They posted a detailed explanation here.
  • CitusDB, which also includes a PostgreSQL instance on each node. This means advanced SQL functions are supported here too.
  • Tajo founded in South Korea is still in incubator mode with Apache, but will bear watching too.

The two different approaches have their benefits each, and to decide which fits you better, I would test both of them. The main issue with all the products is, that this is all relatively new and there is little experience with the technology yet. Some of the products even are still in development, only offering Beta access.
But here is where the future of Big Data will take us. Making the benefits of Hadoop available for more analysts by building an interface they already can use.

Please follow and like us: