SQL and Hadoop

Bringing SQL to Hadoop has been one of the major trends in Big Data these last twelve months. Reason enough for me to take a closer look at that scene right now. One reason to build an interface based on SQL for Hadoop is to make the technology available for more people. Companies that have used SQL for decades won’t just stop and use something different for analysing and accessing their data.
Another reason lies in the nature of Hadoop, as it’s build as a batch processing system, which can be slow in answering queries. These new products emerging are trying to speed up the already existing SQL product Apache released named Hive.
There are two approaches to bringing SQL to Hadoop:

  • SQL natively on Hadoop
  • DBMS on Hadoop

SQL natively on Hadoop

Some example products in this category are:

  • Stinger from HortonWorks, which claims to make SQL on Hadoop 100x faster than Hive. This product is based on Hadoop 2.0 and the new YARN framework.
  • Impala from Coudera, which also claims speed up SQL queries compared to Hive. It is also design to co-exist with MapReduce and can be cleanly integrated into the Hadoop stack.
  • Drill from Apache, which is similar to Googles Dremel.

DBMS on Hadoop

Some example products in this category are:

  • Hadapt, which includes a PostgreSQL instance on each node and takes advantage of the distirubted filesystem for speed and supports advanced SQL functions. They recently introduced a feature called “Schemaless SQL” for their product. This integrates data such as JSON, Documents, etc. into their system and lets you access them by SQL. This stores the data in the original form on the HDFS and emerges columns in a Multistructured table as needed. They posted a detailed explanation here.
  • CitusDB, which also includes a PostgreSQL instance on each node. This means advanced SQL functions are supported here too.
  • Tajo founded in South Korea is still in incubator mode with Apache, but will bear watching too.

The two different approaches have their benefits each, and to decide which fits you better, I would test both of them. The main issue with all the products is, that this is all relatively new and there is little experience with the technology yet. Some of the products even are still in development, only offering Beta access.
But here is where the future of Big Data will take us. Making the benefits of Hadoop available for more analysts by building an interface they already can use.

Please follow and like us:

Hadoop and MPP

With Big Data Map/Reduce is always the first term that comes into mind. But it’s not the only way to handle large amounts of data. There are databasesystems especially built to deal with huge amounts of data and they are called Massively Parallel Processing (MPP) databases.
MPP database systems have been around for a longer time than Map/Reduce and its most popular integration Hadoop and are based on a shared nothing architecture. The data is partitioned across severel nodes of hardware and queries are processed via network interconnect on a central server. They often use commodity hardware that is as inexpensive as hardware for Map/Reduce. For working with data they have the advantage to make use of SQL as their interface, the language used by most Data Scientists and other analytic prefessionals so far.
Map/Reduce provides a Java interface to analyse the data, which comes with more time to implement than just write an SQL statement. Hadoop has some projects, that provide a SQL similar query language, like Hive which provides HiveQL, a SQL like query language, as interface.
Since both systems handle data, there will be a lot gained, when both are combined. There are already projects working on that, like Aster Data nCluster or Teradata and Hortonworks.
There is even a new product bringing both worlds together as one product, Hadapt. With this product you can access all your data, structured or unstructured, in a single plattform. Each node has space for SQL as well as for Map/Reduce.

Last but not least a list of some MPP databases available right now:

Depending on your business needs, you may not need a Map/Reduce cluster, but a MPP database, or both to benefit from their respective strenghts in your implementation.

Please follow and like us: