Apache HAWQ: Full SQL and MPP support on HDFS

Apache HAWQPivotal ported their massively parallel processing (MPP) database Greenplum to Hadoop and made it open source as an incubating project at Apache, called Apache HAWQ. This bring together full ANSI SQL with MPP capabilities and Hadoop integration.

The integration in an existing Hadoop installation is easy, as you can integrate all existing data via external tables. This is done using the pxf API to query external data. This API is customizable, but already brings the most used formats ready made. These include:

To access and store small amounts of data Apache HAWQ has an interface called gpfdist. This enables you to store data outside of your HDFS and still access it within HAWQ to join with the data stored in HDFS. This is especially handy, when you need small tables for dimension or mapping data in Apache HAWQ. This data will then not use a whole block of your HDFS, that is mostly empty.

Apache HAWQ even come integrated with MADlib, also an Apache incubating product, developed by Pivotal. MADlib is a Machine Learning framework, based on SQL. So moving data between different tools for analysing it, is not need anymore. If you have stored your data in Apache HAWQ, you can mine it in the database directly and don’t have to export it, e.g. to a Spark client or tools like Knime or RapidMiner.

MADlib algorithms

MADLib comes with algorithms in the following categories:

  • Classification
  • Regression
  • Clustering
  • Topic Modelling
  • Assocition Rule Mining
  • Descriptive Statistics

By using HAWQ you even can leverage tools like Tableau with real time database connections, which was not satisfactory so far when you used Hive.

Please follow and like us:

Big Data and Data Warehouse Architecture

Further development and new additions to the Hadoop framework, such as Stinger from HortonWorks or Impala from Cloudera try to bridge the gap between traditional EDWH architectures and big data architectures.
Especially Stinger.next initiative with the goal of speeding up Hive and delivering SQL 2011 standard to use on Map / Reduce Hadoop clusters makes this technology usable for developers with a SQL background. This next iteration in Hive optimization also brings an ACID framework with transactions and writeable tables. This is especially useful in data warehouse contexts, for example when you need to add meta data.

With these developments it seems plausible, that Hadoop and with it Big Data as a whole will move from ETL plattform for traditional EDWH architectures using traditional database systems, to a unified plattform, where Hadoop stores all data from raw unstructured data to structured data from the companies transactional systems and the meta data created in for reporting purposes. So access to all data would be given in the same system and query-able with SQL.
Standard reporting and deeper analysis on all data could then be accessed on the same system, so that all analysts and traditional BI developers share one platform and a better understanding of all the data needed and used in the data warehouse system.

I already did a benchmark on query speed for MySQL, Stinger and Impala here and will update this, once Stinger.next is out.

Please follow and like us: