Building a Productive Data Lake: How to keep three systems in sync

Three Systems for save Development

When you are building a productive Data Lake it is important to have at least three environments:

  • Development: for development, where “everything” is allowed.
  • Staging: for testing changes in a production like environment.
  • Production: Running your tested and productive data applications

With these different environments comes the need to keep data structures in sync. You need to find a way to change each system in a controlled way, when it is time to deploy some changes. There are differences in how to manage data structures, depending on where in data processing they occur. Theses differences we split into the parts “Data Entry to Staging” and “Core and Presentation”. I wi

Managing different data structures in the three systems of your Productive Data Lake

As all three systems can be have different data structures, using tools to make this easier is important. Different Apache Avro™ schema versions are managed by a publishing them to a tool called Apache Avro™ Schema Service of which there also also three instances. By publishing changes first on the development, then staging, then production system, we make sure, only tested data structures are deployed to the Production Data Lake. Each change to a schema belongs to a version number, so rolling back to an earlier state, in case of an error is possible.
Evolving our data structures not contained in Apache Avro™ schemas, we use tables managing all meta information.

Data Structures from Entry to Staging

Apache Avro™ is a good way to manage evolving data structures, as I explained here. Once the data is available in Apache Hive™ mapping the AVRO data types to those of your Productive Data Lake system is necessary. We use Apache HAWQ™ as base of our Data Lake. So we map the AVRO data types to HAWQ data types. To keep this as effortless as possible, we reuse attribute names as often as possible. If an attribute, like “user_id” is occuring in different schemas and has the same meaning, then we just reuse it. So the mapping table does not contain an entry for each column of each schema, but only for distinct ones. Considering more than 100 schema with on average 10 attributes, this helps in keeping maintanence down. In the table below is an example, of this mapping.

Schema FieldnameAVRO datatypeHAWQ datatype

Data Strucutes for Core and Presentation

A totally different challenge is it, to manage the data structures after data entry. There need to occur transformations, for technical and business user benefit. This can mean other attributes / column names or a splitting or joining of data from one schema into different tables. These tables here again can evolve. Evolving database tables is a pain, since you cannot, in most cases just drop the table and recreate the new one. Here the processing power of Apache HAWQ™ comes into play, as we can exactly do that now. In most cases, at least for transactional data, we just reload the whole table, during the daily batch process. So we manage all table definitions in tables. They contain all table and column attributes, as well as a version number. This makes rolling back possible.

Example of table definitions:

Schema NameTable NamePartitioneddistributed bydropableVersion
corerel_user_purchaseuser_id, purchase_idtrue4.0

Exmaple of column definitions:

Schema NameTable NameColumn NameData TypeData Type LengthVersion

Automation of data structure deployment.

After defining all the meta data concerning the data structures, we need a way to automate this. For this I created a Java programm using all the information from schema definitions and tables for data type mappings and table definitions. This programm runs on any of the three environments and also take in consideration the version it should create. This only works, if no one can create or change table information manually. So the programm first deploys changes to the data structure on the development system. If it works there and all changes are ready to be deployed on the staging system, run the programm in this environment. With the version numbers defined besides each change, it is possible to run version 5.0 on production, version 5.1 on staging and version 5.2 on development if neccessary.


This way is a possibility to make data structure evolution easier and keep three different systems in sync. It is most important to make sure, that there are no manual changes to the data structure. This approach works nicely, if you can drop and recreate all tables without hesitation. If that is not the case, you can still define your tables this way, just make sure they are not dropped automatically when the programm runs. Having all definitions in one place and being able to automatically deploy the tables and the changes, with version control, can be a great help. This way no one has to remember to create new tables manually in either system. So errors due to missing tables or columns are minimized which is good if you create a Productive Data Lake, where errors affect data products needed by other products of your company.

Apache HAWQ: Building an easily accessable Data Lake

Apache HAWQ for Data Lake ArchitectureData Lake vs Datawarehouse

The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in real-time or batch analysis. This includes unstructured data as well as structured data. In this approach the data is stored on HDFS and made accessible by several tools, including:

All of these tools have advantages and disadvantages when used to process data, but all of them combined make your data accessible. This is the first step in building a Data Lake. You have to have your data, even schemaless data accessible to your customers.
A classical Datawarehouse on the opposite only contains structured data, that is at least preproccessed and has a fixed schema. Data in a classical Datawarehouse is not the raw data entered into the system. You need a seperate staging area for tranformations. Usually this is not accessible for all consumers of your data, but only for the Datawarehouse developers.

Data Lake Architecture using Apache HAWQ

It is a challenge to build a Data Lake with Apache HAWQ, but this can be overcome on the design part. One solution to build such a system can be seen in then picture below.

Data Entry

To make utilization of Apache HAWQ possible the starting point is a controlled Data Entry. This is a compromise between schemaless and schematized data. Apache AVRO is a way to do this. Schema evolution is an integral part of AVRO and it provides structures to save unstrcutured data, like maps and arrays. A separate article about AVRO will be one of this next topics here, to explain schema evolution and how to make the most of it.
Data structured in schema can then be pushed message wise into a messaging queue. Chose a queue that fits your needs best. If you need secure transactions RabbitMQ may be the right choice. Another option is Apache Kafka.

Pre-aggregating Data

Processing and storing single message on HDFS is not an option, so there is need of another system to aggregate messages before storing them on HDFS. For this a software project called Apache Nifi is a good choice. This system comes with processors that make things like this pretty easy. It has a processor called MergeContent that merges single AVRO messages and removes all headers but one, before writing them to HDFS.
If those messages are still not above the HDFS blocksize, there is the possibility to read messages from HDFS and merge them into larger files still.

Making data available in the Data Lake

Use Apache Hive to make data accessible from AVRO format. HAWQ could read the AVRO files directly, but Hive handles schema evolution in a more effective way. For example, if there is the need to add a new optional field to an existing schema, add a default value for that field and Hive will fill entries from earlier messages with this value. So if HAWQ now accesses this Hive table it automatically reads the default value for field added later with default values. It could not do this by itself. Hive also has a more robust way of handling and extracting keys and values from map fields right now.

Data Lake with SQL Access

All data is available in Apache HAWQ now. This enables tranformations using SQL and making all of your data accessible by a broad audience in your company. SQL skills are more common than say Spark programming in Java, Scala or PySpark. From here it is possible to give analysts access to all of the data or building data marts for single subjects of concern using SQL transformations. Connectivity to reporting tools like Tableau is possible with a driver for Postgresql. Even advanced analytics are possible, if you install Apache MADlib on your HAWQ cluster.

Using Data outside of HAWQ

It is even possible to use all data outside of HAWQ, if there is a need for it. Since all data is available in AVRO format, accessing it by means of Apache Spark with Apache Zeppelin is also possible. Hive queries are possible too, since all data is registered there using external tables, which we used for integration into HAWQ.
Accessing results of such processing in HAWQ is possible too. Save the results in AVRO format for integration in the way described above or use “hawq register” to access parquet files directly from HDFS.


Using Apache HAWQ as base of a Data Lake is possible. Just take some contraints into consideration. But entering data with semi-structured with AVRO format also saves work later when you process the data. The main advantage is, that you can utilize SQL as an interface to all of you data. This enables many people in your company to access your data and will help you on your way to Data Driven decisions.

Apache Zeppelin: Use with remote Spark cluster and Yarn

Apache Zeppelin is pretty usefull for interactive programming using the web browser. It even comes with its own installation of Apache Spark. For further information you can check my earlier post.
But the real power in using Spark with Zeppelin lies in its easy way to connect it to your existing Spark cluster using YARN. The following steps are necessary:

  • Copy your Hadoop configuration files to your Zeppelin installation under $ZEPPELIN_HOME/conf
  • Restart your Zeppelin Notebook
  • Insert the value “yarn-client” into the field master in the spark interpreter, as shown in the picture below.


After these steps you can use your notebooks with spark running on a yarn cluster. So you can make use of all the resources in the queue you assigned spark on you cluster.

Apache Zeppelin: Visualization and Spark data processing

Apache Zeppelin

Apache Zeppelin is a web-based notebook for interactive data analytics. It comes will features for all the steps of data analysis:

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration

Besides that feature set it also supports multiple languages in the backend. Currently it supports languages like:

But there is also the possibility to add your own interpreter to Zeppelin. This makes this tool really flexible.
Another feature it has, is the built in integration of Apache Spark. It ships with the following features and more:

  • Automatic SparkContext and SQLContext injection
  • Runtime jar dependency loading from local filesystem or maven repository.
  • Canceling job and displaying its progress

It also has built in visualization, which is an improvemnt over using ipython notebooks I think. The visualization covers the most basic graphs, like:

  • Tables
  • BarCharts
  • Pies
  • Scatterplot
  • Lines

These visualizations can be used with all interpreters and are always the same. So you can show data from Postgres and Spark in the same notebook with the same functions used. There is no need to handle different data sources differently.
You can also use dynamic forms in your notebooks, e.g. to provide filter options to the user. This comes in handy, if you embedd a notebook in your own website.

Apache Spark 2.0

Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It can now run all 99 TPC-DS. The new SQL parser supports ANSI-SQL and HiveQL and sub queries.
Another new features is native csv data source support, based on the already existing Databricks spark csv module. I personally used this module as well as the spark avro module before and they make working with data in those formats really easy.
Also there were some new features added to MLlib:

  • PySpark includes new algorithms like LDA, Gaussian Mixture Model, Generalized Linear Regression
  • SparkR now includes generalized linear models, naive Bayes, k-means clustering, and survival regression.

Spark increased its performance with the release of 2.0. The goal was to make Spark 2.0 10x faster and Databricks shows this performance tuning in a notebook.

All of these improvements make Spark a more complete tool for data processing and analysing. The added SQL2003 support even makes it available for a larger user base and more importantly makes it easier to migrate existing applications from databases to Spark.

Python vs. R for Data Science

In Data Science there are two languages that compete for users. On one side there is R, on the other Python. Both have a huge userbase, but there is some discussion, which is better to use in a Data Science context. Lets explore both a bit:

R is a language and programming environment especially developed for statistical computing and grahics. It has been around some time and several thousand packages to tackle statistical problems. With RStudio it also provides an interactive programming environment, that makes analysing data pretty easy.

Python is a full range programming language, that makes it easy to integrate into a company wide system. With the packages Numpy, Pandas and Scikit-learn, Mathplotlib in combination with IPython, it also provides a full range suite for statistical computing and interactive programming environment.

R was developed solely for the purpose of statistical computing, so it has some advantages there, since it is specialized and has been around some years. Python is coming from a programming language and moves now into the data analysis field. In combination with all the other stuff it can do, websites and easy integrations into Hadoop Streaming or Apache Spark.
And for people who want to use the best of both sides can always use the R Python integration Rpy2.

I personally am recently working with Python for my ETL processes, including MapReduce, and anlysing data, which works awesome in combination with IPython as interactive development tool.

Apache Spark: The Next Big (Data) Thing?

Since Apache Spark became a Top Level Project at Apache almost a year ago, it has seen some wide coverage and adoption in the industry. Due to its promise of being faster than Hadoop MapReduce, about 100x in memory and 10x on disk, it seems like a real alternative to doing pure MapReduce.
Written in Scala, it provides the ability to write applications fast in Java, Python and Scala, and the syntax isn’t that hard to learn. There are even tools available for using SQL (Spark SQL), Machine Learning (MLib) interoperating with Pythons Numpy, graphics and streaming. This makes Spark to a real good alternative for big data processing.
Another feature of Apache Spark is, that it runs everywhere. On top of Hadoop, standalone, in the cloud and can easily access diverse data stores, such as HDFS, Amazon S3, Cassandra, HBase.

The easy integration into Amazon Web Services is what makes it attractive to me, since I am using this already. I also like the Python integration, because latelly, that became my favourite language for data manipulation and machine learning.

Besides the official parts of Spark mentioned above, there are also some really nice external packages, that for example integrate Spark with tools such as PIG, Amazon Redshift, some machine learning algorithms, and so on.

Given the promised speed gain, the ease of use and the full range of tools available, and the integration in third party programms, such as Tableau or MicroStrategy, Spark seems to look into a bright future.

The inventors of Apache Spark also founded a company called databricks, which offers professional services around Spark.

Comparing Stinger to Impala

With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup:

  • 4 GB RAM
  • Intel Core i5 2500 3.3 GHz

The datasets were the following:

  1. Dataset 1: 71.386.291 rows and 5 columns
  2. Dataset 2: 132.430.086 rows and 4 columns
  3. Dataset 3: partitioned data of 2.153.924 rows and 32 columns
  4. Dataset 4: unpartitioned data of 2.153.924 rows and 32 columns

The results were the following:

QueryHive (0.10.0)ImpalaStinger (Hive 0.12.0)
Join tables167.61 sec31.46 sec122.58 sec
Partitioned tables Dataset 342.45 sec0.29 sec20.97 sec
Unpartitioned tables Dataset 447.92 sec1.20 sec36.46 sec
Grouped Select Dataset 1533.83 sec81.11 sec444.634 sec
Grouped Select Dataset 2323.56 sec49.72 sec313.98 sec
Count Dataset 1252.56 sec66.48 sec243.91 sec
Count Dataset 2158.93 sec41.64 sec174.46 sec
Compare Impala vs. Stinger
Compare Impala vs. Stinger

This shows that Stinger provides a faster SQL interface on Hive, but since it is still using Map / Reduce when calculating data it is no match for Impala that doesn’t use Map / Reduce. So using Impala makes sense when you want to analyse data in Hadoop using SQL even on a small installation. This should give you easy and fast access to all data stored in your Hadoop cluster, that was before not possible.
Facebook’s Presto should achieve nearly the same results, since the underlying technique is similar. These latest additions and changes to the Hadoop framework really seem like a big boost in making this project more accessible for many people.

SQL and Hadoop

Bringing SQL to Hadoop has been one of the major trends in Big Data these last twelve months. Reason enough for me to take a closer look at that scene right now. One reason to build an interface based on SQL for Hadoop is to make the technology available for more people. Companies that have used SQL for decades won’t just stop and use something different for analysing and accessing their data.
Another reason lies in the nature of Hadoop, as it’s build as a batch processing system, which can be slow in answering queries. These new products emerging are trying to speed up the already existing SQL product Apache released named Hive.
There are two approaches to bringing SQL to Hadoop:

  • SQL natively on Hadoop
  • DBMS on Hadoop

SQL natively on Hadoop

Some example products in this category are:

  • Stinger from HortonWorks, which claims to make SQL on Hadoop 100x faster than Hive. This product is based on Hadoop 2.0 and the new YARN framework.
  • Impala from Coudera, which also claims speed up SQL queries compared to Hive. It is also design to co-exist with MapReduce and can be cleanly integrated into the Hadoop stack.
  • Drill from Apache, which is similar to Googles Dremel.

DBMS on Hadoop

Some example products in this category are:

  • Hadapt, which includes a PostgreSQL instance on each node and takes advantage of the distirubted filesystem for speed and supports advanced SQL functions. They recently introduced a feature called “Schemaless SQL” for their product. This integrates data such as JSON, Documents, etc. into their system and lets you access them by SQL. This stores the data in the original form on the HDFS and emerges columns in a Multistructured table as needed. They posted a detailed explanation here.
  • CitusDB, which also includes a PostgreSQL instance on each node. This means advanced SQL functions are supported here too.
  • Tajo founded in South Korea is still in incubator mode with Apache, but will bear watching too.

The two different approaches have their benefits each, and to decide which fits you better, I would test both of them. The main issue with all the products is, that this is all relatively new and there is little experience with the technology yet. Some of the products even are still in development, only offering Beta access.
But here is where the future of Big Data will take us. Making the benefits of Hadoop available for more analysts by building an interface they already can use.

REST (Representational state transfer) APIs and Big Data

Getting data, huge amounts of data, out of some systems tends to be quite a hazzle sometimes. Often you are required to use techniques such as FTP or SSH for transfering files. But with RESTful APIs getting more attention in the last few years, there is a new way to get your data.
The charm of REST APIs is, that they are stateless and use HTTP methods explicitly. This makes getting data pretty straight forward:

  • Use POST to create a resource on the server.
  • Use GET to retrieve a resource.
  • Use PUT to change a resource.
  • User DELETE to remove a resource.

The result can be returned in any defined format, but mostly it is XML or JSON. Security is also provided, if you integrate authentification methods like OAUTH or LDAP.

This gives you new possibilities to integrate your data into webbased reporting systems, since you only have to use the HTTP protocol to get your data and can work on the results as they stream in.
Since most REST APIs have the possibility to store results of a request, you could get the same result again at a later time, without having to process it on the source system again.

Hadoop even provides a REST API called WebHDFS REST API developed by Hortonworks, which supports the complete filesystem interface of HDFS. This is a great help, if you are running applications using your Hadoop cluster that are not using Java. So you can mainpulate and access your data from about everywhere.