Analytics Platform: An Evolution from Data Lake

Analytics Platform

Having built a Data Lake for your company’s analytical needs, there soon will arise new use cases, that cannot be easily covered with the Data Lake architecture I covered in previous posts, like Apache HAWQ™: Building an easily accessable Data Lake. You will need to adapt or enhance your architecture to become more flexible. One way to make this flexibility happens, is to transform your Data Lake into an Analytics Platform.

Definition of an Analytics Platform

An Analytics Platform is a platform that provides all kind of services needed for building data products. This often exceeds the functionality of a pure RDBMS or even a Data Lake based on Apache HAWQ™. There are data products that have more requirements than a SQL inferface. Reporting and basic analysis are addressed by this setup, but products dealing with predictions or recommendations often have different needs. An Analytics Platform provides flexibility in the tools used. There can be, for example, a Apache HAWQ™ setup and at the same time an environment for running Tensorflow applications.

Using existing parts: Multi-colored YARN

When you are running a Hadoop Cluster™, you are already familiar with a resource manager. This manager is YARN. With YARN you can already deploy Linux Containers, and support for Docker containers has already progressed pretty far (YARN-3611). Building complex applications and managing them with YARN is called Multi-colored YARN by Hortonworks.
Following through on this idea you will have a cluster with just some central services installed directly on bare metal. You will deploy the rest in containers, as shown in the images below.

Analytics Platform
Analytics Platform based on YARN and Docker

The example makes use of Kubernetes and Docker for virtualization and provides the following services on bare meta, since they are needed by most applications:

  • Ambari
  • Kuberneted
  • YARN
  • ZooKeeper
  • HDFS

Especially the HDFS is important as a central services. This makes it possible for all applications to access the some data. The Picture above shows, that there can be several instances of a Hadoop distribution. This is possible even in different version. So the platform allows for multi tenancy, while all instances are still processing the same data.

Development changes

Having an Analytics Platform makes the development of data products easier. There always was the problem of developing a product on a sample of the data, when you used development and staging systems, as decribed by me here. In same cases these did not contain all possibly combinations of data. This could result in error after a deployemnt on the production environment. Even going through all development and staging could not change this. This new approach allows you to deploy all three systems on the same data. So there you can account for all data induced errors on the development and staging systems already.
You can even become more agile in your development process. The picture below shows an example deployment process, that uses this system.

Analytics Platform: Deployment Process

Conclussion

Moving from a pure Data Lake to an Analytics Platform give you are flexibility and helps in the development of data products. Especially since you can develop on the same data as is available on production. Of course it brings more complexity to an already complex environment. But since it is possible to keep YARN as resource manager and move to a more agile way of development and deployment, it might be worth considering. Once Multi-Colored YARN is finished, it will be easier to make this happen.

Apache HAWQ: Building an easily accessable Data Lake

Apache HAWQ for Data Lake ArchitectureData Lake vs Datawarehouse

The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in real-time or batch analysis. This includes unstructured data as well as structured data. In this approach the data is stored on HDFS and made accessible by several tools, including:

All of these tools have advantages and disadvantages when used to process data, but all of them combined make your data accessible. This is the first step in building a Data Lake. You have to have your data, even schemaless data accessible to your customers.
A classical Datawarehouse on the opposite only contains structured data, that is at least preproccessed and has a fixed schema. Data in a classical Datawarehouse is not the raw data entered into the system. You need a seperate staging area for tranformations. Usually this is not accessible for all consumers of your data, but only for the Datawarehouse developers.

Data Lake Architecture using Apache HAWQ

It is a challenge to build a Data Lake with Apache HAWQ, but this can be overcome on the design part. One solution to build such a system can be seen in then picture below.
data_lake_with_hawq

Data Entry

To make utilization of Apache HAWQ possible the starting point is a controlled Data Entry. This is a compromise between schemaless and schematized data. Apache AVRO is a way to do this. Schema evolution is an integral part of AVRO and it provides structures to save unstrcutured data, like maps and arrays. A separate article about AVRO will be one of this next topics here, to explain schema evolution and how to make the most of it.
Data structured in schema can then be pushed message wise into a messaging queue. Chose a queue that fits your needs best. If you need secure transactions RabbitMQ may be the right choice. Another option is Apache Kafka.

Pre-aggregating Data

Processing and storing single message on HDFS is not an option, so there is need of another system to aggregate messages before storing them on HDFS. For this a software project called Apache Nifi is a good choice. This system comes with processors that make things like this pretty easy. It has a processor called MergeContent that merges single AVRO messages and removes all headers but one, before writing them to HDFS.
nifi_merge_messages
If those messages are still not above the HDFS blocksize, there is the possibility to read messages from HDFS and merge them into larger files still.
nifi_merge_files

Making data available in the Data Lake

Use Apache Hive to make data accessible from AVRO format. HAWQ could read the AVRO files directly, but Hive handles schema evolution in a more effective way. For example, if there is the need to add a new optional field to an existing schema, add a default value for that field and Hive will fill entries from earlier messages with this value. So if HAWQ now accesses this Hive table it automatically reads the default value for field added later with default values. It could not do this by itself. Hive also has a more robust way of handling and extracting keys and values from map fields right now.

Data Lake with SQL Access

All data is available in Apache HAWQ now. This enables tranformations using SQL and making all of your data accessible by a broad audience in your company. SQL skills are more common than say Spark programming in Java, Scala or PySpark. From here it is possible to give analysts access to all of the data or building data marts for single subjects of concern using SQL transformations. Connectivity to reporting tools like Tableau is possible with a driver for Postgresql. Even advanced analytics are possible, if you install Apache MADlib on your HAWQ cluster.

Using Data outside of HAWQ

It is even possible to use all data outside of HAWQ, if there is a need for it. Since all data is available in AVRO format, accessing it by means of Apache Spark with Apache Zeppelin is also possible. Hive queries are possible too, since all data is registered there using external tables, which we used for integration into HAWQ.
Accessing results of such processing in HAWQ is possible too. Save the results in AVRO format for integration in the way described above or use “hawq register” to access parquet files directly from HDFS.

Conclusion

Using Apache HAWQ as base of a Data Lake is possible. Just take some contraints into consideration. But entering data with semi-structured with AVRO format also saves work later when you process the data. The main advantage is, that you can utilize SQL as an interface to all of you data. This enables many people in your company to access your data and will help you on your way to Data Driven decisions.