Apache HAWQ: Using HAWQ gpfdist for small tables

HAWQ gpfdistWhat about small data?

Once Apache HAWQ is installed and running on your HDFS, there is the question of “How do we effectively store small data?”. This is important, since the minimal proposed blocksize for Apache HAWQ is 128MB. Saving small dimensional or mapping data of just a few kilobytes seems like a waste here. This is especially the case, since you propably need a lot of those for your ELT process. This is where HAWQ gpfdist comes in.

HAWQ gpfdist: What is it?

HAWQ gpfdist is a file server for Apache HAWQ. It was orginally used in Pivotal Greenplum to load data parallely into the database. Hence the name Greenplum Parallel File Server, in short gpfdist. The main intention of it is, to write or read data from an external system. This results in the drawback, that data cannot be updated.

HAWQ gpfdist: How to use it?

Install gpfdist on a seperate server. It mainly is a server written in Java, that can be used inside of Apache HAWQ to access external data. Since it is running on a normal filesystem and not HDFS, blocksize does not matter here as much and it seems a good workaround to save small files for use in HAWQ.
Since the original purpose was to read or write file from or to external systems, you need to have two table for each meta- or dimensiondate file. Once to write into the file and one to read from the file. Here is an example:

Writable table

drop external table if EXISTS schema.gpfdist_dimension_write;
CREATE writable EXTERNAL TABLE schema.gpfdist_dimension_write
(
  id int,
  value_name varchar(100),
  value_description varchar(500)
)
 LOCATION (
'gpfdist://gpfdist.server.com:8081/gpfdist_dimension.csv'
)
FORMAT 'CSV' ;

Readable table

drop external table if exists schema.gpfdist_dimension_read;
CREATE readable EXTERNAL TABLE schema.gpfdist_dimension_read
(
  id int,
  value_name varchar(100),
  value_description varchar(500)
)
 LOCATION (
'gpfdist://gpfdist.server.com:8081/gpfdist_dimension.csv'
)
FORMAT 'CSV' ;

Now you can access the file gpfdist_dimension.csv through SQL in your HAWQ instance and query it like a normal table. CSV is not the only file format supported. You can also use:

  • Plain text with any delimiter: ‘TEXT’ (DELIMITER ‘~’)
  • fixed width files: FORMAT ‘CUSTOM’ (formatter=fixedwidth_in, id=10, value_name=100, value_description=200)
  • csv with headers: FORMAT ‘CSV’ (HEADER)

Conclussion

If you use gpfdist this way, you can access small data in you Apache HAWQ Data Lake. This comes with drawback, that tables are not updateable, but dimensional- or mapping data should not change that often. To add rows to a table, you can use normal insert commands on the created writable table. So there is flexibility in this solution. Dimensional data does not change and if it does, it is best to keep a history of what was before. So the need to delete rows should not arise. Just add valid dates to your rows and this drawback is not one any more.

Apache AVRO: Data format for evolution of data

Apache AVROFlexible Data Format: Apache AVRO

Apache AVRO is a data serialization format. It comes with an data definition format that is easy to understand. With the possibility to add optional fields there is a solution for evolution of the schemas for the data.

Defining a Schema

Defining a schema in Apache AVRO is quite easy, since it is a JSON object in the form of:

{ 
"type": "typename",
"name": "name of field",
"doc": "documentation of field"
}

A schema does consists of:

  • a namespace
  • name
  • type = record
  • documentation
  • fields

A typical schema would look like this:

{
	"namespace": "info.datascientists.avro",
	"type": "record",
	"name": "RandomData",
	"doc": "data sets contain random data.",
	"fields": [
		{
			"name": "long_field",
			"type": "long",
			"doc": "a field containg a number"
		},
		{
			"name": "string_field",
			"type": "string",
			"doc": "a field containing a string"
		}
      ]
}

Changing a schema downwards compatible

If you now want to add a new field and stay compatible to your existing schema, you can just do the following:

{
	"namespace": "info.datascientists.avro",
	"type": "record",
	"name": "RandomData",
	"doc": "data sets contain random data.",
	"fields": [
		{
			"name": "long_field",
			"type": "long",
			"doc": "a field containg a number"
		},
		{
			"name": "string_field",
			"type": "string",
			"doc": "a field containing a string"
		},
                {
                       "name": "optional_new_field",
                       "type": ["null", "string"],
                       "default": "New Field",
                       "doc": "This is a new options field"
                }
      ]
}

This change is still compatible to the version above. It is marked as options by [“null”, “string”] in the type field. The attribute default will fill this field with the value documented here, if the field is not existing in the data.

Serializing data using the schema

Once the schema is defined, it can be used to serialize the data. This serialization also serves as a compression format of about 30% in comparison to normal text. Serializazion is possible in a wide range of programming languages, but it is best implemented in Java. A tutorial on how to serialize data using a schema can be found here.

Reading the data with Apache Hive

Data stored in Apache AVRO is easily accessible if read by Hive external tables. The data is autmatically deserialized and made human readable. Apache Hive even supports default values on schema changes. If new fields are added with a default value, the Hive table can read all versions of the schema in the same table.

Apache HAWQ: Building an easily accessable Data Lake

Apache HAWQ for Data Lake ArchitectureData Lake vs Datawarehouse

The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in real-time or batch analysis. This includes unstructured data as well as structured data. In this approach the data is stored on HDFS and made accessible by several tools, including:

All of these tools have advantages and disadvantages when used to process data, but all of them combined make your data accessible. This is the first step in building a Data Lake. You have to have your data, even schemaless data accessible to your customers.
A classical Datawarehouse on the opposite only contains structured data, that is at least preproccessed and has a fixed schema. Data in a classical Datawarehouse is not the raw data entered into the system. You need a seperate staging area for tranformations. Usually this is not accessible for all consumers of your data, but only for the Datawarehouse developers.

Data Lake Architecture using Apache HAWQ

It is a challenge to build a Data Lake with Apache HAWQ, but this can be overcome on the design part. One solution to build such a system can be seen in then picture below.
data_lake_with_hawq

Data Entry

To make utilization of Apache HAWQ possible the starting point is a controlled Data Entry. This is a compromise between schemaless and schematized data. Apache AVRO is a way to do this. Schema evolution is an integral part of AVRO and it provides structures to save unstrcutured data, like maps and arrays. A separate article about AVRO will be one of this next topics here, to explain schema evolution and how to make the most of it.
Data structured in schema can then be pushed message wise into a messaging queue. Chose a queue that fits your needs best. If you need secure transactions RabbitMQ may be the right choice. Another option is Apache Kafka.

Pre-aggregating Data

Processing and storing single message on HDFS is not an option, so there is need of another system to aggregate messages before storing them on HDFS. For this a software project called Apache Nifi is a good choice. This system comes with processors that make things like this pretty easy. It has a processor called MergeContent that merges single AVRO messages and removes all headers but one, before writing them to HDFS.
nifi_merge_messages
If those messages are still not above the HDFS blocksize, there is the possibility to read messages from HDFS and merge them into larger files still.
nifi_merge_files

Making data available in the Data Lake

Use Apache Hive to make data accessible from AVRO format. HAWQ could read the AVRO files directly, but Hive handles schema evolution in a more effective way. For example, if there is the need to add a new optional field to an existing schema, add a default value for that field and Hive will fill entries from earlier messages with this value. So if HAWQ now accesses this Hive table it automatically reads the default value for field added later with default values. It could not do this by itself. Hive also has a more robust way of handling and extracting keys and values from map fields right now.

Data Lake with SQL Access

All data is available in Apache HAWQ now. This enables tranformations using SQL and making all of your data accessible by a broad audience in your company. SQL skills are more common than say Spark programming in Java, Scala or PySpark. From here it is possible to give analysts access to all of the data or building data marts for single subjects of concern using SQL transformations. Connectivity to reporting tools like Tableau is possible with a driver for Postgresql. Even advanced analytics are possible, if you install Apache MADlib on your HAWQ cluster.

Using Data outside of HAWQ

It is even possible to use all data outside of HAWQ, if there is a need for it. Since all data is available in AVRO format, accessing it by means of Apache Spark with Apache Zeppelin is also possible. Hive queries are possible too, since all data is registered there using external tables, which we used for integration into HAWQ.
Accessing results of such processing in HAWQ is possible too. Save the results in AVRO format for integration in the way described above or use “hawq register” to access parquet files directly from HDFS.

Conclusion

Using Apache HAWQ as base of a Data Lake is possible. Just take some contraints into consideration. But entering data with semi-structured with AVRO format also saves work later when you process the data. The main advantage is, that you can utilize SQL as an interface to all of you data. This enables many people in your company to access your data and will help you on your way to Data Driven decisions.

Apache HAWQ: Full SQL and MPP support on HDFS

Apache HAWQPivotal ported their massively parallel processing (MPP) database Greenplum to Hadoop and made it open source as an incubating project at Apache, called Apache HAWQ. This bring together full ANSI SQL with MPP capabilities and Hadoop integration.

The integration in an existing Hadoop installation is easy, as you can integrate all existing data via external tables. This is done using the pxf API to query external data. This API is customizable, but already brings the most used formats ready made. These include:

To access and store small amounts of data Apache HAWQ has an interface called gpfdist. This enables you to store data outside of your HDFS and still access it within HAWQ to join with the data stored in HDFS. This is especially handy, when you need small tables for dimension or mapping data in Apache HAWQ. This data will then not use a whole block of your HDFS, that is mostly empty.

Apache HAWQ even come integrated with MADlib, also an Apache incubating product, developed by Pivotal. MADlib is a Machine Learning framework, based on SQL. So moving data between different tools for analysing it, is not need anymore. If you have stored your data in Apache HAWQ, you can mine it in the database directly and don’t have to export it, e.g. to a Spark client or tools like Knime or RapidMiner.

MADlib algorithms

MADLib comes with algorithms in the following categories:

  • Classification
  • Regression
  • Clustering
  • Topic Modelling
  • Assocition Rule Mining
  • Descriptive Statistics

By using HAWQ you even can leverage tools like Tableau with real time database connections, which was not satisfactory so far when you used Hive.

Apache Zeppelin: Use with remote Spark cluster and Yarn

Apache Zeppelin is pretty usefull for interactive programming using the web browser. It even comes with its own installation of Apache Spark. For further information you can check my earlier post.
But the real power in using Spark with Zeppelin lies in its easy way to connect it to your existing Spark cluster using YARN. The following steps are necessary:

  • Copy your Hadoop configuration files to your Zeppelin installation under $ZEPPELIN_HOME/conf
  • Restart your Zeppelin Notebook
  • Insert the value “yarn-client” into the field master in the spark interpreter, as shown in the picture below.

spark_interpreter_yarn

After these steps you can use your notebooks with spark running on a yarn cluster. So you can make use of all the resources in the queue you assigned spark on you cluster.