Apache HAWQ: Using HAWQ gpfdist for small tables

HAWQ gpfdistWhat about small data?

Once Apache HAWQ is installed and running on your HDFS, there is the question of “How do we effectively store small data?”. This is important, since the minimal proposed blocksize for Apache HAWQ is 128MB. Saving small dimensional or mapping data of just a few kilobytes seems like a waste here. This is especially the case, since you propably need a lot of those for your ELT process. This is where HAWQ gpfdist comes in.

HAWQ gpfdist: What is it?

HAWQ gpfdist is a file server for Apache HAWQ. It was orginally used in Pivotal Greenplum to load data parallely into the database. Hence the name Greenplum Parallel File Server, in short gpfdist. The main intention of it is, to write or read data from an external system. This results in the drawback, that data cannot be updated.

HAWQ gpfdist: How to use it?

Install gpfdist on a seperate server. It mainly is a server written in Java, that can be used inside of Apache HAWQ to access external data. Since it is running on a normal filesystem and not HDFS, blocksize does not matter here as much and it seems a good workaround to save small files for use in HAWQ.
Since the original purpose was to read or write file from or to external systems, you need to have two table for each meta- or dimensiondate file. Once to write into the file and one to read from the file. Here is an example:

Writable table

drop external table if EXISTS schema.gpfdist_dimension_write;
CREATE writable EXTERNAL TABLE schema.gpfdist_dimension_write
(
  id int,
  value_name varchar(100),
  value_description varchar(500)
)
 LOCATION (
'gpfdist://gpfdist.server.com:8081/gpfdist_dimension.csv'
)
FORMAT 'CSV' ;

Readable table

drop external table if exists schema.gpfdist_dimension_read;
CREATE readable EXTERNAL TABLE schema.gpfdist_dimension_read
(
  id int,
  value_name varchar(100),
  value_description varchar(500)
)
 LOCATION (
'gpfdist://gpfdist.server.com:8081/gpfdist_dimension.csv'
)
FORMAT 'CSV' ;

Now you can access the file gpfdist_dimension.csv through SQL in your HAWQ instance and query it like a normal table. CSV is not the only file format supported. You can also use:

  • Plain text with any delimiter: ‘TEXT’ (DELIMITER ‘~’)
  • fixed width files: FORMAT ‘CUSTOM’ (formatter=fixedwidth_in, id=10, value_name=100, value_description=200)
  • csv with headers: FORMAT ‘CSV’ (HEADER)

Conclussion

If you use gpfdist this way, you can access small data in you Apache HAWQ Data Lake. This comes with drawback, that tables are not updateable, but dimensional- or mapping data should not change that often. To add rows to a table, you can use normal insert commands on the created writable table. So there is flexibility in this solution. Dimensional data does not change and if it does, it is best to keep a history of what was before. So the need to delete rows should not arise. Just add valid dates to your rows and this drawback is not one any more.

Please follow and like us:

Author: Marc

My career so far made it possible to have a look at the potential of analysis and data mining over a broad range of industries and data sources. I have expirience from customer relationship management in several industries to optimizing the aquisition of new customers through data mining. I can sqeeze information and knowledge from all available kinds of data to optimize processes in a company.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

code