Apache HAWQ: Using HAWQ gpfdist for small tables

HAWQ gpfdistWhat about small data?

Once Apache HAWQ is installed and running on your HDFS, there is the question of “How do we effectively store small data?”. This is important, since the minimal proposed blocksize for Apache HAWQ is 128MB. Saving small dimensional or mapping data of just a few kilobytes seems like a waste here. This is especially the case, since you propably need a lot of those for your ELT process. This is where HAWQ gpfdist comes in.

HAWQ gpfdist: What is it?

HAWQ gpfdist is a file server for Apache HAWQ. It was orginally used in Pivotal Greenplum to load data parallely into the database. Hence the name Greenplum Parallel File Server, in short gpfdist. The main intention of it is, to write or read data from an external system. This results in the drawback, that data cannot be updated.

HAWQ gpfdist: How to use it?

Install gpfdist on a seperate server. It mainly is a server written in Java, that can be used inside of Apache HAWQ to access external data. Since it is running on a normal filesystem and not HDFS, blocksize does not matter here as much and it seems a good workaround to save small files for use in HAWQ.
Since the original purpose was to read or write file from or to external systems, you need to have two table for each meta- or dimensiondate file. Once to write into the file and one to read from the file. Here is an example:

Writable table

drop external table if EXISTS schema.gpfdist_dimension_write;
CREATE writable EXTERNAL TABLE schema.gpfdist_dimension_write
(
  id int,
  value_name varchar(100),
  value_description varchar(500)
)
 LOCATION (
'gpfdist://gpfdist.server.com:8081/gpfdist_dimension.csv'
)
FORMAT 'CSV' ;

Readable table

drop external table if exists schema.gpfdist_dimension_read;
CREATE readable EXTERNAL TABLE schema.gpfdist_dimension_read
(
  id int,
  value_name varchar(100),
  value_description varchar(500)
)
 LOCATION (
'gpfdist://gpfdist.server.com:8081/gpfdist_dimension.csv'
)
FORMAT 'CSV' ;

Now you can access the file gpfdist_dimension.csv through SQL in your HAWQ instance and query it like a normal table. CSV is not the only file format supported. You can also use:

  • Plain text with any delimiter: ‘TEXT’ (DELIMITER ‘~’)
  • fixed width files: FORMAT ‘CUSTOM’ (formatter=fixedwidth_in, id=10, value_name=100, value_description=200)
  • csv with headers: FORMAT ‘CSV’ (HEADER)

Conclussion

If you use gpfdist this way, you can access small data in you Apache HAWQ Data Lake. This comes with drawback, that tables are not updateable, but dimensional- or mapping data should not change that often. To add rows to a table, you can use normal insert commands on the created writable table. So there is flexibility in this solution. Dimensional data does not change and if it does, it is best to keep a history of what was before. So the need to delete rows should not arise. Just add valid dates to your rows and this drawback is not one any more.

Please follow and like us:

Apache Spark: The Next Big (Data) Thing?

Since Apache Spark became a Top Level Project at Apache almost a year ago, it has seen some wide coverage and adoption in the industry. Due to its promise of being faster than Hadoop MapReduce, about 100x in memory and 10x on disk, it seems like a real alternative to doing pure MapReduce.
Written in Scala, it provides the ability to write applications fast in Java, Python and Scala, and the syntax isn’t that hard to learn. There are even tools available for using SQL (Spark SQL), Machine Learning (MLib) interoperating with Pythons Numpy, graphics and streaming. This makes Spark to a real good alternative for big data processing.
Another feature of Apache Spark is, that it runs everywhere. On top of Hadoop, standalone, in the cloud and can easily access diverse data stores, such as HDFS, Amazon S3, Cassandra, HBase.

The easy integration into Amazon Web Services is what makes it attractive to me, since I am using this already. I also like the Python integration, because latelly, that became my favourite language for data manipulation and machine learning.

Besides the official parts of Spark mentioned above, there are also some really nice external packages, that for example integrate Spark with tools such as PIG, Amazon Redshift, some machine learning algorithms, and so on.

Given the promised speed gain, the ease of use and the full range of tools available, and the integration in third party programms, such as Tableau or MicroStrategy, Spark seems to look into a bright future.

The inventors of Apache Spark also founded a company called databricks, which offers professional services around Spark.

Please follow and like us:

Big Data and Data Warehouse Architecture

Further development and new additions to the Hadoop framework, such as Stinger from HortonWorks or Impala from Cloudera try to bridge the gap between traditional EDWH architectures and big data architectures.
Especially Stinger.next initiative with the goal of speeding up Hive and delivering SQL 2011 standard to use on Map / Reduce Hadoop clusters makes this technology usable for developers with a SQL background. This next iteration in Hive optimization also brings an ACID framework with transactions and writeable tables. This is especially useful in data warehouse contexts, for example when you need to add meta data.

With these developments it seems plausible, that Hadoop and with it Big Data as a whole will move from ETL plattform for traditional EDWH architectures using traditional database systems, to a unified plattform, where Hadoop stores all data from raw unstructured data to structured data from the companies transactional systems and the meta data created in for reporting purposes. So access to all data would be given in the same system and query-able with SQL.
Standard reporting and deeper analysis on all data could then be accessed on the same system, so that all analysts and traditional BI developers share one platform and a better understanding of all the data needed and used in the data warehouse system.

I already did a benchmark on query speed for MySQL, Stinger and Impala here and will update this, once Stinger.next is out.

Please follow and like us:

SQL on Hadoop: Facebook’s Presto

Earlier this month Facebook open sourced its own product for using SQL on Hadoop. It is called Presto and is something like Facebook’s answer to Cloudera’s Impala or Hortonwork’s Stinger already presented in an earlier post called SQL and Hadoop on this site.
Presto is unlike Hive and more like Impala, since it doesn’t rely on MapReduce for its queries. This makes it about 10 times faster than Hive on large datasets, or so Facebook claims in a blog post.
This product may have a huge impact on the further development of SQL on Hadoop tools, if it’s taken up by enough companies. But since there is no commercial goal linked to it right now, it seems more like Facebook will develop it as their needs increase. So they will not be hurried along.
Like Impala it does support a huge subset of ANSI SQL contrary to Hive’s SQL like HiveQL. So it again aims on making Hadoop more accessible for a broader audience of analyst, that already are familiar with SQL.
Analysis on Big Data sets have been strengthened by this release even more and the entry level investments for more companies to use Hadoop as data storage system are decreasing with every development in this direction.

Please follow and like us:

SQL and Hadoop

Bringing SQL to Hadoop has been one of the major trends in Big Data these last twelve months. Reason enough for me to take a closer look at that scene right now. One reason to build an interface based on SQL for Hadoop is to make the technology available for more people. Companies that have used SQL for decades won’t just stop and use something different for analysing and accessing their data.
Another reason lies in the nature of Hadoop, as it’s build as a batch processing system, which can be slow in answering queries. These new products emerging are trying to speed up the already existing SQL product Apache released named Hive.
There are two approaches to bringing SQL to Hadoop:

  • SQL natively on Hadoop
  • DBMS on Hadoop

SQL natively on Hadoop

Some example products in this category are:

  • Stinger from HortonWorks, which claims to make SQL on Hadoop 100x faster than Hive. This product is based on Hadoop 2.0 and the new YARN framework.
  • Impala from Coudera, which also claims speed up SQL queries compared to Hive. It is also design to co-exist with MapReduce and can be cleanly integrated into the Hadoop stack.
  • Drill from Apache, which is similar to Googles Dremel.

DBMS on Hadoop

Some example products in this category are:

  • Hadapt, which includes a PostgreSQL instance on each node and takes advantage of the distirubted filesystem for speed and supports advanced SQL functions. They recently introduced a feature called “Schemaless SQL” for their product. This integrates data such as JSON, Documents, etc. into their system and lets you access them by SQL. This stores the data in the original form on the HDFS and emerges columns in a Multistructured table as needed. They posted a detailed explanation here.
  • CitusDB, which also includes a PostgreSQL instance on each node. This means advanced SQL functions are supported here too.
  • Tajo founded in South Korea is still in incubator mode with Apache, but will bear watching too.

The two different approaches have their benefits each, and to decide which fits you better, I would test both of them. The main issue with all the products is, that this is all relatively new and there is little experience with the technology yet. Some of the products even are still in development, only offering Beta access.
But here is where the future of Big Data will take us. Making the benefits of Hadoop available for more analysts by building an interface they already can use.

Please follow and like us: