Comparing Stinger to Impala

With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup:

  • 4 GB RAM
  • Intel Core i5 2500 3.3 GHz

The datasets were the following:

  1. Dataset 1: 71.386.291 rows and 5 columns
  2. Dataset 2: 132.430.086 rows and 4 columns
  3. Dataset 3: partitioned data of 2.153.924 rows and 32 columns
  4. Dataset 4: unpartitioned data of 2.153.924 rows and 32 columns

The results were the following:

QueryHive (0.10.0)ImpalaStinger (Hive 0.12.0)
Join tables167.61 sec31.46 sec122.58 sec
Partitioned tables Dataset 342.45 sec0.29 sec20.97 sec
Unpartitioned tables Dataset 447.92 sec1.20 sec36.46 sec
Grouped Select Dataset 1533.83 sec81.11 sec444.634 sec
Grouped Select Dataset 2323.56 sec49.72 sec313.98 sec
Count Dataset 1252.56 sec66.48 sec243.91 sec
Count Dataset 2158.93 sec41.64 sec174.46 sec
Compare Impala vs. Stinger
Compare Impala vs. Stinger

This shows that Stinger provides a faster SQL interface on Hive, but since it is still using Map / Reduce when calculating data it is no match for Impala that doesn’t use Map / Reduce. So using Impala makes sense when you want to analyse data in Hadoop using SQL even on a small installation. This should give you easy and fast access to all data stored in your Hadoop cluster, that was before not possible.
Facebook’s Presto should achieve nearly the same results, since the underlying technique is similar. These latest additions and changes to the Hadoop framework really seem like a big boost in making this project more accessible for many people.

SQL on Hadoop: Facebook’s Presto

Earlier this month Facebook open sourced its own product for using SQL on Hadoop. It is called Presto and is something like Facebook’s answer to Cloudera’s Impala or Hortonwork’s Stinger already presented in an earlier post called SQL and Hadoop on this site.
Presto is unlike Hive and more like Impala, since it doesn’t rely on MapReduce for its queries. This makes it about 10 times faster than Hive on large datasets, or so Facebook claims in a blog post.
This product may have a huge impact on the further development of SQL on Hadoop tools, if it’s taken up by enough companies. But since there is no commercial goal linked to it right now, it seems more like Facebook will develop it as their needs increase. So they will not be hurried along.
Like Impala it does support a huge subset of ANSI SQL contrary to Hive’s SQL like HiveQL. So it again aims on making Hadoop more accessible for a broader audience of analyst, that already are familiar with SQL.
Analysis on Big Data sets have been strengthened by this release even more and the entry level investments for more companies to use Hadoop as data storage system are decreasing with every development in this direction.

This website is using Google Analytics. Please click here if you want to opt-out. Click here to opt-out.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close