Comparing Stinger to Impala

With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup:

4 GB RAM
Intel Core i5 2500 3.3 GHz

The datasets were the following:

Dataset 1: 71.386.291 rows and 5 columns
Dataset 2: 132.430.086 rows and 4 columns
Dataset 3: partitioned data of 2.153.924 rows and 32 columns
Dataset 4: unpartitioned data of 2.153.924 rows and 32 columns

The results were the following:

Query	Hive (0.10.0)	Impala	Stinger (Hive 0.12.0)
Join tables	167.61 sec	31.46 sec	122.58 sec
Partitioned tables Dataset 3	42.45 sec	0.29 sec	20.97 sec
Unpartitioned tables Dataset 4	47.92 sec	1.20 sec	36.46 sec
Grouped Select Dataset 1	533.83 sec	81.11 sec	444.634 sec
Grouped Select Dataset 2	323.56 sec	49.72 sec	313.98 sec
Count Dataset 1	252.56 sec	66.48 sec	243.91 sec
Count Dataset 2	158.93 sec	41.64 sec	174.46 sec

This shows that Stinger provides a faster SQL interface on Hive, but since it is still using Map / Reduce when calculating data it is no match for Impala that doesn’t use Map / Reduce. So using Impala makes sense when you want to analyse data in Hadoop using SQL even on a small installation. This should give you easy and fast access to all data stored in your Hadoop cluster, that was before not possible.
Facebook’s Presto should achieve nearly the same results, since the underlying technique is similar. These latest additions and changes to the Hadoop framework really seem like a big boost in making this project more accessible for many people.

Author

Marc Matt

Senior Data Architect with 15+ years of experience helping Hamburg’s leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities.

I help clients:
- Migrate & Modernize: Transitioning on-premise data warehouses to Google Cloud/AWS to reduce costs and increase agility.
- Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs.
- Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow.
Proven track record leading engineering teams.

Comments

Comparing Stinger to Impala

Author

Comments

Leave a Reply Cancel reply