Pivotal ported their massively parallel processing (MPP) database Greenplum to Hadoop and made it open source as an incubating project at Apache, called Apache HAWQ. This bring together full ANSI SQL with MPP capabilities and Hadoop integration.
The integration in an existing Hadoop installation is easy, as you can integrate all existing data via external tables. This is done using the pxf API to query external data. This API is customizable, but already brings the most used formats ready made. These include:
To access and store small amounts of data Apache HAWQ has an interface called gpfdist. This enables you to store data outside of your HDFS and still access it within HAWQ to join with the data stored in HDFS. This is especially handy, when you need small tables for dimension or mapping data in Apache HAWQ. This data will then not use a whole block of your HDFS, that is mostly empty.
Apache HAWQ even come integrated with MADlib, also an Apache incubating product, developed by Pivotal. MADlib is a Machine Learning framework, based on SQL. So moving data between different tools for analysing it, is not need anymore. If you have stored your data in Apache HAWQ, you can mine it in the database directly and don’t have to export it, e.g. to a Spark client or tools like Knime or RapidMiner.
MADLib comes with algorithms in the following categories:
- Topic Modelling
- Assocition Rule Mining
- Descriptive Statistics
By using HAWQ you even can leverage tools like Tableau with real time database connections, which was not satisfactory so far when you used Hive.
Apache Zeppelin is pretty usefull for interactive programming using the web browser. It even comes with its own installation of Apache Spark. For further information you can check my earlier post.
But the real power in using Spark with Zeppelin lies in its easy way to connect it to your existing Spark cluster using YARN. The following steps are necessary:
- Copy your Hadoop configuration files to your Zeppelin installation under $ZEPPELIN_HOME/conf
- Restart your Zeppelin Notebook
- Insert the value “yarn-client” into the field master in the spark interpreter, as shown in the picture below.
After these steps you can use your notebooks with spark running on a yarn cluster. So you can make use of all the resources in the queue you assigned spark on you cluster.
Apache Zeppelin is a web-based notebook for interactive data analytics. It comes will features for all the steps of data analysis:
- Data Ingestion
- Data Discovery
- Data Analytics
- Data Visualization & Collaboration
Besides that feature set it also supports multiple languages in the backend. Currently it supports languages like:
But there is also the possibility to add your own interpreter to Zeppelin. This makes this tool really flexible.
Another feature it has, is the built in integration of Apache Spark. It ships with the following features and more:
- Automatic SparkContext and SQLContext injection
- Runtime jar dependency loading from local filesystem or maven repository.
- Canceling job and displaying its progress
It also has built in visualization, which is an improvemnt over using ipython notebooks I think. The visualization covers the most basic graphs, like:
These visualizations can be used with all interpreters and are always the same. So you can show data from Postgres and Spark in the same notebook with the same functions used. There is no need to handle different data sources differently.
You can also use dynamic forms in your notebooks, e.g. to provide filter options to the user. This comes in handy, if you embedd a notebook in your own website.
Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It can now run all 99 TPC-DS. The new SQL parser supports ANSI-SQL and HiveQL and sub queries.
Another new features is native csv data source support, based on the already existing Databricks spark csv module. I personally used this module as well as the spark avro module before and they make working with data in those formats really easy.
Also there were some new features added to MLlib:
- PySpark includes new algorithms like LDA, Gaussian Mixture Model, Generalized Linear Regression
- SparkR now includes generalized linear models, naive Bayes, k-means clustering, and survival regression.
Spark increased its performance with the release of 2.0. The goal was to make Spark 2.0 10x faster and Databricks shows this performance tuning in a notebook.
All of these improvements make Spark a more complete tool for data processing and analysing. The added SQL2003 support even makes it available for a larger user base and more importantly makes it easier to migrate existing applications from databases to Spark.
In Data Science there are two languages that compete for users. On one side there is R, on the other Python. Both have a huge userbase, but there is some discussion, which is better to use in a Data Science context. Lets explore both a bit:
R is a language and programming environment especially developed for statistical computing and grahics. It has been around some time and several thousand packages to tackle statistical problems. With RStudio it also provides an interactive programming environment, that makes analysing data pretty easy.
Python is a full range programming language, that makes it easy to integrate into a company wide system. With the packages Numpy, Pandas and Scikit-learn, Mathplotlib in combination with IPython, it also provides a full range suite for statistical computing and interactive programming environment.
R was developed solely for the purpose of statistical computing, so it has some advantages there, since it is specialized and has been around some years. Python is coming from a programming language and moves now into the data analysis field. In combination with all the other stuff it can do, websites and easy integrations into Hadoop Streaming or Apache Spark.
And for people who want to use the best of both sides can always use the R Python integration Rpy2.
I personally am recently working with Python for my ETL processes, including MapReduce, and anlysing data, which works awesome in combination with IPython as interactive development tool.