Apache Spark 2.0

Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It can now run all 99 TPC-DS. The new SQL parser supports ANSI-SQL and HiveQL and sub queries.
Another new features is native csv data source support, based on the already existing Databricks spark csv module. I personally used this module as well as the spark avro module before and they make working with data in those formats really easy.
Also there were some new features added to MLlib:

  • PySpark includes new algorithms like LDA, Gaussian Mixture Model, Generalized Linear Regression
  • SparkR now includes generalized linear models, naive Bayes, k-means clustering, and survival regression.

Spark increased its performance with the release of 2.0. The goal was to make Spark 2.0 10x faster and Databricks shows this performance tuning in a notebook.

All of these improvements make Spark a more complete tool for data processing and analysing. The added SQL2003 support even makes it available for a larger user base and more importantly makes it easier to migrate existing applications from databases to Spark.

Please follow and like us:

Author: Marc

My career so far made it possible to have a look at the potential of analysis and data mining over a broad range of industries and data sources. I have expirience from customer relationship management in several industries to optimizing the aquisition of new customers through data mining. I can sqeeze information and knowledge from all available kinds of data to optimize processes in a company.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

code