Apache Spark 2.0

Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It can now run all 99 TPC-DS. The new SQL parser supports ANSI-SQL and HiveQL and sub queries.
Another new features is native csv data source support, based on the already existing Databricks spark csv module. I personally used this module as well as the spark avro module before and they make working with data in those formats really easy.
Also there were some new features added to MLlib:

  • PySpark includes new algorithms like LDA, Gaussian Mixture Model, Generalized Linear Regression
  • SparkR now includes generalized linear models, naive Bayes, k-means clustering, and survival regression.

Spark increased its performance with the release of 2.0. The goal was to make Spark 2.0 10x faster and Databricks shows this performance tuning in a notebook.

All of these improvements make Spark a more complete tool for data processing and analysing. The added SQL2003 support even makes it available for a larger user base and more importantly makes it easier to migrate existing applications from databases to Spark.


Posted

in

, , ,

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept the Privacy Policy

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website is using Google Analytics. Please click here if you want to opt-out. Click here to opt-out.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close