{"id":245,"date":"2016-09-08T09:37:47","date_gmt":"2016-09-08T07:37:47","guid":{"rendered":"http:\/\/datascientists.info\/?p=245"},"modified":"2016-09-08T09:37:47","modified_gmt":"2016-09-08T07:37:47","slug":"apache-spark-2-0","status":"publish","type":"post","link":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/","title":{"rendered":"Apache Spark 2.0"},"content":{"rendered":"<p>Apache Spark has release <a href=\"https:\/\/spark.apache.org\/releases\/spark-release-2-0-0.html\">version 2.0<\/a>, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It can now run all 99 TPC-DS. The new SQL parser supports ANSI-SQL and HiveQL and sub queries.<br \/>\nAnother new features is native csv data source support, based on the already existing <a href=\"https:\/\/databricks.com\/\" target=\"_blank\">Databricks<\/a> <a href=\"https:\/\/github.com\/databricks\/spark-csv\" target=\"_blank\">spark csv module<\/a>. I personally used this module as well as the <a href=\"https:\/\/github.com\/databricks\/spark-avro\" target=\"_blank\">spark avro module<\/a> before and they make working with data in those formats really easy.<br \/>\nAlso there were some new features added to MLlib:<\/p>\n<ul>\n<li>PySpark includes new algorithms like LDA, Gaussian Mixture Model, Generalized Linear Regression<\/li>\n<li>SparkR now includes generalized linear models, naive Bayes, k-means clustering, and survival regression.<\/li>\n<\/ul>\n<p>Spark increased its performance with the release of 2.0. The goal was to make Spark 2.0 10x faster and Databricks shows this performance tuning in a <a href=\"https:\/\/databricks-prod-cloudfront.cloud.databricks.com\/public\/4027ec902e239c93eaaa8714f173bcfc\/6122906529858466\/293651311471490\/5382278320999420\/latest.html\" target=\"_blank\">notebook<\/a>.<\/p>\n<p>All of these improvements make Spark a more complete tool for data processing and analysing. The added SQL2003 support even makes it available for a larger user base and more importantly makes it easier to migrate existing applications from databases to Spark.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,5,7,9],"tags":[18,65,76,78],"ppma_author":[144],"class_list":["post-245","post","type-post","status-publish","format-standard","hentry","category-big-data","category-data-science","category-machine-learning","category-tools","tag-apache-spark","tag-pyspark","tag-sparkr","tag-sql-2003","author-marc"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Apache Spark 2.0: SQL in Spark<\/title>\n<meta name=\"description\" content=\"Apache Spark 2.0 release brings a lot of new features, but most importantly major SQL support, which makes it easier to use.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Spark 2.0: SQL in Spark\" \/>\n<meta property=\"og:description\" content=\"Apache Spark 2.0 release brings a lot of new features, but most importantly major SQL support, which makes it easier to use.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/\" \/>\n<meta property=\"og:site_name\" content=\"DATA DO - \u30c7\u30fc\u30bf \u9053\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataScientists\/\" \/>\n<meta property=\"article:published_time\" content=\"2016-09-08T07:37:47+00:00\" \/>\n<meta name=\"author\" content=\"Marc Matt\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Marc Matt\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/\"},\"author\":{\"name\":\"Marc Matt\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#\\\/schema\\\/person\\\/723078870bf3135121086d46ebb12f19\"},\"headline\":\"Apache Spark 2.0\",\"datePublished\":\"2016-09-08T07:37:47+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/\"},\"wordCount\":214,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/#organization\"},\"keywords\":[\"Apache Spark\",\"PySpark\",\"SparkR\",\"SQL 2003\"],\"articleSection\":[\"Big Data\",\"Data Science\",\"Machine Learning\",\"Tools\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/\",\"url\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/\",\"name\":\"Apache Spark 2.0: SQL in Spark\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/#website\"},\"datePublished\":\"2016-09-08T07:37:47+00:00\",\"description\":\"Apache Spark 2.0 release brings a lot of new features, but most importantly major SQL support, which makes it easier to use.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/index.php\\\/2016\\\/09\\\/08\\\/apache-spark-2-0\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/datascientists.info\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Apache Spark 2.0\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#website\",\"url\":\"https:\\\/\\\/datascientists.info\\\/\",\"name\":\"Data Scientists\",\"description\":\"Digging data, Big Data, Analysis, Data Mining\",\"publisher\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/datascientists.info\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#organization\",\"name\":\"DATA DO - \u30c7\u30fc\u30bf \u9053\",\"url\":\"https:\\\/\\\/datascientists.info\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/datascientists.info\\\/wp-content\\\/uploads\\\/2026\\\/02\\\/Bildschirmfoto-vom-2026-02-02-08-13-21.png\",\"contentUrl\":\"https:\\\/\\\/datascientists.info\\\/wp-content\\\/uploads\\\/2026\\\/02\\\/Bildschirmfoto-vom-2026-02-02-08-13-21.png\",\"width\":250,\"height\":174,\"caption\":\"DATA DO - \u30c7\u30fc\u30bf \u9053\"},\"image\":{\"@id\":\"https:\\\/\\\/datascientists.info\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/DataScientists\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/datascientists.info\\\/#\\\/schema\\\/person\\\/723078870bf3135121086d46ebb12f19\",\"name\":\"Marc Matt\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g53b84b5f47a2156ba8b047d71d6d05fc\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g\",\"caption\":\"Marc Matt\"},\"description\":\"Senior Data Architect with 15+ years of experience helping Hamburg's leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities. I help clients: Migrate &amp; Modernize: Transitioning on-premise data warehouses to Google Cloud\\\/AWS to reduce costs and increase agility. Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs. Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow. Proven track record leading engineering teams.\",\"sameAs\":[\"https:\\\/\\\/data-do.de\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache Spark 2.0: SQL in Spark","description":"Apache Spark 2.0 release brings a lot of new features, but most importantly major SQL support, which makes it easier to use.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/","og_locale":"en_US","og_type":"article","og_title":"Apache Spark 2.0: SQL in Spark","og_description":"Apache Spark 2.0 release brings a lot of new features, but most importantly major SQL support, which makes it easier to use.","og_url":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/","og_site_name":"DATA DO - \u30c7\u30fc\u30bf \u9053","article_publisher":"https:\/\/www.facebook.com\/DataScientists\/","article_published_time":"2016-09-08T07:37:47+00:00","author":"Marc Matt","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Marc Matt","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/#article","isPartOf":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/"},"author":{"name":"Marc Matt","@id":"https:\/\/datascientists.info\/#\/schema\/person\/723078870bf3135121086d46ebb12f19"},"headline":"Apache Spark 2.0","datePublished":"2016-09-08T07:37:47+00:00","mainEntityOfPage":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/"},"wordCount":214,"commentCount":0,"publisher":{"@id":"https:\/\/datascientists.info\/#organization"},"keywords":["Apache Spark","PySpark","SparkR","SQL 2003"],"articleSection":["Big Data","Data Science","Machine Learning","Tools"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/","url":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/","name":"Apache Spark 2.0: SQL in Spark","isPartOf":{"@id":"https:\/\/datascientists.info\/#website"},"datePublished":"2016-09-08T07:37:47+00:00","description":"Apache Spark 2.0 release brings a lot of new features, but most importantly major SQL support, which makes it easier to use.","breadcrumb":{"@id":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/datascientists.info\/index.php\/2016\/09\/08\/apache-spark-2-0\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/datascientists.info\/"},{"@type":"ListItem","position":2,"name":"Apache Spark 2.0"}]},{"@type":"WebSite","@id":"https:\/\/datascientists.info\/#website","url":"https:\/\/datascientists.info\/","name":"Data Scientists","description":"Digging data, Big Data, Analysis, Data Mining","publisher":{"@id":"https:\/\/datascientists.info\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/datascientists.info\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/datascientists.info\/#organization","name":"DATA DO - \u30c7\u30fc\u30bf \u9053","url":"https:\/\/datascientists.info\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/datascientists.info\/#\/schema\/logo\/image\/","url":"https:\/\/datascientists.info\/wp-content\/uploads\/2026\/02\/Bildschirmfoto-vom-2026-02-02-08-13-21.png","contentUrl":"https:\/\/datascientists.info\/wp-content\/uploads\/2026\/02\/Bildschirmfoto-vom-2026-02-02-08-13-21.png","width":250,"height":174,"caption":"DATA DO - \u30c7\u30fc\u30bf \u9053"},"image":{"@id":"https:\/\/datascientists.info\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataScientists\/"]},{"@type":"Person","@id":"https:\/\/datascientists.info\/#\/schema\/person\/723078870bf3135121086d46ebb12f19","name":"Marc Matt","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g53b84b5f47a2156ba8b047d71d6d05fc","url":"https:\/\/secure.gravatar.com\/avatar\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g","caption":"Marc Matt"},"description":"Senior Data Architect with 15+ years of experience helping Hamburg's leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities. I help clients: Migrate &amp; Modernize: Transitioning on-premise data warehouses to Google Cloud\/AWS to reduce costs and increase agility. Implement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs. Scale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow. Proven track record leading engineering teams.","sameAs":["https:\/\/data-do.de"]}]}},"authors":[{"term_id":144,"user_id":1,"is_guest":0,"slug":"marc","display_name":"Marc Matt","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/74f48ef754cf04f628f42ed117a3f2b42931feeb41a3cca2313b9714a7d4fdd2?s=96&d=mm&r=g","author_category":"1","first_name":"Marc","last_name":"Matt","user_url":"https:\/\/data-do.de","job_title":"Senior Data Architect | GenAI & RAG Expert | GCP \/ AWS","description":"Senior Data Architect with 15+ years of experience helping Hamburg's leading enterprises modernize their data infrastructure. I bridge the gap between legacy systems (SAP, Hadoop) and modern AI capabilities.\r\n\r\nI help clients:\r\n\r\n \tMigrate &amp; Modernize: Transitioning on-premise data warehouses to Google Cloud\/AWS to reduce costs and increase agility.\r\n\r\n\r\n \tImplement GenAI: Building secure RAG (Retrieval-Augmented Generation) pipelines to unlock value from internal knowledge bases using LangChain and Vector DBs.\r\n \tScale MLOps: Operationalizing machine learning models from PoC to production with Kubernetes and Airflow.\r\n\r\nProven track record leading engineering teams."}],"_links":{"self":[{"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/posts\/245","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/comments?post=245"}],"version-history":[{"count":0,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/posts\/245\/revisions"}],"wp:attachment":[{"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/media?parent=245"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/categories?post=245"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/tags?post=245"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/datascientists.info\/index.php\/wp-json\/wp\/v2\/ppma_author?post=245"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}