DataScientists: a blog about everything data related.
Apache HAWQ: Building an easily accessable Data Lake
Data Lake vs Datawarehouse The Data Lake Architecture is an up and coming approach to making all data accessible through several methods, be that in real-time or batch analysis. This includes unstructured data as well as structured data. In this approach the data is stored on HDFS and made accessible by several tools, including: Apache […]
Apache HAWQ: Full SQL and MPP support on HDFS
Pivotal ported their massively parallel processing (MPP) database Greenplum to Hadoop and made it open source as an incubating project at Apache, called Apache HAWQ. This bring together full ANSI SQL with MPP capabilities and Hadoop integration. The integration in an existing Hadoop installation is easy, as you can integrate all existing data via external […]
Apache Zeppelin: Use with remote Spark cluster and Yarn
Apache Zeppelin is pretty usefull for interactive programming using the web browser. It even comes with its own installation of Apache Spark. For further information you can check my earlier post. But the real power in using Spark with Zeppelin lies in its easy way to connect it to your existing Spark cluster using YARN. […]
Apache Zeppelin: Visualization and Spark data processing
Apache Zeppelin is a web-based notebook for interactive data analytics. It comes will features for all the steps of data analysis: Data Ingestion Data Discovery Data Analytics Data Visualization & Collaboration Besides that feature set it also supports multiple languages in the backend. Currently it supports languages like: Apache Spark (SQL, PySpark, Java, Scala) R […]
Apache Spark 2.0
Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It […]
Python vs. R for Data Science
In Data Science there are two languages that compete for users. On one side there is R, on the other Python. Both have a huge userbase, but there is some discussion, which is better to use in a Data Science context. Lets explore both a bit: R R is a language and programming environment especially […]
Apache Spark: The Next Big (Data) Thing?
Since Apache Spark became a Top Level Project at Apache almost a year ago, it has seen some wide coverage and adoption in the industry. Due to its promise of being faster than Hadoop MapReduce, about 100x in memory and 10x on disk, it seems like a real alternative to doing pure MapReduce. Written in […]
Big Data and Data Warehouse Architecture
Further development and new additions to the Hadoop framework, such as Stinger from HortonWorks or Impala from Cloudera try to bridge the gap between traditional EDWH architectures and big data architectures. Especially Stinger.next initiative with the goal of speeding up Hive and delivering SQL 2011 standard to use on Map / Reduce Hadoop clusters makes […]
Comparing Stinger to Impala
With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup: 4 GB RAM Intel Core i5 2500 3.3 GHz The datasets were the following: Dataset 1: 71.386.291 rows and 5 […]
SQL on Hadoop: Facebook’s Presto
Earlier this month Facebook open sourced its own product for using SQL on Hadoop. It is called Presto and is something like Facebook’s answer to Cloudera’s Impala or Hortonwork’s Stinger already presented in an earlier post called SQL and Hadoop on this site. Presto is unlike Hive and more like Impala, since it doesn’t rely […]
Got any book recommendations?