Category: Big Data
-
Apache Spark 2.0
Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It…
-
Python vs. R for Data Science
In Data Science there are two languages that compete for users. On one side there is R, on the other Python. Both have a huge userbase, but there is some discussion, which is better to use in a Data Science context. Lets explore both a bit: R R is a language and programming environment especially…
-
Apache Spark: The Next Big (Data) Thing?
Since Apache Spark became a Top Level Project at Apache almost a year ago, it has seen some wide coverage and adoption in the industry. Due to its promise of being faster than Hadoop MapReduce, about 100x in memory and 10x on disk, it seems like a real alternative to doing pure MapReduce. Written in…
-
Big Data and Data Warehouse Architecture
Further development and new additions to the Hadoop framework, such as Stinger from HortonWorks or Impala from Cloudera try to bridge the gap between traditional EDWH architectures and big data architectures. Especially Stinger.next initiative with the goal of speeding up Hive and delivering SQL 2011 standard to use on Map / Reduce Hadoop clusters makes…
-
Comparing Stinger to Impala
With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup: 4 GB RAM Intel Core i5 2500 3.3 GHz The datasets were the following: Dataset 1: 71.386.291 rows and 5…
-
SQL on Hadoop: Facebook’s Presto
Earlier this month Facebook open sourced its own product for using SQL on Hadoop. It is called Presto and is something like Facebook’s answer to Cloudera’s Impala or Hortonwork’s Stinger already presented in an earlier post called SQL and Hadoop on this site. Presto is unlike Hive and more like Impala, since it doesn’t rely…
-
SQL and Hadoop
Bringing SQL to Hadoop has been one of the major trends in Big Data these last twelve months. Reason enough for me to take a closer look at that scene right now. One reason to build an interface based on SQL for Hadoop is to make the technology available for more people. Companies that have…
-
REST (Representational state transfer) APIs and Big Data
Getting data, huge amounts of data, out of some systems tends to be quite a hazzle sometimes. Often you are required to use techniques such as FTP or SSH for transfering files. But with RESTful APIs getting more attention in the last few years, there is a new way to get your data. The charm…
-
Big Data in Learning
There are many fields in which big data can improve results. One of these being (e-)learning. Until recently the focus on analysing learning lay on analysing results of exams but with big data and analytics there are new possibilities to enhance the experience of learning as a whole. For example there is the possibility to…
-
Hadoop and MPP
With Big Data Map/Reduce is always the first term that comes into mind. But it’s not the only way to handle large amounts of data. There are databasesystems especially built to deal with huge amounts of data and they are called Massively Parallel Processing (MPP) databases. MPP database systems have been around for a longer…