DATA DO – データ道

Category: Big Data

Apache Spark 2.0

Apache Spark has release version 2.0, which is a major step forward in usability for Spark users and mostly for people, who refrained from using it, due to the costs of learning a new programming language or tool. This is in the past now, as Spark 2.0 supports improved SQL functionalities with SQL2003 support. It […]

September 8, 2016
Python vs. R for Data Science

In Data Science there are two languages that compete for users. On one side there is R, on the other Python. Both have a huge userbase, but there is some discussion, which is better to use in a Data Science context. Lets explore both a bit: R R is a language and programming environment especially […]

February 15, 2015
Apache Spark: The Next Big (Data) Thing?

Since Apache Spark became a Top Level Project at Apache almost a year ago, it has seen some wide coverage and adoption in the industry. Due to its promise of being faster than Hadoop MapReduce, about 100x in memory and 10x on disk, it seems like a real alternative to doing pure MapReduce. Written in […]

January 24, 2015
Big Data and Data Warehouse Architecture

Further development and new additions to the Hadoop framework, such as Stinger from HortonWorks or Impala from Cloudera try to bridge the gap between traditional EDWH architectures and big data architectures. Especially Stinger.next initiative with the goal of speeding up Hive and delivering SQL 2011 standard to use on Map / Reduce Hadoop clusters makes […]

September 28, 2014
Comparing Stinger to Impala

With Hadoop 2.0 and the new additions of Stinger and Impala I did a (not representive) test of the performance on a Virtual Box running on my desktop computer. It was using the following setup: 4 GB RAM Intel Core i5 2500 3.3 GHz The datasets were the following: Dataset 1: 71.386.291 rows and 5 […]

January 30, 2014
SQL on Hadoop: Facebook’s Presto

Earlier this month Facebook open sourced its own product for using SQL on Hadoop. It is called Presto and is something like Facebook’s answer to Cloudera’s Impala or Hortonwork’s Stinger already presented in an earlier post called SQL and Hadoop on this site. Presto is unlike Hive and more like Impala, since it doesn’t rely […]

November 16, 2013
SQL and Hadoop

Bringing SQL to Hadoop has been one of the major trends in Big Data these last twelve months. Reason enough for me to take a closer look at that scene right now. One reason to build an interface based on SQL for Hadoop is to make the technology available for more people. Companies that have […]

October 9, 2013
REST (Representational state transfer) APIs and Big Data

Getting data, huge amounts of data, out of some systems tends to be quite a hazzle sometimes. Often you are required to use techniques such as FTP or SSH for transfering files. But with RESTful APIs getting more attention in the last few years, there is a new way to get your data. The charm […]

August 23, 2013
Big Data in Learning

There are many fields in which big data can improve results. One of these being (e-)learning. Until recently the focus on analysing learning lay on analysing results of exams but with big data and analytics there are new possibilities to enhance the experience of learning as a whole. For example there is the possibility to […]

June 12, 2013
Hadoop and MPP

With Big Data Map/Reduce is always the first term that comes into mind. But it’s not the only way to handle large amounts of data. There are databasesystems especially built to deal with huge amounts of data and they are called Massively Parallel Processing (MPP) databases. MPP database systems have been around for a longer […]

April 26, 2013

By continuing to use the site, you agree to the use of cookies. more information