Hadoop and MPP

With Big Data Map/Reduce is always the first term that comes into mind. But it’s not the only way to handle large amounts of data. There are databasesystems especially built to deal with huge amounts of data and they are called Massively Parallel Processing (MPP) databases.
MPP database systems have been around for a longer time than Map/Reduce and its most popular integration Hadoop and are based on a shared nothing architecture. The data is partitioned across severel nodes of hardware and queries are processed via network interconnect on a central server. They often use commodity hardware that is as inexpensive as hardware for Map/Reduce. For working with data they have the advantage to make use of SQL as their interface, the language used by most Data Scientists and other analytic prefessionals so far.
Map/Reduce provides a Java interface to analyse the data, which comes with more time to implement than just write an SQL statement. Hadoop has some projects, that provide a SQL similar query language, like Hive which provides HiveQL, a SQL like query language, as interface.
Since both systems handle data, there will be a lot gained, when both are combined. There are already projects working on that, like Aster Data nCluster or Teradata and Hortonworks.
There is even a new product bringing both worlds together as one product, Hadapt. With this product you can access all your data, structured or unstructured, in a single plattform. Each node has space for SQL as well as for Map/Reduce.

Last but not least a list of some MPP databases available right now:

Depending on your business needs, you may not need a Map/Reduce cluster, but a MPP database, or both to benefit from their respective strenghts in your implementation.

Please follow and like us:

Data Science Tools

What tools are used for Data Science? There are a lot of them out there and in this post I want to tell you about the ones I currently use or used before.

  • KNIME is a graphical tool to analyse data. It uses an interface to build process flows that contain everything, from data transformation, initial analysis, predictive analysis, vizualisation and reporting. One of it’s advantages is the huge community and it being an open source tool, that encourages the community to contribute.
  • Rapid Miner from Rapid-I is also a graphical tool to analyse data. Processes are built using predifined steps. It provides data transformation, initial analysis, predictive analysis, vizualisation and reporting. Since it is based on Java it is plattform independent. There is a community too, that helps to improve the programm and expands the available resources.
  • SAShas a whole suite of tools for data manipulation and analysis. They provide Olap tool, predictive analytics, reporting and vizualisation. Being in the market for a long time, they have a huge customer base and lots of experience. There is also a system of trainings with exams to provide certified qualifications in using there tools.
  • R is a free tool, developed for scientists in biology first, but it is spread through all kinds of industries now, due to its wide range of packages. There is no graphical interface but the language is easy to learn. R provides data manipulation, visualization, predictive analysis, reporting and initial analysis. Also there is an integration into Hadoop for better interaction with Big Data.
  • Splunk is a tool primarily for analysing unstructured data, like logfiles. It provides real time statistics and a outstanding visualization for reports. Its language is related to SQL, so it is pretty easy to learn, if you used SQL queries before.
  • Jedox provides an Olap server with an interface that looks like MS Excel on the web and they have a plugin into MS Excel too. It caters mainly to controlling need, but has some advantages regarding self-service BI. Based on PHP and Java it is available in a community version and a professional version.
  • FastStats from Apteco uses a easy to understand graphical interface and some basic predictive methods. It enables business users to analyse their data themselves and even build small models. It also provides visualization tools. This is also a tool catering to self-service BI.

If you have other tools you use and like, please feel free to share them with me. I am always interessted in learning about new tools.

Please follow and like us: