With Big Data Map/Reduce is always the first term that comes into mind. But it’s not the only way to handle large amounts of data. There are databasesystems especially built to deal with huge amounts of data and they are called Massively Parallel Processing (MPP) databases.
MPP database systems have been around for a longer time than Map/Reduce and its most popular integration Hadoop and are based on a shared nothing architecture. The data is partitioned across severel nodes of hardware and queries are processed via network interconnect on a central server. They often use commodity hardware that is as inexpensive as hardware for Map/Reduce. For working with data they have the advantage to make use of SQL as their interface, the language used by most Data Scientists and other analytic prefessionals so far.
Map/Reduce provides a Java interface to analyse the data, which comes with more time to implement than just write an SQL statement. Hadoop has some projects, that provide a SQL similar query language, like Hive which provides HiveQL, a SQL like query language, as interface.
Since both systems handle data, there will be a lot gained, when both are combined. There are already projects working on that, like Aster Data nCluster or Teradata and Hortonworks.
There is even a new product bringing both worlds together as one product, Hadapt. With this product you can access all your data, structured or unstructured, in a single plattform. Each node has space for SQL as well as for Map/Reduce.
Last but not least a list of some MPP databases available right now:
Depending on your business needs, you may not need a Map/Reduce cluster, but a MPP database, or both to benefit from their respective strenghts in your implementation.
After my previous post How to visualize data? I was unsatisfied with the visualization provided by the Palo Suite provided by Jedox. This could have several reasons, not the least, that I may not have been able to get the max out of it. But the quality of the resulting diagramms and it’s interactivity were lacking for the purposes I have to deal with, especially after working with Circos the last few weeks.
So I went hunting for something easy to integrate into my Palo Suite.
There are a lot more out there and sometime I had to decide on one. So I settled on NVD3.js since I liked the look of the graphics and because it is based on Data Driven Documents.
It supports several types of graphs already and integrating them all into the interface provided by Jedox, got me quick results. Here is a quick view on the difference between Palo built-in graphics and NVD3.js. Both graphs are based on the same data.
Palo Suite Webreporting graph
For anyone interessted I uploaded the file here. This is just a quick hack and not very representable, but it shows how it works.
Data visualization is something like an art. How to make results from your research in data easy to understand by management, business users or just everyone out there? A list of data, like an Excel sheet ist not what catches the eye. The art in visualization is shown perfectly on the site of Martin Wattenberg.
Now the questions is, what tools are easy to use in a company environment to visualize your data?
There are several classes of tools you can use:
- Beginner: These are tools with a wide knowlegde throughout the company, mainly MS Excel. You can explore data easily and make diagramms without too much hazzle. It provides Barcharts, Lines, Pies and a combination of those. It is also very easy to use for adhoc analysis and making the data and graphs available to business users, if necessary.
- Online Libraries: If you don’t want to be limited to Excel and use a Web-based reporting / analysis tool, you maybe can integrate one of the libraries available. There are several for all purpose you can imagine:
Circos is a great tool, if you want to use circles to visualize your data. It is written in Perl and produces PNG output.
Visual.ly focusses more on the infographics side of graph. It is mainly a marketplace, but you can make your own cartoon like graph with it.
- Professional tools: The opposite of Excel in manners of manipulating and analysing data. These tools are sometimes pretty expensive, such as SAS and SPSS. But there are also open source and free to use tools, that sometimes are more flexible and easier to use, since they have a strong user base.
R: Besides its nearly unlimited supply of libraries for all manners of analysis, R also has lots of packages concerning visualizing data and makes good use of them. It is one of the complexest tools I mentioned here.
Gephi is a graph-based tool for data exploration. It is most useful for relations of notes of all kinds.
These are some examples and I evaluated even more tools. So there are many ways to visualize data and what you use, is depending on your environment and skills. I mostly use R for generating complex graphs, but only because I use that tool for the analysis. I will be integrating more Circos into our autmated scripts soon, since they are all based on Perl anyways.
What tools are used for Data Science? There are a lot of them out there and in this post I want to tell you about the ones I currently use or used before.
- KNIME is a graphical tool to analyse data. It uses an interface to build process flows that contain everything, from data transformation, initial analysis, predictive analysis, vizualisation and reporting. One of it’s advantages is the huge community and it being an open source tool, that encourages the community to contribute.
- Rapid Miner from Rapid-I is also a graphical tool to analyse data. Processes are built using predifined steps. It provides data transformation, initial analysis, predictive analysis, vizualisation and reporting. Since it is based on Java it is plattform independent. There is a community too, that helps to improve the programm and expands the available resources.
- SAShas a whole suite of tools for data manipulation and analysis. They provide Olap tool, predictive analytics, reporting and vizualisation. Being in the market for a long time, they have a huge customer base and lots of experience. There is also a system of trainings with exams to provide certified qualifications in using there tools.
- R is a free tool, developed for scientists in biology first, but it is spread through all kinds of industries now, due to its wide range of packages. There is no graphical interface but the language is easy to learn. R provides data manipulation, visualization, predictive analysis, reporting and initial analysis. Also there is an integration into Hadoop for better interaction with Big Data.
- Splunk is a tool primarily for analysing unstructured data, like logfiles. It provides real time statistics and a outstanding visualization for reports. Its language is related to SQL, so it is pretty easy to learn, if you used SQL queries before.
- Jedox provides an Olap server with an interface that looks like MS Excel on the web and they have a plugin into MS Excel too. It caters mainly to controlling need, but has some advantages regarding self-service BI. Based on PHP and Java it is available in a community version and a professional version.
- FastStats from Apteco uses a easy to understand graphical interface and some basic predictive methods. It enables business users to analyse their data themselves and even build small models. It also provides visualization tools. This is also a tool catering to self-service BI.
If you have other tools you use and like, please feel free to share them with me. I am always interessted in learning about new tools.
Machine Learning is acknowlegded as a part of Data Science, but will it be able to replace a Data Scientist?
There have been several articles around that topic in the last few years and months. It’s true there has been some major progress in the field of machine learning and there are already articles about the beginning of automated science like Lipson and Schmidt.
During the SXSWeek there will even be a Panel concerning this topic The Data Scientist Will Be Replaced By Tools.
The main question is, will machine results replace human expertise? There are several startups that provide data science as a service, like Prior Knowledge or Platfora. These companies help to discover knowledge hidden in a company’s data. PK looks in the provided data for correlations and helps to build predictive models. Pltfora on the other hand wants to make Hadoop usable for everyone.
These companies can help discover information, but only combined with human expertise from inside the company it is possible to make the most of the uncovered information. So, in my opinion, machine learning helps making the job of a data scientist easier, because he can concentrate more on his expertise with the context the data was created in.
This may even help in broadening the access to data science to more people.
Data Scientists seem to be everywhere nowadays. This title has seen a huge increase in appearences in job descriptions, as Indeed.com demostrates in its data.
There are several sites and articles that even describe the job as sexy:
The combination of handling Big Data and Analytics is what makes this title so attractive. So far handling data and analytics were too parts, sometimes combined in one person, but most times not. But with all the unstructured data available and new tools to handle it, the combination is easier to handle for one person. But technical understanding is not enough to become a Data Scientist. It requires an understanding of products and customer behaviour as well as how to manage and analyse data.
Right now there is no programm that graduates with the title Data Scientist, so companies have to look for people that learned all skills during their career so far or have strong affinities towards analysis or programming with corresponding skills learned during their studies or education.
Because of the interdisciplinary area of this field, some companies try to get their hands on graduated physicist. Their studies involve a great deal of interdisciplinary themes and the affinity to both algorithms and research.
Other possibilities are Business Intelligence Analysts, Data Mining specialists or even Web Analytics manager, depending on their career so far, especially their experience with different kinds of data and their presentation skills regarding actions resulting from their analysis.
All in all this new title and the news coverage it is getting is a great opportunity for people already working in this field and newbies wanting to work with both data and analysis.
This makes this new profession sexy.
Data Science is an interdisciplinary field of sciences. It includes:
It revolves around, as the name suggests, working with data. With the development in creating big sets of data in our society, the need for analyzing this data grows across all industries. And this calls for people, who can work with data, analyze it and make their findings available for management decisions. Here comes into play math, statistics, expertise and visualization.
One of the most important challenges is, to integrate findings from unstructured data, like logfiles, and structured data, like databases. This enables a company to develop a whole new view of their customers and their products.
One task of data science is to make all necessary information available and filter out the noise that this information age provides us with. This is where statistics and data mining come into play. Finding effective ways to distribute these findings is partly data engineering, math and visualization. Presenting results from analysing data in graphical form is a way to make it easier to grasp the results.