Data Science Tools

What tools are used for Data Science? There are a lot of them out there and in this post I want to tell you about the ones I currently use or used before.

  • KNIME is a graphical tool to analyse data. It uses an interface to build process flows that contain everything, from data transformation, initial analysis, predictive analysis, vizualisation and reporting. One of it’s advantages is the huge community and it being an open source tool, that encourages the community to contribute.
  • Rapid Miner from Rapid-I is also a graphical tool to analyse data. Processes are built using predifined steps. It provides data transformation, initial analysis, predictive analysis, vizualisation and reporting. Since it is based on Java it is plattform independent. There is a community too, that helps to improve the programm and expands the available resources.
  • SAShas a whole suite of tools for data manipulation and analysis. They provide Olap tool, predictive analytics, reporting and vizualisation. Being in the market for a long time, they have a huge customer base and lots of experience. There is also a system of trainings with exams to provide certified qualifications in using there tools.
  • R is a free tool, developed for scientists in biology first, but it is spread through all kinds of industries now, due to its wide range of packages. There is no graphical interface but the language is easy to learn. R provides data manipulation, visualization, predictive analysis, reporting and initial analysis. Also there is an integration into Hadoop for better interaction with Big Data.
  • Splunk is a tool primarily for analysing unstructured data, like logfiles. It provides real time statistics and a outstanding visualization for reports. Its language is related to SQL, so it is pretty easy to learn, if you used SQL queries before.
  • Jedox provides an Olap server with an interface that looks like MS Excel on the web and they have a plugin into MS Excel too. It caters mainly to controlling need, but has some advantages regarding self-service BI. Based on PHP and Java it is available in a community version and a professional version.
  • FastStats from Apteco uses a easy to understand graphical interface and some basic predictive methods. It enables business users to analyse their data themselves and even build small models. It also provides visualization tools. This is also a tool catering to self-service BI.

If you have other tools you use and like, please feel free to share them with me. I am always interessted in learning about new tools.

Data Science and Machine Learning

Machine Learning is acknowlegded as a part of Data Science, but will it be able to replace a Data Scientist?
There have been several articles around that topic in the last few years and months. It’s true there has been some major progress in the field of machine learning and there are already articles about the beginning of automated science like Lipson and Schmidt.
During the SXSWeek there will even be a Panel concerning this topic The Data Scientist Will Be Replaced By Tools.
The main question is, will machine results replace human expertise? There are several startups that provide data science as a service, like Prior Knowledge or Platfora. These companies help to discover knowledge hidden in a company’s data. PK looks in the provided data for correlations and helps to build predictive models. Pltfora on the other hand wants to make Hadoop usable for everyone.
These companies can help discover information, but only combined with human expertise from inside the company it is possible to make the most of the uncovered information. So, in my opinion, machine learning helps making the job of a data scientist easier, because he can concentrate more on his expertise with the context the data was created in.
This may even help in broadening the access to data science to more people.

Data Scientist: Hype or Sexy?

Data Scientists seem to be everywhere nowadays. This title has seen a huge increase in appearences in job descriptions, as demostrates in its data.
There are several sites and articles that even describe the job as sexy:

The combination of handling Big Data and Analytics is what makes this title so attractive. So far handling data and analytics were too parts, sometimes combined in one person, but most times not. But with all the unstructured data available and new tools to handle it, the combination is easier to handle for one person. But technical understanding is not enough to become a Data Scientist. It requires an understanding of products and customer behaviour as well as how to manage and analyse data.
Right now there is no programm that graduates with the title Data Scientist, so companies have to look for people that learned all skills during their career so far or have strong affinities towards analysis or programming with corresponding skills learned during their studies or education.
Because of the interdisciplinary area of this field, some companies try to get their hands on graduated physicist. Their studies involve a great deal of interdisciplinary themes and the affinity to both algorithms and research.
Other possibilities are Business Intelligence Analysts, Data Mining specialists or even Web Analytics manager, depending on their career so far, especially their experience with different kinds of data and their presentation skills regarding actions resulting from their analysis.
All in all this new title and the news coverage it is getting is a great opportunity for people already working in this field and newbies wanting to work with both data and analysis.

This makes this new profession sexy.

Data Science: What is it?

Data Science is an interdisciplinary field of sciences. It includes:

      Data Engineering
      Advanced Computing
      Domain Expertise

It revolves around, as the name suggests, working with data. With the development in creating big sets of data in our society, the need for analyzing this data grows across all industries. And this calls for people, who can work with data, analyze it and make their findings available for management decisions. Here comes into play math, statistics, expertise and visualization.
One of the most important challenges is, to integrate findings from unstructured data, like logfiles, and structured data, like databases. This enables a company to develop a whole new view of their customers and their products.
One task of data science is to make all necessary information available and filter out the noise that this information age provides us with. This is where statistics and data mining come into play. Finding effective ways to distribute these findings is partly data engineering, math and visualization. Presenting results from analysing data in graphical form is a way to make it easier to grasp the results.