Further development and new additions to the Hadoop framework, such as Stinger from HortonWorks or Impala from Cloudera try to bridge the gap between traditional EDWH architectures and big data architectures.
Especially Stinger.next initiative with the goal of speeding up Hive and delivering SQL 2011 standard to use on Map / Reduce Hadoop clusters makes this technology usable for developers with a SQL background. This next iteration in Hive optimization also brings an ACID framework with transactions and writeable tables. This is especially useful in data warehouse contexts, for example when you need to add meta data.
With these developments it seems plausible, that Hadoop and with it Big Data as a whole will move from ETL plattform for traditional EDWH architectures using traditional database systems, to a unified plattform, where Hadoop stores all data from raw unstructured data to structured data from the companies transactional systems and the meta data created in for reporting purposes. So access to all data would be given in the same system and query-able with SQL.
Standard reporting and deeper analysis on all data could then be accessed on the same system, so that all analysts and traditional BI developers share one platform and a better understanding of all the data needed and used in the data warehouse system.
I already did a benchmark on query speed for MySQL, Stinger and Impala here and will update this, once Stinger.next is out.
Getting data, huge amounts of data, out of some systems tends to be quite a hazzle sometimes. Often you are required to use techniques such as FTP or SSH for transfering files. But with RESTful APIs getting more attention in the last few years, there is a new way to get your data.
The charm of REST APIs is, that they are stateless and use HTTP methods explicitly. This makes getting data pretty straight forward:
- Use POST to create a resource on the server.
- Use GET to retrieve a resource.
- Use PUT to change a resource.
- User DELETE to remove a resource.
The result can be returned in any defined format, but mostly it is XML or JSON. Security is also provided, if you integrate authentification methods like OAUTH or LDAP.
This gives you new possibilities to integrate your data into webbased reporting systems, since you only have to use the HTTP protocol to get your data and can work on the results as they stream in.
Since most REST APIs have the possibility to store results of a request, you could get the same result again at a later time, without having to process it on the source system again.
Hadoop even provides a REST API called WebHDFS REST API developed by Hortonworks, which supports the complete filesystem interface of HDFS. This is a great help, if you are running applications using your Hadoop cluster that are not using Java. So you can mainpulate and access your data from about everywhere.
There are many fields in which big data can improve results. One of these being (e-)learning. Until recently the focus on analysing learning lay on analysing results of exams but with big data and analytics there are new possibilities to enhance the experience of learning as a whole. For example there is the possibility to personalize learning and helping students to achieve better results. Big Data makes this possible in nearly real time. There is the possibility to help students in the process of learning, as soon as the programm realizes a problem and providing a solution in the workflow, instead of the student having to stop his learning process for his problem to be solved and then continue. This also applies for working environments.
Not only inside a process of learning or work analysing data can come in handy. Even after a course is finished analysing the data produced during the course by all students can help optimize the course and resulting exam. Identifying where users got stuck or what was to easy will improve the learning experience for everyone.
There are already efforts to integrate this into the learning experience like Predictive Analytics Reporting (PAR) Framework.
PAR is trying to integrate several data sources and base their studies on this data instead of the studies that are based on individual programms. This approach broadens the base and this may make it able to find other (better) insights into the educational system of the U.S.
With Big Data Map/Reduce is always the first term that comes into mind. But it’s not the only way to handle large amounts of data. There are databasesystems especially built to deal with huge amounts of data and they are called Massively Parallel Processing (MPP) databases.
MPP database systems have been around for a longer time than Map/Reduce and its most popular integration Hadoop and are based on a shared nothing architecture. The data is partitioned across severel nodes of hardware and queries are processed via network interconnect on a central server. They often use commodity hardware that is as inexpensive as hardware for Map/Reduce. For working with data they have the advantage to make use of SQL as their interface, the language used by most Data Scientists and other analytic prefessionals so far.
Map/Reduce provides a Java interface to analyse the data, which comes with more time to implement than just write an SQL statement. Hadoop has some projects, that provide a SQL similar query language, like Hive which provides HiveQL, a SQL like query language, as interface.
Since both systems handle data, there will be a lot gained, when both are combined. There are already projects working on that, like Aster Data nCluster or Teradata and Hortonworks.
There is even a new product bringing both worlds together as one product, Hadapt. With this product you can access all your data, structured or unstructured, in a single plattform. Each node has space for SQL as well as for Map/Reduce.
Last but not least a list of some MPP databases available right now:
Depending on your business needs, you may not need a Map/Reduce cluster, but a MPP database, or both to benefit from their respective strenghts in your implementation.
Data Scientists seem to be everywhere nowadays. This title has seen a huge increase in appearences in job descriptions, as Indeed.com demostrates in its data.
There are several sites and articles that even describe the job as sexy:
The combination of handling Big Data and Analytics is what makes this title so attractive. So far handling data and analytics were too parts, sometimes combined in one person, but most times not. But with all the unstructured data available and new tools to handle it, the combination is easier to handle for one person. But technical understanding is not enough to become a Data Scientist. It requires an understanding of products and customer behaviour as well as how to manage and analyse data.
Right now there is no programm that graduates with the title Data Scientist, so companies have to look for people that learned all skills during their career so far or have strong affinities towards analysis or programming with corresponding skills learned during their studies or education.
Because of the interdisciplinary area of this field, some companies try to get their hands on graduated physicist. Their studies involve a great deal of interdisciplinary themes and the affinity to both algorithms and research.
Other possibilities are Business Intelligence Analysts, Data Mining specialists or even Web Analytics manager, depending on their career so far, especially their experience with different kinds of data and their presentation skills regarding actions resulting from their analysis.
All in all this new title and the news coverage it is getting is a great opportunity for people already working in this field and newbies wanting to work with both data and analysis.
This makes this new profession sexy.