Since Apache Spark became a Top Level Project at Apache almost a year ago, it has seen some wide coverage and adoption in the industry. Due to its promise of being faster than Hadoop MapReduce, about 100x in memory and 10x on disk, it seems like a real alternative to doing pure MapReduce.
Written in Scala, it provides the ability to write applications fast in Java, Python and Scala, and the syntax isn’t that hard to learn. There are even tools available for using SQL (Spark SQL), Machine Learning (MLib) interoperating with Pythons Numpy, graphics and streaming. This makes Spark to a real good alternative for big data processing.
Another feature of Apache Spark is, that it runs everywhere. On top of Hadoop, standalone, in the cloud and can easily access diverse data stores, such as HDFS, Amazon S3, Cassandra, HBase.
The easy integration into Amazon Web Services is what makes it attractive to me, since I am using this already. I also like the Python integration, because latelly, that became my favourite language for data manipulation and machine learning.
Besides the official parts of Spark mentioned above, there are also some really nice external packages, that for example integrate Spark with tools such as PIG, Amazon Redshift, some machine learning algorithms, and so on.
Given the promised speed gain, the ease of use and the full range of tools available, and the integration in third party programms, such as Tableau or MicroStrategy, Spark seems to look into a bright future.
The inventors of Apache Spark also founded a company called databricks, which offers professional services around Spark.