This was the last year in which Big Data was an off-the-street term. As the year ended, Hortonworks was preparing to go public, with Cloudera preparing for the same outcome. Big Data will officially transition into being a big business this year.
Thus, 2014 was the year when pilot projects went into production, or when small production projects went large scale. As a result, the infrastructure of Hadoop did a lot of growing up this past year. Projects were able to take on larger problem spaces as well, thanks to the release of YARN in 2013. With a year to work with the now generally applicable job- and task-scheduling systems in Hadoop, developers working at the Apache Foundation (and inside companies like Hortonworks, Cloudera and MapR) were able to quickly ramp up projects that use YARN, like Apache Tez, Apache Hama and management system Apache Ambari.
In the ISV world, 2014 was a popular year for SQL on Hadoop, but it wasn’t until November that Splice Machine launched the first ACID-compliant transactional SQL store on top of Hadoop. Meanwhile, MapR, Pivotal, IBM, Cloudera and others continued to build out SQL capabilities on top of HBase and Apache Cassandra. That means your database workers can finally get back to doing their jobs on Hadoop, instead of spending all their time learning Apache Hive or Apache Pig as a method of data access.
Cloudera’s Impala in particular saw a lot of enthusiasm in 2014 as more and more businesses moved their data into Hadoop and found a need for more traditional data-access methodologies.
Perhaps just as interesting as what 2014 was is what it wasn’t. 2014 was not yet the year of machine learning, though the popular Apache Mahout project did move off of Map/Reduce (its original Map/Reduce-based algorithms will remain available). Despite years of promising more in the way of machine learning for Hadoop users, however, we’re still stuck in the more basic areas of usage with the platform.
2014 did see the launch of Databricks, the company that will commercially support Apache Spark. Spark is the near-real-time framework for processing inside of Hadoop, and it opens the door to numerous other use cases, such as stream and event processing inside of Hadoop. You can bet Spark will become more and more relevant as the years go by, but it’s still unlikely that 2015 will be either the year of Spark or the year of machine learning.
But it will most certainly be the year that the stock market first gets its hands on Hadoop in stock form. That should make it an interesting year.