Buzzwords come and go from the software development industry. Some, like “agile” and “test-driven development,” have weathered the test of time. Others, like “SOA” and “LISPy everything,” haven’t fared as well. But “Big Data” as a buzzword, and as a quantifiable problem, is unique in the world of buzz.
Service-oriented architecture, agile development, and indeed most other development-related buzzwords, are prescriptive: SOA and agile are both solutions to the classic problem of software development, that is, not getting enough work done. But Big Data as a buzzword is not a solution. It’s a representation of a problem, and one that if your company does not have now, it assuredly will have soon.
Emil Eifrém, CEO of Neo Technology (a producer of the Neo4j graph database), said that Big Data is here to stay. “First, Big Data is not a fad. We see it every day,” he said. “You’ve all seen a gazillion presentations and analyst reports on the exponential growth of data. Supposedly, all new information generated this year will be more than all the data generated by humanity in all prior years of history combined.”
So, clearly, there’s plenty of data out there to deal with. But the Big-Data problem, as it were, isn’t just about having that big stack of information. It’s about juicing it like a pear for the sweet nectar of truth that awaits inside.
Big Data is about figuring out what to do with all that information that comes pouring out of your applications, your websites and your business transactions. The logs, records and details of these various systems have to go somewhere, and sticking them into a static data warehouse for safekeeping is no longer the way to handle the problem.
Instead, vendors, developers and the open-source community have all designed their own solutions to the problem. And for most of those problems, the Apache Hadoop Project is the most popular solution, though it is not the only option. Since its creation in 2005, however, Hadoop has grown to become the busiest project in the Apache Software Foundation’s retinue.
The reason for this popularity is that Hadoop solves two of the most ornery Big-Data problems right off of the bat: Hadoop is a combination of a MapReduce algorithm with a distributed file system known as HDFS. As a cluster environment, Hadoop can take batch-processing jobs and distribute them across multiple machines, each of which holds a chunk of the larger data picture.
Facebook often touts its Hadoop cluster as an example of success, citing its size of over 45PB as a sign that Hadoop can handle even the largest of data sets. But there are other signs that point to the increasing power, relevance and appeal of Hadoop. First of all, there are now three major Hadoop companies, with more popping up every day. Outside of the dedicated ISVs, analytics firms and major software vendors are also building connectors to Hadoop and its sub-projects.
Why all the enthusiasm for Hadoop? Because there’s no alternative at the moment if you have to deal in the petabyte range. Below that threshold a number of other solutions are available, but even vendors have realized that no matter how robust their solutions are, Hadoop integrations can only make them better.