Fifteen years ago, the Hadoop data management platform was created. This kicked off a land rush of companies looking to plant their flags in the market and open-source projects began to spring up to extend what the platform was designed to do.
As often happens with technology, it ages, and newer things emerge that either eclipse or consume those earlier works. And both of those things have impacted Hadoop: Cloud providers offered huge data storage that overtook HDFS and the proprietary MapR file system. But industry experts point to execution missteps by the Hadoop platform providers as being equally to blame for what appears to be the decline of these platforms.
Things looked bad for the big three in the market. Cloudera and Hortonworks merged to strengthen their offering and streamline operations, but fumbled its release and sales plan. MapR, which offered a leading file system for Hadoop projects, clung to life before finally being rescued — if that’s the right word — by HPE, which has not had a great track record of reviving struggling software.
To get some perspective, it’s important to define exactly what Hadoop is. And that’s no simple task. It started out as a single open-source distributed data storage project to support the Big Data search tool Nutch, but since has grown into the stack that it is today, encompassing data streaming and processing, resource management, analytics and more.
Gartner analyst Merv Adrian said back when he started covering the space, the question was ‘What is Hadoop?’ Today, he said, it just might be what ISN’T Hadoop? “I had a conversation with a client that just finished a project where they used TensorFlow, a Google cloud thing for AI, and they used Spark and they used S3 storage, as it happens, because they were on Amazon but they liked the TensorFlow tool,” Adrian recounted. “And they said, ‘This is one of the best Hadoop projects we’ve done so far,’ and I asked them, ‘Why is this a Hadoop project?’ And they said, ‘Well, the Hadoop team built it, and we got the Spark from our [Cloudera] Hortonworks distribution.’ It’s some of the stuff we got with Hadoop plus some other stuff.”
Factors impacting Hadoop
How did we get to this place, where something that seemed so transformational just a few years ago couldn’t sustain itself? First and foremost, the Hadoop platform vendors simply missed the cloud. They were successfully helping companies with on-premises data centers implement distributed file systems and the rest of the stack, while Google, Amazon, Microsoft and — to a lesser degree Oracle — were building this out in the cloud. Further, open-source projects that extended or augmented the Hadoop platforms became viable options in their own right. This created complexity and some confusion.
According to Monte Zweben, co-founder and CEO of data platform provider Splice Machine, the problems were due to the growing number of components supporting Hadoop platforms, and from swelling lakes of uncurated data. “When Hadoop emerged, a mentality arose that was, to use a fancy word, specious. That mentality was that you could just dump data onto a distributed system in a fairly uncurated and sort of random way, and the users of that data will come. That has proven to not work. In the technical ranks, they call that ‘schema on read,’ meaning, ‘Hey, don’t worry about what these data elements look like, whether they’re numbers or strings. Just dump data out there in any random format and then whoever needs to build applications will make sense of it.’ And that turned out to be a disaster. And what happened with this data lake view is that people ended up with a data swamp.”
Zweben went on to say that complex componentry created a sales problem, due to how complicated they made the Hadoop distributions. “You need a car but what you’re being sold is a suspension system, a fuel injector, some axles, and so on and so forth. It’s just way too difficult. You don’t build your own cars, so why should you build your own distributed platform, and that’s what I think is at the heart of what’s gone sideways for the Hadoop community. Instead of making it easier for the community to implement applications, they just kept innovating with lots of new componentry.”
The emergence of the public cloud, of course, has been cited as a major factor impacting Hadoop vendor platforms. But Scott Gnau, vice president of data platforms at Intersystems and former CTO at Hortonworks, sees it from two sides.
“If you define Hadoop as HDFS, then the game is over … take your toys and go home,” Gnau said. “I don’t think that cloud has single-handedly caused the demise of or trouble for Hadoop vendors … The whole idea of having an open-source file system and a massively parallel compute paradigm — which was the original Hadoop stuff — has waned, but that doesn’t mean that there isn’t a lot of opportunity in the data management space, especially for open-source tools.”
Those open-source projects also have hurt the Hadoop platform vendors, providing less expensive and just as capable substitutes. “There are about a dozen or so things that all distributors have,” Gartner’s Adrian explained. “Bear in mind that in every layer of this stack, there’s an alternative. You might be using HBase but you might be using Accumulo. You might be using Storm, but you might be using Spark back then. Already, by 2017, you could also add, you might be using HDFS or you might be using S3, or rather data lake storage, and that’s very prevalent now.”
Vendors still delivering value
Still, there is much life left in the space. Adrian provided a glimpse of the value remaining there. “Let’s just take the dollars associated with what you could call the Hadoop players, even if they don’t call themselves that. In 2018, if you took the dollars for Cloudera and MapR and Google and AWS Elastic MapReduce, we’re talking about close to $2 billion in revenue representing over 4.2% of the DBMS revenue as Gartner counts it. That makes it bigger than the sum by far of all of the pure-play non-relational vendors who weren’t Hadoop. If you add up MarkLogic, MongoDB, Datastax and Kafka, those guys only add $600 million of revenue — that’s less than a third of the Hadoop space. In 2018.”
Going forward, a big future opportunity lies in helping organizations manage their data in hybrid and multicloud environments. Arun Murthy, chief product officer at Cloudera, explained, “Hadoop started off as one open-source project, and it’s now become a movement — a distributed architecture running on commodity hardware, and cloud well fits this concept of commodity hardware. We want to make sure that we actually help customers manage that commodity hardware using open-source technologies. This is why Hadoop becomes an abstraction layer, if you will, and enterprises can use it to move data and workloads better if they choose, with consistent security and governance, and you can run multiple workloads on the same data set. That data can reside on-prem, in Amazon S3, or Microsoft [Azure Data Lake Storage], and you get a consistent one plane of glass, one set of experiences to run all the workloads.”
To that end, Cloudera last month launched the Cloudera Data Platform, a native cloud service designed to manage data and workloads on any cloud, as well as on-premises.
Murthy pointed out that enterprises are embracing the public cloud, and in many cases, more than one. They also are likely to have data they’re retaining on private servers. “IT is trying really hard to make sure they don’t run afoul of regulations, while the line of business is moving really fast, and want to use data for their productions,” he said. “This leads to inherent tension. Both sides are right. In that world, you want to make sure regardless of where you want to do this — on-prem, public cloud and the edge — today, more data is handled outside the data center than inside the data center. When you look at the use cases the line of business wants to solve — even something as prosaic as real-time billing — you want to lift your smartphone and see how much data you used. You need streaming, data transformation, reporting and machine learning.”
Another opportunity for ISVs to play the multicloud game, according to Gartner’s Adrian, who said containers are not going to do this. “Containers will let me pick something up and move it somewhere else and have it run, but it’s not going to let me govern it, it’s not going to let me manage security and policy consistently, from one place. That is one of the opportunities,” he said.
“What Cloudera has ahead of them is a very good, relatively open field to continue to sell what we think of as Hadoop on-premises,” Adrian added, “people who already know what they’re doing, and there are lots of successful use cases that are going to grow. They’re going to sell more nodes for the people who want to be on-prem, and as for people who want to do on-prem, where else are they going to go to? They could cobble it together out of open-source pieces, which, if they haven’t done it by now, they’re not the early adapters with a strong engineering organization that’s going to do that. They’re going to want something packaged.”
As the industry moves forward, the technologies that underlie Hadoop remain, even if it won’t be known as Hadoop.
“Far be it for me to guess what the marketing folks at these companies are going to come up with,” Intersystems’ Gnau said. “With all of the execution missteps by management teams and these companies recently, maybe they want to change their name, to protect the innocent,” he added with a chuckle. “In the end, there is a demand out there for this kind of tack, and folks who are calling it over because of the execution missteps are being a bit short-sighted.
“I’m talking about the need in the marketplace,” he continued. “I’ve got diverse sets of data created by systems or processes that are potentially outside of my control, but I want to capture and map that data into real-time decision-making. What are the tools I need to go do that? Well, provenance is one of the tools I need. Certainly, the ability to have flexibility and not require a schema for capturing, onboarding this data, because data that’s created outside of my control is going to change, the schema’s going to change, so there’s an interesting space for the toolset, regardless of what it ends up being called.”
So whatever it’s name will be, Hadoop technologies will continue to have a place in the market, no matter who’s supplying it. “I think there is a use case and a relevance for that kind of product and that kind of company,” Gnau said, “and I do think there’s a lot of confusion based on failure to execute versus validity of technology.”