Is the Hadoop party over?

Published: November 8th, 2019

Fifteen years ago, the Hadoop data management platform was created. This kicked off a land rush of companies looking to plant their flags in the market and open-source projects began to spring up to extend what the platform was designed to do.

As often happens with technology, it ages, and newer things emerge that either eclipse or consume those earlier works. And both of those things have impacted Hadoop: Cloud providers offered huge data storage that overtook HDFS and the proprietary MapR file system. But industry experts point to execution missteps by the Hadoop platform providers as being equally to blame for what appears to be the decline of these platforms.

Things looked bad for the big three in the market. Cloudera and Hortonworks merged to strengthen their offering and streamline operations, but fumbled its release and sales plan. MapR, which offered a leading file system for Hadoop projects, clung to life before finally being rescued — if that’s the right word — by HPE, which has not had a great track record of reviving struggling software.

To get some perspective, it’s important to define exactly what Hadoop is. And that’s no simple task. It started out as a single open-source distributed data storage project to support the Big Data search tool Nutch, but since has grown into the stack that it is today, encompassing data streaming and processing, resource management, analytics and more.

Gartner analyst Merv Adrian said back when he started covering the space, the question was ‘What is Hadoop?’ Today, he said, it just might be what ISN’T Hadoop? “I had a conversation with a client that just finished a project where they used TensorFlow, a Google cloud thing for AI, and they used Spark and they used S3 storage, as it happens, because they were on Amazon but they liked the TensorFlow tool,” Adrian recounted. “And they said, ‘This is one of the best Hadoop projects we’ve done so far,’ and I asked them, ‘Why is this a Hadoop project?’ And they said, ‘Well, the Hadoop team built it, and we got the Spark from our [Cloudera] Hortonworks distribution.’ It’s some of the stuff we got with Hadoop plus some other stuff.”

Factors impacting Hadoop
How did we get to this place, where something that seemed so transformational just a few years ago couldn’t sustain itself? First and foremost, the Hadoop platform vendors simply missed the cloud. They were successfully helping companies with on-premises data centers implement distributed file systems and the rest of the stack, while Google, Amazon, Microsoft and — to a lesser degree Oracle — were building this out in the cloud. Further, open-source projects that extended or augmented the Hadoop platforms became viable options in their own right. This created complexity and some confusion.

According to Monte Zweben, co-founder and CEO of data platform provider Splice Machine, the problems were due to the growing number of components supporting Hadoop platforms, and from swelling lakes of uncurated data. “When Hadoop emerged, a mentality arose that was, to use a fancy word, specious. That mentality was that you could just dump data onto a distributed system in a fairly uncurated and sort of random way, and the users of that data will come. That has proven to not work. In the technical ranks, they call that ‘schema on read,’ meaning, ‘Hey, don’t worry about what these data elements look like, whether they’re numbers or strings. Just dump data out there in any random format and then whoever needs to build applications will make sense of it.’ And that turned out to be a disaster. And what happened with this data lake view is that people ended up with a data swamp.”

Zweben went on to say that complex componentry created a sales problem, due to how complicated they made the Hadoop distributions. “You need a car but what you’re being sold is a suspension system, a fuel injector, some axles, and so on and so forth. It’s just way too difficult. You don’t build your own cars, so why should you build your own distributed platform, and that’s what I think is at the heart of what’s gone sideways for the Hadoop community. Instead of making it easier for the community to implement applications, they just kept innovating with lots of new componentry.”

The emergence of the public cloud, of course, has been cited as a major factor impacting Hadoop vendor platforms. But Scott Gnau, vice president of data platforms at Intersystems and former CTO at Hortonworks, sees it from two sides.

“If you define Hadoop as HDFS, then the game is over … take your toys and go home,” Gnau said. “I don’t think that cloud has single-handedly caused the demise of or trouble for Hadoop vendors … The whole idea of having an open-source file system and a massively parallel compute paradigm — which was the original Hadoop stuff — has waned, but that doesn’t mean that there isn’t a lot of opportunity in the data management space, especially for open-source tools.”

Those open-source projects also have hurt the Hadoop platform vendors, providing less expensive and just as capable substitutes. “There are about a dozen or so things that all distributors have,” Gartner’s Adrian explained. “Bear in mind that in every layer of this stack, there’s an alternative. You might be using HBase but you might be using Accumulo. You might be using Storm, but you might be using Spark back then. Already, by 2017, you could also add, you might be using HDFS or you might be using S3, or rather data lake storage, and that’s very prevalent now.”

Vendors still delivering value
Still, there is much life left in the space. Adrian provided a glimpse of the value remaining there. “Let’s just take the dollars associated with what you could call the Hadoop players, even if they don’t call themselves that. In 2018, if you took the dollars for Cloudera and MapR and Google and AWS Elastic MapReduce, we’re talking about close to $2 billion in revenue representing over 4.2% of the DBMS revenue as Gartner counts it. That makes it bigger than the sum by far of all of the pure-play non-relational vendors who weren’t Hadoop. If you add up MarkLogic, MongoDB, Datastax and Kafka, those guys only add $600 million of revenue — that’s less than a third of the Hadoop space. In 2018.”

Going forward, a big future opportunity lies in helping organizations manage their data in hybrid and multicloud environments. Arun Murthy, chief product officer at Cloudera, explained, “Hadoop started off as one open-source project, and it’s now become a movement — a distributed architecture running on commodity hardware, and cloud well fits this concept of commodity hardware. We want to make sure that we actually help customers manage that commodity hardware using open-source technologies. This is why Hadoop becomes an abstraction layer, if you will, and enterprises can use it to move data and workloads better if they choose, with consistent security and governance, and you can run multiple workloads on the same data set. That data can reside on-prem, in Amazon S3, or Microsoft [Azure Data Lake Storage], and you get a consistent one plane of glass, one set of experiences to run all the workloads.”

To that end, Cloudera last month launched the Cloudera Data Platform, a native cloud service designed to manage data and workloads on any cloud, as well as on-premises.

Murthy pointed out that enterprises are embracing the public cloud, and in many cases, more than one. They also are likely to have data they’re retaining on private servers. “IT is trying really hard to make sure they don’t run afoul of regulations, while the line of business is moving really fast, and want to use data for their productions,” he said. “This leads to inherent tension. Both sides are right. In that world, you want to make sure regardless of where you want to do this — on-prem, public cloud and the edge — today, more data is handled outside the data center than inside the data center. When you look at the use cases the line of business wants to solve — even something as prosaic as real-time billing — you want to lift your smartphone and see how much data you used. You need streaming, data transformation, reporting and machine learning.”

Another opportunity for ISVs to play the multicloud game, according to Gartner’s Adrian, who said containers are not going to do this. “Containers will let me pick something up and move it somewhere else and have it run, but it’s not going to let me govern it, it’s not going to let me manage security and policy consistently, from one place. That is one of the opportunities,” he said.

“What Cloudera has ahead of them is a very good, relatively open field to continue to sell what we think of as Hadoop on-premises,” Adrian added, “people who already know what they’re doing, and there are lots of successful use cases that are going to grow. They’re going to sell more nodes for the people who want to be on-prem, and as for people who want to do on-prem, where else are they going to go to? They could cobble it together out of open-source pieces, which, if they haven’t done it by now, they’re not the early adapters with a strong engineering organization that’s going to do that. They’re going to want something packaged.”

As the industry moves forward, the technologies that underlie Hadoop remain, even if it won’t be known as Hadoop.

“Far be it for me to guess what the marketing folks at these companies are going to come up with,” Intersystems’ Gnau said. “With all of the execution missteps by management teams and these companies recently, maybe they want to change their name, to protect the innocent,” he added with a chuckle. “In the end, there is a demand out there for this kind of tack, and folks who are calling it over because of the execution missteps are being a bit short-sighted.

“I’m talking about the need in the marketplace,” he continued. “I’ve got diverse sets of data created by systems or processes that are potentially outside of my control, but I want to capture and map that data into real-time decision-making. What are the tools I need to go do that? Well, provenance is one of the tools I need. Certainly, the ability to have flexibility and not require a schema for capturing, onboarding this data, because data that’s created outside of my control is going to change, the schema’s going to change, so there’s an interesting space for the toolset, regardless of what it ends up being called.”

So whatever it’s name will be, Hadoop technologies will continue to have a place in the market, no matter who’s supplying it. “I think there is a use case and a relevance for that kind of product and that kind of company,” Gnau said, “and I do think there’s a lot of confusion based on failure to execute versus validity of technology.”

Article Tags

cloud, Cloudera, compute, data, data streaming, Hadoop

About David Rubinstein

David Rubinstein is editor-in-chief of SD Times.

View all posts by David Rubinstein

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Is the Hadoop party over?

Article Tags

Subscribe to SDTimes

About David Rubinstein

Related Articles

Wherobots is Bringing Spatial Context to AI

MCP leaves much to be desired when it comes to data privacy and security

Quest Trusted Data Management Platform makes it easier for organizations to create reusable data products

Box Extract intelligently pulls information from unstructured content to help with workflow automation