Big Data just gets bigger

Published: May 25th, 2012

- Alex Handy

Buzzwords come and go from the software development industry. Some, like “agile” and “test-driven development,” have weathered the test of time. Others, like “SOA” and “LISPy everything,” haven’t fared as well. But “Big Data” as a buzzword, and as a quantifiable problem, is unique in the world of buzz.

Service-oriented architecture, agile development, and indeed most other development-related buzzwords, are prescriptive: SOA and agile are both solutions to the classic problem of software development, that is, not getting enough work done. But Big Data as a buzzword is not a solution. It’s a representation of a problem, and one that if your company does not have now, it assuredly will have soon.

Emil Eifrém, CEO of Neo Technology (a producer of the Neo4j graph database), said that Big Data is here to stay. “First, Big Data is not a fad. We see it every day,” he said. “You’ve all seen a gazillion presentations and analyst reports on the exponential growth of data. Supposedly, all new information generated this year will be more than all the data generated by humanity in all prior years of history combined.”

So, clearly, there’s plenty of data out there to deal with. But the Big-Data problem, as it were, isn’t just about having that big stack of information. It’s about juicing it like a pear for the sweet nectar of truth that awaits inside.

Big Data is about figuring out what to do with all that information that comes pouring out of your applications, your websites and your business transactions. The logs, records and details of these various systems have to go somewhere, and sticking them into a static data warehouse for safekeeping is no longer the way to handle the problem.

Instead, vendors, developers and the open-source community have all designed their own solutions to the problem. And for most of those problems, the Apache Hadoop Project is the most popular solution, though it is not the only option. Since its creation in 2005, however, Hadoop has grown to become the busiest project in the Apache Software Foundation’s retinue.

The reason for this popularity is that Hadoop solves two of the most ornery Big-Data problems right off of the bat: Hadoop is a combination of a MapReduce algorithm with a distributed file system known as HDFS. As a cluster environment, Hadoop can take batch-processing jobs and distribute them across multiple machines, each of which holds a chunk of the larger data picture.

Facebook often touts its Hadoop cluster as an example of success, citing its size of over 45PB as a sign that Hadoop can handle even the largest of data sets. But there are other signs that point to the increasing power, relevance and appeal of Hadoop. First of all, there are now three major Hadoop companies, with more popping up every day. Outside of the dedicated ISVs, analytics firms and major software vendors are also building connectors to Hadoop and its sub-projects.
#!
Why all the enthusiasm for Hadoop? Because there’s no alternative at the moment if you have to deal in the petabyte range. Below that threshold a number of other solutions are available, but even vendors have realized that no matter how robust their solutions are, Hadoop integrations can only make them better.

John Bantleman, CEO of RainStor, said that Hadoop really became relevant around two years ago. “Go back two years, and the data management landscape was all [online transaction processing] relational databases, like Oracle,” he said. “Once you start to hit velocity and volume, you’ll generate hundreds of terabytes of data and billions of records a day. Your general roles-based relational database tops out. That’s forcing customers to look at alternative solutions.

“Part of what’s available in the market is data-warehousing technology in products like Teradata. Those do scale to petabytes, but they’re extremely costly. They cost hundreds of millions of dollars to put that infrastructure in place. Hadoop has a cloud-like infrastructure, which allows you to manage that data at that scale at a fraction of the cost.”

And despite the fact that RainStor essentially competes with Hadoop, the company is still offering its products in forms that work on top of or in conjunction with Hadoop. So while RainStor’s bread and butter is storing large amounts of data in its highly compressed database, it is also able to perform analytics across a Hadoop cluster instead of inside itself.

“My view is Hadoop is going to be a platform, kind of like Linux is a platform,” said Bantleman. “It’s going to be a management system to manage Big Data. Generally, you have to build the stack out to meet enterprise requirements.”

Horton Hears a 2.0
Hortonworks (a producer of supplemental Hadoop software) is pushing forward efforts to do just this: turn Hadoop into a job-scheduling cluster-management system, rather than just a vehicle for MapReduce. As the company charged with maintaining and pushing forward innovation in the Apache Hadoop Project, Hortonworks is focused on two things: training, and improving open-source Hadoop.

On the other side of the coin, Cloudera is the commercial services and products company for Hadoop. It offers its own distribution of Hadoop, coupled with management tools to make cluster control simpler.

But it is Hortonworks that is thinking most heavily about the next version of Hadoop. Dubbed version 2.0 by some, this next edition will include complete rewrites of many aspects of the system.

Shaun Connolly, vice president of corporate strategy at Hortonworks, said that Hadoop 2.0 will be about updating some of the lagging portions of the project to be more in line with modern needs.

“Today’s MapReduce is a job-processing architecture and a task tracker,” he said. “The Hadoop 2.0 effort really has been focused on separating the notions of the MapReduce framework paradigm from the resource-management capabilities of the platform, and generalizing the platform so MapReduce can just be one type of application framework that can run on the platform.”

That means Hadoop 2.0 is poised to be more like a data center operating system than simply a MapReduce bucket and scheduler. “Other types of applications we can foresee coming would be message-passing interface applications, graphic-processing applications, stream-processing systems, those types of things,” said Connolly. “At the end of the day, they begin to open up Hadoop. We view Hadoop as a data platform, and for the data platform to continue to be relevant, it needs to open itself up to other work cases and to effectively store the data across larger clusters.”

HDFS too is receiving an update in Hadoop 2.0, he said, and will become highly available in the next major release. Making the file system highly available will vastly improve the performance of HBase, the in-Hadoop database framework. HBase has been slowly evolving to become fast enough for front-line usage, with the ultimate goal being to allow Hadoop to run all the data for your sites, not just the long-term storage.
#!
Not the only elephant in the room
While Hadoop is clearly where the excitement is for the open-source Big-Data community, it’s not the only option out there. Big-Data solutions come in many shapes and sizes, and the ability to couple them with Hadoop makes them even more relevant.

One of the popular use cases for Hadoop right now is as a scum filter: All data is poured into the unstructured Hadoop data store, and jobs are then written to winnow down that data set to a manageable size—say a few gigabytes.

Those gigabytes of filtered data are then ready to be processed in more-traditional business intelligence and analytics platforms. And that’s not to say that many of these solutions don’t have Hadoop connectors as well.

But the current state of analytics platforms is in flux due to this major shift in focus away from expensive data warehouses to more commodity-priced Hadoop hardware.

That’s echoed by George Mathew, president and COO of Alteryx, an analytics platform, which was recently integrated with Hadoop through a new Apache Hive-based connector.

“For our Hadoop driver, we have a rough ODBC connector to Hive,” he said. “We did that with the 7.0 release, and released the connector jointly with MapR. The ability to take pre- or post-MapReduce functions and bring that onboard into a SQL-like expression, and then basically join that against more structured data that might be in a more traditional data warehouse, is very powerful. That’s where we see the current movement, particularly the whole volume side of things.”

MapR, too, has been busy building on top of Hadoop. The MapR distribution of Hadoop includes many next-generation features that have not even made it off the drawing board at Apache, said Jack Norris, vice president of marketing at MapR.

“We looked at Hadoop early on, and basically determined that this would be a very strategic analytic data store for organizations both large and small,” he said. “We made some deep integrations to make it reliable, and to provide the data protection and provide full high availability: all the features and capabilities you expect in other enterprise applications. It’s leaps and bounds ahead of where Apache is, and where it plans to be in the future.”

Running jobs on top of Hadoop is no longer just about Java, the SQL-like Hive or the ETL-like Pig. Thanks to hard work from the community around open-source language R and Revolution Analytics, that popular analytic programming language can now be used to write map/reduce jobs for a Hadoop cluster.

David Smith, vice president of marketing and communications at Revolution Analytics, said that R is great for sifting through giant piles of data.

“The basic use case that I have seen with this is to think of Hadoop as this massive unstructured data store that can read data from all sorts of formats,” he said.

“R shines at the distillation and refinement process for that data. You can use it to convert that data in Hadoop into something that’s structured that you can then apply statistical analytics too. You do that distillation process with R in Hadoop, and you generate another large, but not as large or unstructured, data set. You can then do regressions and an exploratory analysis.”
#!
Simple on Big
MapR’s Norris said that one of the biggest appeals of Hadoop and other Big-Data systems is that algorithmic analysis of big data sets is wildly simpler than it is on smaller sets of data. While there are some who quibble with this theory, he said that “Simple algorithms on big data outperform complex models. There was an article about this from Google. They used scene completion: They take a picture, and they want to take an element out and fill the background back in. On a corpus of thousands of data examples, it didn’t work, but the same algorithm with a million samples in the corpus worked well.”

But simple is not the only option. There is a bit of a valley between the simple and the complex in Big Data, and that second slope upward begins where predictive analytics and machine learning begin.

In fact, Hadoop is ripe for machine learning. Though Apache Mahout (the Hadoop machine-learning library project) is not as mature as commercial offerings, it has been expanding and evolving apace with the rest of Hadoop.

Jacob Spoelstra, head of R&D at Opera Solutions (a producer of predictive-analytic software), said that predictive and machine learning are extremely powerful for business, but the trick is figuring out how much such capabilities are worth from a development standpoint.

“There are things like the Netflix challenge, where Netflix held a contest to rewrite its recommendation algorithm,” he said. “Sure you have lots of data, and maybe a fairly standard approach to achieve an improvement over what you were doing before, but the question is how much is that further improvement worth to you? Is that change worth a few million dollars?

“To some extent, that is the promise of machine learning, and that is what Opera does. We differentiate from predictive analytics by saying we’re machine learning. The machine responds to changes over time. It’s about using feedback to help you make predictions of whether someone will purchase an item, and incorporating that back to your algorithm and adjusting the parameters of your statistical model so next time you will make a better prediction. It’s continuously improving your performance over time. That will become more important as time moves on.”

As machine learning and predictive analytics become more commonplace, they may help to alleviate what is, perhaps, the most difficult problem for Big Data: a lack of talent.

Patrick Taylor, CEO of Oversight Systems (a Big-Data analysis and workflow company), said there are not enough data scientists to go around, and that there may not be for some time. “I saw a really interesting article from McKinsey where they were talking about Big Data, and one of the things they cited was that we’re headed toward a shortage of analysts, data scientists and whatnot,” he said.

“That’s the problem we solve for people. There are a lot more people who could use the valuable insights than there are people who know how to get them. What’s emerging are software platforms like ours that are basically in the business of putting those ‘ah has’ to work every day.”

Taylor likened Big-Data strategy to the “Moneyball” strategy employed by the Oakland Athletics baseball team. “In ‘Moneyball,’ there are two steps. The part everybody remembers is where they were really clever about how they assembled a team on a budget. They looked for new and innovative analysis of what makes a good first baseman. That’s the traditional strategic planning side of what we’ve done with analytics. But they didn’t start winning until they applied those same insights into how they played the game. That’s what you’re talking about here. It’s going to take both of those together, and I have to put the results from analysis to work in my day-in, day-out business,” he said.
#!
Other ways over Big Data
Outside of all this Hadoop fervor, there are plenty of alternative solutions to the Big-Data problem. One of those solutions had its initial public offering in late April: Splunk.

While Splunk began as a tool for analyzing log files, it has matured into a clear window from which Big Data can be observed and put through its paces. And it’s not just about operations using those logs to tweak server configurations, either.

Leena Joshi, director of solutions marketing at Splunk, said that “One of the big reasons why IT ops people put Splunk in place is to get developers off of their production box. They use the role-based access controls to make sure the data from the production boxes is available to the developer. They find it easy to see what’s going on in their own code, and if you log key value pairs, you can get deep analytics out of them.”

Elsewhere, graph databases are appearing as an alternative to Hadoop and other data stores for Big Data. Neo Technology’s Eifrém said that his company’s flagship graph database offers significant insights and real-time data mining for enterprises, something that can’t be done on a slower Hadoop cluster.

“We see this huge trend toward real-time data,” he said. “One example is retail. There are a bunch of people using graph databases in retail because they want to know buying patterns and predict buying behaviors of customers while they’re in the store. Someone coming in and buying a bunch of beer and a bunch of diapers is a pattern, and if you can figure out the demographics of someone while they’re in the store, that’s when you can really target discounts and rebates and promotions. If you figure this out a day later, that’s kind of interesting, but it’s much more valuable to get that within minutes, while they’re in the store. There’s value in low-latency response time.”

But real time or not, Hadoop remains the big focus for Big Data at the moment. While alternatives exist, Revolution Analytics’ Smith said, “I definitely see Hadoop being the new hotness, if you want to put it that way. It’s groundbreaking technology, as it allows companies to store the data that heretofore they’ve let fall off the table. Now you have the capability of looking at the atomic level of your data to find out interesting things about it.”

Article Tags

Big Data, Hadoop

About Alex Handy

Alex Handy is the Senior Editor of Software Development Times.

View all posts by Alex Handy

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Big Data just gets bigger

Article Tags

Subscribe to SDTimes

About Alex Handy

Related Articles

Data is the new petroleum; companies need better pipelines — and better oil-spill clean-up methods

Canonical announces general availability of Charmed MLFlow

IBM launches guide for contributing to open source cloud projects

SD Times Open-Source Project of the Week: Apache Drill