We’ve struck Big Data! Now what do we ask it?

Published: June 23rd, 2014

It’s been nine years since former Yahoo search technologist Doug Cutting and computer scientist Mike Cafarella started Apache Hadoop, based on Google’s novel scaling of the 1970s-era Map/Reduce algorithm for data searches on cheap server clusters. That’s about halfway toward the typical 20-year maturity cycle, according to Gartner Big Data analyst Merv Adrian.

Recently he told a SiliconANGLE reporter at the Hadoop Summit that “irrational Hadooperance” is subsiding, and “we are moving from a blunt instrument for processing to a nuanced stack.” But if, as he recommended, it was time to join this ecosystem, it didn’t translate to making a major investment: By his estimates, fewer than 800 customers are currently paying for Hadoop services or distributions.

As Hadoop captures the fancy of futurists, fabulators, inventors and investors alike, the scramble has begun both to capitalize on a new data ecosystem and re-inject relevance into existing enterprise business intelligence solutions.

The age of data deserves better BI
The IT industry has progressed from the iron age (mainframes and minicomputers) to the software age (operating systems, databases and applications) to the age of data, according to Mike Hoskins, CTO of Actian. Google, Apple and Amazon epitomize this content-delivery era, though the search-engine giant is the only one built primarily on data analytics.

If, as Tim O’Reilly posits, data is the new oil, the question isn’t where to find it, but what to ask it. Plenty of Hadoop-era startups like Pentaho, Splunk, TIBCO Jaspersoft and more are angling to answer.

(Related: Big Data’s also in the cloud)

San Francisco’s Alpine Data Labs, founded in 2011, is a perfect example of the new breed of Big Data BI. Its flagship product, Chorus, simplifies the process of building predictive models for Big Data, which has become so cheap to store.

“When you store more data, you will ask more questions, but with this power comes great responsibility,” to paraphrase the comic and movie hero Spider-Man. Bruno Aziza, Alpine Data Labs’ chief marketing officer, put it this way: “You have the risk of a lot more noise. That’s where the BI approach to Big Data breaks. You can’t just be reporting; you must be more descriptive. We’re about to get into the third stage of this industry, where algorithms are going to take control. The companies that can run algorithms at scale are the ones that are going to grow faster.”

Contrast Alpine with Actian. If Alpine is a typical startup, Actian is a classic database survivor, built on a long line of companies dating back to Relational Technology, Inc., a former Oracle competitor founded in 1980, later renamed Ingres Corp. and rebranded Actian in 2011. As a testament to its adaptability, a series of acquisitions in the last few years have reoriented the BI company toward Big Data processing, most notably with its 2010 purchase of VectorWise, founded by large-scale analytical data-management professor and researcher Peter Boncz.

According to Actian, the Hadoop ecosystem is playing catch-up to cutting-edge analytics technology so that it can reach business analysts.

“People have already proven that Hadoop is affordable and scales. They never kept that data before, and now they’re throwing everything into Hadoop and doing discovery analytics to find things that weren’t recognizable in human terms. But about 60% of those Big Data projects are without a business case. They’re still in the lab and haven’t made it into production,” said John Santaferraro, vice president of marketing for Actian.

If only it were easier to get SQL access to data that lives in Hadoop, observed Santaferraro. “Every business has a conduit to data that the business analyst uses: SQL. Now we’ve flung that door wide open with high-performance SQL access to data right in Hadoop with our Vector database product,” he said.

A number of open-source projects have aimed to support fast BI on top of Hadoop, but Actian claims to have beaten these. “Impala and Hive: Both projects refer back to research by Peter Boncz in the 2000s, who built out a 100x engine with vector processing,” said Santaferraro. “But they’re starting from scratch. We started building in the mid-2000s, released in 2010 and have up to 30x better performance than Impala has, along with ACID compliance, security, things that anybody using a database and doing operational BI need. But imitation is the sincerest form of flattery.”

Intelligent vs. smart
Though “Big Data” has possibly exceeded “cloud” as an over-hyped technology catchall in the last two years, the data tide has not lifted all BI boats. “Business intelligence” took off as a term in the late 1990s, evolving from decision support systems of the 1980s. Some experts now say BI is bifurcating again, thanks to Big Data, and the winners will be in “data discovery,” not BI.

“Visual data discovery took center stage in 2013, with specialty vendors (Tableau, QlikView, TIBCO Spotfire) growing rapidly ,and mega vendors (SAP, IBM, Microsoft, Oracle, MicroStrategy, SAS) stagnating,” blogged Cindi Howson, founder of the BI analyst firm BI Scorecard.

“BI heavyweights have taken notice that more agile and visual solutions have eaten into their bread-and-butter, query-and-reporting market share. All have responded with new interfaces and solutions.” But it’s hard to balance between what she termed “agile yet trusted data—user-driven but IT-controlled.”

Despite the buzz around discovery and personalization, traditional BI tools haven’t seen an uptick in adoption, according to Howson’s survey of 513 respondents in 2013. Her firm measured adoption as a percentage of total employees at 22%, down several percentage points since the survey was first conducted in 2006. That dovetails with findings in the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms report of February 2014. The study claimed that large-scale, IT-centric systems-of-record reporting with online analytical processing and ad hoc query that traditionally comprised BI never caught on. Rather, a five-year trend shows legacy BI platforms are being complemented or replaced by “business-user-driven data discovery techniques.”

From IT monitoring to BI
There’s another market movement afoot, thanks to the Big Data hype. Increasingly, application-monitoring tools are putting their foot in the door left ajar by BI and Hadoop. Splunk, New Relic and AppDynamics are among application intelligence tools that are encroaching on BI. The claim? That software-driven businesses need data about the health of their products, not just the happiness of their customers. These tools can convert obscure IT scenarios into clear business conversations.

Take the example of a recent AppDynamics customer situation: a large movie ticket vendor using application performance management (APM). “Our APM detected an outage using real-time business collectors and began computing revenue lost per minute,” said Adam Leftik, director of product management for AppDynamics.

“IT was no longer talking the language of ‘There’s a JVM outage.’ It was, ‘Ticket sales have dropped from US$5 million per minute to zero.’ We’re changing the way that people talk about the system in general, no longer dealing with key performance indicators that are tech-related, like load in terms of calls per minute, or average response time, but in business terms. It’s all about revenue and business impact that owners understand.”

Collecting so-called “machine data” has broader utility in the age of Hadoop. Tapan Bhatt is vice president of retail systems at Splunk, an app-monitoring company founded in 2004 whose event and machine data analysis has been expanded with a new product, Hunk, which works to dice and visualize batch data stored in Hadoop.

“Machine data, to us, is broad,” he said. “It might be coming from app servers, mobile devices, RFID chips, sensors on buildings or cars. It’s not human-generated data entered in a CRM system. The founders of Splunk came from running data centers, and the original use was IT ops.”

There are six main uses for Splunk, said Bhatt. “First, app management; second, IT operations; third, security and compliance. After that come digital intelligence, business analytics and Internet of Things. The first three are where we started. The next three are emerging.”

What will Big Blue do? Watch Watson
No other modern software company has the database history that IBM does. Starting with the Tabulating Machine Company, founded in the 1880s—one of the three separate companies that merged to become the future IBM in 1911—Big Blue has been at the forefront of data analytics. Looking at its installed base of mainframe customers and enterprise database users, it would be easy to discount IBM’s ability to turn on a dime. Enter Watson.

Though these are early days for the next-generation BI tool, who can resist the promise described on its website: “Watson is a cognitive technology that processes information more like a human than a computer—by understanding natural language, generating hypotheses based on evidence and learning as it goes.” In early 2014, IBM unveiled the Watson Business Group, comprising 2,000 employees and a $1 billion investment, to hail what Michael Rhodin, senior vice president of the IBM Watson Group, called the third era of computing: cognitive computing.

The May 2014 acquisition of Cognea, an artificial intelligence startup, brings on virtual assistants sure to complement Watson, IBM’s “Jeopardy” question-answering supercomputer. Though all this seems far afield from BI, a recent Gartner report ties Watson Analytics to the “smart data discovery” trend, noting the aging IBM Cognos BI platform.

Ironically, as Big Blue boldly goes where no BI has gone before, Hadoop sellers Cloudera and Hortonworks are busy becoming staid. Cloudera has brought on data warehousing guru Ralph Kimball to help architect something useful around overflowing, underperforming Hadoop “data lakes.” And Microsoft aims to bring Big Data to a billion users in a partnership with Hortonworks.

With Excel, Microsoft already has the world’s most popular—albeit puny—data analysis tool. But the company’s Azure-and-mobile play is innovating BI access for all, while it takes advantage of the Big Data hype to remind corporations of the speed and power of proprietary in-memory OLTP for SQL Server 2014.

Finally, for starry-eyed types, Microsoft CEO Satya Nadella did not disappoint when he announced the limited public preview of Azure Intelligent Systems Service in April. The Internet of Things is a popular aspiration for Big Data futurists, though many pragmatists note that much more basic information (such as “How many orders did we fill yesterday?”) is still beyond the grasp of many companies.

Mining the hype
To a degree, the hyping of Big Data is itself a symptom of data generated by click-baiting headlines that would be rendered powerless with sentiment-based algorithms. A case in point: The canny, possibly apocryphal, description fed to New York Times reporter Charles Duhigg by an unnamed Target employee of the store visit by a father irked by the ultimately correct direct mail his pregnant daughter received. That 2012 article, which helped Duhigg’s book “The Power of Habit” become a best-seller, is now part of the Big Data gospel. But even if the predictive analytics that buying patterns reveal are ever more useful, no algorithm comes without false positives.

“We should not buy the idea that Target employs mind-readers before considering how many misses attend each hit,” writes Tim Harford in FT Magazine. (Interestingly, in its 2013 white paper on Big Data journalists, LexisNexis reveals that the Financial Times had published 49 articles on Big Data in the previous two years, making it the leading U.K. outlet covering the topic. Too bad the analysis didn’t delve deeper into, say, sentiments expressed, tools covered or sources cited.)

The 2012 U.S. presidential election was another moment where (Obama for America CTO Harper Reed’s current protestations notwithstanding) data played a starring role in the victory. Witness self-taught statistician Nate Silver’s subsequent best-seller “The Signal and the Noise,” which played up the role of Bayesian inference in making predictions such as his successful bet on Obama’s win. Suddenly, stats were exciting again, and fevered debates over Fisher vs. Bayes, among other flavors of statistical design, took off.

Damn lies, and other statistics
In the O’Reilly book Big Data Now: 2012 Edition, UC San Diego cognitive science professor Bradley Voytek describes a 2008 scientific paper that found that neuroscience papers with brain scan images were perceived as more credible than those without.

“This should cause any data scientist serious concern,” he writes. “In fact, I’ve formulated three laws of statistical analyses:
1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.”

At the O’Reilly Strata 2014 conference, author David McRaney told a compelling story about how the Department of War Math in World War II, populated by women statisticians, calculated survival rates of bombers. An insight derived from their work was survivorship bias, which blinds us to studying factors contributing to failure not from the winners but, more importantly, the losers.

Who are the new data scientists?
Contrary to popular belief, the term “data scientist” doesn’t mean “hipster”—though some who hold the title come awfully close, like UC San Diego cognitive science professor Bradley Voytek, who is both an academic and a data evangelist for the Uber ridesharing startup. The term is quite a bit older than that, in fact.

“‘Data scientist’ dates back to 1974, when computer scientist Peter Naur used it,” said Carla Gentry, founder of Analytical Solution in Louisville, Ky. She has more opinions about the topic than a BI vendor can shake a stick at, and has shared them widely on blogs and at conferences.

“Data science is more than data mining. It’s not as easy as you think it is,” she said. I wrote a blog called How to Hire a Real Data Scientist. Big Data means data science, but data science doesn’t necessarily mean Big Data.”

Gentry added that the term took off in popularity around 2011.

“If you add visual and unstructured text, then of course there’s more. Yes, you have been tracked. But it’s really nothing new. We have a little more data, and more ways to look at it, like sentiment analysis,” she said.

“I keep saying the sexy job in the next 10 years will be statisticians,” said Google chief economist Hal Varian in 2009. “The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades.”

He was right. According to Gartner analyst Mark Raskino, in 2014, Chief Data Officer “is one of those titles that has been rising in visibility over the last two or three years,” he blogged. To be precise, he found more than 100 chief data officers in large companies, “more than double the number we counted in 2012.”

An interesting aspect of data science is that it is noticeably more female than computer science. Raskino reported that more than 25% of those holding the exact “chief data officer” title are women, while only 13% of CIOs are women. Women are also more prevalent as chief data scientists and BI consultants. Could a preference for statistics and math be one reason for relatively more women? A 2011 National Center for Education Statistics survey found that while women earned less than 20% of all computer science bachelor’s degrees, and just under 30% of master’s degrees, they make up more than 40% of undergraduate and master’s degree earners in mathematics and statistics.

Is questioning a human condition?
Perhaps the bigger question many have is not whether more women or men are drawn to statistics, but whether humans or machines will be the ones asking the questions. (Even Stephen Hawking recently expressed concern that allowing AI to develop unabated posed grave risks to the future of humanity.) For the short term, anyway, it seems unlikely that even pre-sentient supercomputers like Watson will replace humans anytime soon. After all, it’s our insatiable appetite for entertainment that drives this era of content on computers, right? Or is it our attention span?

“I prefer a different definition of Big Data, and that is that Big Data is data made useful,” said data scientist Hilary Mason to the crowd at Mort Data’s first New York City data science meetup in February 2013. “In the case of deploying specialized infrastructure to count things, like having a Hadoop cluster, it means we can ask a question at the data and get the answer back before we forgot why we asked the question in the first place. It’s a human problem. Big Data is a technical solution to a human problem, and that is our inability to pay attention to things for very long. We’re really just little data monkeys.”

Dark Data joins the fray
Despite its ominous name, coined a few years ago and now widespread across the Web, dark data isn’t something most people lie awake worrying about. That’s just the problem, according to Boulder, Colo.-based Parascript. Whether they are insurance adjusters dealing with tornado relief, social workers visiting homes, farmers assessing crops, or oil and gas experts in the field, many professionals submit handwritten notes to corporations, which in turn fail to do anything further with the information.

“The category of dark data is very broad. Gartner says it’s data you collect, process and store, but fail to use for other purposes,” said Don Dew, director of marketing for Parascript, who said that call data and images can also be included in this category. His company’s longstanding handwriting-recognition solution, already employed by the US Postal Service, FiServ, IBM and others, could be part of a solution designed to convert the meanings of dark data into something businesses can act on.

A recent Web survey commissioned by Parascript of 385 members of the Association for Information and Image Management (AIIM) found that, of the 73% of survey respondents who scan forms, only half do text recognition. Twenty-five percent only scan to archive, and the remaining 25% workflow scan images but manually re-key the data. But the use of captured text for archive indexing has grown from 64% to 87% (of those who do use text recognition) since 2012.

“Dark data is starting to get a lot of mileage,” said Bob Larrivee, director of custom research at AIIM. “The idea is that all this information is sitting out there, but nobody accesses it and gains the maximum benefit from it.” Possible applications range from fraud detection, large analog conversions, litigation discovery, signature detection or comparison, and sentiment analysis of hand-written annotations to documents.

“This study reinforced that we’ve used handwriting recognition on well-defined applications like check and mail processing, but it starts to highlight the potential for this,” said Dew. “As the report points out, actual adoption has been pretty low.”

Parascript’s technology, like the data it wants to discover, has been made faster and more accurate thanks to neural networks and speedy processors.

Ultimately, the promise of Big Data discovery would be to associate dark data with tags that help point it toward the people who can use it, said Larrivee. “The realm of knowledge management has been around for a long time,” he said.

“In it’s purest sense, it means you wouldn’t have to look for it; it would find its way to you based on profiles. When something enters the enterprise, we extract that dark data, and it becomes one of those elements that says ‘I recognized what this is, and there’s a couple of profiles that want this information.’ ”

Article Tags

BI, Big Data, business intelligence

About Alexandra Weber Morales

Alexandra Weber Morales is a freelance writer (and singer and songwriter).

View all posts by Alexandra Weber Morales

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

We’ve struck Big Data! Now what do we ask it?

Article Tags

Subscribe to SDTimes

About Alexandra Weber Morales

Related Articles

Data is the new petroleum; companies need better pipelines — and better oil-spill clean-up methods

6 actionable practices for putting data democratization into practice

Canonical announces general availability of Charmed MLFlow

AtScale unveils new capabilities to support code-first data modelers