Big Data, bigger competition

Published: December 3rd, 2013

- Alex Handy

Funny thing about Big Data: Making it is the easy part. Capturing it is the difficult part. Analyzing it is the ultimate goal. With the right approach and processes in place, however, Big Data can provide a path toward attaching actual metrics to activities and events that otherwise might be immeasurable. And that’s exactly what’s being done at Wargaming.net.

Craig Fryar, head of global business intelligence at Wargaming.net, has been in the games industry for 23 years. He cut his teeth building Spectre, one of the first 3D multiplayer games. Later, he found work at EA Bioware, helping the developers there make sense of the data generated by the massively multiplayer online game, Star Wars: The Old Republic.

But when he came on board at Wargaming.net in April 2013, the numbers and data involved got a lot bigger. With more than 70 million registered players in World of Tanks alone, and two other games based on warplanes and battleships also out there, the company generates huge quantities of data every minute.

Just how does Fryar turn 70 million players (with 3 million daily games) into actionable business intelligence that can be translated into subtle changes in gameplay? For him and his team, Hadoop, Oracle, R and Tableau form a large part of that solution.

“The strategy I’ve put into place is a stack that embodies Gartner’s logical data warehouse,” he said. “We’re collecting a great amount of data into the data lake of Hadoop. We’re studying and having the data science team mine that, but we’re also taking components and putting it into our operational data store in Oracle.”

Outside of the Hadoop cluster, other tools help to make sense of all those tank battles. “We’re using visual analytics with Excel and Tableau,” Fryar said. “We also use the Cloudera distribution of Hadoop, and we’re working with Oracle in that regard, with a Big Data appliance, NoSQL, Oracle, R, and the Cloudera distribution. We’re connecting that into Oracle Enterprise Data Edition, and presenting data from that out to analysts.”

But before Wargaming.net could even begin to analyze all of this data, Fryar and his team had to get the most difficult part of this work out of the way first: They had to agree on terms.

“The most challenging part is the data definition layer,” he said. “We have a number of products from tanks to planes to ships for PC, iOS and Xbox, and creating a definitional layer that provides a basis for info on telemetry and other key performance indicators is a big task, and one we started on early.

“We’re just coming to the point of finishing that definitional layer. We started with the Xbox product. We used that to create a standard set of telemetry. What does it mean by ‘active player’? What’s the calculation of ‘lifetime of player’? We’ve been normalizing our meanings. That’s a challenging thing, but we’re almost done.”

Wargaming.net uses all this data for one primary purpose: improving the player experience. That means a lot of the data is used to answer balance and design questions. Building a video game is somewhat unlike enterprise software because the fundamental goal of all games boils down to one very simple, but extremely subjective concept: fun.

Thus, one of the first tasks Fryar took on was analyzing the game’s new Xbox tutorial. After analyzing the data from numerous players, the tutorial map was adjusted and modified to make the experience less confusing.

“In terms of balancing with Big Data, we use Big Data to balance a number of different things, and our first instance was for World of Tanks Xbox,” he said. “We had a Hadoop cluster for that, and ran beta data through that. We examine things like progression rates, number of players moving from skill tier to skill tier. Progression rates allow us to look at how people are making their way from tier to tier. We can observe the time it takes and the effort it takes to advance from tier to tier, and the designers can intervene.”

#!Reporting for duty
Francois Ajenstat, director of product management at Tableau, said that this is the appeal of Big Data analytics: tying numbers and measurements to things that typically aren’t directly tied to a single factor. For World of Tanks, for example, that means translating player fun into a heat map of the battlefield, showing where players died or got stuck. He said that such data can be processed in Tableau without the need for an analytics degree, or a Ph.D. in statistics.

“The user doesn’t have to know anything about SQL to work with Tableau; they literally just point to a database or Hive instance and the data—the metadata—becomes available in Tableau for the user to interact with,” said Ajenstat. “The user can go in and say, ‘I’d like to see my sales’ through a drag-and-drop interface. Every time I drag-and-drop, Tableau will convert that intent into the appropriate language for the database.

“With new technologies like Hadoop, people aren’t thinking in that same way anymore. They focus on ‘How do I keep the data—not based on size—let’s just keep as much as we can because there’s a question in the future I might want to answer.’ I don’t know what that question is yet, but now I can do that more easily because the data is available. That becomes a really important thing, being able to blend that with a data source that’s more structured or formalized. We’re seeing lots of scenarios where in the Hadoop world people were using Tableau to see the data in there. We’re so visual and interactive that now people are using that data to start to answer questions.”

Ajenstat added that having the ability to work with the data in a real-time fashion allows developers and analytics people to quickly refine questions and get feedback on whether or not their query will be fruitful.

“Getting the data is one thing—and lots of tools give you data,” he said. “It’s being able to ask different questions, test different hypotheses, test different paths. You can say, ‘Let’s go down a different path and explore from different angles.’ That’s the differentiator, and people are hungry to be able to do that.”

But Tableau is only one of dozens of analytics and reporting tools that can make all that data into actionable charts, graphs and reports. Many other tools are available to turn complex Big Data into simpler charts and reports.

Quinton Alsbury, cofounder of mobile business intelligence app company Roambi, is focused on not only making that Big Data make sense, but also on making it available to salespeople and workers in the field. In order to do that, the company focuses on the small screen for Big Data.

“People are going to be interacting with these things in completely different ways,” he said. “[On mobile], the interface is limited to a three-inch by two-inch-wide screen, so we thought there was a great opportunity to start from scratch.”

To that end, Roambi has spent the last five years building up its mobile business intelligence consumption platform. “Our solution isn’t a warehouse, or a database, or any of those things,” said Alsbury. “It’s purely about the visualization aspect. It connects to everything. Customers can use it against Hive or other business intelligence systems. We have customers using it against IBM DB2 or Oracle or Microsoft Excel.”

On the other side of the mobile coin is Dundas, which focuses on dashboards for everyone. Using the Big Data stores already in your company, Dundas allows specific dashboards, mobile or otherwise, to be built for sales teams, ops personnel, C-level executives, or just about anyone else who needs to measure their progress through data metrics.

Discovering data
Embedding analytics can go even further than a dashboard. Pentaho offers a data-integration platform that doubles as an analytics platform once all that data is integrated. Pentaho 5.0 added a host of new data-analysis features, such as the ability to restart and roll back analytics jobs. But with this version, the company added support for running analytics and integrations from MongoDB, the popular NoSQL store.

But to even begin to make easily digestible charts and graphs, one must first sit down and dig through the data. In your average Hadoop cluster, that’s easier said than done, as most Hadoop jobs take a significant amount of time. While that may change as Hadoop matures, and as its successor, the real-time in-memory-focused Spark gains traction, for now, search tools are extremely useful for finding those data needles in the Hadoop haystack.

Splunk has been searching Big Data since 2003, though for the Splunk crowd, the terms “server logs” and “Big Data” are basically synonymous. Clint Sharp, senior product manager for Big Data at Splunk, said the company has an unstructured approach to Big Data.

“What’s different about Splunk is that we don’t require you to do structuring and analysis of the data in advance. There are no ETL requirements,” he said. “I don’t have to give you the data in a tabular form. Give us the data however it sits, and Splunk will be able to read that data and give you the ability to do charting and analytics. We’re really changing the workflow. We’re allowing them to do the analytics on the data without having to do a whole upfront investment in order to do analytics on top of that.”

Grant Ingersoll, CTO of LucidWorks, offers a similar value proposition with his company’s search tools. LucidWorks offers commercialized versions and support for Apache Lucene and Solr, which form the core components of its search engine, given a Hadoop cluster to work from.

The key difference for LucidWorks is that its tools can be used to form the basis of larger Hadoop applications; instead of sitting outside of Hadoop like a microscope into the data, LucidWorks tools are more like a scaffolding for adding search into applications. But that doesn’t mean they can’t also be used for discovery.

“Search is when you know what you’re looking for,” said Ingersoll. “Discovery is all about what kind of content we should be recommending to you based on prior actions, and who you are, etc. This is all about machine learning and natural-language processing. You’re trying to aid the user in terms of finding what the next best action is. Then the third area is analytics, analytics that help drive understanding of the search and discovery actions. What are they not looking at? What are the data quality issues?”

Then there’s QlikView, which takes data discovery as its mantra, credo and modus operandi. The QlikView Business Discovery platform allows developers and business users to quickly build specific dashboards and reports around their ideas, rather than being locked into formal “official” reports all the time. This allows business users to basically sift through the data and get to the actual underlying business metrics that are shifting.

#!Having Hadoop
All of these solutions infer one major piece of infrastructure for your Big Data: Hadoop. And with that Big Data platform maturing and expanding daily (through open-source and commercial distributions from Cloudera and MapR), there are a lot of moving pieces in the Hadoop puzzle to link into your existing software development and analytics processes.

Shaun Connolly, vice president of corporate strategy at Hortonworks, said that integrating Hadoop into your software development life cycle isn’t just about adding some processes to your Java developers’ workload.

“As far as continuous integration in the Hadoop space, not all Hadoop applications are necessarily Java and Map/Reduce,” he said. “A larger percentage of the applications are written by higher-level components like Hive, or scripting languages like Pig, where you’re able to do data transformation in a higher-level language than Java.”

To that end, building a process around Hadoop will require serious consideration as to who will be using which aspects of the system. In this way, each team will have to integrate their Hadoop jobs in their own way, according to their usage needs. Fortunately, the multi-tenant additions to Hadoop 2.0 will allow for jobs to share processor time, thus ensuring these multiple teams using Hadoop won’t be fighting over the cluster resources, said Connolly.

“[Hadoop 2.0] changes the nature of the Hadoop platform, so you’ll have a higher-level data processing engine that developers might interact with,” he said. “The job processing is just one aspect of it. If you’re able to plug stream processing into YARN, that’s where the higher-level data processing engine becomes useful. The point there is the engines are able to plug in, and the question that is then asked is who’s the type of user who interacts with that? For HBase, maybe that’s a mobile application, but for Map/Reduce maybe it’s a Pig or Hive user.”

All this newfound flexibility for the cluster doesn’t mean Hadoop is definitely going to live in your data center, however. Raymie Stata, CEO of Hadoop cloud company Altiscale, is betting heavily that hosted Hadoop can add value to the Big Data jobs companies need performed. The value of a hosted solution, he said, is in allowing the developers and users of the Hadoop cluster to focus only on their processes, data intake and jobs, instead of on the nitty-gritty of Hadoop cluster management.

“I think what people tend to focus on is moving the historical data into the cluster, instead of dealing with the ongoing data collection,” said Stata. “Both are issues, but we focus on the ongoing data-collection part. How do you capture—in a reliable way—the data?

“Typically, that data is distributed. Even Amazon Web Services encourages you to put your site in multiple regions. We’ve set up data collection mechanisms. It’s a thousand points of light. There are so many ways to collect data, we try to have enough mechanisms to cover that space. Operationally, collecting data is challenging. If a machine goes down, you’re not collecting from it. You need redundancy to have very reliable data capture.

“Even though I am a big believer that the historical data is important to driving business value, the urgent need is to process the data. What happened last week is more important than what happened a year ago.”

One of the secrets of Altiscale is that it doesn’t just run in Amazon’s cloud, said Stata. “We have custom infrastructure. Core Hadoop infrastructure runs best on the right buses, drives and networking equipment,” he said.

“To get really good performance, you need to run the core Hadoop cluster in an optimized fashion. Around core Hadoop, you have to put all sorts of other things—the Hive metadata server, for example. You actually surround the cluster with tons of additional services. We do use AWS for all those additional services. We’re customers of Amazon Direct Connect, to help people with that data-movement problem.”

But when it comes to hosted cloud solutions, Big Data processing is spurring innovation in the cloud layer. While Altiscale takes its own approach to hosting Hadoop on its own infrastructure, Verizon is hoping its new cloud will compel users to consider it as an alternative to Amazon for its biggest data jobs.

Pivotal Labs, on the other hand, is hoping that developers and analysts will continue to run Hadoop on-premise. Pivotal HD is that company’s Hadoop distribution, and while it offers a number of enterprise features, the biggest and most important one is SQL processing capabilities within the Hadoop cluster.

For many analysts and data scientists, SQL has been the language of choice for getting into all of that data, but Hadoop’s nature as a non-structured data store means that SQL queries and actions like joins just don’t work. But Pivotal created HAWQ, the SQL query engine for Hadoop. HAWQ allows SQL jobs to be run on top of all that data inside of a Hadoop cluster.

World of Tanks loads up on Big Data
While software companies battle for supremacy in the Hadoop cluster, Fryar and the Wargaming.net team are still cranking away at improving their players’ experience. Even when rolling out new features, he and his team use Big Data to predict where the problems will occur, based on the results they get from their beta tests.

As each country has distinct tank advantages and disadvantages, the designers at Wargaming.net need a lot of data and feedback to be sure the thicker armor of German tank doesn’t make it more powerful than a heavily gunned Russian tank. “We look at the choice selection between nations, and our objective is not to have a perfect balance. But we need to make sure the particular benefits are balanced against other benefits, country to country,” he said.

“We’re about to introduce the new Japanese tank line. We have to make sure the advantages and disadvantages line up appropriately with the existing tanks from China, France, Germany, etc. We can statistically model that to look at any inefficiencies or outliers. Visual analytics show us the high nails almost instantly.”

And with the right developers to look at that data, they can quickly hammer them down before players become disenfranchised and move on to another game.

Article Tags

Altiscale, Big Data, Pivotal Labs, Roambi, Tableau, Wargaming.net

About Alex Handy

Alex Handy is the Senior Editor of Software Development Times.

View all posts by Alex Handy

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Big Data, bigger competition

Article Tags

Subscribe to SDTimes

About Alex Handy

Related Articles

Data is the new petroleum; companies need better pipelines — and better oil-spill clean-up methods

Canonical announces general availability of Charmed MLFlow

Salesforce introduces Tableau Genie to power better data insights

IBM launches guide for contributing to open source cloud projects