Funny thing about Big Data: Making it is the easy part. Capturing it is the difficult part. Analyzing it is the ultimate goal. With the right approach and processes in place, however, Big Data can provide a path toward attaching actual metrics to activities and events that otherwise might be immeasurable. And that’s exactly what’s being done at Wargaming.net.
Craig Fryar, head of global business intelligence at Wargaming.net, has been in the games industry for 23 years. He cut his teeth building Spectre, one of the first 3D multiplayer games. Later, he found work at EA Bioware, helping the developers there make sense of the data generated by the massively multiplayer online game, Star Wars: The Old Republic.
But when he came on board at Wargaming.net in April 2013, the numbers and data involved got a lot bigger. With more than 70 million registered players in World of Tanks alone, and two other games based on warplanes and battleships also out there, the company generates huge quantities of data every minute.
Just how does Fryar turn 70 million players (with 3 million daily games) into actionable business intelligence that can be translated into subtle changes in gameplay? For him and his team, Hadoop, Oracle, R and Tableau form a large part of that solution.
“The strategy I’ve put into place is a stack that embodies Gartner’s logical data warehouse,” he said. “We’re collecting a great amount of data into the data lake of Hadoop. We’re studying and having the data science team mine that, but we’re also taking components and putting it into our operational data store in Oracle.”
Outside of the Hadoop cluster, other tools help to make sense of all those tank battles. “We’re using visual analytics with Excel and Tableau,” Fryar said. “We also use the Cloudera distribution of Hadoop, and we’re working with Oracle in that regard, with a Big Data appliance, NoSQL, Oracle, R, and the Cloudera distribution. We’re connecting that into Oracle Enterprise Data Edition, and presenting data from that out to analysts.”
But before Wargaming.net could even begin to analyze all of this data, Fryar and his team had to get the most difficult part of this work out of the way first: They had to agree on terms.
“The most challenging part is the data definition layer,” he said. “We have a number of products from tanks to planes to ships for PC, iOS and Xbox, and creating a definitional layer that provides a basis for info on telemetry and other key performance indicators is a big task, and one we started on early.
“We’re just coming to the point of finishing that definitional layer. We started with the Xbox product. We used that to create a standard set of telemetry. What does it mean by ‘active player’? What’s the calculation of ‘lifetime of player’? We’ve been normalizing our meanings. That’s a challenging thing, but we’re almost done.”
Wargaming.net uses all this data for one primary purpose: improving the player experience. That means a lot of the data is used to answer balance and design questions. Building a video game is somewhat unlike enterprise software because the fundamental goal of all games boils down to one very simple, but extremely subjective concept: fun.
Thus, one of the first tasks Fryar took on was analyzing the game’s new Xbox tutorial. After analyzing the data from numerous players, the tutorial map was adjusted and modified to make the experience less confusing.
“In terms of balancing with Big Data, we use Big Data to balance a number of different things, and our first instance was for World of Tanks Xbox,” he said. “We had a Hadoop cluster for that, and ran beta data through that. We examine things like progression rates, number of players moving from skill tier to skill tier. Progression rates allow us to look at how people are making their way from tier to tier. We can observe the time it takes and the effort it takes to advance from tier to tier, and the designers can intervene.”
#!Reporting for duty
Francois Ajenstat, director of product management at Tableau, said that this is the appeal of Big Data analytics: tying numbers and measurements to things that typically aren’t directly tied to a single factor. For World of Tanks, for example, that means translating player fun into a heat map of the battlefield, showing where players died or got stuck. He said that such data can be processed in Tableau without the need for an analytics degree, or a Ph.D. in statistics.
“The user doesn’t have to know anything about SQL to work with Tableau; they literally just point to a database or Hive instance and the data—the metadata—becomes available in Tableau for the user to interact with,” said Ajenstat. “The user can go in and say, ‘I’d like to see my sales’ through a drag-and-drop interface. Every time I drag-and-drop, Tableau will convert that intent into the appropriate language for the database.
“With new technologies like Hadoop, people aren’t thinking in that same way anymore. They focus on ‘How do I keep the data—not based on size—let’s just keep as much as we can because there’s a question in the future I might want to answer.’ I don’t know what that question is yet, but now I can do that more easily because the data is available. That becomes a really important thing, being able to blend that with a data source that’s more structured or formalized. We’re seeing lots of scenarios where in the Hadoop world people were using Tableau to see the data in there. We’re so visual and interactive that now people are using that data to start to answer questions.”
Ajenstat added that having the ability to work with the data in a real-time fashion allows developers and analytics people to quickly refine questions and get feedback on whether or not their query will be fruitful.
“Getting the data is one thing—and lots of tools give you data,” he said. “It’s being able to ask different questions, test different hypotheses, test different paths. You can say, ‘Let’s go down a different path and explore from different angles.’ That’s the differentiator, and people are hungry to be able to do that.”
But Tableau is only one of dozens of analytics and reporting tools that can make all that data into actionable charts, graphs and reports. Many other tools are available to turn complex Big Data into simpler charts and reports.
Quinton Alsbury, cofounder of mobile business intelligence app company Roambi, is focused on not only making that Big Data make sense, but also on making it available to salespeople and workers in the field. In order to do that, the company focuses on the small screen for Big Data.
“People are going to be interacting with these things in completely different ways,” he said. “[On mobile], the interface is limited to a three-inch by two-inch-wide screen, so we thought there was a great opportunity to start from scratch.”
To that end, Roambi has spent the last five years building up its mobile business intelligence consumption platform. “Our solution isn’t a warehouse, or a database, or any of those things,” said Alsbury. “It’s purely about the visualization aspect. It connects to everything. Customers can use it against Hive or other business intelligence systems. We have customers using it against IBM DB2 or Oracle or Microsoft Excel.”
On the other side of the mobile coin is Dundas, which focuses on dashboards for everyone. Using the Big Data stores already in your company, Dundas allows specific dashboards, mobile or otherwise, to be built for sales teams, ops personnel, C-level executives, or just about anyone else who needs to measure their progress through data metrics.
Embedding analytics can go even further than a dashboard. Pentaho offers a data-integration platform that doubles as an analytics platform once all that data is integrated. Pentaho 5.0 added a host of new data-analysis features, such as the ability to restart and roll back analytics jobs. But with this version, the company added support for running analytics and integrations from MongoDB, the popular NoSQL store.
But to even begin to make easily digestible charts and graphs, one must first sit down and dig through the data. In your average Hadoop cluster, that’s easier said than done, as most Hadoop jobs take a significant amount of time. While that may change as Hadoop matures, and as its successor, the real-time in-memory-focused Spark gains traction, for now, search tools are extremely useful for finding those data needles in the Hadoop haystack.
Splunk has been searching Big Data since 2003, though for the Splunk crowd, the terms “server logs” and “Big Data” are basically synonymous. Clint Sharp, senior product manager for Big Data at Splunk, said the company has an unstructured approach to Big Data.
“What’s different about Splunk is that we don’t require you to do structuring and analysis of the data in advance. There are no ETL requirements,” he said. “I don’t have to give you the data in a tabular form. Give us the data however it sits, and Splunk will be able to read that data and give you the ability to do charting and analytics. We’re really changing the workflow. We’re allowing them to do the analytics on the data without having to do a whole upfront investment in order to do analytics on top of that.”
Grant Ingersoll, CTO of LucidWorks, offers a similar value proposition with his company’s search tools. LucidWorks offers commercialized versions and support for Apache Lucene and Solr, which form the core components of its search engine, given a Hadoop cluster to work from.
The key difference for LucidWorks is that its tools can be used to form the basis of larger Hadoop applications; instead of sitting outside of Hadoop like a microscope into the data, LucidWorks tools are more like a scaffolding for adding search into applications. But that doesn’t mean they can’t also be used for discovery.
“Search is when you know what you’re looking for,” said Ingersoll. “Discovery is all about what kind of content we should be recommending to you based on prior actions, and who you are, etc. This is all about machine learning and natural-language processing. You’re trying to aid the user in terms of finding what the next best action is. Then the third area is analytics, analytics that help drive understanding of the search and discovery actions. What are they not looking at? What are the data quality issues?”
Then there’s QlikView, which takes data discovery as its mantra, credo and modus operandi. The QlikView Business Discovery platform allows developers and business users to quickly build specific dashboards and reports around their ideas, rather than being locked into formal “official” reports all the time. This allows business users to basically sift through the data and get to the actual underlying business metrics that are shifting.
All of these solutions infer one major piece of infrastructure for your Big Data: Hadoop. And with that Big Data platform maturing and expanding daily (through open-source and commercial distributions from Cloudera and MapR), there are a lot of moving pieces in the Hadoop puzzle to link into your existing software development and analytics processes.
Shaun Connolly, vice president of corporate strategy at Hortonworks, said that integrating Hadoop into your software development life cycle isn’t just about adding some processes to your Java developers’ workload.
“As far as continuous integration in the Hadoop space, not all Hadoop applications are necessarily Java and Map/Reduce,” he said. “A larger percentage of the applications are written by higher-level components like Hive, or scripting languages like Pig, where you’re able to do data transformation in a higher-level language than Java.”
To that end, building a process around Hadoop will require serious consideration as to who will be using which aspects of the system. In this way, each team will have to integrate their Hadoop jobs in their own way, according to their usage needs. Fortunately, the multi-tenant additions to Hadoop 2.0 will allow for jobs to share processor time, thus ensuring these multiple teams using Hadoop won’t be fighting over the cluster resources, said Connolly.
“[Hadoop 2.0] changes the nature of the Hadoop platform, so you’ll have a higher-level data processing engine that developers might interact with,” he said. “The job processing is just one aspect of it. If you’re able to plug stream processing into YARN, that’s where the higher-level data processing engine becomes useful. The point there is the engines are able to plug in, and the question that is then asked is who’s the type of user who interacts with that? For HBase, maybe that’s a mobile application, but for Map/Reduce maybe it’s a Pig or Hive user.”
All this newfound flexibility for the cluster doesn’t mean Hadoop is definitely going to live in your data center, however. Raymie Stata, CEO of Hadoop cloud company Altiscale, is betting heavily that hosted Hadoop can add value to the Big Data jobs companies need performed. The value of a hosted solution, he said, is in allowing the developers and users of the Hadoop cluster to focus only on their processes, data intake and jobs, instead of on the nitty-gritty of Hadoop cluster management.
“I think what people tend to focus on is moving the historical data into the cluster, instead of dealing with the ongoing data collection,” said Stata. “Both are issues, but we focus on the ongoing data-collection part. How do you capture—in a reliable way—the data?
“Typically, that data is distributed. Even Amazon Web Services encourages you to put your site in multiple regions. We’ve set up data collection mechanisms. It’s a thousand points of light. There are so many ways to collect data, we try to have enough mechanisms to cover that space. Operationally, collecting data is challenging. If a machine goes down, you’re not collecting from it. You need redundancy to have very reliable data capture.
“Even though I am a big believer that the historical data is important to driving business value, the urgent need is to process the data. What happened last week is more important than what happened a year ago.”
One of the secrets of Altiscale is that it doesn’t just run in Amazon’s cloud, said Stata. “We have custom infrastructure. Core Hadoop infrastructure runs best on the right buses, drives and networking equipment,” he said.
“To get really good performance, you need to run the core Hadoop cluster in an optimized fashion. Around core Hadoop, you have to put all sorts of other things—the Hive metadata server, for example. You actually surround the cluster with tons of additional services. We do use AWS for all those additional services. We’re customers of Amazon Direct Connect, to help people with that data-movement problem.”
But when it comes to hosted cloud solutions, Big Data processing is spurring innovation in the cloud layer. While Altiscale takes its own approach to hosting Hadoop on its own infrastructure, Verizon is hoping its new cloud will compel users to consider it as an alternative to Amazon for its biggest data jobs.
Pivotal Labs, on the other hand, is hoping that developers and analysts will continue to run Hadoop on-premise. Pivotal HD is that company’s Hadoop distribution, and while it offers a number of enterprise features, the biggest and most important one is SQL processing capabilities within the Hadoop cluster.
For many analysts and data scientists, SQL has been the language of choice for getting into all of that data, but Hadoop’s nature as a non-structured data store means that SQL queries and actions like joins just don’t work. But Pivotal created HAWQ, the SQL query engine for Hadoop. HAWQ allows SQL jobs to be run on top of all that data inside of a Hadoop cluster.
World of Tanks loads up on Big Data
While software companies battle for supremacy in the Hadoop cluster, Fryar and the Wargaming.net team are still cranking away at improving their players’ experience. Even when rolling out new features, he and his team use Big Data to predict where the problems will occur, based on the results they get from their beta tests.
As each country has distinct tank advantages and disadvantages, the designers at Wargaming.net need a lot of data and feedback to be sure the thicker armor of German tank doesn’t make it more powerful than a heavily gunned Russian tank. “We look at the choice selection between nations, and our objective is not to have a perfect balance. But we need to make sure the particular benefits are balanced against other benefits, country to country,” he said.
“We’re about to introduce the new Japanese tank line. We have to make sure the advantages and disadvantages line up appropriately with the existing tanks from China, France, Germany, etc. We can statistically model that to look at any inefficiencies or outliers. Visual analytics show us the high nails almost instantly.”
And with the right developers to look at that data, they can quickly hammer them down before players become disenfranchised and move on to another game.