The explosion of data in the world is stressing organizations that need to store, organize, retrieve and learn from all this information. Larger companies have traditionally dealt with this problem by confining data into silos that only deal with data from a single source or of a single type, but that approach often misses the bigger picture and the most important insights.
Big Data provides the potential to do better, and Microsoft is no longer standing back, watching the technology take off without it. The company has for some time been pulling together components such as improved business intelligence reporting products that will make it easy and inexpensive enough for anyone to reap the rewards promised by harnessing Big Data, without suffering the pains that have traditionally accompanied wiring a Big Data solution together.
The small-company dilemma
Companies of all sizes are struggling with many technical challenges all the time. This technological revolution has pushed progress across the board like nothing ever seen before. Each time a disruptive technology enters the scene, it follows predictable stages (provided it survives to maturity). Big Data is well on its way through the stages to the point where everyone is quite convinced that failing to leverage Big Data will put their business at a severe disadvantage.
The first critical insight is that having a large database does not qualify as having Big Data. Classic data warehousing can already solve problems related to pulling actionable insights from data of this type. To rise to the level of being a Big Data problem, the data must be large, varied and fast-flowing. Most startups want to define their data usage as a Big Data problem so that they can use the buzzword as they search for investors. The problem with this is that it misses the point as to what problem Big Data actually solves. While some data will become Big Data over time as it scales and subsumes more varied data sources, the crossover point varies greatly depending on the business and what data it has.
The good news is that Excel is still the lens for seeing data from Microsoft’s solution perspective. Even some of the most recent announcements from Microsoft around tooling for business intelligence and Big Data are Excel-centric. These include GeoFlow and Data Explorer, which will be covered in greater detail later. If Microsoft is successful in becoming a major source of the tools used for Big Data, then companies of any size will be able to leverage Big Data without diverting a great deal of technical resources from their base business model and without breaking the bank. That, at least, is the idea.
Big Data meets the cloud
Leveraging Big Data is not trivial and is beyond the ability of most organizations that lack good tools. Yet even with great tooling, it takes skill to get the right answers from the seas of data involved. Microsoft’s role in the market as technologies mature is usually that it joins late, and it often makes the tools that democratize the disruptive technology by making it easy to use. Enterprise-class databases are a classic example of this. Microsoft SQL Server was the first enterprise-scale DBMS that could be managed in someone’s spare time because it was designed to be self-configuring and was truly easy to set up and use.
HDInsight, now coming together as part of Windows Azure, is poised to serve the same role for Big Data. In fact it is likely that HDInsight is the most important aspect of Microsoft’s efforts to not be left out of the Big Data market, and it appears to have arrived just in time. Soon, anyone with a Windows Azure subscription will be able to set up a Hadoop cluster that reads data from relatively inexpensive Blob storage. With HDInsight, Blob storage is integrated into the Hadoop Distributed File System (HDFS) that underlies the entire system.
Microsoft explained the relationship between the primary components this way on the HDInsight page: “HDInsight Service clusters are deployed in Azure on compute nodes to execute Map/Reduce tasks, and can be dropped by users once these tasks have been completed. Keeping the data in the HDFS clusters after computations have been completed would be an expensive way to store this data. Windows Azure Blob Storage is a robust, general-purpose Azure storage solution, so storing data in Blob Storage enables the clusters used for computation to be safely deleted without losing user data.”
This statement highlights Microsoft’s view of the main advantages to be gained by hosting Hadoop in the cloud on Azure, which provides cheaper storage and the ability to not pay for processing nodes when Map/Reduce tasks are not actually running.
Microsoft has posted a number of detailed tutorials on how to get up and running with the Azure-based HDInsight. These include Getting Started, Using MapReduce, Using Hive, Using Pig and Running Samples. The inclusion of Hive and Pig shows that even advanced components are in the Azure.
Microsoft Big Data on premises
The HDInsight feature of Windows Azure is still in a preview as of this writing, and access to the feature is not instant, which means that pricing information is not yet available. For those who want to jump right in, the source of the implementation that Microsoft is leveraging behind the scenes is also available to install on your own servers, thanks to Hortonworks. Hortonworks provides open-source Hadoop tooling in the form of its Hortonworks Data Platform (HDP). This solution is compatible with Windows 2008 and 2012, and it integrates with SQL Server 2012 and the new version of the Parallel Data Warehouse.
Democratization and self-service
Two recurring themes in Microsoft’s efforts in the Big Data space are the idea of democratizing the ability to use Big Data, and in furthering the ability for users to self-serve through tools like Data Explorer. These are two sides of the same coin, because while making the tooling less expensive is a big part of democratization, the other big part is usability.
Andrew Brust, founder and CEO of Blue Badge Insights, Visual Studio Magazine columnist and co-author of the book “Programming Microsoft SQL Server 2012,” has focused on the democratization aspect of Microsoft’s strategy. He said that democratization of Big Data is one of the key pillars of Microsoft’s Big Data strategy. This is born out by Microsoft’s own messaging. If you visit Microsoft’s Big Data page, one of the main headers you’ll see is “Democratize Big Data.” The other pillars according to Brust are the cloud (embodied by the new HDInsight offering of Azure), and in-memory, which means taking advantage of systems with much more RAM for faster processing.
Data Explorer is an Excel add-in that lets you pull data from a wide range of data sources, including HDFS, relational databases (i.e., SQL Server, MySQL, DB2, Oracle and Access), Web pages, text files and even Facebook. Big Data means dealing with data from a variety of sources and formats, which Data Explorer certainly enables. Data Explorer is a big step on the road toward enabling self-service where the user can pick and choose.
A common demonstration of the abilities of this tool is to point it at a stat-laden Wikipedia.org page (try “UEFA European Football Championship”) and let it scrape the table-formatted data that it finds in the page. Figure 1 shows this page pulled into Data Explorer, but rather than selecting the Results section of statistics (see the list of tables to draw data from on the left side of the screen), which is how this feature is normally demonstrated, I selected the “Teams reaching the final” section.
Figure 2 shows the source on the Wikipedia.org page. You will notice that while it does a great job of making it very easy to pull this data in, it does not correctly deal with superscripts on some of the years. For example, 1972, with a superscript denoting that the country was West Germany before reunification, is rendered as 19721. There are many ways that small issues such as this could skew results and therefore lead to the wrong conclusions. No matter how easy these data tools are to use, there still needs to be a human brain in the mix that can discern good from bad data. Realizing this makes self-service a key ingredient to getting the most out of the Big Data revolution.
GeoFlow is another Excel add-in that will help ordinary users find insights in cases where location matters. The GeoFlow tool allows for up to a million data points to be plotted on Bing Maps at a time to create 3D charts or even heat maps. Figure 3 shows a demonstration illustrating tickets sales in the Seattle area.
While Data Explorer can be run with Excel 2013 or Excel 2010 with Service Pack 1, GeoFlow requires Office Professional Plus 2013 or Office 365 ProPlus along with the .NET Framework 4.0. The download page for the GeoFlow Preview also provides a number of sample datasets if you do not have any geospatial data handy and want to try out the system for yourself.
One of the really powerful analysis aspects to GeoFlow is (if we ignore how powerful the 3D images are all by themselves) the ability to move the charting through time to see how things change. We have seen this before in Microsoft business intelligence tooling, but it never gets old because this is often the way that you find the insights that staring at spreadsheets of numbers will never convey. This is the key to Big Data usability and the way Microsoft is embracing and extending it. The war will be waged on the client rather than on the server. This is evident from the fact that Microsoft has chosen to not rewrite its own version of Hadoop, but has instead focused on the client-analysis side of the system.
Big Data is coming to more and more organizations whether they are ready for it or not. The pace of data growth is staggering and accelerating year over year. Information that has historically been thrown away will be gathered, and organizations without a Big Data culture will look for the lowest-cost alternative that allows mere mortals to find the answers that will bring about the returns.
I expect Microsoft to continue to innovate in the SQL Server business intelligence tools at a steady pace with much more focus on leveraging the many, many gigabytes of RAM available (half a terabyte of RAM on a SQL Server is not uncommon in some circles these days). The real place to watch is the analysis-tool side. GeoFlow and Data Explorer may be just the beginning as Microsoft seeks to ensure that there is no excuse not to use HDInsight on Azure.