Microsoft’s packing list for Big Data

Published: April 23rd, 2013

The explosion of data in the world is stressing organizations that need to store, organize, retrieve and learn from all this information. Larger companies have traditionally dealt with this problem by confining data into silos that only deal with data from a single source or of a single type, but that approach often misses the bigger picture and the most important insights.

Big Data provides the potential to do better, and Microsoft is no longer standing back, watching the technology take off without it. The company has for some time been pulling together components such as improved business intelligence reporting products that will make it easy and inexpensive enough for anyone to reap the rewards promised by harnessing Big Data, without suffering the pains that have traditionally accompanied wiring a Big Data solution together.

The small-company dilemma
Companies of all sizes are struggling with many technical challenges all the time. This technological revolution has pushed progress across the board like nothing ever seen before. Each time a disruptive technology enters the scene, it follows predictable stages (provided it survives to maturity). Big Data is well on its way through the stages to the point where everyone is quite convinced that failing to leverage Big Data will put their business at a severe disadvantage.

The first critical insight is that having a large database does not qualify as having Big Data. Classic data warehousing can already solve problems related to pulling actionable insights from data of this type. To rise to the level of being a Big Data problem, the data must be large, varied and fast-flowing. Most startups want to define their data usage as a Big Data problem so that they can use the buzzword as they search for investors. The problem with this is that it misses the point as to what problem Big Data actually solves. While some data will become Big Data over time as it scales and subsumes more varied data sources, the crossover point varies greatly depending on the business and what data it has.

The good news is that Excel is still the lens for seeing data from Microsoft’s solution perspective. Even some of the most recent announcements from Microsoft around tooling for business intelligence and Big Data are Excel-centric. These include GeoFlow and Data Explorer, which will be covered in greater detail later. If Microsoft is successful in becoming a major source of the tools used for Big Data, then companies of any size will be able to leverage Big Data without diverting a great deal of technical resources from their base business model and without breaking the bank. That, at least, is the idea.

Big Data meets the cloud
Leveraging Big Data is not trivial and is beyond the ability of most organizations that lack good tools. Yet even with great tooling, it takes skill to get the right answers from the seas of data involved. Microsoft’s role in the market as technologies mature is usually that it joins late, and it often makes the tools that democratize the disruptive technology by making it easy to use. Enterprise-class databases are a classic example of this. Microsoft SQL Server was the first enterprise-scale DBMS that could be managed in someone’s spare time because it was designed to be self-configuring and was truly easy to set up and use.

HDInsight, now coming together as part of Windows Azure, is poised to serve the same role for Big Data. In fact it is likely that HDInsight is the most important aspect of Microsoft’s efforts to not be left out of the Big Data market, and it appears to have arrived just in time. Soon, anyone with a Windows Azure subscription will be able to set up a Hadoop cluster that reads data from relatively inexpensive Blob storage. With HDInsight, Blob storage is integrated into the Hadoop Distributed File System (HDFS) that underlies the entire system.

Microsoft explained the relationship between the primary components this way on the HDInsight page: “HDInsight Service clusters are deployed in Azure on compute nodes to execute Map/Reduce tasks, and can be dropped by users once these tasks have been completed. Keeping the data in the HDFS clusters after computations have been completed would be an expensive way to store this data. Windows Azure Blob Storage is a robust, general-purpose Azure storage solution, so storing data in Blob Storage enables the clusters used for computation to be safely deleted without losing user data.”

This statement highlights Microsoft’s view of the main advantages to be gained by hosting Hadoop in the cloud on Azure, which provides cheaper storage and the ability to not pay for processing nodes when Map/Reduce tasks are not actually running.

Microsoft has posted a number of detailed tutorials on how to get up and running with the Azure-based HDInsight. These include Getting Started, Using MapReduce, Using Hive, Using Pig and Running Samples. The inclusion of Hive and Pig shows that even advanced components are in the Azure.

Microsoft Big Data on premises
The HDInsight feature of Windows Azure is still in a preview as of this writing, and access to the feature is not instant, which means that pricing information is not yet available. For those who want to jump right in, the source of the implementation that Microsoft is leveraging behind the scenes is also available to install on your own servers, thanks to Hortonworks. Hortonworks provides open-source Hadoop tooling in the form of its Hortonworks Data Platform (HDP). This solution is compatible with Windows 2008 and 2012, and it integrates with SQL Server 2012 and the new version of the Parallel Data Warehouse.

If you are serious about using HDInsight, then the HDInsight preview is a good place to start. It has almost all of the features of the cloud-based version, and since you have total control of the environment, you do not have to worry about Azure Storage Accounts or integration with Blob Storage while working with it. Most of the attention is currently on the Azure version of HDInsight, but the bidirectional connectors between Hadoop and SQL Server 2012 make HDP for Windows a great way to see if the solution will get what you want from your data. It is also likely that the freedom to develop .NET and JavaScript applications that interact with the system will be more limited with the Azure-hosted system, which might be the biggest factor leading companies to choose on-premise over cloud-hosted HDInsight.

Democratization and self-service
Two recurring themes in Microsoft’s efforts in the Big Data space are the idea of democratizing the ability to use Big Data, and in furthering the ability for users to self-serve through tools like Data Explorer. These are two sides of the same coin, because while making the tooling less expensive is a big part of democratization, the other big part is usability.

Andrew Brust, founder and CEO of Blue Badge Insights, Visual Studio Magazine columnist and co-author of the book “Programming Microsoft SQL Server 2012,” has focused on the democratization aspect of Microsoft’s strategy. He said that democratization of Big Data is one of the key pillars of Microsoft’s Big Data strategy. This is born out by Microsoft’s own messaging. If you visit Microsoft’s Big Data page, one of the main headers you’ll see is “Democratize Big Data.” The other pillars according to Brust are the cloud (embodied by the new HDInsight offering of Azure), and in-memory, which means taking advantage of systems with much more RAM for faster processing.

Data Explorer is an Excel add-in that lets you pull data from a wide range of data sources, including HDFS, relational databases (i.e., SQL Server, MySQL, DB2, Oracle and Access), Web pages, text files and even Facebook. Big Data means dealing with data from a variety of sources and formats, which Data Explorer certainly enables. Data Explorer is a big step on the road toward enabling self-service where the user can pick and choose.

A common demonstration of the abilities of this tool is to point it at a stat-laden Wikipedia.org page (try “UEFA European Football Championship”) and let it scrape the table-formatted data that it finds in the page. Figure 1 shows this page pulled into Data Explorer, but rather than selecting the Results section of statistics (see the list of tables to draw data from on the left side of the screen), which is how this feature is normally demonstrated, I selected the “Teams reaching the final” section.

Figure 2 shows the source on the Wikipedia.org page. You will notice that while it does a great job of making it very easy to pull this data in, it does not correctly deal with superscripts on some of the years. For example, 1972, with a superscript denoting that the country was West Germany before reunification, is rendered as 19721. There are many ways that small issues such as this could skew results and therefore lead to the wrong conclusions. No matter how easy these data tools are to use, there still needs to be a human brain in the mix that can discern good from bad data. Realizing this makes self-service a key ingredient to getting the most out of the Big Data revolution.

GeoFlow is another Excel add-in that will help ordinary users find insights in cases where location matters. The GeoFlow tool allows for up to a million data points to be plotted on Bing Maps at a time to create 3D charts or even heat maps. Figure 3 shows a demonstration illustrating tickets sales in the Seattle area.

While Data Explorer can be run with Excel 2013 or Excel 2010 with Service Pack 1, GeoFlow requires Office Professional Plus 2013 or Office 365 ProPlus along with the .NET Framework 4.0. The download page for the GeoFlow Preview also provides a number of sample datasets if you do not have any geospatial data handy and want to try out the system for yourself.

One of the really powerful analysis aspects to GeoFlow is (if we ignore how powerful the 3D images are all by themselves) the ability to move the charting through time to see how things change. We have seen this before in Microsoft business intelligence tooling, but it never gets old because this is often the way that you find the insights that staring at spreadsheets of numbers will never convey. This is the key to Big Data usability and the way Microsoft is embracing and extending it. The war will be waged on the client rather than on the server. This is evident from the fact that Microsoft has chosen to not rewrite its own version of Hadoop, but has instead focused on the client-analysis side of the system.

Future directions
Big Data is coming to more and more organizations whether they are ready for it or not. The pace of data growth is staggering and accelerating year over year. Information that has historically been thrown away will be gathered, and organizations without a Big Data culture will look for the lowest-cost alternative that allows mere mortals to find the answers that will bring about the returns.

I expect Microsoft to continue to innovate in the SQL Server business intelligence tools at a steady pace with much more focus on leveraging the many, many gigabytes of RAM available (half a terabyte of RAM on a SQL Server is not uncommon in some circles these days). The real place to watch is the analysis-tool side. GeoFlow and Data Explorer may be just the beginning as Microsoft seeks to ensure that there is no excuse not to use HDInsight on Azure.

Article Tags

Big Data, Microsoft

About Patrick Hynds

Patrick Hynds is a Regional Director for Microsoft and president of CriticalSites.

View all posts by Patrick Hynds

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Microsoft’s packing list for Big Data

Article Tags

Subscribe to SDTimes

About Patrick Hynds

Related Articles

Last week in AI dev tools: Cloudflare blocking AI crawlers by default, Perplexity Max subscription, and more (July 7, 2025)

AI updates from the past week: OpenAI Codex adds internet access, Mistral releases coding assistant, and more — June 6, 2025

AI updates from the past week: Anthropic launches Claude 4 models, OpenAI adds new tools to Responses API, and more — May 23, 2025

Microsoft Build: GitHub Copilot coding agent, Azure AI Foundry updates, support for MCP, and more