The Internet disrupts every industry, and business intelligence (BI) is no exception. As modern Web applications voraciously consume data, they blast past the bounds of traditional relational databases. Scaling out mushrooming data sets with replication, caching, tuning, hardware and sharding solutions is one way to go, but more than ever, flexible NoSQL data stores like MongoDB are a preferred alternative for developers who don’t want to be locked into rigid schemas for amorphous data.
What happens when downstream data analysts try to normalize and interpret what’s been vacuumed into MongoDB, however? That’s where Sumit Sarkar, Chief Data Evangelist at Progress DataDirect in Raleigh, N.C., takes a pragmatic approach to BI. Step one, he says, is to establish a logical normalized schema to bridge BI tools that expect SQL with NoSQL databases. Step two? Make analytics easy with standards-based connectivity so that you can make actionable decisions based on Big Data. We spoke with him recently to get his perspective on MongoDB and SQL.
SD Times: What are some of the challenges people have with MongoDB?
Sarkar: A lot of folks we work with are data-centric developers doing data modeling and analytics. MongoDB is a popular and disruptive technology for people used to working with SQL technologies, but when MongoDB gets dropped in, they’re not completely sure what to do with it. It’s a scalable database, great for large data sets that don’t have rigid data models, but downstream developers often have a hard time consuming that data.
The challenge of connecting BI tools to multi-dimensional documents with a dynamic schema is that data is not structured for existing BI tools. How do you infer what the data model should be? One approach that is just plain ugly is flattening JSON documents. We knew there had to be a better way.
A few years ago, Gartner predicted that the NoSQL label will die out by 2017. Do you agree?
It’s a pretty hot topic. NoSQL in the beginning meant “non SQL.” Over time, it has begun to stand for “not only SQL.” Once you get to that point that these guys get to—“I’m a developer building analytics or a data-driven app, what do I do with this MongoDB document database?”—you need to give them some easy options for extracting that knowledge.
What we did was build the first normalized SQL interface for MongoDB, ODBC Connectors, which we presented at MongoDB World. Data visualization tools using interfaces such as ours are a very popular way to connect to MongoDB for analytics. Since then, MongoDB has built a BI connector as well.
Do these ODBC Connectors go both ways?
Yes, you can normalize MongoDB data into relational views, building a structured schema on read, that are natural to applications expecting tabular data. We’re the first in the industry to do that. But you can also implement NoSQL technology with your existing technology without major changes. It’s a logical view we provide on top of your existing data. And it’s a major productivity enhancement because you don’t have to go in and manipulate the data.
What is the main use case you see for MongoDB and how is it different from Hadoop?
MongoDB is a document store. We see it often as more of an operational type of database. It’s popular to build node applications on top of MongoDB—real-time, high-performing applications—whereas the Hadoop ecosystem, especially when looking at Hive, is more a batch system for data lakes or analyzing Big Data sets.
What’s a common MongoDB Big Data scenario you see?
In financial services, MongoDB enables various use cases such as tick data capture, risk analytics, network security monitoring, or even content-management systems where data volumes can grow really large. A lot of folks we work with in that market are accustomed to relational models.
So are the database analysts out of date, or are the relational models out of date?
Big Data is really analyzing new workloads and new types of data: log files on cloud systems across millions or billions of users. But the use cases for relational stuff are not going away. Both types of databases are going to be around for a while.
Some newer digital companies are starting out with Hadoop before they have relational databases. Folks like Facebook and LinkedIn started with Big Data and didn’t even have relational systems at first. But then they realized that they need them for core operations such as financials.
Does the SQL/NoSQL divide mirror the Dev/Ops divide?
No, it’s around the downstream analysts, business analysts and reporting guys getting Big Data projects dumped on their laps. It’s not a divide but a byproduct of organizations and lines of business doing more with new data sets. Disruptive database technologies are growing across the landscape, and business analysts may not know how to derive intelligence from each of these.
Enter the data scientists, right?
Yes and no. The top data scientists are unicorns in terms of their skills: They have to have subject matter expertise and be very statistical-minded. It’s not an easy mix to find.
Instead, we’re providing standards-based connectors that can shift you away from having to hire an army of people who may not exist to using your existing analysts who know your data and business, so you can instantly be productive.
Companies like MicroStrategy and TIBCO Jaspersoft are creating tooling to bridge that divide, leading to a big BI world transition, right?
There is definitely a shift happening with these massive data sets. Disruptive data sources come in all different shapes and sizes, such as NoSQL. A lot of BI tools were built around relational sources. Having these standards-based ODBC/JDBC BI connectors, that is today’s secret sauce. New and existing vendors are even trying to do more things closer to the data: start processing, blending, wrangling data within the Big Data itself. There’s disruption in the industry, but using these industry standards for accessing the data, we can help you disrupt the disruption.
How do you deal with changing data models?
When DataDirect connectors first sample a configurable portion of the database, they infer a logical, normalized data model. If that model changes significantly, you should refresh the data model across both BI servers and client metadata modeling tools. In these cases, the responsibility of data integrity has somewhat shifted from the database to the developer.
Do you eat your own dog food at Progress Software?
We are able to access really large data sets with predictive analytics modules, making recommendations within applications. A lot of these things are happening right now, whether people are aware of it or not.
Within our own company, we have cloud services that produce large amounts of data. We visualize that data and display it around the company. Humans are really good at detecting patterns visually. If you’re on the way to the break room, you can look at these big screens and instantly see when something is off.
Also, some data scientists in our organization are able to use our technology to build statistical models on top of operational data in real time by leveraging our BI connectors for SaaS data sources.
How has MongoDB, now at 3.0, evolved?
As database technology matured, part of the SQL language is to be able to aggregate, group and order-by. When we query large data sets in MongoDB, that is a significant improvement.
The WiredTiger storage engine in version 3.0 improves performance over their default MMapV1 engine by offering B-tree and log-structured merge algorithms.
Earlier than that, MongoDB 2.2, introduced the aggregation framework to simplify counting and averaging. Before that was available, we had to do the crunching ourselves—and crunching Big Data is not necessarily something you want to do client-side.
Is there anything you can do that MongoDB can’t?
From the perspective of querying data for BI, the DataDirect MongoDB connectors can handle multiple types of cross-collection joins, which is not something MongoDB supports natively. How to do that is probably the most common question we get.
Why is data connectivity your passion?
The bulk of my career has been around these industry standards for connectivity, which keep everything open and are vendor-neutral. I understand the pain points of developers and what they need, but also the enterprise side. My employer, Progress DataDirect, is a member of both the OData technical committee and JDBC Expert group. John Goodson founded the DataDirect business line and is the co-author of the ODBC specification.
One of the unique things about this job is, being in the space of connectivity, you are in between everything. I really enjoyed my NoSQL evangelism tour last year, because I got to meet cool kinds of developers. So my passion is to connect data, people and ISVs, and keep the intelligence flowing.