SQL is the prime building block of the modern enterprise. All those exciting applications, nifty mobile apps and massive back-end projects are, essentially, useless without the data behind them. That data may not be so important at runtime if the application is just saving logs or form information, but at the end of the day, that data has to live somewhere.
Today, wherever that data ends up, it’s highly likely it will be accessed with SQL (or at the very least a SQL offshoot, be it Oracle’s PL/SQL or Microsoft’s Transact-SQL). At their cores, even the modified versions of SQL all aim for the same goal: making data stores accessible to analysts and business people.
(Related: Microsoft to bring SQL Server to Linux)
In the beginning, Edgar F. Codd, Donald Chamberlin and Raymond F. Boyce laid out the basics for relational databases in their work at IBM between 1970 and 1974. That work would form the foundation for databases for decades to come, and included the invention of not only SQL, but also the schema model for storing and organizing data into tables.
The SQL we use today bears little resemblance to the original language, created at IBM’s San Jose Research Laboratory. Originally dubbed SEQUEL, which stood for the Structured English Query Language, it was designed as a tool to help access the data in these newfangled things IBM was playing with: relational databases.
By 1979, the original SEQUEL ideas (by then shortened to SQL due to trademark concerns) had percolated within Relational Software, the company that would become Oracle. By the end of 1979, Relational offered the first commercial implementation of SQL with its Oracle V2 database for VAX.
It would take another seven years before ANSI would standardize SQL. The SQL-87 standard would lay the foundations for modern software development and data management by ensuring that different database vendors would be able to run the same queries. This made knowledge workers vastly more valuable, as they could move from company to company and not require retraining to use a different database.
Two revisions later, SQL-92 saw the first sweeping changes to the language. The actual spec itself grew exponentially in this release, though new features only accounted for double the size of the standard. The primary goal for SQL-92 was to be much more specific about how things should be done, thus lowering the amount of divergence between the various relational database platforms in the market.
SQL has continued to grow over the years, gaining recursive queries in 1999, adding XML support in 2003, and taking in XQuery support in 2006. Which brings us to today, when the SQL 2011 standard rules the roost.
SQL:2011, as it is called, was primarily about temporal support. This version of the standard brought in more handlers for doing work related to time series inside databases. This means most SQL databases (such as PostgreSQL, Oracle and DB2) can now treat time as a top-level function across SQL, and there are new temporal predicates, such as overlaps, equals and precedes. This means time-series database work should be easier to sync up across different vendors.
Oracle, for example, supports SQL:2011 in 12c, but versions 10g and 11g use Oracle’s Flashback queries to ask time-based questions to their databases. IBM, on the other hand, calls its temporal features Time Travel Queries.
This, perhaps, all points to the future for the SQL standard. As SQL has evolved over the past 40 years, it’s consistently taken on the common data challenges of the day with an approach that comes close to making everyone happy.
The Big Data connection
One place where the future of SQL is evident is in the world of Big Data. When Apache Hadoop burst onto the scene in 2010, there were no SQL tools in sight. But as of 2014, SQL on Hadoop has become essential.
A major reason for the continuing popularity of SQL, said Vicky Harp, corporate strategist at Idera, is that open source has democratized the language, opening it to more than simply enterprise users.
“For a long time, [SQL] was something people saw as [for the] big enterprise, but now we have other open-source alternatives, so people don’t have to make the big investment that they did before. You have a lot more developers who know SQL now,” she said.
“I think the analytics platforms are coming along with access to data. We’ve had a large data accumulation phase, and people are seeing that you can get things out of that. When you’re asking, ‘What do we do with all these marketing visits to our website?’ it winds up being more data than you could point Crystal Reports at.”
Because all of this data is being saved, the natural business instinct is to do something with it. The trick is to actually get information out of the data, a task that requires highly skilled workers—and more often than not—SQL.
Unfortunately, said Harp, the market has realized this as well, and has essentially flooded customers with choices. That means there’s a lot of turbulence and no clear market leader when it comes to SQL on Hadoop, or even analytics on Hadoop.
“You need to have more data science and actual analytics capabilities,” said Harp. “The space is a place where there’s a need. We’re seeing there’s a lot of jostling in the Gartner Magic Quadrant on that in 2105 and 2016. We saw a lot of people drop in terms of their ability to execute, which I thought was interesting. It’s a space to continue to watch. We’re also seeing vendors move in and out of that magic quadrant. It’s not like they’re having trouble finding vendors. In the 2016 version, even Oracle fell off.
“There is demand in the market for people to do what they’re comfortable with, and at the same time, it’s the relational database providers who are seeing what their users want. It depends on what you’re looking at it. Is this relational on top of Hadoop versus…Hadoop working with SQL Server or some other platform where you are mixing the two types of data?”
Indeed, Hadoop has muddied the waters around big enterprise data analytics, thanks to hundreds of vendors now offering compatible products to analyze the mountains of data that come from a modern enterprise.
Monte Zweben, cofounder and CEO of Splice Machine, has built a company to deliver ACID transactions on top of Hadoop. That means SQL users can use their Hadoop cluster as they would typically use a relational data store.
“I don’t think it’s the language [SQL] that I would argue is the new innovation; it’s the workload using the language that’s going to be unique,” he said. “I see the world bifurcating. What I mean by that is, there was this heavy push to do rapid ingestion of data. The NoSQL guys glommed onto that. Then there’s this other world of people doing big batch analytics. This is where the Hadoop world has gone. All SQL on Hadoop is focused on that: big batch analytics.
“The one piece of the pie nobody addressed was powering concurrent applications. That’s where you need ACID semantics. That’s what relational databases had done for years and years. If you have all three of those, you have what’s typically remarked as a dual workload. The magic of this next generation of architectures is supporting the dual workload, where those workloads are isolated from each other, and don’t interfere with each other.
“Think of a database that’s trying to do both analytics and transactions. What typically happens is you run analytics on a single-lane highway, blocking all these little cars behind them. Those cars are the transactions. If someone kicks off a report to summarize the last six months of sales, and all of a sudden your resources are shot, that’s what traditional databases struggle with: resource isolation.
“In the new architectures, you can use different Big Data compute engines for different purposes. We have one lane for transactions powered by HBase, and one lane for analysis powered by Spark.”
And that is, perhaps, the biggest draw to Big Data for SQL users: the potential to unlock massive troves of data without the potential to lock up the entire dataset with a single miswritten query.
The Calcite layer: Key to SQL’s future
SQL’s big contribution to humanity is providing a singular way to access data, regardless of the underlying storage medium or vendor. The various compromises currently required by cloud infrastructure, however, are beginning to cause divergence once again, as numerous data stores compete in the cloud. Many have their own little SQL quirks or oversights.
That’s why the Apache Calcite project is so important to the future of SQL and to the future of Big Data. The project was created three years ago by Julian Hyde, a data architect at Hortonworks. The goal of the project was to clean up the mess around how SQL is run across Big Data. Essentially, Calcite is a generic query optimizer that is compatible with anything for which developers desire to write a plug-in.
“I’m a database guy. I’ve been building databases for most of my career: SQL databases, open-source and otherwise,” said Hyde. “I wrote the Mondrian 11 engine, the leading open-source LDAP engine. I’d done query optimizers before. What I saw was—and the Hadoop revolution was one big part of it—was the fact that the databases was no longer a monolithic entity anymore. People were choosing their own storage formats and algorithms.
“Federating the data across a cluster (or several clusters) and a query optimizer were going to be key to keeping those all together and keep your sanity. I thought to liberate the query optimizer from the inside of the database so people could integrate disparate components.
“There is a diverse community of users, but not everyone wants to write Scala, not everyone wants to write SQL, not everyone wants to write R. But all those communities exist, and they need to be served. It was fairly clear to a lot of us that a SQL interface to Hadoop was going to come along, and two years ago about 10 came along at once. There’s not a single paradigm that will win, but the SQL community is very strong and doesn’t show signs of going away. Tableau is still the way the majority of users get to their data.”
Calcite brings some coherence to this multiple-language world. Instead of implementing its own database, Calcite is, essentially, the building blocks for a database. Calcite includes the framework for managing data, but does not include traditional database capabilities, such as managing storage locations, hosting a repository for metadata, or including algorithms for processing data.
“What I think is interesting about SQL is the declarative approach to data, where you have a query planner,” said Hyde. “You say ‘Here’s what I want to get,’ and the system goes and gets it. That isn’t limited to SQL: Pig has an optimizer in it, Storm has an optimizer in it. The general approach extends beyond SQL.
“Another part of our mission is integrating together data federation. That’s why an open-source project is a good way of solving it: We have various people who are solving these individual problems that find that Calcite is the way they can pool their resources. Just last week someone contributed an Apache Cassandra adapter. They also recognize that there is some basic stuff that query optimizers do that applies to Cassandra, just as it applies to MySQL or Apache Drill.”
Calcite, said Hyde, allows database engineers “to start 80% up the mountain and climb the interesting 20%.” That means all the mundane things databases must do to handle queries can be handled by Calcite, while the more important differentiation features, such as storage medium, built-in algorithms and a metadata store, are handled by the engineers.
“Another thing this particular contributor wanted from was Calcite’s support for materialized views,” said Hyde. “That’s a table whose contents are defined by a query. This table always contains the highest salary of each department, so if someone writes a query, they can go to this table instead. That avoids actually scanning all the data. Calcite has the features for defining these materialized views.”
Enterprises are addicted to those highly important data queries, and Calcite can help to eliminate some of the headaches associated with them. “On the mundane level, we are using Calcite to build really high-quality cost-based optimizers for some really high-performance systems,” said Hyde. “Hortonworks is investing in Apache Hive very strongly, and we’re building a world-class cost-based optimizer in Hive. It’s a massive ongoing engineering effort. Oracle, Microsoft and IBM have spent a lot of effort building their cost-based optimizers for their systems.
“My prediction is that people will want a SQL interface on top of streaming data for the same reason they wanted SQL on top of Hadoop. Not because SQL is the ideal language, but because of its interoperability. Existing skill sets can use them, and the system can self optimize.”
Jim Scott, director of enterprise strategy and architecture at MapR, said that SQL still drives the needs of many enterprises. “When it comes down to it, most people need the rudimentary basics of ANSI SQL, and the tiny subset of that in Hive is usually less than adequate,” he said.
“Calcite is just sitting out there waiting to be used. Drill helped open that one up. When it comes down to it, look at the history of SQL on Hadoop technologies. Apache Hive was a great entry into expressing SQL at scale. Apache Impala came along and took a step forward and said, ‘We need to make this faster.’ They didn’t necessarily fix the problems. They just made something run faster, so it has a complete dependency on Hive.”
Scott predicted change will come to the SQL-on-Hadoop market, mainly because existing solutions are not optimum. “I think what it comes down to is the logical model these platforms have been built on are not the easiest to adapt to the complexity of SQL will support,” he said.
“Idealistically, people are going to put their hands on a tool like Apache Drill [and] say, ‘I can start with this on my laptop and can query every data store in my enterprise.’ Drill supports utilizing the Hive metastore, but does not require Hive to use it. There has been a competitive landscape of SQL on Hadoop.”
Apache Calcite
Perhaps the best way to describe Apache Calcite is to let the project describe itself. According to the Apache site:
Apache Calcite is a dynamic data-management framework.
It contains many of the pieces that comprise a typical database-management system, but omits some key functions: storage of data, algorithms to process data, and a repository for storing metadata.
“Calcite intentionally stays out of the business of storing and processing data. As we shall see, this makes it an excellent choice for mediating between applications and one or more data-storage locations and data-processing engines. It is also a perfect foundation for building a database: Just add data.