“The one piece of the pie nobody addressed was powering concurrent applications. That’s where you need ACID semantics. That’s what relational databases had done for years and years. If you have all three of those, you have what’s typically remarked as a dual workload. The magic of this next generation of architectures is supporting the dual workload, where those workloads are isolated from each other, and don’t interfere with each other.
“Think of a database that’s trying to do both analytics and transactions. What typically happens is you run analytics on a single-lane highway, blocking all these little cars behind them. Those cars are the transactions. If someone kicks off a report to summarize the last six months of sales, and all of a sudden your resources are shot, that’s what traditional databases struggle with: resource isolation.
“In the new architectures, you can use different Big Data compute engines for different purposes. We have one lane for transactions powered by HBase, and one lane for analysis powered by Spark.”
And that is, perhaps, the biggest draw to Big Data for SQL users: the potential to unlock massive troves of data without the potential to lock up the entire dataset with a single miswritten query.
The Calcite layer: Key to SQL’s future
SQL’s big contribution to humanity is providing a singular way to access data, regardless of the underlying storage medium or vendor. The various compromises currently required by cloud infrastructure, however, are beginning to cause divergence once again, as numerous data stores compete in the cloud. Many have their own little SQL quirks or oversights.
That’s why the Apache Calcite project is so important to the future of SQL and to the future of Big Data. The project was created three years ago by Julian Hyde, a data architect at Hortonworks. The goal of the project was to clean up the mess around how SQL is run across Big Data. Essentially, Calcite is a generic query optimizer that is compatible with anything for which developers desire to write a plug-in.
“I’m a database guy. I’ve been building databases for most of my career: SQL databases, open-source and otherwise,” said Hyde. “I wrote the Mondrian 11 engine, the leading open-source LDAP engine. I’d done query optimizers before. What I saw was—and the Hadoop revolution was one big part of it—was the fact that the databases was no longer a monolithic entity anymore. People were choosing their own storage formats and algorithms.
“Federating the data across a cluster (or several clusters) and a query optimizer were going to be key to keeping those all together and keep your sanity. I thought to liberate the query optimizer from the inside of the database so people could integrate disparate components.