“There is a diverse community of users, but not everyone wants to write Scala, not everyone wants to write SQL, not everyone wants to write R. But all those communities exist, and they need to be served. It was fairly clear to a lot of us that a SQL interface to Hadoop was going to come along, and two years ago about 10 came along at once. There’s not a single paradigm that will win, but the SQL community is very strong and doesn’t show signs of going away. Tableau is still the way the majority of users get to their data.”
Calcite brings some coherence to this multiple-language world. Instead of implementing its own database, Calcite is, essentially, the building blocks for a database. Calcite includes the framework for managing data, but does not include traditional database capabilities, such as managing storage locations, hosting a repository for metadata, or including algorithms for processing data.
“What I think is interesting about SQL is the declarative approach to data, where you have a query planner,” said Hyde. “You say ‘Here’s what I want to get,’ and the system goes and gets it. That isn’t limited to SQL: Pig has an optimizer in it, Storm has an optimizer in it. The general approach extends beyond SQL.
“Another part of our mission is integrating together data federation. That’s why an open-source project is a good way of solving it: We have various people who are solving these individual problems that find that Calcite is the way they can pool their resources. Just last week someone contributed an Apache Cassandra adapter. They also recognize that there is some basic stuff that query optimizers do that applies to Cassandra, just as it applies to MySQL or Apache Drill.”
Calcite, said Hyde, allows database engineers “to start 80% up the mountain and climb the interesting 20%.” That means all the mundane things databases must do to handle queries can be handled by Calcite, while the more important differentiation features, such as storage medium, built-in algorithms and a metadata store, are handled by the engineers.
“Another thing this particular contributor wanted from was Calcite’s support for materialized views,” said Hyde. “That’s a table whose contents are defined by a query. This table always contains the highest salary of each department, so if someone writes a query, they can go to this table instead. That avoids actually scanning all the data. Calcite has the features for defining these materialized views.”
Enterprises are addicted to those highly important data queries, and Calcite can help to eliminate some of the headaches associated with them. “On the mundane level, we are using Calcite to build really high-quality cost-based optimizers for some really high-performance systems,” said Hyde. “Hortonworks is investing in Apache Hive very strongly, and we’re building a world-class cost-based optimizer in Hive. It’s a massive ongoing engineering effort. Oracle, Microsoft and IBM have spent a lot of effort building their cost-based optimizers for their systems.
“My prediction is that people will want a SQL interface on top of streaming data for the same reason they wanted SQL on top of Hadoop. Not because SQL is the ideal language, but because of its interoperability. Existing skill sets can use them, and the system can self optimize.”