The building blocks of SQL

Published: March 29th, 2016

- Alex Handy

SQL is the prime building block of the modern enterprise. All those exciting applications, nifty mobile apps and massive back-end projects are, essentially, useless without the data behind them. That data may not be so important at runtime if the application is just saving logs or form information, but at the end of the day, that data has to live somewhere.

Today, wherever that data ends up, it’s highly likely it will be accessed with SQL (or at the very least a SQL offshoot, be it Oracle’s PL/SQL or Microsoft’s Transact-SQL). At their cores, even the modified versions of SQL all aim for the same goal: making data stores accessible to analysts and business people.

In the beginning, Edgar F. Codd, Donald Chamberlin and Raymond F. Boyce laid out the basics for relational databases in their work at IBM between 1970 and 1974. That work would form the foundation for databases for decades to come, and included the invention of not only SQL, but also the schema model for storing and organizing data into tables.

The SQL we use today bears little resemblance to the original language, created at IBM’s San Jose Research Laboratory. Originally dubbed SEQUEL, which stood for the Structured English Query Language, it was designed as a tool to help access the data in these newfangled things IBM was playing with: relational databases.

By 1979, the original SEQUEL ideas (by then shortened to SQL due to trademark concerns) had percolated within Relational Software, the company that would become Oracle. By the end of 1979, Relational offered the first commercial implementation of SQL with its Oracle V2 database for VAX.

It would take another seven years before ANSI would standardize SQL. The SQL-87 standard would lay the foundations for modern software development and data management by ensuring that different database vendors would be able to run the same queries. This made knowledge workers vastly more valuable, as they could move from company to company and not require retraining to use a different database.

Two revisions later, SQL-92 saw the first sweeping changes to the language. The actual spec itself grew exponentially in this release, though new features only accounted for double the size of the standard. The primary goal for SQL-92 was to be much more specific about how things should be done, thus lowering the amount of divergence between the various relational database platforms in the market.

SQL has continued to grow over the years, gaining recursive queries in 1999, adding XML support in 2003, and taking in XQuery support in 2006. Which brings us to today, when the SQL 2011 standard rules the roost.

SQL:2011, as it is called, was primarily about temporal support. This version of the standard brought in more handlers for doing work related to time series inside databases. This means most SQL databases (such as PostgreSQL, Oracle and DB2) can now treat time as a top-level function across SQL, and there are new temporal predicates, such as overlaps, equals and precedes. This means time-series database work should be easier to sync up across different vendors.

Oracle, for example, supports SQL:2011 in 12c, but versions 10g and 11g use Oracle’s Flashback queries to ask time-based questions to their databases. IBM, on the other hand, calls its temporal features Time Travel Queries.

This, perhaps, all points to the future for the SQL standard. As SQL has evolved over the past 40 years, it’s consistently taken on the common data challenges of the day with an approach that comes close to making everyone happy.

The Big Data connection
One place where the future of SQL is evident is in the world of Big Data. When Apache Hadoop burst onto the scene in 2010, there were no SQL tools in sight. But as of 2014, SQL on Hadoop has become essential.

A major reason for the continuing popularity of SQL, said Vicky Harp, corporate strategist at Idera, is that open source has democratized the language, opening it to more than simply enterprise users.

“For a long time, [SQL] was something people saw as [for the] big enterprise, but now we have other open-source alternatives, so people don’t have to make the big investment that they did before. You have a lot more developers who know SQL now,” she said.

“I think the analytics platforms are coming along with access to data. We’ve had a large data accumulation phase, and people are seeing that you can get things out of that. When you’re asking, ‘What do we do with all these marketing visits to our website?’ it winds up being more data than you could point Crystal Reports at.”

Because all of this data is being saved, the natural business instinct is to do something with it. The trick is to actually get information out of the data, a task that requires highly skilled workers—and more often than not—SQL.

Unfortunately, said Harp, the market has realized this as well, and has essentially flooded customers with choices. That means there’s a lot of turbulence and no clear market leader when it comes to SQL on Hadoop, or even analytics on Hadoop.

“You need to have more data science and actual analytics capabilities,” said Harp. “The space is a place where there’s a need. We’re seeing there’s a lot of jostling in the Gartner Magic Quadrant on that in 2105 and 2016. We saw a lot of people drop in terms of their ability to execute, which I thought was interesting. It’s a space to continue to watch. We’re also seeing vendors move in and out of that magic quadrant. It’s not like they’re having trouble finding vendors. In the 2016 version, even Oracle fell off.

“There is demand in the market for people to do what they’re comfortable with, and at the same time, it’s the relational database providers who are seeing what their users want. It depends on what you’re looking at it. Is this relational on top of Hadoop versus…Hadoop working with SQL Server or some other platform where you are mixing the two types of data?”

Indeed, Hadoop has muddied the waters around big enterprise data analytics, thanks to hundreds of vendors now offering compatible products to analyze the mountains of data that come from a modern enterprise.

Monte Zweben, cofounder and CEO of Splice Machine, has built a company to deliver ACID transactions on top of Hadoop. That means SQL users can use their Hadoop cluster as they would typically use a relational data store.

“I don’t think it’s the language [SQL] that I would argue is the new innovation; it’s the workload using the language that’s going to be unique,” he said. “I see the world bifurcating. What I mean by that is, there was this heavy push to do rapid ingestion of data. The NoSQL guys glommed onto that. Then there’s this other world of people doing big batch analytics. This is where the Hadoop world has gone. All SQL on Hadoop is focused on that: big batch analytics.

“The one piece of the pie nobody addressed was powering concurrent applications. That’s where you need ACID semantics. That’s what relational databases had done for years and years. If you have all three of those, you have what’s typically remarked as a dual workload. The magic of this next generation of architectures is supporting the dual workload, where those workloads are isolated from each other, and don’t interfere with each other.

“Think of a database that’s trying to do both analytics and transactions. What typically happens is you run analytics on a single-lane highway, blocking all these little cars behind them. Those cars are the transactions. If someone kicks off a report to summarize the last six months of sales, and all of a sudden your resources are shot, that’s what traditional databases struggle with: resource isolation.

“In the new architectures, you can use different Big Data compute engines for different purposes. We have one lane for transactions powered by HBase, and one lane for analysis powered by Spark.”

And that is, perhaps, the biggest draw to Big Data for SQL users: the potential to unlock massive troves of data without the potential to lock up the entire dataset with a single miswritten query.

The Calcite layer: Key to SQL’s future
SQL’s big contribution to humanity is providing a singular way to access data, regardless of the underlying storage medium or vendor. The various compromises currently required by cloud infrastructure, however, are beginning to cause divergence once again, as numerous data stores compete in the cloud. Many have their own little SQL quirks or oversights.

That’s why the Apache Calcite project is so important to the future of SQL and to the future of Big Data. The project was created three years ago by Julian Hyde, a data architect at Hortonworks. The goal of the project was to clean up the mess around how SQL is run across Big Data. Essentially, Calcite is a generic query optimizer that is compatible with anything for which developers desire to write a plug-in.

“I’m a database guy. I’ve been building databases for most of my career: SQL databases, open-source and otherwise,” said Hyde. “I wrote the Mondrian 11 engine, the leading open-source LDAP engine. I’d done query optimizers before. What I saw was—and the Hadoop revolution was one big part of it—was the fact that the databases was no longer a monolithic entity anymore. People were choosing their own storage formats and algorithms.

“Federating the data across a cluster (or several clusters) and a query optimizer were going to be key to keeping those all together and keep your sanity. I thought to liberate the query optimizer from the inside of the database so people could integrate disparate components.

“There is a diverse community of users, but not everyone wants to write Scala, not everyone wants to write SQL, not everyone wants to write R. But all those communities exist, and they need to be served. It was fairly clear to a lot of us that a SQL interface to Hadoop was going to come along, and two years ago about 10 came along at once. There’s not a single paradigm that will win, but the SQL community is very strong and doesn’t show signs of going away. Tableau is still the way the majority of users get to their data.”

Calcite brings some coherence to this multiple-language world. Instead of implementing its own database, Calcite is, essentially, the building blocks for a database. Calcite includes the framework for managing data, but does not include traditional database capabilities, such as managing storage locations, hosting a repository for metadata, or including algorithms for processing data.

“What I think is interesting about SQL is the declarative approach to data, where you have a query planner,” said Hyde. “You say ‘Here’s what I want to get,’ and the system goes and gets it. That isn’t limited to SQL: Pig has an optimizer in it, Storm has an optimizer in it. The general approach extends beyond SQL.

“Another part of our mission is integrating together data federation. That’s why an open-source project is a good way of solving it: We have various people who are solving these individual problems that find that Calcite is the way they can pool their resources. Just last week someone contributed an Apache Cassandra adapter. They also recognize that there is some basic stuff that query optimizers do that applies to Cassandra, just as it applies to MySQL or Apache Drill.”

Calcite, said Hyde, allows database engineers “to start 80% up the mountain and climb the interesting 20%.” That means all the mundane things databases must do to handle queries can be handled by Calcite, while the more important differentiation features, such as storage medium, built-in algorithms and a metadata store, are handled by the engineers.

“Another thing this particular contributor wanted from was Calcite’s support for materialized views,” said Hyde. “That’s a table whose contents are defined by a query. This table always contains the highest salary of each department, so if someone writes a query, they can go to this table instead. That avoids actually scanning all the data. Calcite has the features for defining these materialized views.”

Enterprises are addicted to those highly important data queries, and Calcite can help to eliminate some of the headaches associated with them. “On the mundane level, we are using Calcite to build really high-quality cost-based optimizers for some really high-performance systems,” said Hyde. “Hortonworks is investing in Apache Hive very strongly, and we’re building a world-class cost-based optimizer in Hive. It’s a massive ongoing engineering effort. Oracle, Microsoft and IBM have spent a lot of effort building their cost-based optimizers for their systems.

“My prediction is that people will want a SQL interface on top of streaming data for the same reason they wanted SQL on top of Hadoop. Not because SQL is the ideal language, but because of its interoperability. Existing skill sets can use them, and the system can self optimize.”

Jim Scott, director of enterprise strategy and architecture at MapR, said that SQL still drives the needs of many enterprises. “When it comes down to it, most people need the rudimentary basics of ANSI SQL, and the tiny subset of that in Hive is usually less than adequate,” he said.

“Calcite is just sitting out there waiting to be used. Drill helped open that one up. When it comes down to it, look at the history of SQL on Hadoop technologies. Apache Hive was a great entry into expressing SQL at scale. Apache Impala came along and took a step forward and said, ‘We need to make this faster.’ They didn’t necessarily fix the problems. They just made something run faster, so it has a complete dependency on Hive.”

Scott predicted change will come to the SQL-on-Hadoop market, mainly because existing solutions are not optimum. “I think what it comes down to is the logical model these platforms have been built on are not the easiest to adapt to the complexity of SQL will support,” he said.

“Idealistically, people are going to put their hands on a tool like Apache Drill [and] say, ‘I can start with this on my laptop and can query every data store in my enterprise.’ Drill supports utilizing the Hive metastore, but does not require Hive to use it. There has been a competitive landscape of SQL on Hadoop.”

Apache Calcite
Perhaps the best way to describe Apache Calcite is to let the project describe itself. According to the Apache site:

Apache Calcite is a dynamic data-management framework.

It contains many of the pieces that comprise a typical database-management system, but omits some key functions: storage of data, algorithms to process data, and a repository for storing metadata.

“Calcite intentionally stays out of the business of storing and processing data. As we shall see, this makes it an excellent choice for mediating between applications and one or more data-storage locations and data-processing engines. It is also a perfect foundation for building a database: Just add data.

Article Tags

Big Data, Calcite, data, data analytics, databases, Hadoop, Spark, SQL

About Alex Handy

Alex Handy is the Senior Editor of Software Development Times.

View all posts by Alex Handy

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

The building blocks of SQL

Article Tags

Subscribe to SDTimes

About Alex Handy

Related Articles

ScyllaDB X Cloud’s autoscaling capabilities meet the needs of unpredictable workloads in real time

Databricks adds new tools like Lakebase, Lakeflow Designer, and Agent Bricks to better support building AI apps and agents in the enterprise

Snowflake introduces agentic AI innovations for data insights

Garbage in, garbage out: The importance of data quality when training AI models