PostgreSQL is a popular open-source OLTP database for systems of record. It’s the fourth most-widely used database in the world, and its popularity has grown more than any other database for three of the last four years in a row, according to the recent DB-Engines database rankings.
Despite the widespread use of PostgreSQL by application developers to manage transactional data, its use in analytic applications such as data warehousing has been quite limited. PostgreSQL is extremely versatile, but query performance often struggles as the quantity of data grows into the terabytes range. As a result, analytic data processing (OLAP) is still dominated by more mature SQL databases like Oracle, SQL Server, DB2, and relative newcomers like Amazon Redshift and Snowflake, or IBM Netezza, and Greenplum. Ironically, Redshift, Netezza, and Greenplum are all forks of PostgreSQL.
But there is a perfect storm brewing that bodes well for PostgreSQL and people who want to save money with open source. There are three trends that are increasing the use of PostgreSQL for data warehousing and analytic applications.
Trend one: People like saving money
By next year, Gartner Research has predicted that 70% of database applications will be running on open-source databases. Demand for new systems of insights continues to intensify, but data warehousing is an area that, historically, has not had many good, free, open-source SQL database options. That’s unfortunate because data warehousing platforms are very expensive. The annual maintenance alone on a legacy data warehouse database platform can cost many hundreds of thousands of dollars per year. The economic advantages of replacing legacy data warehouses with open source PostgreSQL are immense. As a result, we’re seeing many Fortune 1000 companies exploring open-source options like PostgreSQL.
Trend two: PostgreSQL has pumped up its performance
PostgreSQL has always been a solid transactional performance engine, but historically it has lacked some of the query performance features that we see in commercial data warehouse databases.
Those features include parallel query processing, columnar storage (which greatly reduces the time required to scan and filter query data), and more mature query planning, optimization, and execution, which give you faster performance out of the box with less tuning.
Within the last 18 months, these technical shortcomings have largely been addressed. The most recent release of PostgreSQL, version 13, included improvements in parallel processing and query planning. More importantly, PostgreSQL, due to its wonderful extensibility and enthusiastic community, now benefits from a number of extensions, which enhance it with greater parallel processing and columnar storage. So technically, we now see PostgreSQL performing on-par with most of the commercial data warehouse systems, but at 50% to 90% lower annual cost.
Trend three: Goodbye OLAP, hello HTAP
The definition of a data warehouse is changing. Traditionally, an analytic system would copy data from one or more sources, via extract-transform-load (ETL) programs, into a separate data warehouse database. The problems with this approach are a) cost of maintaining a separate database, b) reporting latency and c) massive effort maintaining ETL scripts. Sometimes minutes, hours, or even days pass before the data is ETL’d into the data warehouse, which means people are analyzing data that might be outdated already.
The solution to these problems is hybrid transactional/analytical processing (HTAP). Instead of having separate databases for transactions and for analytics, you have a single database, which serves both needs. In the database management system market, we see two main paths to HTAP — you either add some “T” to an OLAP database, or more commonly we see OLTP databases being enhanced with some “A.”
In PostgreSQL for instance, HTAP is primarily enabled by columnar indexes. Some of the extensions I mentioned earlier enhance PostgreSQL with columnar indexes on its base tables, which accelerate query performance by greatly reducing I/O. This is much easier for developers, and it gives businesses the ability to analyze data that is fresher, even in real-time.