When you’re building a new application, you’re going to make lots of hard decisions. Every engineering project involves trade-offs. Most of the time it is smart to ship a feature quickly, assess market fit, then go back to build it “right” in a subsequent release. We commonly call this “technical debt,” and it’s an obligation every engineering team works to pay off as they build new features.
Some classic examples of technical debt include eBay starting as a monolithic Perl app written over a long weekend, Twitter beginning as a Ruby on Rails app, and Amazon operating as a monolithic C++ app. And while Facebook has stuck with PHP for almost 15 years now, they developed an evolved variant called Hack, and built a virtual machine called HHVM to run their massive codebase. These companies evolved over time through multiple iterations, each making a different set of tradeoffs, working to pay down their technical debt along the way.
What is Big Data Debt?
There’s a specific kind of technical debt that most application teams tend to overlook: Big Data Debt. Today developers have lots of storage options for building new apps, including JSON databases like MongoDB and Elasticsearch, eventually consistent systems like Cassandra, cloud services like DynamoDB, as well as distributed file systems like HDFS and Amazon S3. Each of these options has advantages and can be the “right tool for the job.”
One of the big drivers of adoption for these new technologies is that they allow development teams to more easily and efficiently iterate on features. With Agile methodologies, teams work in sprints that last just a few weeks. At the end of each sprint engineers check in with users to see if they’re headed in the right direction, and then course-correct as necessary. Compared to relational databases, NoSQL, S3, and Hadoop tend to be less demanding in terms of modeling and structuring data, and this is a big advantage in terms of development speed.
But eventually all the data being creating in these apps needs to be analyzed. And this is where companies start to feel the downside of the trade-offs they made to gain application development speed and agility. For application data in relational databases, companies are well equipped to move data from these systems into their analytical environments. In contrast, the data in SaaS applications, NoSQL databases, and distributed file systems is fundamentally incompatible with their existing approach to data pipelines.
Analytics infrastructure is dominated by the relational model, including ETL, data warehouses, data marts, and BI tools. Because the data from these new systems is non-relational, there’s additional work that must be addressed by your data engineers. This is a kind of technical debt that is often overlooked by application teams, and frequently it isn’t well understood or planned for.
The “Last Mile” Problem in Analytics
With data across the organization accumulating in many different technology stacks, companies have turned to Hadoop or S3 to consolidate their data into a “data lake” or central system for analytics. However, most companies who have gone down this road find that it doesn’t meet the performance needs of their analysts and data scientists. Why?
Ultimately, analysts and data scientists want to make sense of the data in order to tell a story that is meaningful to the business. The “last mile” in analytics consists of the tools millions of analysts and data scientists use from their devices. These are BI products (Tableau, Power BI, MicroStrategy), data science tools (Python, R, and SAS), and most popular of all, Excel. One thing these tools all have in common is they work best when all the data is stored in a high-performance relational database.
But companies don’t have all their data in a single relational database. Instead, their application data is stored in data lakes, data warehouses, third-party apps, S3 and more. So, IT extracts summarized data from all these systems, and loads it into a relational database to support their “last mile” tools, or they create BI extracts for each of the different tools.
If we step back to look at the end-to-end solution — from sources, to staging areas, loads and loads of ETL, data warehouses, data marts, data lakes, BI extracts, and so on — it is incredibly complex, expensive, fragile, and slow to adapt to new application data. This is your Big Data Debt, all the incremental time and money you spend to make data from your non-relational applications fit into your analytics infrastructure.
We’ve put together a free, anonymous calculator to help you estimate the costs of paying down your Big Data Debt. It makes conservative assumptions about the costs of software, infrastructure, data engineers, analysts, and data scientists, and helps you to quickly get a sense for just how much debt you’re accumulating each year.
In Summary
When used carefully technical debt can make a lot of sense and be very advantageous to your business. The same is true with Big Data debt – for example, time to market may be a more important consideration for a new application than making this data available for analysis. If the application is successful and its data turns out to be very valuable, you can build a data pipeline that will make this data compatible with the tools used by your analysts and data scientists. The total cost might be higher, but getting the application to market may have been the priority. The problem is when you don’t have a plan to pay down this debt.
The costs associated with Big Data Debt can be surprisingly high. As with any area of technology, where there are high costs there is opportunity for innovation. We believe that the next major advances in the data technology space will come from products that help companies to more effectively align the data generated from their diverse application portfolio with the tools used by their analysts and data scientists.