The technical debt you're overlooking: Big Data debt

Published: August 16th, 2017

When you’re building a new application, you’re going to make lots of hard decisions. Every engineering project involves trade-offs. Most of the time it is smart to ship a feature quickly, assess market fit, then go back to build it “right” in a subsequent release. We commonly call this “technical debt,” and it’s an obligation every engineering team works to pay off as they build new features.

Some classic examples of technical debt include eBay starting as a monolithic Perl app written over a long weekend, Twitter beginning as a Ruby on Rails app, and Amazon operating as a monolithic C++ app. And while Facebook has stuck with PHP for almost 15 years now, they developed an evolved variant called Hack, and built a virtual machine called HHVM to run their massive codebase. These companies evolved over time through multiple iterations, each making a different set of tradeoffs, working to pay down their technical debt along the way.

What is Big Data Debt?
There’s a specific kind of technical debt that most application teams tend to overlook: Big Data Debt. Today developers have lots of storage options for building new apps, including JSON databases like MongoDB and Elasticsearch, eventually consistent systems like Cassandra, cloud services like DynamoDB, as well as distributed file systems like HDFS and Amazon S3. Each of these options has advantages and can be the “right tool for the job.”

One of the big drivers of adoption for these new technologies is that they allow development teams to more easily and efficiently iterate on features. With Agile methodologies, teams work in sprints that last just a few weeks. At the end of each sprint engineers check in with users to see if they’re headed in the right direction, and then course-correct as necessary. Compared to relational databases, NoSQL, S3, and Hadoop tend to be less demanding in terms of modeling and structuring data, and this is a big advantage in terms of development speed.

But eventually all the data being creating in these apps needs to be analyzed. And this is where companies start to feel the downside of the trade-offs they made to gain application development speed and agility. For application data in relational databases, companies are well equipped to move data from these systems into their analytical environments. In contrast, the data in SaaS applications, NoSQL databases, and distributed file systems is fundamentally incompatible with their existing approach to data pipelines.

Analytics infrastructure is dominated by the relational model, including ETL, data warehouses, data marts, and BI tools. Because the data from these new systems is non-relational, there’s additional work that must be addressed by your data engineers. This is a kind of technical debt that is often overlooked by application teams, and frequently it isn’t well understood or planned for.

The “Last Mile” Problem in Analytics
With data across the organization accumulating in many different technology stacks, companies have turned to Hadoop or S3 to consolidate their data into a “data lake” or central system for analytics. However, most companies who have gone down this road find that it doesn’t meet the performance needs of their analysts and data scientists. Why?

Ultimately, analysts and data scientists want to make sense of the data in order to tell a story that is meaningful to the business. The “last mile” in analytics consists of the tools millions of analysts and data scientists use from their devices. These are BI products (Tableau, Power BI, MicroStrategy), data science tools (Python, R, and SAS), and most popular of all, Excel. One thing these tools all have in common is they work best when all the data is stored in a high-performance relational database.

But companies don’t have all their data in a single relational database. Instead, their application data is stored in data lakes, data warehouses, third-party apps, S3 and more. So, IT extracts summarized data from all these systems, and loads it into a relational database to support their “last mile” tools, or they create BI extracts for each of the different tools.

If we step back to look at the end-to-end solution — from sources, to staging areas, loads and loads of ETL, data warehouses, data marts, data lakes, BI extracts, and so on — it is incredibly complex, expensive, fragile, and slow to adapt to new application data. This is your Big Data Debt, all the incremental time and money you spend to make data from your non-relational applications fit into your analytics infrastructure.

We’ve put together a free, anonymous calculator to help you estimate the costs of paying down your Big Data Debt. It makes conservative assumptions about the costs of software, infrastructure, data engineers, analysts, and data scientists, and helps you to quickly get a sense for just how much debt you’re accumulating each year.

In Summary
When used carefully technical debt can make a lot of sense and be very advantageous to your business. The same is true with Big Data debt – for example, time to market may be a more important consideration for a new application than making this data available for analysis. If the application is successful and its data turns out to be very valuable, you can build a data pipeline that will make this data compatible with the tools used by your analysts and data scientists. The total cost might be higher, but getting the application to market may have been the priority. The problem is when you don’t have a plan to pay down this debt.

The costs associated with Big Data Debt can be surprisingly high. As with any area of technology, where there are high costs there is opportunity for innovation. We believe that the next major advances in the data technology space will come from products that help companies to more effectively align the data generated from their diverse application portfolio with the tools used by their analysts and data scientists.

Article Tags

Big Data, data debt, databases, Dremio

About Tomer Shiran

Tomer Shiran is the cofounder and CPO of Dremio.

View all posts by Tomer Shiran

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

The technical debt you’re overlooking: Big Data debt

Article Tags

Subscribe to SDTimes

About Tomer Shiran

Related Articles

AI updates from the past week: IBM watsonx Orchestrate updates, web search in Anthropic API, and more — May 9, 2025

Android Studio Meerkat Feature Drop, Neo4j Aura Graph Analytics, and more software development news

Data is the new petroleum; companies need better pipelines — and better oil-spill clean-up methods

Canonical announces general availability of Charmed MLFlow