The future of databases: A chat about managing and scaling ‘agile Big Data’ in the cloud

Published: May 14th, 2014

- Rob Marvin

As Big Data has gotten bigger and bigger, and businesses demand more and more out of their data, traditional database structures just don’t cut it anymore. The traditional single static repository simply isn’t equipped to handle the industry’s rapidly evolving needs.

Cory Isaacson, database technology veteran and the CEO of agile Big Data technology provider CodeFutures, believes we need to rethink the role of databases in a cloud and mobile-dominated landscape. He has worked with database technologies for 25 years, from the early days of Sybase to MySQL and SQL, in-memory databases, and more recently open-source database projects such as MapDB. An early startup of Isaacson’s built some of the first big client-server applications for the entertainment industry in the early 1980s, and in the decades since he has started and sold several consulting companies, and spent several years heading up Rogue Wave before starting CodeFutures in 2007.

SD Times spoke with Isaacson ahead of his upcoming talk, “Scaling and Managing Big Data: Have We Been Looking at Databases Wrong This Whole Time?” about how databases have changed, scaling in the cloud, and why “agile Big Data” is the future.

SD Times: How would you describe the traditional view of databases?
Cory Isaacson: People look at databases as a static repository. You develop a schema you think will fit your needs as best you can, you start developing against it, and invariably you write, read and start manipulating the data. You don’t really think of it as a dynamic. Then what happens is that, very quickly, application requirements change and evolve. You have to start scaling the database and altering the schemas as best you can, usually sticking with what you have as close as possible, but that’s almost always very impractical.

So what happens is you run into an incredible number of performance problems and what I’d call application integration difficulty. The requirements fit less and less to that traditional model and need to expand more and more into completely different and new capabilities. Over time, it just gets messier and messier, it makes the application developer’s job harder and harder, and it makes performance more and more challenging as the application grows.

How has your view of databases and Big Data evolved over time? What do organizations need out of data now that they didn’t necessarily need in the past?
There’s quite a bit that has changed. I’ll start with scaling, which is of big interest to everyone. The way you scale a database is to partition it across a number of servers. It’s the only practical way to do it. While there are many ways to do that, they all come down to sharding in one capacity or another.

Sharding comes from broken glass, a metaphor popularized by Google with its BigTable architecture. The simple idea of sharding is you’re going to use a key in the data to divvy it up. With a NoSQL database, it’s a no-brainer. The database itself doesn’t know anything about your content, it just knows about the key itself, so it’s very easy to do. But when you have related data—which is true almost anywhere—as soon as you shard one way, it works well for one use case but not for another.

Let’s say you have a multi-user game with players competing against each other. You want to show players a list of all the games they played and what their scores were. Every game will want that. Let’s say you grow to millions and millions of players and shard by player. Then what happens is now the players say they would like to see a list of who else played a given game they’ve clicked on. The data is partitioned completely wrong for that, so the only way you can get that answer is to search all the partitions, which is the worst-performing thing you can do.

As Big Data needs evolve, people need to scale, but typically they only pick one scaling mechanism, and as soon as they do that, everything else starts to break down. The data is also getting much bigger. The rate of data is growing much faster when you think about the Internet of Things and the number of mobile devices and data sources out there. We’re now talking about tens of thousands or millions of transactions a second in these systems. So the data is getting much bigger and faster, but people also need the ability to see on a real-time basis what’s happening with their businesses and customers. It’s no longer good enough to take all this data, put it in a data warehouse and get an answer in a week or overnight.

To pose the question of your upcoming presentation back to you, have we been looking at databases wrong this whole time?
As embarrassing as it is for the whole community—including myself—I think the answer is yes. If you look at databases as a static repository, where you can only make a limited amount of structure changes and you can only partition one way, it’s far too static to be able to handle today’s fast-changing needs and application requirements. It’s far too much work as well; that’s the real kicker. It’s not like things can’t be done, it just takes much, much longer to do than it should with current databases as a sort of graveyard for static data.

Describe what you see as the agile approach to Big Data, and explain how it works in respect to upending that static view of databases.
The best way to look at your data, as opposed to a static repository, is as a real-time flow. It’s just amazing if you start to look at and process your data as a flow into different structures, scaling and partitioning schemas as you need, the unbelievable amount of freedom and simplicity you gain.

Again, take that example about the game application. Now what you could do is use stream-based processing to take all the transactions from your game, put those into the list of games by player, but at the same time automate that same list into all the players who played a single game. You can take that work away from the application developer so they can concentrate on the game features, while a data architect looks at the data in-flow and organizes it by sequence.

I should’ve seen this sooner, to be honest. The idea of software pipelines was that software is very much like fluid mechanics, essentially water going through pipes. It makes the data more fluid, dynamic and much easier to think with.

What are some of the most prevalent challenges in scaling databases for the cloud?
There are a few challenges happening in the cloud that you don’t see elsewhere. Performance in the cloud is generally worse than it is in regular servers. You have to scale much sooner in the cloud than you do in other places. The second thing is that cloud environments, particularly public cloud environments, are shared, so they’re not as reliable and the performance is not as consistent. You will see failures in a cloud environment because as you scale you’re adding failure points, and you have to be able to respond to those failure points without downtime, which is a very difficult challenge when it comes to database technology.

How have your experiences over the past two decades colored your beliefs and expertise about how best to utilize Big Data?
Certainly the whole industry has learned a tremendous amount. There has been more evolution and generation of new database technologies in the last five or six years than in the 20 years prior. Most databases prior to the past half-decade or so were pretty much built along the same infrastructure. One thing I’ve learned is that you end up needing not just a more flexible infrastructure, but you need to understand and probably use more than one database. Relying on say, Oracle, to do everything you need is just not going to work in today’s environment.

Especially in the open-source world, you end up using lots and lots of different capabilities. That’s good, but it also makes things much harder on the application developer. But this agile combination of streams and databases solves the problems because now you can stream to more than one database if you need a characteristic of a given database for a certain part of your application.

Going back to the game example, let’s say you want to add leaderboards to a game. You’re not going to do a leaderboard in the same database you use to keep track of the gameplay itself. On top of that, game developers want to know absolutely everything that goes on in a game. All the clicks a user has, what the most popular features or paths through the application are, how long it takes them to get from one level to the next, all those questions have to be tracked, and it generates an enormous amount of data. Imagine millions of daily active users in a highly successful game, and in each game a user might be clicking 100-200 times. To track all of those and trend that is a phenomenal challenge.

One size will definitely not fit all. But how do you do that in a more seamless fashion than it’s done today? Right now there’s a lot of tedious, manual work on the part of developers that makes for quite a brittle infrastructure.

What is your vision for the future of databases, this new paradigm for agile Big Data, cloud infrastructure and management?
The ideal situation is where the application developer doesn’t have to worry about it, yet the data infrastructure can be tremendously intricate and involved and support all the things the application really needs. Making that easier to do will have a huge impact on the way these applications are developed.

Why this is so critical is because as big as data is today, we’ve only seen the tip of the iceberg. There’s going to be a total data explosion, starting now but growing somewhere between 10 and 50 times over the next 10 years. Everything will have a CPU in it sending out something, connecting the world over the Internet. We have to be ready for that. We can’t spend all our time hand-massaging code to try to fit arcane data structures. That’s never going to cut it.

Article Tags

agile, Big Data, cloud, Cory Isaacson

About Rob Marvin

Rob Marvin has covered the software development and technology industry as Online & Social Media Editor at SD Times since July 2013. He is a 2013 graduate of the S.I. Newhouse School of Public Communications at Syracuse University with dual degrees in Magazine Journalism and Psychology. Rob enjoys writing about everything from features, entertainment, news and culture to his current work covering the software development industry. Reach him on Twitter at @rjmarvin1.

View all posts by Rob Marvin

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

The future of databases: A chat about managing and scaling ‘agile Big Data’ in the cloud

Article Tags

Subscribe to SDTimes

About Rob Marvin

Related Articles

The AI productivity paradox in software engineering: Balancing efficiency and human skill retention

Plotly brings vibe coding to visual data app development

Four trends reshaping Kubernetes platform engineering

Data is the new petroleum; companies need better pipelines — and better oil-spill clean-up methods