As Big Data has gotten bigger and bigger, and businesses demand more and more out of their data, traditional database structures just don’t cut it anymore. The traditional single static repository simply isn’t equipped to handle the industry’s rapidly evolving needs.
Cory Isaacson, database technology veteran and the CEO of agile Big Data technology provider CodeFutures, believes we need to rethink the role of databases in a cloud and mobile-dominated landscape. He has worked with database technologies for 25 years, from the early days of Sybase to MySQL and SQL, in-memory databases, and more recently open-source database projects such as MapDB. An early startup of Isaacson’s built some of the first big client-server applications for the entertainment industry in the early 1980s, and in the decades since he has started and sold several consulting companies, and spent several years heading up Rogue Wave before starting CodeFutures in 2007.
SD Times spoke with Isaacson ahead of his upcoming talk, “Scaling and Managing Big Data: Have We Been Looking at Databases Wrong This Whole Time?” about how databases have changed, scaling in the cloud, and why “agile Big Data” is the future.
SD Times: How would you describe the traditional view of databases?
Cory Isaacson: People look at databases as a static repository. You develop a schema you think will fit your needs as best you can, you start developing against it, and invariably you write, read and start manipulating the data. You don’t really think of it as a dynamic. Then what happens is that, very quickly, application requirements change and evolve. You have to start scaling the database and altering the schemas as best you can, usually sticking with what you have as close as possible, but that’s almost always very impractical.
So what happens is you run into an incredible number of performance problems and what I’d call application integration difficulty. The requirements fit less and less to that traditional model and need to expand more and more into completely different and new capabilities. Over time, it just gets messier and messier, it makes the application developer’s job harder and harder, and it makes performance more and more challenging as the application grows.
How has your view of databases and Big Data evolved over time? What do organizations need out of data now that they didn’t necessarily need in the past?
There’s quite a bit that has changed. I’ll start with scaling, which is of big interest to everyone. The way you scale a database is to partition it across a number of servers. It’s the only practical way to do it. While there are many ways to do that, they all come down to sharding in one capacity or another.
Sharding comes from broken glass, a metaphor popularized by Google with its BigTable architecture. The simple idea of sharding is you’re going to use a key in the data to divvy it up. With a NoSQL database, it’s a no-brainer. The database itself doesn’t know anything about your content, it just knows about the key itself, so it’s very easy to do. But when you have related data—which is true almost anywhere—as soon as you shard one way, it works well for one use case but not for another.
Let’s say you have a multi-user game with players competing against each other. You want to show players a list of all the games they played and what their scores were. Every game will want that. Let’s say you grow to millions and millions of players and shard by player. Then what happens is now the players say they would like to see a list of who else played a given game they’ve clicked on. The data is partitioned completely wrong for that, so the only way you can get that answer is to search all the partitions, which is the worst-performing thing you can do.
As Big Data needs evolve, people need to scale, but typically they only pick one scaling mechanism, and as soon as they do that, everything else starts to break down. The data is also getting much bigger. The rate of data is growing much faster when you think about the Internet of Things and the number of mobile devices and data sources out there. We’re now talking about tens of thousands or millions of transactions a second in these systems. So the data is getting much bigger and faster, but people also need the ability to see on a real-time basis what’s happening with their businesses and customers. It’s no longer good enough to take all this data, put it in a data warehouse and get an answer in a week or overnight.
To pose the question of your upcoming presentation back to you, have we been looking at databases wrong this whole time?
As embarrassing as it is for the whole community—including myself—I think the answer is yes. If you look at databases as a static repository, where you can only make a limited amount of structure changes and you can only partition one way, it’s far too static to be able to handle today’s fast-changing needs and application requirements. It’s far too much work as well; that’s the real kicker. It’s not like things can’t be done, it just takes much, much longer to do than it should with current databases as a sort of graveyard for static data.
Describe what you see as the agile approach to Big Data, and explain how it works in respect to upending that static view of databases.
The best way to look at your data, as opposed to a static repository, is as a real-time flow. It’s just amazing if you start to look at and process your data as a flow into different structures, scaling and partitioning schemas as you need, the unbelievable amount of freedom and simplicity you gain.
Again, take that example about the game application. Now what you could do is use stream-based processing to take all the transactions from your game, put those into the list of games by player, but at the same time automate that same list into all the players who played a single game. You can take that work away from the application developer so they can concentrate on the game features, while a data architect looks at the data in-flow and organizes it by sequence.
I should’ve seen this sooner, to be honest. The idea of software pipelines was that software is very much like fluid mechanics, essentially water going through pipes. It makes the data more fluid, dynamic and much easier to think with.
What are some of the most prevalent challenges in scaling databases for the cloud?
There are a few challenges happening in the cloud that you don’t see elsewhere. Performance in the cloud is generally worse than it is in regular servers. You have to scale much sooner in the cloud than you do in other places. The second thing is that cloud environments, particularly public cloud environments, are shared, so they’re not as reliable and the performance is not as consistent. You will see failures in a cloud environment because as you scale you’re adding failure points, and you have to be able to respond to those failure points without downtime, which is a very difficult challenge when it comes to database technology.
How have your experiences over the past two decades colored your beliefs and expertise about how best to utilize Big Data?
Certainly the whole industry has learned a tremendous amount. There has been more evolution and generation of new database technologies in the last five or six years than in the 20 years prior. Most databases prior to the past half-decade or so were pretty much built along the same infrastructure. One thing I’ve learned is that you end up needing not just a more flexible infrastructure, but you need to understand and probably use more than one database. Relying on say, Oracle, to do everything you need is just not going to work in today’s environment.
Especially in the open-source world, you end up using lots and lots of different capabilities. That’s good, but it also makes things much harder on the application developer. But this agile combination of streams and databases solves the problems because now you can stream to more than one database if you need a characteristic of a given database for a certain part of your application.
Going back to the game example, let’s say you want to add leaderboards to a game. You’re not going to do a leaderboard in the same database you use to keep track of the gameplay itself. On top of that, game developers want to know absolutely everything that goes on in a game. All the clicks a user has, what the most popular features or paths through the application are, how long it takes them to get from one level to the next, all those questions have to be tracked, and it generates an enormous amount of data. Imagine millions of daily active users in a highly successful game, and in each game a user might be clicking 100-200 times. To track all of those and trend that is a phenomenal challenge.
One size will definitely not fit all. But how do you do that in a more seamless fashion than it’s done today? Right now there’s a lot of tedious, manual work on the part of developers that makes for quite a brittle infrastructure.
What is your vision for the future of databases, this new paradigm for agile Big Data, cloud infrastructure and management?
The ideal situation is where the application developer doesn’t have to worry about it, yet the data infrastructure can be tremendously intricate and involved and support all the things the application really needs. Making that easier to do will have a huge impact on the way these applications are developed.
Why this is so critical is because as big as data is today, we’ve only seen the tip of the iceberg. There’s going to be a total data explosion, starting now but growing somewhere between 10 and 50 times over the next 10 years. Everything will have a CPU in it sending out something, connecting the world over the Internet. We have to be ready for that. We can’t spend all our time hand-massaging code to try to fit arcane data structures. That’s never going to cut it.