What worked in the past may not work in the present. That’s certainly true for how I, and many of my colleagues in the industry, have looked at databases.
As a software engineer and database architect by trade, I see clear analogies of how the software and database communities came to realize that traditional approaches – most established many years ago – were not only ineffective, but extremely wasteful. The software community became so frustrated with waterfall approaches that a group of industry visionaries created an “Agile Manifesto” that led to one of the most profound and disruptive trends in IT. I say kudos to the development community for tearing down establishes methodologies.
Now it’s time for the database community, many of whom are also developers, to embrace many of the tenets of agile as means of solving the increasingly huge mess we now refer to as “Big Data.”
Our conceptual view of databases as static data repositories is narrow and limited. It has hindered our ability to keep pace with changing business requirements, and to adapt to today’s more “agile” approach to IT. Traditional approaches have limited our ability to match data structures to changing application requirements, has slowed development processes, added waste and complexity, and failed to meet the data needs of the emerging real-time enterprise. Steps have been made with sharding and distributed database technology, but that’s not nearly enough. What organizations need is real-time access to data, allowing immediate and actionable interpretation of events as they occur. What is needed is an agile approach to Big Data, one that affords architects and developers the freedom to rapidly support a wide variety of changing business requirements and needs – without a full “reset” of the Big Data infrastructure and toolset. It’s time for a fresh look at the Big Data infrastructure – we call it “Agile Big Data.” An agile approach to databases can be extremely revealing and offers a glimpse of a new paradigm for advanced data infrastructures and management.
Before we get into what Agile Big Data is, we need to consider the elements that have driven us to this point. When a database application starts out, the initial schema is designed to meet the early requirements – it is simple, the queries are straightforward, and it works well. However, application requirements evolve, necessitating new features, and the data structure invariably becomes more complex. Adding to that complexity is the literal “bigness” of Big Data. As the sheer amount of data increases and the application requirements change, a database manager is now faced with two elements that seem to be at odds with each other. As the application complexity grows, the database will begin to perform poorly, becoming mired in high latency queries and forcing the implementation of additional pieces to allow the database to function regularly. This issue is very common and very “waterfall-ly” in nature.
To combat database performance degradation, developers have turned to a few different methods. Some developers have resorted to using multiple database engines for a single application. This can improve performance, but as with anything, adding more moving parts results, inevitably, in even more complexity. In this instance, a developer must now interact with multiple data structures, in multiple database engines, ultimately leading to a schema that doesn’t mirror the original needs of the application. Another common method used to address poor database performance is known as partitioning, or database sharding. Ideally, this can be an effective technique, as it spreads reading and writing operations across many servers, diminishing the load a single server needs to bear, and decreasing latency. These engines segment the data by using a key across the different servers. However, the danger of partitioning data by a key is that many operations cannot be performed on a single server, requiring distributed operations. These distributed operations must read and write data from multiple servers, resulting in increased complexity and network traffic and pushing the latency back up through the roof. Sharding can work, but it is, essentially, a temporary Band-Aid rather than a permanent fix.
The key concept of Agile Big Data is viewing databases as real-time streams, utilizing dynamic views. When considering data as an agile, real-time stream, we take virtually any data source and turn it into a dynamic view. By working with data as a stream, it is possible to easily and quickly solve many problems in a way that eases complexity for developers, while increasing the capabilities of the Big Data infrastructure. Then, instead of having to bring in multiple data engines, or partition the data into separate servers, we’ve developed an infrastructure that not only meets the original application requirements, but also leaves us with views that are built and maintained by the agile infrastructure itself, in real-time, giving us a picture of the data as it happens.
There are a number of other benefits to developing an Agile Big Data infrastructure, but the core of it comes back to that initial immutable concept: your data is only as good as your infrastructure. Having the ability to view, query and model your data in real-time, as it aggregates, increases the value of that data exponentially. And really, that’s the whole point, isn’t it?