“CouchDB has built a lot of clever management tools around how they load balance and around the database capabilities,” said Poulley. “Cloudant did a really nice job. At one stage, they looked like they were going to fork CouchDB, but they decided to contribute that back to Apache CouchDB. We’re seeing tremendous uptake; it’s one of those teams with all the ingredients for success. I actually spent a period of my career inside Lotus. Not everybody recognizes what CouchDB is. The first time I saw CouchDB and Cloudant I thought, ‘That’s Lotus!’ ”
IBM’s view of the database-management problem boils down to the enterprise network that includes data zones, said Poulley. “We talk about data zones, and we came up with these different zones based on thousands of customer engagements. What we observed is that customers had four or five data zones. One would be a relational, operational store. Then we would find, classically, the data warehouse zone, and the data-mart zone, typically with the conditional relational model. Then we’re seeing this emergence, particularly with Hadoop, of an exploration zone, where data is being brought in raw and dumped into Hadoop.”
That’s a change, said Poulley, from where data has traditionally lived and for how it’s being managed. “Before, people would do preprocessing before they moved it into a relational data warehouse,” he said.
“Similarly, we’ve seen with the explosion of data analytics, and the growth of data, we’re seeing a tremendous growth in our relational data warehouse as well. Understanding the true value is not in the physical storage itself, it’s in being able to process the info you need at the speed you need it, for the period you need it. This idea, which is not new, is becoming increasingly interesting as a logical data warehouse where data flows to and from a high-performance environment like a Netezza data warehouse, into something like Big Insights, and vice versa.
“If you move info from the relational world into the Hadoop world, you can still introduce the queries. That’s what we describe as an actionable archive: You can run the same queries on Hadoop, keep the hot data in the performant environment, and move the longer-term data into a colder environment.”
Thus, Poulley advocates the reuse of queries instead of the replication of data. Rather than moving data around from data warehouses to relational data stores and into NoSQLs or test environments, he advocates keeping the data where it is and writing queries that can run on any type of data store.
So, perhaps data management and automation isn’t just about moving around the data and masking it. Perhaps it’s more about moving as much code out of the database as possible so that it can be managed in the same way as software. Then, instead of bringing the data to the software, it’s perhaps easier and more efficient to bring the software to the data. That requires some steps in between to ensure some basic capabilities around the data, such as masking, however.
“Once you know you can mask the data, the idea of capturing the workflows is almost like a VCR form,” said Poulley. “To really capture the workloads that are happening on a day-to-day basis is something most companies don’t do, which they should. They put it in a staging environment, but it never really experiences true workflows until it experiences the workflows of live data. Being able to capture sample data and being able to mask the data is something all companies should be doing.”
Thus, Poulley advocates testing applications with real-world data traffic as much as possible. Doing so, he said, will give you a better idea of what the application will do in production. And doing all of this will allow developers to “bring the analytics to the data, rather than the other way around,” he said.