A blended approach to managing data

Published: January 2nd, 2015

- Alex Handy

Making the case for version control, testing environments and continuous integration when it comes to software development these days is a no brainer. But when it comes to the data behind your important applications, life-cycle management and data flow automation are still new ideas struggling to find their place in the market.

That doesn’t mean, however, that managing your data is an impossible task, devoid of vendors and best practices. But it does mean that most enterprises may not yet fully comprehend what exactly data automation entails.

It can refer to a number of different things, or to all of them together in a single workflow. These include things like data scrubbing to remove sensitive data before use in testing environments. It can also refer to the flow of data from production systems, back to data warehouses, then forward again into analytics data stores. Data automation can even refer to the actual change management of data as it comes into the system and evolves over time.

With such a broad space to cover, the term data automation has a lot of heavy lifting to do when it comes to being advocated for inside your organization. But it doesn’t have to be an impossible struggle, thanks to numerous companies and tools that make managing that data life cycle much easier.

Seb Taveau, senior business leader and technical evangelist at MasterCard for its Open APIs, said that the management of data in the software development life cycle is extremely important for a heavily regulated organization like his.

(Related: Scaling agile in databases)

MasterCard has a particularly interesting set of problems when it comes to the data life cycle. As a credit card company, financial regulations weigh heavily on what types of data MasterCard can collect, let alone share. And yet, the company still has been able to build up a developer network of APIs and services based on their data. This data includes information like purchases made at specific locations, as well as the time stamps associated with them. As a result, using MasterCard APIs, developers can determine which restaurants in their town are the most popular among locals, or when certain businesses are open.

“MasterCard’s been working on taking some of its private APIs and making them public for the past four years,” said Taveau. “The developer program is not new inside MasterCard. This became a key project for MasterCard about a year ago. They wanted to make sure MasterCard was a tech companion, not just a payment network.”

But having all that data publicly available means there’s a lot of work to be done before an API can go live. Depending on the data source, the project to open the data behind an API can sometimes be very easy, and sometimes require a lot of hands-on work.

“Sometimes it works great; sometimes it requires manual work to make sure it works,” said Taveau. “It depends on the complexity of the API and the data you’re trying to reach. Location is very easy to automate: It doesn’t require the security. The data is not critical—just a pinpoint to a merchant location or restaurant location or ATM location. That’s not critical.”

When it comes to more delicate information, said Taveau, the effort requires more coordination and care. “When you talk about a merchant identifier API, that’s a completely different story. As you expect, the review process for the location API and the tokenization API or risk management are very different.”

Shifting tools
Yaniv Yehuda, cofounder and CTO of DBmaestro, said that managing software is almost second nature to developers, but managing the data is another story entirely. “Tools are important, as well as human processes. But what we found out is that people are really challenged when they get to the database. Databases don’t follow the same processes the code does. They require special attention,” he said.

The reason for the difference in managing them is that they are created in different ways and have a different workflow, said Yehuda. “When you deal with traditional code, you have your development people working with version control. Then it gets built by a build server and pushed to the next environment,” he said.

“When you deal with a database, a database is not compiled in one area and then pushed to another. The database actually holds its internal structure of the data: the schema in each environment. If you have a development environment, it holds the structure code content as the application, but it still hosts the same data in the Q/A environment. In order to deal with that, you create transition code that changes the database from one state to the next.

“The database, in order to be promoted from one version to the next, it requires additional steps in order to deal with that transition code. Because people are really doing stuff with traditional tools, developers have to do this stuff manually. This code is really static, so if you have some changes in your development environment and you want them to push to another environment, then you write the code to deal with that transition. You have code override, and people are not using the database itself to manage version control because they have to extract the objects from the database and put them in a different version control from the code.

“What DBmaestro does first is to create a bulletproof version-control system. We created an enforced version control to the database. You cannot change the database unless you check out the object. You go to the database and say ‘add column,’ but only when you check it out. When you check it back in, it gets version-controlled. The second thing DBmaestro does is it safely automates deployments from one environment to another.”

Red Gate offers similar tools for data automation and management. Ben Rees, general manager at Red Gate, said that the company has been building tools around this problem for 15 years now. “People were using them to do database life-cycle management, but we didn’t call it that back then. Back then we called it agile database delivery and agile database development,” he said.

“The essence of it is that there are developers and DBAs who want to make changes to their databases, and more importantly, they want to make those changes live. They’ve been doing that using our tools, and have been following this process. We’ve only realized recently what they’re doing is application life-cycle management, except for the database.”

(Related: A better way to look at databases)

Rees said that Red Gate’s tools can help a development organization get control of data management within its application life cycle. “It’s about source controlling the database. It’s about automating the updates. It’s about properly managing the release process, and having proper release management to get changes into production. Then it’s about monitoring what happens after the event.”

Thus, as developers have awakened to database management, Red Gate has awakened to how its customers are using its tools. “What we’re realizing now, over the last year or so, is that we have this complete story that our customers have been telling us over and over for years,” said Rees. “We used to sell these 1,500-point tools to the end user. Now we’re selling something that’s about changes in process, changes in how you work.”

That means Red Gate can no longer rely on single developers using a credit card to buy a tool. Instead, data management and the database life cycle have moved up the stack to become CIO-level concerns. Getting buy-in from that far up the chain can ensure that a new process and life cycle can be pushed out across the organization, not just into small pockets.

Ron Huizenga, data architect and product manager at Embarcadero Technologies, said that change management in databases is a great way to ensure changes don’t destroy essential information.

“We have database change-management tools, from the metadata approach, from the modeling approach, and the metadata artifacts,” he said. “It’s extremely important to be able to map out all of what that is.

“There’s also a difference in terms of the data content in those stores. And that’s where we really get into an enterprise data lineage. You may have a company that has employee data scattered across a number of different systems. How do we map that together, and how do we know if info is changing in its journey through the system?”

Embarcadero, he said, has tools that can tie different data store terms together. If one database refers to employees with the field “EMP” while another uses the field title “PEOP,” Embarcadero’s tools can be used to define these two different data columns as meaning the same thing, thus allowing for quick integration of data from disparate data sources.

New stores, new processes
When it comes to the actual database, NoSQL data stores are offering another wrinkle. Couchbase, for example, is a NoSQL database that behaves similarly to Lotus Notes. For mobile applications, this means that Couchbase can be used to embed a datastore onto a mobile device, then sync that datastore with the remotely hosted version of Couchbase.

Wayne Carter, chief architect of mobile at Couchbase, said that database management has changed drastically in recent years.

“I come from Oracle, where I spent 14 years,” he said. “I grew up in CRM from Siebel, and the models in CRM are extremely complicated. We went through several variances, and even a simple thing like a contact… the model associated with that becomes monstrous over time.

“At the application logic level, some things you’d do in the database, like validation into your application code, and things like joins and queries just to get simple objects. That just doesn’t work in today’s mobile space. Applications move faster and need to evolve a lot faster. These are completely dependent on databases, from the change tracking perspective.”

He went on to say that, “For our database, because it’s moved to the application for management, it’s managed at the code level. The same code can be used for archiving [and] versioning, and could be stored on GitHub… rather than within the database. The actual migration of the database is being managed on the change management [system].” Thus, NoSQL data stores allow the application to define the layout of the data. Because of this, simply doing version control within the application code itself can help to manage the data.

Mongo extends this concept even further with its MongoDB Management System. This enterprise tool for managing MongoDB instances can quickly spin up test environments and replicate data from one area to another, ensuring testers and developers can access the data during development.

That doesn’t mean it’s the entire solution to the data management problem, yet, however. Kelly Stirman, director of products at Mongo, said, “What it doesn’t do is things like data masking and sampling. It doesn’t generate different distributions of fields and attributes or create randomized samples. There’s a lot of interesting stuff you could do to load test your applications based on the distribution of certain types of values on a known data set. We’re looking at that and thinking about [adding these capabilities to future versions].”

The business of data
Sean Poulley, vice president of databases and data warehousing at IBM, said that understanding the data life cycle requires an understanding of how a database works in a services environment. “One of the things people often misunderstand is that having a database is one thing, but running a data service is another. There’s a world of difference between having a database-management product and actually delivering it as a service,” he said.

To this end, IBM has been focusing on CouchDB as its platform for the future. IBM acquired Cloudant in February 2014, and has since been building enterprise products around CouchDB, the database originally created by ex-Lotus developer Damien Katz. CouchDB is, essentially, an attempt to recreate the database system inside of Lotus Notes.

“CouchDB has built a lot of clever management tools around how they load balance and around the database capabilities,” said Poulley. “Cloudant did a really nice job. At one stage, they looked like they were going to fork CouchDB, but they decided to contribute that back to Apache CouchDB. We’re seeing tremendous uptake; it’s one of those teams with all the ingredients for success. I actually spent a period of my career inside Lotus. Not everybody recognizes what CouchDB is. The first time I saw CouchDB and Cloudant I thought, ‘That’s Lotus!’ ”

IBM’s view of the database-management problem boils down to the enterprise network that includes data zones, said Poulley. “We talk about data zones, and we came up with these different zones based on thousands of customer engagements. What we observed is that customers had four or five data zones. One would be a relational, operational store. Then we would find, classically, the data warehouse zone, and the data-mart zone, typically with the conditional relational model. Then we’re seeing this emergence, particularly with Hadoop, of an exploration zone, where data is being brought in raw and dumped into Hadoop.”

That’s a change, said Poulley, from where data has traditionally lived and for how it’s being managed. “Before, people would do preprocessing before they moved it into a relational data warehouse,” he said.

“Similarly, we’ve seen with the explosion of data analytics, and the growth of data, we’re seeing a tremendous growth in our relational data warehouse as well. Understanding the true value is not in the physical storage itself, it’s in being able to process the info you need at the speed you need it, for the period you need it. This idea, which is not new, is becoming increasingly interesting as a logical data warehouse where data flows to and from a high-performance environment like a Netezza data warehouse, into something like Big Insights, and vice versa.

“If you move info from the relational world into the Hadoop world, you can still introduce the queries. That’s what we describe as an actionable archive: You can run the same queries on Hadoop, keep the hot data in the performant environment, and move the longer-term data into a colder environment.”

Thus, Poulley advocates the reuse of queries instead of the replication of data. Rather than moving data around from data warehouses to relational data stores and into NoSQLs or test environments, he advocates keeping the data where it is and writing queries that can run on any type of data store.

So, perhaps data management and automation isn’t just about moving around the data and masking it. Perhaps it’s more about moving as much code out of the database as possible so that it can be managed in the same way as software. Then, instead of bringing the data to the software, it’s perhaps easier and more efficient to bring the software to the data. That requires some steps in between to ensure some basic capabilities around the data, such as masking, however.

“Once you know you can mask the data, the idea of capturing the workflows is almost like a VCR form,” said Poulley. “To really capture the workloads that are happening on a day-to-day basis is something most companies don’t do, which they should. They put it in a staging environment, but it never really experiences true workflows until it experiences the workflows of live data. Being able to capture sample data and being able to mask the data is something all companies should be doing.”

Thus, Poulley advocates testing applications with real-world data traffic as much as possible. Doing so, he said, will give you a better idea of what the application will do in production. And doing all of this will allow developers to “bring the analytics to the data, rather than the other way around,” he said.

“The innovation in the next few years won’t be about data movement; it’ll be about query movement. It’s much easier to move the queries and federate the queries.”

Back to the master
No matter how you manage your data, there is one thing that is for certain: dealing with the data itself requires discussions with the controller of that data. At MasterCard, every API in the company is controlled by one group or another. That means when MasterCard’s Taveau and his team are beginning the work of taking an API public, their first stop is in the offices of the team that manages the API and data they’re working on.

“When you’re looking at an API, we won’t be exposing single data points through an API; we create an aggregate of the data, and that’s what the API will look at,” he said. “This requires a lot of discussion internally on the whys and the why nots, where we discuss the value that can be brought out of these type of packages. How do you do it? There’s this filter, then it’s reviewed, then we check the data we’re putting in the package. We have the information security team to make sure the code is made properly, and we have friends in sales to see if there’s value in it.

“At the end of the day, the value is in the data and the quality of the data.”

Article Tags

data automation, databases, life-cycle management, testing, version control

About Alex Handy

Alex Handy is the Senior Editor of Software Development Times.

View all posts by Alex Handy

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

A blended approach to managing data

Article Tags

Subscribe to SDTimes

About Alex Handy

Related Articles

Snyk announces new DAST solution for securing APIs and web apps

5 common assumptions in load testing—and why you should rethink them

BrowserStack adds Private Devices offering to enabling testing across variety of secured devices

3 ways test impact analysis optimizes testing in Agile sprints