DataOps is more than just DevOps for data

Published: November 13th, 2019

Development, testing, security and operations have all been transformed to keep up with the pace of software today — but one piece is still missing. Data is now becoming a roadblock to Agile and DevOps initiatives.

“People are getting stuck with data saying ‘I have my infrastructure layer automated and self-serviced so a developer can push a button and an environment can be configured automatically. I have made my entire CI/CD pipeline, my entire software delivery life cycle automated. I can promate code. I can test code. I can automate testing. But the last layer is data. I need data everywhere,’” said Sanjeev Sharma, vice president and global practice director for data modernization and strategy at Delphix.

As a result, development teams are starting to turn to DataOps to help speed up that data layer. SD Times recently caught up with Sharma who spoke about what DataOps means, how to be successful, and what’s next for data.

SD Times: I’ve heard people refer to DataOps as just another term for DevOps, so how would you define DataOps?
Sharma: If you look at the history of the word DataOps, it started off mainly from the data science people — people wanting to do artificial intelligence and machine learning who had lost data assertion.

I was talking to a client of ours who was saying most data scientists don’t come from a computer science background, so their method of versioning data is “save as” and put a number at the end of the file name. It is that primitive. Of course he was exaggerating, but what he was saying is that there is no way to manage data.

Our perspective of DataOps is very simple. In your enterprise, you have data owners: people who create the data either because they: own the application; customers use that application and data is created; or the data is coming from logs [such as] telemetry data from a mobile application or a log data from something running in production. And then there are data managers. These are the database administrators and security people whose job is to manage the data, store it and secure it. Then there are the data consumers. These are your data scientists, your AI and ML experts, your developers and your testers who need the data to be able to do their job. How do you make these three sets of stakeholders work together and collaborate in a lean and efficient manner? That is DataOps.

It involves process improvement, and it involves technology.

So do you follow or recommend people look at the DataOps Manifesto?
The DataKitchen team wrote the manifesto, which is a data science company, so they have a data science-centric view of data, but the manifesto is a great thing. It sets up some of these things that I am talking about out in the open to say it is not just technology. It is not just building a data pipeline. If you don’t change the organizational ownerships and bring out the responsibility between the data consumers, data owners and data managers, you are not going to succeed. That explains it very well. I think it is a great opening move. I wouldn’t say it is the final word though.

What makes a successful DataOps initiative?
DataOps has two perspectives. If you are looking at it from a data science lens, you are looking at how you got your data science activities to a stage where the biggest source of friction is the inability to get the right data at the right time to the right people.

From a DevOps lens, you are asking yourself if you have reached a stage where you are struggling with getting the right data to the right people at the right time… and you might not experience that unless you are Agile. If you still have a six month waterfall life cycle, six months is enough time to make a copy of a database. But if you are doing daily builds, true CI/CD and doing daily deployments to test environments — you need that data to be refreshed daily, sometimes multiple times a day. You need developers to be able to do local data sets for themselves, and be able to branch data to do A/B testing. You are more likely to hit that friction point when you have already done some level of automation around environments and code. Data won’t be what you address first.

What are the benefits database owners and database admins get from DataOps?
Data managers are hired and paid to manage data, store it, make it available to the people who need it, and secure it to make sure they don’t get hacked. They are there to manage data in a lean and efficient manner. Making copies of data for data consumers is not their job. It is something a developer opens a ticket and tells a DBA to do. That ticket is the last one on the list because the database admin had other tickets that say this database needs to be finetuned because it is not performing properly; this database index needs to be reindex; I need to add a new database for this new production environment, or I am running out of storage. All those will have a higher priority over a developer asking for a copy.

Why not automate that and provide self service to the data consumer. It makes their job more efficient because they can focus on the high priority tasks like managing data schemas or making the database, rather than low level copies.

From a data owner perspective, if the data is not being used, what use is it? It is just being stored. It is just sitting there. They have data for 20 years, but the data consumer only has access to the last three years. To a business owner, they are looking at what information, what insights and inferences they are not able to access because of a policy that says I can’t give that to anyone. For them, they want data they can use as an asset which can be mined, used to draw inferences to better understand their customers, make better predictions, and make better investment decisions. Getting business value out of data is what DataOps brings to them.

How can you keep DataOps initiatives on track?
DataOps by itself has no value in the sense of it has to be in the context of either you are doing a data science initiative and you need the data to be Agile for that initiative, or you are doing DevOps and you need data to be Agile for DevOps. A DataOps initiative has to be attached to a DevOps or data science initiative because it is serving that purpose of making data lean, Agile and available to the right people.

That train needs to be moving and DataOps is just making the track straighter and faster.

How do data regulations and data privacy concerns come into play in a DataOps movement?
One of the tenets of DevOps is to make production-like environments available, which means the data should be production. It shouldn’t be synthetic data. Synthetic data doesn’t have the noise and the texture of production data. You will need synthetic data if you are building a new feature where data that doesn’t exist in production yet, but for everywhere else you want to put production data in your lower environments — but that raises security and compliance restrictions.

We at Delphix do masking of the data. We do it at two layers. We mask the data, so we will replace all the sensitive information with dummy information while maintaining the relational integrity.

The second thing we do is put in a lot of identity and access management controls. For instance, we can put in policies that say if the data is not masked and classified at this level you can not provision it to an overseas environment.

What is the state of DataOps today?
It is where DevOps was maybe 8 years ago where we were spending time explaining to people what was DevOps. Today, we don’t do that. We don’t need to explain to anyone what DevOps is. It is very well established, even though there are multiple definitions floating around, they are all at least on the same playing field.

With DataOps, I think we are still at that “what is DataOps and does it applies to me” stage. I would say there are still a couple of years before you have a DataOps Day or a conference dedicated to it.

What is still to come from the DataOps movement?
Most of the world’s data is still living on a mainframe so that spectrum needs to be addressed. Our goal is to say no matter what kind of data, where it is,We will allow you to manage it like code.

Article Tags

data, DataOps, Delphix, DevOps

About Christina Cardoza

Christina Cardoza is the News Editor of SD Times. She is responsible for the oversight of the daily news published to the website as well as the company's weekly newsletter, News on Monday. She covers agile, DevOps, AI, machine learning, mixed reality and software security. She is an undeniable nerd who loves Marvel comics and Star Wars. On Follow her on Twitter at @chriscatdoza!

View all posts by Christina Cardoza

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

DataOps is more than just DevOps for data

Article Tags

Subscribe to SDTimes

About Christina Cardoza

Related Articles

The AI productivity paradox in software engineering: Balancing efficiency and human skill retention

ScyllaDB X Cloud’s autoscaling capabilities meet the needs of unpredictable workloads in real time

Databricks adds new tools like Lakebase, Lakeflow Designer, and Agent Bricks to better support building AI apps and agents in the enterprise

Garbage in, garbage out: The importance of data quality when training AI models