Some companies are building massive data lakes and throwing all possible data into it, hoping it will help produce some valuable insights in the future. The problem with that approach can be poor data quality, which can be difficult and costly to rectify later. On the other end of the spectrum are companies that are so focused on data quality, they’re impeding their own progress.
“If you have access to data today, storing it in a data lake is cheap enough that you’d be foolish not to do it,” said Johnson. “Obviously there may be some data you don’t need to keep, but for the most part, it’s so cost-efficient to store it, you store it and label it so when people do analytics it’s fine.”
On the other hand, just because the data is available doesn’t necessarily mean the organization will use it. For example, one of Ernst & Young’s healthcare clients built a model describing how long it retained customers. With Ernst & Young’s help, the company came up with 150 variables, but it identified some external data that it had excluded because the company didn’t consider the data relevant to the problem. Since the data-mining techniques could process 282 variables as fast as 150 variables, Ernst & Young convinced the client to use both internal and external datasets. Of the 282 variables, 20 were highly correlated with customer retention, and of those 20, 12 came from the external dataset.
“The only reason they didn’t know the data was relevant was because they weren’t using the appropriate analytical techniques,” said Johnson. “Now they have 12 variables that are external to their own system they can apply to [customer acquisition] rather than just looking at existing customers.”
Progress DCI has seen some of its customers take some of their old legacy systems or on-premise enterprise systems and build a data lake that contains a whole history of information, including lower-value data coming in from the core systems.
“You can build your data lake quickly and let data scientists and data modelers define some kind of analytics, or they can do some data science-type programming and build some statistical models,” said Sumit Sarkar, chief data evangelist at Progress DCI. “You can put all the data you think has business value in a data lake and let your data science team decide what to do with it.”
Don’t overlook data quality and data governance, though. Otherwise the data lake may become a data swamp that gets increasingly difficult to manage over time.
“If an organization is bringing in all their data, then they’re going to have the information available to ask lots of questions,” said KPMG’s Gusher. “The problem is if you haven’t taken the right data management approaches—metadata tagging, lineage, policies, data dictionaries, etc.—then what you have is a data swamp and it won’t provide value, so putting data under governance is imperative. If you don’t do that, you might as well not bring the data together.”
Verifying you’re on the right track
Companies realize they need to actively invest in reporting and analytics capabilities, but it isn’t always clear whether the data strategy, reporting, analytics, or even business processes are what they need to be. Technology, business environments and end-user expectations are changing rapidly. To keep pace, organizations need to be more agile than they’ve been historically, and they need the fortitude to improve what’s working and deemphasize that which is not providing business value.