While the amount of data in the world is infinite, our attention span is not. That’s why AI is becoming a valuable tool for data integration to create concise analysis from data and to make it more accessible to everyone throughout an organization.
According to SnapLogic’s Ultimate Guide to Data Integration, AI and ML capabilities are increasingly being built into data integration platforms to significantly improve integrator productivity and time to value.
Companies are also making sure that no data slips through the cracks. They realize that they have to be more sensitive and careful with user data in the wake of large data breaches and resulting regulations that followed.
They can rely on AI and ML capabilities to identify what data should be masked or anonymized, and also discern what is useful and what isn’t. AI is able to do this automatically to help ensure compliance with HIPAA, GDPR, and other regulations.
The process of adding AI to analyze and transform massive data sets into intelligent data insight is often referred to as data intelligence, according to an article by data analytics platform provider OmniSci.
Five components of intelligence
There are five major components of data-driven intelligence, including descriptive data, prescriptive data, diagnostic data, decisive data, and predictive data. Applying AI to these areas helps with understanding data, developing alternative knowledge, resolving issues, and analyzing historical data to predict future trends.
“AI is being used across multiple functions in data integration, but I would say it is being used most effectively in providing intelligence about data, automating the collection and curation of metadata, so that organizations can gain control over highly distributed, diverse, and dynamic modern data environments,” said Stewart Bond, the research director of IDC’s Data Integration and Intelligence service.
Data intelligence is effective at gathering the data from various sources, which is often necessary within a company’s data integration initiatives, and then it creates a uniform identity model across the data sources.
This intelligence can leverage business, technical, relational, and behavioral metadata to provide transparency of data profiles, classification, quality, location, lineage, and context.
“To take an example from our world at LinearB: to effectively integrate data from disparate dev systems such as Git or Jira, one needs to be able to map the identities such as developer usernames between these systems. That’s a great task for some ML models. As more systems are involved, the problem gets tougher but you have more data to assist your AI/ML to solve it,” said Yishai Beeri, the CTO at LinearB.
Organizations that are looking to infuse AI into their data integration are primarily looking at three things: how to minimize human effort, reduce complexity, and cost optimization, according to Robert Thanaraj, the senior principal analyst who is part of the data management team at Gartner.
“Number one, I’m looking at improved productivity of users, the technical experts, citizen developers, or business users. Secondly, if complexities are solved, it opens up for business users to carry out integration tasks almost without any support from a central IT team, or your integration specialist, such as a data engineer,” Thanaraj said. “Lastly, ask yourself, can we get rid of any duplicated copies of data? Can we recommend an alternative source for good quality trusted data? Those are the kind of the typical benefits that enterprises are looking to prototype and then to experiment with integrating AI into data integration.”
AI is being used to improve data quality
AI is now not only turning out to be pivotal in business use cases, but it can also quickly solve problems that have to do with data quality.
Specifically, AI is making it possible to achieve improved consistency of data and allows for improved master data management, according to Chandra Ambadipudi, senior vice president at EXL, a provider of data services.
Dan Willoughby, a principal engineer at Crowdstorage, described how his company used AI/ML to tackle data quality problems in a proactive rather than reactive fashion.
The company would continually write 15 petabytes of data to over 250,000 devices in people’s homes every month and AI was used for both predicting when a device would go offline and to detect malicious devices.
“Since a device could go offline at any time for any reason, our system had to detect which data was becoming endangered,” Willoughby explained. “If it was in trouble, that data would be queued up to be repaired and placed elsewhere. The idea was that if we could predict a device would go offline soon by observing patterns of other devices we’d stop sending data to it, so we could save on repair costs.”
Also, since the company had no control over what people could do to their devices, they needed to have protections in place beyond encryption to see anomalies in a device’s behavior.
“ML is perfect for this because it can average out the “normal” behavior and easily determine a bad actor,” Willoughby said.
LinearB’s Beeri said another common example of AI weeding out bad data is in detecting and ignoring Git work done by scripts and bots.
AI can address many of the common data integration challenges
The introduction of AI and ML to data integration is still a relatively new phenomenon, but companies are realizing that handling data integration tasks manually is proving especially difficult.
One of the challenges is the absence of intelligence about the data when handled manually.
According to the Data Culture Survey that IDC ran in December 2020, 50% of the respondents said they felt there was too much data available and they couldn’t find the signal for the noise, and the other 50% said there wasn’t enough data to help them make data-driven decisions, which is the outcome of data integration and analytics.
“If you don’t know where the best data is related to the problem you are trying to solve, what that data means, where it came from, how clean or dirty it is – it can be difficult to integrate and use in analytical pipelines,” IDC’s Bond said. “Manual methods of harvesting and maintaining intelligence about data are no longer effective. Many still use spreadsheets and Wikis and other forms of documentation that cannot be kept up to date with the speed at which data is moved, consumed, and changed.”
As for getting started with AI and ML in data integration, companies should see if the solutions fit the requirements of their work, Bond added. And many of these industries with the greatest need for data intelligence include cybersecurity, finance, health, insurance, and law enforcement.
Companies should look at how data intelligence factors into the solution, whether it is part of the vendor’s platform, or whether the technology supports integration with data intelligence solutions.
“As organizations try to understand how data integration and intelligence tasks are automated, they should understand what is truly AI-driven and what is rules-driven,” Bond said. “Rules require maintenance, AI requires training. If you have too many rules, maintenance is difficult.”
Gartner’s Thanaraj recommends embarking on the data fabric design, which utilizes continuous analytics over existing, discoverable, and inferenced metadata assets. This model can support the design, deployment, and utilization of integrated and reusable data across all environments, including hybrid and multi-cloud platforms.
This method leverages both human and machine capabilities and continuously identifies and connects data from disparate applications to discover unique, business-relevant relationships between the available data points.
It uses Knowledge Graph technologies that are built on top of a solid data integration backbone. It also uses recommendation engines, orchestration of AI, and data capabilities, primarily driven with metadata.
“Metadata will be a game-changer of the future, and AI will take advantage of the metadata,” Thanaraj said.
How does the introduction of AI/ML affect the data engineering role
AI and ML will vastly improve the speed at which data integration is handled, but the role of data engineering is constantly in demand and even more so to work with AI in an augmented way.
AI can help in making recommendations about the best way to join multiple data sets together, the best sequence of operations on the data, or the best ways to parse data within fields and standardize output, according to IDC’s Bond.
“If we consider data quality work, people will shift from writing rules for identifying and cleansing data to training machines on whether or not anomalies that are detected are really data quality issues, or if it represents valid data,” Bond said. “If we consider data classification efforts for governance and business context, again the person becomes the supervisor of the machine – training the machine about what are the correct associations or classifications, and what are not correct assumptions made by the machine.”
The AI capabilities will help people working on data integration with the mundane tasks, which both frees them up to do more important work and helps them avoid burnout when dealing with data, a common problem today.
“It takes easily between 18 to 24 months before data engineers are fully productive and then in another year or so, they are burnt out because of lack of automation,” Thanaraj said. “So one of the key things I recommend to data and analytics leaders is you should create a social structure where you’re celebrating automation.”
Data engineers can’t do everything by themselves, and this has resulted in various roles that specialize in various aspects of handling data.
In a blog post, IDC listed these roles as data integration specialists that blend data for analytics and reporting or data architects who bridge business and technology with contextual, logical and physical data models and dictionaries. On top of that, there are data stewards, DataOps managers, and business analysts, and data scientists.
“Data engineers are our critical role for any enterprise to succeed today. And it is in the hands of data engineers, you’re going to build these automation capabilities at the end of the day,” Thanaraj said. “The AI bots or AI engines are going to do the core repetitive scanning for filing, classifying, and standardizing all these tasks with data.”
On top of that, you need business experts and domain experts to be validating whether the data is being used the right way and to have the final say. As a result, AI and ML are then learning from these human decisions.
“This is why humans become the number one custodians; the ones who monitor and avoid any deviation of models done by AI,” Thanaraj said.