Industries today are struggling to deal with the issue of increased data fragmentation. No longer is all data sitting in an RDBMS, snuggled in behind a corporate firewall with access and permissions easily defined and managed. Today, data lives in on-premises systems, in cloud storage and in partner systems, creating new complexities for accessing data.
A well-understood way to wrangle all this data is through the use of data virtualization technology. This involves creating an abstraction layer that provides access to the underlying data sources with which it integrates. This removes the need to move or replicate data and perform ETL.
Embedded data virtualization uses the same technology but embeds it into applications, data integration tools, data warehouses or DBMS systems, as well as within things like iPaaS applications. Among the components that make up a data virtualization “stack” are a real-time data connectivity layer, a query federation engine for joining data across multiple data sources, caching and query optimization — because if data retrieval is slow, your application is not going to perform well — data transformation and custom views, metadata discovery and a data consumption layer.
This provides applications with an abstract data layer where every connected data source appears like it is part of the same database, without needing to move the data. Eric Madariaga, co-founder and chief marketing officer at CData, gave this example from the world of business analytics: “Most BI tools can connect to multiple data sources, however in order to join or aggregate data across sources they need to retrieve the entire data set from each source before operating client side.”
This can have an important performance impact in some scenarios, as different data source APIs can be slow, limited, or even charge fees based on consumption.
Talking about the CData solution, Madariaga said, “With our federated query engine, we can dissect a join query that spans multiple data sources, retrieve a subset of the data from each source that matches query parameters, and then join only those subsets of data.”
And that moves data access from a technology solution up into an enterprise value chain, as waste and the lag time it takes to get the data you want can be eliminated.
To embed data virtualization, Madariaga said the first step is to have a SQL abstraction layer for the underlying data sources. CData does this through its driver technologies, he said. “If you want to connect to a data source like Salesforce from a BI tool, you can use our ODBC driver and embed that in your BI tool and have full real-time bi-directional access to Salesforce and the underlying metadata and capabilities that are available within that platform,” he explained as an example.
This, from a developer standpoint, provides one point of integration and one common, universally understood interface (SQL), for working with data from any of the solutions it is connected with.
“Think about the types of products and solutions working with data from disparate data sources,” Madariaga said. “BI and analytics is an obvious choice for embedded data visualization technology. Data warehouse, data integration, data governance, data preparation… all of those data-centric, data-heavy platforms need the ability to interact with their customers’ data, wherever that data is.”
Content provided by SD Times and CData