Data Connectivity: An unrecognized, multi-issue problem

Published: April 27th, 2018

“Facts are stubborn, but statistics are more pliable,” wrote Mark Twain. Never has this been more true when it comes to data connectivity. If you don’t have good connectivity to your data wherever it may reside, then it’s hard to do applications like artificial intelligence (AI) or analytics. They are very data dependent technologies. Connectivity is a huge, largely unrecognized problem, according to Amit Sharma, CEO of CData. “If you don’t have good data, you won’t be able to have good AI solutions or analytics, or big data solutions,” he said. “I think it’s a problem that’s going to keep being more challenging and relevant in the market in the future.”

A seismic shift
There’s a dramatic shift from data for the sake of data, to data for the sake of business, and this manifests itself in many different ways. Terms like “self-service analytics” services provide enough utility to the business user that they can get the data they need and manipulate it in a very non-technical fashion. Tony Fisher, general manager of Magnitude Software, said, “Data for the sake of data, is a very technical thing, and data for the business is very business-oriented thing. Business users are more concerned about orders and customers or things that are more business-oriented than they are about table or column names. They want to access their data and manipulate it in terms of business needs.”

Fisher believes that it’s very important for technical staff to grasp the concept that data is really just an artifact that’s providing the business analyst with a business-oriented view into their data. “I think that’s one of the big shifts that’s going on now, and we’ll continue to see that for some time to come.”

Adaptability
Roi Avinoam, CTO and co-founder of Panoply Software, said he believes that adaptability plays an important role in how companies master their data connectivity. “I think what some people might miss is that really the way to master data and get insights comes from being adaptable,” he said. “All the time, I notice that when people talk or think about data, the solutions proposed are always the ones where you have to review all the data you have, and then review all the business questions, and ideas for insight requirements, and then after you map these out you come up with solutions. It’s great for about two to three months.”

He added that the “problem is, after you’ve done all of that, the business changes, the industry changes, the market changes, the API’s and the data that you have changes, and then you have to do it all over again. And that’s the kind of state of mind that I think we need as an industry to evolve.”

Avinoam is an engineer, so he compares it to software engineering’s Waterfall development process. “Basically we have to design everything up front, and we have to figure out how it needs to work, and then we go and develop it, and it was so rigid. You can’t make changes, and it’s not adaptable. Now development teams are incredibly agile, right?” He pointed out that now an idea can be brainstormed in the morning and it’s shipped to production that night. He’s trying to impose this agility on his team and wants the industry to follow. “One day is too much, in my opinion. We need to be able to think up ideas and try them out and make drastic changes overnight, without having a big price to pay for it. It should be encouraged, it should be a positive experience that we’ll rotate our entire state of mind to think of completely different types of data or different connections that we might do. And execute on it in a day or two.”

He emphasized that the issue really isn’t how you solve your current problem, that’s easy. What’s important is thinking about how you are going to solve an endless stream of problems, challenges and opportunities that may hit you on a daily basis, and ensure that every system keeps up

Maintaining data consistency
Ensuring data consistency when new technologies like microservices and cloud-based distributed applications come into play is no easy task. There’s been a pretty significant shift over the past couple of years just in terms of the overall approach according to Dion Picco, vice president of product management and product marketing for Progress. He uses data warehousing as an example. “The data warehouse approach of more of a record-oriented, relational or star schema approach, has really served its purpose well. It’s certainly going to continue as a pretty prominent standard, since it’s the whole process behind Extract, Transform, Load (ETL). What I’m seeing that’s largely driven in some ways by the whole data science, big data movement, is a move away from ETL and a move more towards ELT style.”

He explained, “Instead of extracting data from systems, transforming it into the format you want for long term storage, and then putting it into your data warehouse, the ultimate move from just a pure warehousing data perspective has been just dump everything into a data lake. There is no qualified set of records necessarily. You basically have a data lake of stuff that you load and transform on demand. This has really given rise to things like data prep tools, as an example of something that previously was part of the ETL process and driven by IT. Now it’s driven in terms of data scientists, and various folks on the business side who need access to the data when they want, leveraging more citizen-oriented tools to do the data transformation, data access piece.” Using this new approach preserves the fidelity of the original data set in a way that your typical ETL process doesn’t. This represents a fundamental change.

He described the microservices landscape in general and hybrid architectures. “What’s typically happening here is every service often has its own database. So a customer service might have a customer database. A product service might have a product database, and as you scale these services up, the horizontal scalability of that service needs to make sure that they have a consistent view of that data.” It’s not as simple as it sounds because you may hit a level of scalability that you can’t grow beyond.

On the other hand, he said he believes the simplest pattern is the one that’s still the most dominant pattern, which is to not have one database per microservice. The end game, according to him is, “You end up with a shared database architecture behind all of the microservices much as in the old style architecture, but it still removes a lot of that need for dealing with the nuances of this because you ultimately defer to the database system. So if you have a clustered database system, you’re gonna hit a certain level of scalability, and because everybody’s using a shared database, you don’t have those issues of consistency to worry about in the same way.”

Picco pointed out a third approach, “Obviously you can relax certain constraints, and so if you’re not dealing with fully transactional environments and you can deal with eventually consistent sort of scenarios, there’s a wealth of new databases to choose from like Apache Cassandra and Spark. There’s a lot of infrastructure built on the Hadoop ecosystem today. I mean, we just had an explosion of different databases that are really fit for purpose, and so if your purpose isn’t high-scale online transaction processing (OLTP), then likely there’s a database to fit your need.”

He listed several new SQL vendors that also achieve a different level of performance but still enable a full active database. “Google Spanner is a great example. You got Cluster XDB and Volt DB that are out there. They’re combining the best of in-memory along with new architectural patterns from an OLTP perspective that I think a lot of transactional-style applications need.”

APIs vs drivers in the new world
The difference between an API and a data driver is that an API is a specification that describes what to do. A driver is an implementation that describes how to do it. They’re both still relevant in the modern services world. JDBC and ODBC are standards that have been around for more than a decade. According to CData’s Sharma, they’re technologies that are going to stay around. He said, “The choice of connectivity or any of these driver technologies is dependent on the platform choice that people make. While ODBC is still very popular, I see a little bit of trend, not much, of people moving away from the native driver technology. ODBC is C/C++ based technology, and I see people moving to JDBC and ADO.NET instead of that. In some enterprises, I also see resistance to JDBC, just because of how Oracle is handling Java. Separate issue, but still Java is very popular. I see a little bit of trend with people are moving away from native. People had that impression that native technologies were required for performance reasons, but I don’t think that’s true anymore.”

The Java and .NET runtimes have matured so much, that they can be comparable to native technologies, and offer other advantages on top of that.

Mobile platforms are taking off in popularity. Sharma noted, “Our driver technologies offer direct connectivity from the driver to the data source. What’s more popular with the mobile platforms, is to go through an intermediary. If you’re building a mobile application, what you would do is, that application would talk to some server somewhere, which might be, again, JDBC or ADO.NET, or something, and that’s where the connectivity would happen, instead of building the connectivity right into the application on the device itself.”

The challenges that data connectivity present are multi-faceted and require that key players on both the business and technical sides of the issue collaborate and come up with innovative solutions. Sharma predicted, “People think connectivity’s easy, but it’s going to take a lot of effort to actually get it right.”

Article Tags

analytics, APIs, data, data connectivity, software architetures

About Alyson Behr

View all posts by Alyson Behr

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Data Connectivity: An unrecognized, multi-issue problem

Article Tags

Subscribe to SDTimes

About Alyson Behr

Related Articles

ScyllaDB X Cloud’s autoscaling capabilities meet the needs of unpredictable workloads in real time

Databricks adds new tools like Lakebase, Lakeflow Designer, and Agent Bricks to better support building AI apps and agents in the enterprise

Garbage in, garbage out: The importance of data quality when training AI models

ABBYY’s new OCR API enables developers to more easily extract data from documents