Industry Watch: Microservices and scaling out Big Data

Column

Published: November 30th, 2016

This month’s column is a transcript of a fascinating conversation I had with MapR executives Jim Scott, the director of enterprise strategy and architecture at Hadoop solution provider MapR, and Jack Norris, senior VP of data and applications, on the subject of microservices and scaling Big Data.

SD Times: So, we know that scaling data can be a hassle. What is the impact of microservices on this issue?

Jack Norris: There are some complementary technologies that really are game-changing in terms of how to take advantage of [microservices]. The underlying data layer is an incredible enabler of microservices. If you’re doing microservices that are ephemeral and don’t require a lot of stateful data, then I think it’s pretty well understood and people can be quite successful with it. But the data issues drive a lot of complexity for the developers and for the administrators, and that’s an area that Jim has championed for quite a while, and his experience as an architect and a developer allowed him to grasp this and see it early on.

Jim Scott: There are two different ways to look at it when you look at the more ephemeral services. If you were to take just kind of a general front-end service that’s handling the primary load of a consumer-facing application, it’s probably not going to be doing a lot of work. It’s probably going to be handing off the workload to other services that are sitting behind it. Those services sitting behind it are the ones that are more likely to fall into this model. So, if you were to imagine companies building websites like Amazon, where it consists of 100-plus different service calls to a bunch of different back-end services, there’s the need to compile all the different information to bring back and build a user experience.

When you start looking at those services, being able to have a linearly scalable back-end data flow is pretty important. As you scale out your services, which are going to be doing some of the work, they need to figure out who the user is, what information is relevant to them, they then need to give that information back to the front end to render a front end for the user. The compilation of those different data sets is pretty important. Being able to scale out that tier that is intelligent, where it’s clearly doing some level of computational work, is one thing. But in the same vein, without the data that it depends on, it can’t really do anything.

So, as you scale that service up, you will see how much work each instance of that microservice can perform. You know your scaling factors, and then you know based off of how many different services you have what your workloads are on your back-end data platform, and so when you exceed the total capabilities, you just add another server to that cluster. The same goes for whether it’s a streaming capability, a database capability or a file system capability.

Those microservices, when you imagine for just a moment when you start deploying microservices, if you are the software engineer, you need to have visibility into your services. And that is to say, how fast are they performing? Are there bottlenecks? Are there certain types of requests that are coming in that are causing errors? So, when you look at performance and application monitoring, you must be able to emit data from these instances of microservices so that you can troubleshoot. In the old troubleshooting model, we typically did that by doing complete isolation for different servers, and then each server had its own logs, and you could just trace it that way. Trouble is, that doesn’t scale very well from a cost perspective.

The great thing is, if you imagine how it was done last year, or five years ago, or 10 years ago, however far back you want to go, they were the equivalent of multipurpose applications, monolithic if you will, and those applications had a long life cycle to be able to get updates into them. And the scaling factors for them were all or nothing. You basically typically put one instance on a server, and as soon as you realized that one instance couldn’t consume all the CPU, you then went back and figured out how to run multiple instances by doing some nice DevOps types of procedures that are much easier nowadays, making sure that things were listening on the proper ports so things could be load-balanced.

And then from a microservice perspective, if you think about it in a much more granular approach, I can now say, “OK, I know exactly what I have in each of these services, and I can monitor and measure their performance because I’ve decoupled the communication between the components.” If you don’t decouple the communication, you basically will not have a microservice model because everything will be tightly coupled and just fall over.

Having a decoupled communication model suddenly makes it where this front-end service can now say, “Hey, I’ve got a user that just showed up.” Here’s the message, you drop it on a stream, and then the application—or a cluster of an application—all working as a group, come in and say, “Give me the next message sitting here waiting for me to operate on.” It picks up that message, it does its work, and it puts the return message on another stream for the front end to listen to.

That decoupling is absolutely critical to be able to get the real scaling. As soon as you have the application decoupled, now, because of that, I can come in and say, “Oh man, I had a bug in this service.” Well, you don’t have to redeploy every application in the entire stack. You just go redeploy that one microservice. As long as you’re not changing your API, as long as you’re just fixing bugs, it’s not a big deal. It’s very easy, it’s very fast and it’s very fluid.

What’s different from SOA?
Scott: I think the first thing that pops out is when you look at traditional message queues versus the Kafka model, log-based messaging that Kafka and MapR implement. When you look at that model, it is a high-speed, high-throughput persistent model, whereas [with] the message queues of yesteryear, people would be happy to get 50,000 or 60,000 messages per second to go through it. They’d be cheering. Now, it’s kind of an embarrassment if you’re talking about that because you have access to tools like Kafka and MapR Streams. The general expectation is that if you’re getting less than probably half a million a second, you probably don’t want to tell your friends about it. You’re probably messing something up somewhere.

So fundamentally, the scaling factor changed. In the old technology, if I needed one server to be able to handle 50,000 to 60,000 events per second, and I need to be able handle 100,000 events per second today, 150,000 the day after that, suddenly, that’s a pretty expensive scaling factor. That’s pretty big. And so, going to microservices, where every component receives and sends its own events, every component you break out can be a micro single-purpose service, effectively now has an input and output. So you multiply every different service you break out and it grows very rapidly.

If your monolithic application had 20 or 30 different individual pieces that you pull out into microservices, now take your total events in and multiply it by 30, and then multiply it by 2 because you have an input and an output on every one of those, because it’s all decoupled, so you’re now up to times 60. Now think about the fact that you want to do monitoring, and you want to be able to get metrics out of all of these, and each one of those is going to have an event stream. So you are effectively up to 100x of what you originally started with. That’s not even creating anything new. It’s just shifting from one architecture to another.

That microservices model, really fundamentally, when you look at it as a factor of the scaling cost, it’s ridiculous on the old technology. It really doesn’t work well. It’s kind of illogical to even conceive of a microservices model on that old SOA/ESB type of architecture that was out there. They were just much more heavy-handed architectures then.

The concepts are still valid and true, it’s just the technology implementation models that were there couldn’t keep up with the speed, which made it too expensive to implement this type of a model. But the fundamentals of a message-driven architecture or an event-driven architecture, the value is there. People have seen the value, they understand the value, but now finally the costs of the technologies are there to be able to support these models.

What does this mean for developers?
Scott: For software developers, number one, it’ll mean a pretty significant reduction in workload, to be able to get changes into production, because a single service can be monitored, a single service can be edited and be put back into production with easy bug fixes. When you have API changes. That’s going to be about the same as it was under any old architecture, because you have to deploy and release multiple components for different versions. But fundamentally it is taking the burden off software developers and architectures of how to make things scale.

And I will add, I like to try to point out to people that not everything is rainbows and unicorns. The fundamentals are this is a new technology stack for most people, and they still have to get comfortable with the technologies. Using a microservices model, an event-driven architecture, requires a little bit of discipline and requires getting used to some technologies that people aren’t accustomed used to using.

But I don’t see that as a big hurdle or even a steep learning curve. I just see it as people need constant reminding that this is a new technology; don’t just expect to be running on day one. Give yourself some time, set yourself up for success, give yourself the ability to prove that it works the way you need it to, and learn how to use the tools and technologies to support your use cases.

What is MapR doing in this space?
Scott: The first is MapR Streams. MapR Streams is extremely fast. On a properly defined hardware stack—and by that I mean extremely fast networking—we’ve had benchmark tests done that show MapR, with 200-byte messages, can push through 3.5GB per second of throughput. That’s 18 million events per second sustained, or one and a half trillion events per day on a five-node cluster. Most people would never even come close to that. The reason why I think it’s important to point it out is because it helps put the proof out there that MapR’s not going to be the bottleneck, the technology will not bottleneck your capability, so you can focus on solving the problems. You know exactly what your scaling factor is once you start getting your payloads moving through.

In the data platform itself, we have certain capability sets that really greatly enable the users to get much more creative and think outside of the traditional boxes that they’ve been put into, with predominantly relational databases and such. When you look at MapR DB, it has the same underlying capabilities that MapR Streams and the file system has, and that is the ability to be able to do things like snapshots and organize your data based on volumes.

So, when you think about a volume of data and a microservice, you have stream, you potentially have MapR DB for doing data persistence, so if you think about some full-fledged application stack, and you say, “OK, I have users coming in through my front end and I want to capture every event.” You could have all of those events come in and put them on a stream, and that stream you could locate in volume A.

That stream of data coming in in volume A, over time you consume it with your purpose-filled application stack, you write all of your data into volume B. Maybe you’re creating some user profiles, you’re bringing in third-party data sets, and you’re creating software to make recommendations back to your user. So you get this built, you’re happy, everything’s great. Tomorrow, someone says, “Hey, we have to try an alternate approach because we think we can squeak more out on the front end or increase our revenue stream here because we have a better user profile we can build.” So you could create your alternate implementation of this codebase, and you could use volume C, and you could replay the entire stream of history of the volume A stream. So from the time you started up until now, you could rerun it all through, and you could regenerate all of those user profiles that you had.

You could do profile-by-profile comparisons, and then you could actually use that to create something like an A/B test for your application to see what the performance is for your user profiles in your new implementation versus your old, and see if you want to switch over to it or trash it. And if you want to switch over to it, well, you already have it over there; you don’t have to migrate any data. You just switch off the old. And then instead of doing an A/B test, you just switch it over and you send all of your traffic over to that instance of that data set. So it opens up the door to just a plethora of opportunities for how you can pick and choose to use the data platform to support all the aspirations that you have from an innovative perspective for your business.

The last thing I would say that is beneficial here is we have Project Spyglass that’s part of our product offering. It’s a single-pane monitoring application stack. We ingest metrics from the services running on top of MapR, and we also ingest all of the logs coming from the processes running on MapR. Over time, Project Spyglass will become more and more mature, and in the future it will be completely capable of supporting the entire microservices architectures that people will want, so that they don’t have to figure out how to monitor their microservices. It will just become part of the same single pane of glass that they use to monitor the rest of the data platform.

Norris: To take it back up to the 60,000-foot level, this is all about how you drive agility within an organization. The types of applications that can take data and analytics and have those brought to bear on the operation so that the analytics move from a reporting function to actually impacting business as it happens. That’s a huge area to exploit. Microservices and a converged data platform make that easier for organizations to do.

That insight…in the past was, how does an analyst do a query and somehow get better informed? The insight that we’re seeing now pertains more to automated actions, and how do you kind of bake that into the process. That’s a whole new frontier.

Article Tags

Big Data, data, Hadoop, MapR, microservices

About David Rubinstein

David Rubinstein is editor-in-chief of SD Times.

View all posts by David Rubinstein

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Industry Watch: Microservices and scaling out Big Data

Article Tags

Subscribe to SDTimes

About David Rubinstein

Related Articles

Modernizing your approach to governance, risk and compliance

ScyllaDB X Cloud’s autoscaling capabilities meet the needs of unpredictable workloads in real time

Databricks adds new tools like Lakebase, Lakeflow Designer, and Agent Bricks to better support building AI apps and agents in the enterprise

Garbage in, garbage out: The importance of data quality when training AI models