Once you hit a stride with microservices and you are able to iterate more quickly, find and fix bugs faster, and introduce new features rapidly — it is crucial not to go overboard. You may want to try to start moving all your pieces of infrastructure to a microservice architecture, but as one company found out, not all monoliths are worth changing.
Customer data infrastructure company Segment’s infrastructure is currently made up of hundreds of different microservices, but there is one piece of infrastructure where the company took it too far. According to Calvin French-Owen, CTO and co-founder of Segment, the company decided to move to microservices because the architecture tends to allow more people to work on different parents of different codebases independently.
With microservices, more isn’t always better
However, Segment decided to try to split up a piece of its infrastructure based on where data was being sent and not based upon the individual teams using or making changes to that service. For instance, Google Analytics had its own service, Salesforce had its own service, Optimizely had its own service, and so on with each data destination Segment provided. The problem here, however, is that there are more than 100 types of destinations like this. So, instead of being able to scale the product, the company was creating more friction, more operational overhead and more operational load every time a new integration was added on, explained French-Owen.
Alexandra Noonan, software engineer for Segment, explained in a blog post: “In early 2017 we reached a tipping point with a core piece of Segment’s product. It seemed as if we were falling from the microservices tree, hitting every branch on the way down. Instead of enabling us to move faster, the small team found themselves mired in exploding complexity. Essential benefits of this architecture became burdens. As our velocity plummeted, our defect rate exploded.”
Noonan explained that in addition to the overhead that came with adding new integrations, there was just a ton of overhead having to maintain the current ones. Among the problems the team encountered was that while the services shared libraries, the team didn’t have a good way of testing changes to that shared library across all the services. “What happened is we would release an update to this library, and then one service would use the new service and now all of a sudden all these other services were using an older version and we had to try to keep track of which service was using what version of the library,” Noonan told SD Times.
Other problems included trying to autoscale these services when each service had a radically different load pattern in terms of how much data it was processing and trying to decrease the amount of differences between all the services.
“It got to a point where we were no longer making progress. Our backlog was building up with things we needed to fix. We couldn’t add any new destinations anymore because these teams were just struggling to keep the system alive,” said Noonan. “We knew something had to change if we wanted to be able to scale without just throwing more engineers on the team.”
That change came in the form of going back to a monolith, but it wasn’t easy to bring all these microservices into one service again because each service had its own individual queue, so the team had to rethink an entirely new way to bring everything back together. The company developed an entirely new product called Centrifuge to replace the individual queues and be solely responsible for sending events to the monolith.
Overall, the company was operating these different data integrations for about three years until it reached an inflection point where it realized things were becoming too difficult to manage. Centrifuge was built over the course of six to nine months and it wasn’t until that piece of infrastructure was built that the company was able to put the codebase back together again in a way that was easier to work with.
“With microservices, you have a tradeoff where the nice reason to adopt them is it allows more people to work on different parents of the codebase independently, but in our case it requires more work operationally and more operational overhead to monitor multiple services instead of just one,” said French-Owen. “In the beginning, that tradeoff made sense and we were willing to accept it, but as we evolved we realized there were only a couple of teams contributing to the service and there were just way too many copies of it running around.”
“Now, we rarely get paged to scale up that service. It scales on its own. We have everything in one monolith that is able to absorb these spikes and load, and handle them much easier, Noonan added.