At QCon San Francisco today, Matt Ranney, chief architect at Uber, detailed the chaotic internal workings of the company’s software stack. The talk included references to three of Uber’s open-source projects, but also highlighted the utter chaos going on inside the organization.
When Uber was created, said Ranney, it was little more than an outsourced PHP and MySQL application. By 2010, the system was being reworked into what would eventually become the two primary services within the organization: dispatch, and the API, which was used to build applications for phones.
Once these two systems were built, however, Ranney said the team quickly found it was difficult building out two large services and maintaining them at scale. The journey into microservices began as a way to break up those large systems.
(Related: How to get everyone to help scale agile)
Ranney said that the Uber philosophy on microservices was to encourage them entirely. As a result, new developers could join the company and work on a completely fresh project that would then become another microservice.
Since 2011, when before the company had only two services, Uber has added more than 698 services to its retinue, and Ranney admitted that this was a tad much to deal with. In order to help make sense of the chaos of more than 700 moving parts, he advocated stopping as many as possible from moving at all.
“This is a crazy, very chaotic situation,” said Ranney. “I don’t know that microservices are the answer to all problems, but they’re the answer to a lot of problems if you put them in the right context. The idea I want to propose is that I think a microservice architecture works best if you have immutable infrastructure.
“Netflix showed us that once your infrastructure is immutable, you can count on it more. I think the same is true for microservices: After some point, every time you change them is when you risk breaking them.”
Thus, Ranney said that microservices should reach a point where they are—for lack of a better word—done. “After some amount of time and baking, I think maybe we should just not change something and, instead, build something new,” he said.
Tools for the job
While the rest of the world is still excited about HTTP and REST as a transport, Ranney said that the company’s internal systems had outgrown HTTP. It’s also outgrown JSON, he added, despite the company’s dedication to Node.js.
This all began when the company created Ringpop, its first open-source project. Ringpop solved the problem of running large numbers of instances of the same Node and Python applications. Node and Python are single-threaded environments and thus are the solution to scaling is to spin up hundreds of instances of the application, rather than maximizing a single instance, as would be the case in Java.
With all those servers running at once, and all those systems requiring coordination across two, sometimes four different languages, Ringpop allows servers to keep track of work and which servers are active. It does this through some very optimal server gossiping, where systems randomly check each other and bring in other systems for double checks in case there is no response.
Unfortunately, this created a lot of traffic on the network, said Ranney. HTTP may be a standard, but across four languages, the edge cases where HTTP was implemented differently were becoming a problem for Uber.
“We found HTTP was incredibly slow. When we were moving to this regular gossip thing and forwarding stuff around, HTTP started to be the biggest bottleneck. We decided to make our own RPC protocol that would go fast and worked well in Node and Python,” he said.
That protocol is called TChannel, and it is also open source. TChannel packets are specifically designed for speed. One aspect of this is putting the destination information right at the front of the packet, allowing a forwarding group of Ringpop servers to read the first few bytes of the packet, then simply pass it on down the line.
To bolster the effectiveness of TChannel, the Uber team also built tcurl to allow developers to use Curl-like functionality during debugging, and they created libraries for parsing pcap files from networks containing TChannel traffic.
HTTP was not the only Web darling to be thrown overboard by Uber’s massive scale, either. “We’re getting out of the HTTP and JSON businesses,” said Ranney. “While JSON is wonderful in Node, once you get into the world of languages other than Node, JSON becomes problematic and slow. It’s also very strange to validate and to make sure you don’t break something when you change the response to your service.”
The Uber team has moved to Thrift, instead.
But that’s not the end of the Uber growth pains. Even with all these major engineering projects relatively complete and running, the company found that its service registry was being pushed to the brink.
Ranney said that the existing method of service registering came through the construction of a large virtual network with HAProxy. Every service got a port number, and listing out all the port numbers available yielded a listing of all the services available.
Unfortunately, Ranney said that the Uber environment is running out of port numbers. The solution, he said, was the creation of Hyperbahn, which links into all the previous open-source projects to allow for a single view of the available services inside the company.
“We ran into a lot of limitations with existing solutions that were caused by the fact we were using Node and Python. Thus we were spawning thousands of these instances. All the really good stuff for this, like Twitter’s Finagle, work really well if you’re on the JVM. Our call graph is totally unknowable in our current world,” said Ranney.
“We wanted to solve this problem. We don’t actually know what calls what, and it’s truly hilarious. As a result, we have repeatedly caused accidental self-inflicted DDOS attacks. If only we could know what things needed and could throttle things back so when something went wrong; we could contain the failure.”
Ranney said that Hyperbahn was the “logical extension of TChannel and Ringpop. We have a similar system, except all those router processes are in the same Ringpop ring, so they’re all gossiping with each other. You can scale that by adding more of these router processes,” he said.
With Hyperbahn now up and running, Uber has a better view of its many microservices. But that doesn’t mean the system is no longer a chaotic mess. He admonished attendees to read Netflix’s “Principles of Chaos Engineering.”
“We’re not taming the chaos; we’re embracing the chaos, and we’ve done our best to build systems to help us function as best we can,” he said.