Anyone who has run a full marathon knows about “the wall,” also known as the point where runners feel like they can’t go any further. This happens when runners hit the 20-mile mark which for many is the most challenging part of the race––their bodies send signals like muscle cramps and fatigue as a cry for them to stop. However, determined runners continue to put one foot in front of the other. Similarly, a marathon project, by definition, is a project that has been running forever and without an end in sight. Marathon projects typically result in teams who are frustrated and left with the perception that it will be impossible to deliver their project before the deadline (that’s already been pushed back three times). 

One of xMatters’ high-profile projects ran into this situation last year. The project, a new feature that allows the end users to get reminders of coming on-call shifts, seemed straightforward. However, as our team began to scratch the surface, they realized the complexity of the project. This included crazy amounts of shift data, as well as the unlimited combinations of data. The project was pushed so many times, we got to a point where we couldn’t push it anymore. 

Test in production, a.k.a. ‘train on race day’
After we exhausted our last resort, we made the important decision to test the project on production. Here’s how the scene unfolded:

Q: “Wait a minute, did you say test in production? Or do you actually mean test in production-like data?”

A: “The best place with production-like data is in production, so we test in production.”

Q: “Doesn’t that sound too risky to run the project in production without proper performance testing? It’s like running a marathon without training.”

A: “The risks are calculated. It’s like training for a marathon on race day, with preparation.”

Before I explain how we made this work, let’s take a look at my experience training for my first marathon. Back in 2016, not long after I accomplished my first 10K race, I decided to give a marathon a shot. As an engineer trained to approach the problem in a very logical manner, I conducted a technical spike, otherwise known as  a Google search. I realized after this search that the most important training I could implement for a full marathon would be to build my endurance. This meant I needed to get to a point where I could complete anywhere from 32K-35K (roughly 20 to 22 miles) in a single run. However, I did have a constraint here—I really didn’t have the time. As a busy parent of two, I couldn’t afford to go on three-hour-long runs; not even occasionally. I came up with a plan:

  1. While I could not afford to participate in 30K runs, I would train up to 10K, but with a longer pace.
  2. I needed to figure out the areas I needed to work on most. 
  3. The best way to know a race is by doing it as often as possible. And I did just that. 

In 2017, one year after my venture into distance running, I ran my first marathon. I managed to walk/limp the last 10K across the finish line, with a 4+-hour finish time. This result was not ideal. However, I was able to get data that I wasn’t able to previously see from any of my training runs, including the following:

  1. I did the first half in 1:42, so it built my confidence that I could do a half marathon in 100 minutes.
  2. The second half is is much more difficult than the first. I needed to ease into the first half and keep my pace lower.
  3. Mostly importantly, preventing cramping needed to become my main focus. I needed to ensure I have proper hydration techniques, fuel and increased strength training.

Similar to my marathon experience, the marathon project we at xMatters were working on had multiple constraints. There was a hard deadline we weren’t able to push anymore, or we would risk the project being cancelled, and the features weren’t finished. Performance testing also hadn’t been done yet, as it required ample time that we just didn’t have. To meet the deadline, we decided to test in production. There were some risks with this including the potential for a bad user experience with a half-baked feature, flooded downstream services and outage, and other services on the same cluster being impacted if CPU or RAM is going out of the capacity.

To minimize the risk, we had to have risk control measures in place including:

  1. Extensive monitoring: We set up extensive monitoring to ensure all matrices were in a safe range. At the same time, we gathered valuable data on traffic and usage patterns.
  2. Disabled live notification: The live notification portion of the feature was disabled. It was just micro brains doing their calculations in the back end without actually bothering end users.
  3. Feature turned on in canary fashion: The feature was turned on in the canary fashion, e.g., 5% of users in one cluster on day one, 10% on day two, etc.

The result was quite eye-opening. We discovered that the usage pattern was quite different from what we expected. Some of the traffic peak was from the source of an epic data sync that we hadn’t planned for. It was clear that some of the expensive API calls caused performance problems. 

Just like my marathon training, testing in production did the trick! Identifying and knowing pain points was integral to achieving good progress. It made it easier for us to figure out which areas we needed to focus on the most. Actions like improving the performance by implementing caching were taken to resolve these problems. We wouldn’t have been able to figure out these areas just from performance testing, no matter how well it might have been defined. It helped us to have a good understanding of usage patterns, while directing us to areas that needed more work. Additionally, it uncovered a lot of unknowns and built the team’s confidence in delivering the feature.

The last push
Back to my marathon running. I didn’t make another attempt to run a marathon for quite a while. Even after almost two years, if someone asked me if I was ready for another 26.2 marathon, I would say, “no.” I’m NEVER ready! When I did decide to register for another marathon, my brain decided for me that a half would be plenty. Two weeks before race-day, I asked myself, “How many more times will you have the opportunity to run a race like this?” I guess the answer is not many. I realized, it’s important to start now; getting TO the start line is half the battle. That final push convinced me to train for another full marathon. Ready or not, I just ran it. And the result was not as bad as I originally anticipated. Compared to the first attempt, my pace improved from 9:40min/Mile to 7:55min/mile.

Similarly, in the final stage of any engineering project delivery, the team may need a final push. The last few miles are always the hardest, especially after all the preparation that has come before to get the team to where it is. Project managers need to make it clear that it’s time to get the project finalized, no matter what it takes. I have witnessed my team struggle on that daunting “wall”  of a project, but I’ve also witnessed their determination to cross the finish line. When the team first started, it didn’t seem feasible to get the project out on time, especially given so many constraints. With determination, willingness to go above and beyond and time management, we accomplished our goals and completed the project. 

One of the greatest marathon runners, Eliud Kipchoge, always said, “No human is limited.” A team, if empowered properly and with proper risk management, can make the impossible possible.