Almost every successful startup company reaches a stage where the “quick and dirty” way of coding becomes a quality liability and makes it difficult for the company to continue to grow and develop. At MyHeritage we encountered this issue, but because we were already running a successful website and supporting a huge user base of millions, we couldn’t just stop what we were doing and “refactor” the required code. Doing a quality change in parallel with our company’s growth in product and complexity was an extreme challenge that required the right state of mind and the right set of people.
This, then, is the story of how MyHeritage created a real quality change in an extreme environment.
A couple of days after I arrived at the company, I got some courage and decided to code something small by myself. My hands were a bit shaky, but I was thinking to myself, “What could possibly go wrong? If I really make a mess, the unit and integration tests will detect it.”
After I wrote a small piece of code, I planned to add a test and asked for a colleague’s help. “Unit what?” he answered. “Just open the website, browse to the page your code affects and see what it looks like there. Worst case, QA will let you know if it doesn’t work.”
The team understood very quickly that this is not the way to go, and that we needed to do some rethinking about some of our engineering practices.
The first step: Coding standards
The first thing we agreed to start with was adopting coding standards to be used by all developers. So with the help of our CTO and feedback from all developers we decided on the coding standards for MyHeritage that included in addition guidelines for error handling and the way to deal with legacy code.
Like any beginning, this one was not easy. In one code review after another, developers commented to each other about coding standards and asked people why they have warnings in their code. But after a while people have forgotten they used to code in a different manner.
Right after presenting the coding standards to R&D, we started working on having unit testing guidelines and an education plan. Beyond the obvious benefits of unit testing, these tests contribute a lot to code quality and keep classes coherent with well-defined roles and responsibilities.
We spent some time on initial research. As time was short and lots of tasks were in the pipeline, we needed to find a creative way to integrate our solution within the existing resources. So we decided to start with a pilot with only one team member. We presented the initial direction to the rest of the team and started with the pilot team member. We built on our knowledge and made ourselves familiar with advanced topics in how to implement unit testing.
After a few weeks, we presented our pilot experience and what we learned, and we extended the pilot. After a few more weeks everyone joined, and as time went by, writing unit tests as part of the coding tasks became the de facto habit of every team member.
The next phase: System testing
Some time after adopting the concept of unit testing, we started to dream about having a framework for automatic testing for QA, which was a top need for the company. So we did some research and decided to go on a pilot with Cucumber. (Cucumber is a testing tool notable for being created with test-driven development techniques.) We started developing the infrastructure and built a pilot for one of the most important flows in our website. When this was done, we presented it to the QA manager, who was willing to give it a try and allocated one of her team members for the pilot.
There was a lot of passion about this automated testing system, and our belief in the benefits that automated testing would bring to the company really helped get the project off the ground. After the successful pilot phase, the QA team started building more tests scenarios. Now, any skepticism that might have existed about the efficacy of automated testing has vanished, and all QA engineers have gone through training sessions. Now, part of MyHeritage’s definition of “done” for each feature is beginning to include automated tests.
Creating a risk-free staging environment
Before, our staging environment worked with our production database, which was a huge risk to production.
Myself, our CTO and one of my talented team members worked on a plan to build a true staging environment that will not be able to access, even by mistake, the production environment. Our IT team also lent a hand, and we have now a real staging environment that allows in one click building new environments—copying complete data from the production environment—to apply DB schema changes and test them before they reach production, among other benefits.
Technical design phase
During some code reviews, we noticed that often some basic questions are left unanswered. How is QA going to test it? How is this solution going to be scaled up? We realized that major changes are required in the code to allow testing before it goes live, and that beyond this, significant changes are required to allow the solution to become as scalable as it needs to be.
Another thing we realized is that we don’t have a proper design phase where developers refer to the non-functional requirements in addition to the functional ones, get feedback from others, and add design-related documentation to the feature that is about to be developed.
So in one of the next features that we developed, we created a short document that presented the feature’s design. We discussed the design document with the team members and got feedback from them not only about the actual design, but also about instituting a process of designing features and getting feedback as a procedure for the team. Most team members were in favor of the idea of having such template, and we decided to give it a shot.
Since we implemented and went through some rounds of this new design phase in our development cycle, we have managed to communicate clearly all aspects of the design to all stakeholders, reduce the overall development time of features, and deliver features with higher quality.
An important part of any quality process is collecting quality attributes data and making them visible. So first the team suggested finding a way to collect all PHP errors from all Web servers, daemon servers, etc. to a single DB table, where all application errors were already reported. After we had these results, we exposed them as part of R&D metrics so everyone would be aware of the situation.
Then we decided to add a mechanism for logging all long DB queries in the system so we have all information regarding them. Exposing this information not only helped us improve the response time of the website, but also enabled us to track serious contention issues on the DB that were spotted easily.
Later on we decided to expose all important R&D metrics such as average home page load time and others—issues that we act on regularly—to monitor and help us improve our performance by looking at “small measurable” changes.
Automatic monitoring tools
We were satisfied with the progress we had made so far, and we aimed to achieve more. Thus, the next issue we decided to focus on was the manual quality operations we have performed. Because they were manual and took precious time away from other activities, we performed them only once a week. But MyHeritage could not afford to wait that long to find important quality issues.
So next we gave priority to monitoring our application and PHP errors. We decided to automate the error-monitoring process, run analyses twice per day, and send an e-mail that summarizes all new errors and all existing errors that grew significantly. Errors were spotted the same day they reached production.
Another quality measure that is under QA’s responsibility is going over all our statistics charts and looking for anomalies in the number. This might not sound like a big deal, but imagine going over 2,500 charts all by yourself. This is not fun, and definitely not efficient.
So once again, we came up with the idea of building an infrastructure (coming soon as open source) to analyze all our charts’ data automatically. Our mathematically inclined developer built our system to be highly flexible and to allow the user to add charts to white lists, define thresholds, replace the analysis algorithm and much more.
But what is a true quality change?
So everything we’ve talked about so far definitely contributes to quality, but is this a quality change? Not to me. A quality change to me is what the organization is experiencing right now as I write this article.
A quality change to me is when my manager keeps asking me why the unit/integration tests are not running automatically and continuously.
A quality change to me is when one of my team members goes in his free time to a continuous integration conference, then comes back and launches a great system that runs all unit testing after each commit to the source control.
A quality change to me is when team members complain that we are not doing enough code reviews, and when people come with their own initiatives regarding tests.
A quality change to me is when all teams in the company talk about unit testing, when people argue about mocks vs. stubs, when the client team has continuous build and automatic sanity tests that come right after the build, when all teams have a flavor of the design template used by the back end team, when refactoring happens all the time, and when my manager expects and demands FUC (Feature flags, Unit testing, Code reviews) from everyone.
A quality change to me is when management agrees to stop all R&D activities, despite business needs, for three weeks to reduce the number of errors and develop a quality supporting mechanism. When all of R&D is recruited to complete their tasks, and when management asks when the next time you will be doing a quality sprint, that’s a mindset change around quality.
Putting it all together
So what have we achieved so far?
• Coding standards so everyone codes using the same style
• Unit testing that runs on every commit so we know we are not breaking anything
• System testing so we know end-to-end functionality still works
• Increasing visibility so we always what’s going on
• Automatic monitoring tools so we can easily monitor the production.
Ran Levy leads the back-end team at MyHeritage, where he has worked with the management team to create a culture of “quality as a way life” within the back-end team and R&D. Ran has 15 years’ experience in the technology industry both as a developer and architect in complex large-scale systems.