Blame the cloud, DevOps, consumer demand or continuous delivery. No matter the reason, a wide variety of applications are now aiming for high availability (HA) — and increasingly, that overlaps with planning for disaster recovery. Too many software organizations not only lack tools that can help, they fail to test their disaster recovery plans until it’s too late.
Brian Bowman, senior principal product manager for Progress® OpenEdge®, has plenty of experience in improving availability. A 20-year veteran of Progress, he’s performed database tuning and disaster planning for customers of all sizes around the world. According to Bowman, in addition to some process-oriented changes, new aspects of the Progress OpenEdge 11.7 application development platform, including OpenEdge Replication Target Synchronization, can help you succeed — at failover.
What change are you seeing among your customers in terms of the need for high availability applications?
What we’re finding is that always on in the 24x7x365 environment, depending on the vertical or business, is not just seen as potentially possible — it’s expected.
But that’s an ops function, isn’t it?
Historically, the people in an IT organization who maintain and restore a system aren’t the same people who develop the application. Today, those two groups are working together. High availability has always been about making sure the app is up and running. If the database is up, but users can’t use it, that’s not good enough.
Anecdotally, we have customers who have a team to develop, then turn to operations people to deploy it—but then the app doesn’t work. They go back to the developers and say it’s not working. Dev says, “It worked on our end.” As the saying goes, “If it compiles, ship it.”
Continuous delivery is in part driving this. Some of our customers are making changes daily to their applications. It becomes important for them from a continuous integration/delivery (CI/CD) standpoint to be able to tie the changes they make to what’s being deployed.
Is there a new way to deal with this challenge?
From the Progress OpenEdge standpoint, we’re attacking it from two places, maintainability and disaster recovery, with two features, online index activation and replication target synchronization.
From the maintainability perspective, continuous delivery requires that the application has to be continuously running. Starting in OpenEdge 11.7, we’re helping our customers maintain that system in a near 24x7x365 environment.
Part of it is updating the application on the fly. You also want to make it so those changes can be accessed and used immediately, not put off until the weekend when you have a two-hour outage so you can reboot.
Online index activation lets you add a new index to the schema on an OpenEdge database without shutting down. If you have dynamic queries, they can make immediate use of new indexes.
So are most HA concerns database-oriented?
This is an industry-wide shift in disaster recovery. Everyone is focused on data and databases, but we look at the whole application as a holistic solution. It is much bigger than the database. Online index activation is about making everything available to users.
Is there a competing standalone solution that does something like online index activation?
Really, no. The competition is accepting that you need to shut down–or just taking an outage—to do the maintenance.
And what about disaster recovery?
The last thing you want is to have two disasters back-to-back.
In disaster recovery/HA, one of the rules is that you eliminate your single points of failure.
Where we really see this applying is in cloud-based and SaaS apps. If I’m a consumer using an app and it goes down, if it doesn’t come back in, say, 2-8 seconds, I’m going someplace else.
OpenEdge Replication Target Synchronization means that in the event of a disaster, there is no single point of failure. It’s a three-pronged failover approach using two backup databases that automatically serve as the sources when needed, to keep the system up.
One of our customers is a wholesale distribution ERP running a new SaaS business in the cloud. This is the technology they need to deliver to their customers, and it needs to be available all the time.
Not only that, this solution goes towards maintaining the app as well. You can configure automatic or manual transitions for unplanned or planned downtime. It serves both purposes, by allowing you to move from a production to a maintenance machine.
Software teams often struggle with basic failover and recovery. You see systems in place, but ends up not working at the critical moment. What kinds of mistakes do you see in disaster planning?
The single biggest mistake people make is not testing their plan.
I was visiting a company that had prepared disaster recovery plans. A storm was coming through. They had a power outage, and they were very proud of their computing center, so we went into their computer room to see if failover was happening correctly. As we’re looking at the computers through the window, they realized their entry security systems — the card readers to get into the computer room — were not on backup power. Their backup generators were designed to only run for 60 minutes. So we watched their systems one by one go down as they ran out of power.
It was tough to just stand there and watch them shut down. In disaster recovery or business continuity, you’ve got to test all the parts.
But the biggest challenge organizations have is simply declaring the disaster. If they’re confident in their ability to fail over, they should act quickly to meet business requirements such as service level agreements.
How do you make that decision, then?
There are two metrics in disaster recovery: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO is how fast do I have to have the application back to availability? RPO is how much data can I afford to lose?
With an RTO of 30 minutes, as often happens in a manufacturing line, the manufacturing stops. But if you think about a financial system, their RPO is that they can’t afford to lose any data. Those metrics drive their business.
I was on a customer site at a mortgage company where we were implementing a DR solution. We were in the final process when they had a system outage—a disk on a drive failed. They had a meeting and brought in all the executives. The solution was very simple: They just had to go to backup system. The executives argued for an hour.
It was very interesting for me to stand there and watch that debate, knowing full well it would work.
It’s amazing how much a disaster helps a disaster recovery program.
CI/CD emphasizes automation because software is being delivered so quickly, but should automation be automatic?
Often we hear people asking, “Can I automate this whole thing so that it ‘automagically’ fails over?”
I’m in in the European office today meeting with Tech Support and was talking to them about Replication Target Synchronization. They were very skeptical. They said, “You mean I can automate this process so that in the failover it brings the other systems up and running and transitions to those other systems?” But then they realized there’s also an option for human intervention. That’s an advantage we bring to the table: You can configure it both ways, automatically or manually.
You can configure your system and failover environment so that if the network is down for 10 minutes, you automatically failover. But if you know the network is only going to be down 15 minutes, should you failover? You can decide “Yes, we should” or “No, we shouldn’t.”
How do you get developer buy-in to HA and DR?
They need to buy into the complete process. What does business continuity mean to their application? They need to think about it when they start building the app. We have one customer who has a team that works through the whole disaster recovery process as they’re building the product and the features. This is when it’s the most successful.