What Happens When Backup Data Systems Don’t Have Your Back
Recent high-profile computer crashes have highlighted what can happen when IT systems aren’t resilient to outages. Whatever the specific cause, companies need to remember that outages are always more costly than strategic planning.
For the second time this year, we have a dramatic example of what can happen when airline IT systems fail. In July, a Southwest Airlines glitch caused delays and cancellations for three days; on August 8th, massive Delta computer problems interrupted the travel plans of tens of thousands of travellers.
Such delays not only tarnish public confidence in these airlines’ services, but heavily dent their checkbooks as well: Southwest’s glitch likely cost the company between between $54 and $82 million, reports WFAA, while The Street estimates that Delta could see losses of $120 million or more. Although Southwest’s CEO initially called their situation a “once-in-thousand-year flood,” it’s becoming harder to make that case convincingly.
While the outages won’t shipwreck these industry giants, they serve as a timely reminder to every organization that improper infrastructure management can be devastating to a company’s service quality. IT plans are bound to fail at a certain point, and when they do, companies must be certain that critical systems can quickly transfer to backup data centers (on-premise or in the cloud), and that those systems can handle the unexpected capacity.
According to Bloomberg, Delta’s crash was set in motion by a small fire at the company’s Atlanta command center at 2:30am, prompting 7,000 servers to revert to backup power (curiously, a routine switch to a backup generator apparently caused the fire). Through some strange chance, 300 of those servers, evidently critical, weren’t wired to backup power, causing an outage that lasted until the following morning when systems were able to finally reset.
That was enough to cause the cancellation of more than 2,100 flights. Although Delta was soon able to get it’s “original interface” for flight scheduling up and running, the airline domino effect — similar to what happens after snowstorms, when grounded crews can’t make their second and third scheduled flights — was already in motion, reports the Washington Post.
It seems clear that Delta was forced to run an outdated system interface because its backup systems simply couldn’t handle the capacity of their newer applications (even then Delta experienced “slowness” and “instability”). As commentators at MyNorthwest observe, today’s airline IT infrastructures are sprawling, supporting mobile apps and online check-ins, while their backups, decades old, suffer from “software fatigue.” Though only 4% of servers officially were offline due to power trouble, all of Delta’s systems were completely overwhelmed.
For many companies, it’s hard to get a true sense of how many systems and services rely on their critical IT infrastructure until core systems fail. And while the bulk of organizations don’t have a busy schedule of flights to maintain, outages frustrate customers and drive them to competitors just the same.
Server fires, outages, and the like can seem like rare twists of fate, but they’re entirely commonplace, and companies must not only have backup systems in place, but ensure that they can appropriately handle their current capacity — if not, systems will go down. When it’s all said and done, it costs considerably less to run your systems through performance management tools, or even to upgrade your entire infrastructure, than it does to bleed capital during a crisis situation.
There’s no ultimate failsafe against outages, of course — Delta claims they invested “hundreds of millions” into IT upgrades over the past few years. But functionally speaking, every airplane landing is really just a controlled crash, which is to say that if you expect to fail, you’re much more likely to have a smooth landing.