Overcome Unplanned Outages

    The game clock can’t tick fast enough. The underdog team clings to a one-point lead. Your stomach churns. You change positions every few seconds, hands wringing. A burst of heat flows from your shoulders to your head. The referee calls timeout.

    That anxious moment has come. Your company’s infrastructure is about to see an exponential rise in demand. The first of several television and digital ads are about to hit, and the one thing you can’t afford is an unplanned outage. Are you sweating?

    Unplanned outages erode market share, revenue, customer experience and the trust between IT and business organizations. Unfortunately, the modus operandi for most organizations tends to be reactive rather than proactive, which can lead to outages. IT departments spend nearly half their time responding to problems like performance or service outages, according to a global IT survey from Kelton Research.

    In fact, on average it takes seven IT staff nearly 3.5 hours to resolve each unplanned outage, according to the survey. The average number of ‘IT fires’ per week? 8. That comes to 190 hours per week - the equivalent of 4.75 full-time employees - doing nothing but fighting fires.

    Let’s take a look at some examples of performance-related outages and how you can mitigate or eliminate them.

    • Sometimes there’s insufficient capacity to handle peak processing demands. There’s a lack of system resources to handle peak processing. For example, in the finance world, social security processing can occur on a Friday which could also fall on the end of the month and at the end of the quarter.

    • Other times, there’s insufficient capacity to handle anomalous peak processing. This could happen for a couple different scenarios. Imagine the peak described above for the financial world. The CEO tells employees, “we’ve matched your 401(k) contributions plus added some extra money for your retirement plans! Go take a look.” Or perhaps a customer points a synthetic transaction processor at your website and does a massive robotic data entry.

    • Another example would be a runaway/looping process. The process gets into a loop and consumes all available resources. This is usually an edge case for which the programmer/QA person didn’t sufficiently account and test.

    • The final example is a poorly written application to handle peak volumes/single threaded application. There are plenty of system resources, but the application is unable to take advantage of them because it is single threaded.

    Here are some ways to keep you ahead of the game.

    Run multiple scenarios to show business units the impact of various business decisions. This will help IT and the business determine the optimal cost/performance balance needed to meet business goals. You’ll also ensure the metrics can be tracked and reported on an ongoing basis.

    You can’t prevent every unplanned outage, but you can prevent the same problem from recurring. Use your monitoring software to gather all the disparate data from your systems. Get detailed diagnostics to quickly identify the cause of an outage, and more importantly, determine how to keep the problem from recurring. When possible, capture historical data for trend reporting which allows IT to proactively predict issues and address them before they occur.

    Customize reports for different audiences by organizing and annotating them. Choose appropriate charts and graphs for your customers with all the necessary explanations and labeling. Translate technical data into business data for business types, and leave the technical data to the techies. Boost your team’s credibility by providing a user-friendly, self-service report that provides the analysis needed by the business to understand how their decisions could affect the customer experience.