How to Calculate Health and Risk Within Your Infrastructure

    March 21, 2016

    By Dino Balafas

    All the metrics and considerations that go into evaluating IT health and risk form a complex system of processes that are constantly underway. Automating these processes, however, can be relatively simple.  

    To the casual observer, capacity planning and performance management might seem like rocket science. In fact, that estimation isn’t far off-base: many companies hire highly-trained data scientists just to get an approximation of their IT systems’ health and risk.

    Such expertise comes with sky-high rates, unfortunately, because there’s complicated, time-intensive work that goes into evaluating performance and predicting future trends. Even if you can afford those rates, truly world-class data experts come along once in a blue moon.

    That paradigm is changing, however. Automated analytics have proven effective at transforming complicated systems data into simple, easy-to-understand measures of health and risk. When advanced, reliable software is placed behind the scenes, IT managers save time and money, and gain the peace of mind that comes with simplicity.

    The Building Blocks of Health and Risk

    Health and risk scores are assessments of an IT system’s performance, flagging current and future issues that will cause outages or degraded response times below desired thresholds. Health is an extension of performance management, identifying and resolving moment-by-moment problems as they arise. Risk, on the other hand, is forward-looking, and determines the time until poor performance or outages will occur if capacity or hardware is not upgraded.

    Calculating health scores is tricky, and requires that you take a broad and deep look across hardware and applications. To determine whether a system’s I/O and CPU performance matches its out-of-the-box capabilities, analytic queueing network models must be employed. For issues of memory and disk usage, professionals should look for deviations between current, historical, and estimated levels of utilization and capacity.

    Risk scores use capacity planning algorithms to estimate how service growth rates will impact services over time, which normally occurs in non-linear fashion. From there, professionals can again use analytic queueing network models to predict future CPU and disk I/O performance, charting the estimated remaining days-to-risk. Significantly, these measures incorporate unique future events (like Black Friday) along with expected growth trends.

    Using automated analytics, however, the solutions are much simpler.

    Health and Risk Figures, at a Glance

    Because such calculations are extremely cumbersome for IT professionals to make at the rate and frequency they’re required, TeamQuest has developed a way to automate them. Indeed, the only metrics that IT managers need encounter when using our Vityl suite (Advisor, Monitor, and Dashboard) are simple scores of health and risk, from 0 to 100. If a score is between 55-100, they’re healthy (low risk); 45-55 indicates a warning; and 0-45 indicates poor health and high future risk.

    Far from general approximations, our calculations of health and risk are the most accurate in the industry, with an impressive success rate of 95%. And these calculations don’t need to be general, either; if you need to find the root cause of an issue (of if you’re curious), IT managers can seamlessly drill down to highly granular details and pinpoint bottlenecks at their source. At a glance, Vityl provides both big-picture clarity and refined process details.

    In many ways, capacity planning and performance management is rocket science — many companies could go to the moon for what they spend on expertise. But if you’ve got tools that can do the heavy lifting for you, why be afraid of a little rocket science?When the math is contained within automated analytics, then assessing health and risk is as simple as checking the weather.

    Tags: teamquest, vityl