Comparing Different Methods for Calculating Health and Risk

    How do you calculate IT health and risk?

    There are different methods you can use, depending on your needs.

    The most common methods for determining IT infrastructure health are:

    1. Threshold comparison
    2. Enhanced threshold comparison
    3. Event detection
    4. Variation from normal
    5. Allocation comparison
    6. Queuing theory for health

    On the other hand, the most common methods for calculating IT infrastructure risk are:

    1. Linear trending
    2. Enhanced trending
    3. Event projection
    4. Allocation projection
    5. Queuing theory for risk

    There’s pros and cons to each of these methods. It’s hard to know which one to rely on. So consider this your guide to making health and risk calculations easy.

    How to Calculate IT Infrastructure Health

    1. Threshold Comparison

    Threshold comparison is all about using measurement statistics to set thresholds.

    These measurement statistics typically include things like:

    • CPU utilization
    • Memory utilization
    • Queue length

    Once you determine your measurement statistics, your next step is setting up thresholds for these measurements. Your IT infrastructure health is determined by how many and which of these thresholds are exceed.  

    For instance, thresholds for CPU utilization are often greater than 70 percent. If that threshold is exceeded, your IT infrastructure health will be compromised.

    On the pro side, threshold comparisons are:

    • Relatively easy to set up
    • Customizable

    The cons of threshold comparison are:

    • Assumptions that a threshold is a boundary between radically different states
    • Requirements of significant knowledge of the environment to set the right threshold
    • Conditions that frequently change within the environment will require thresholds to change
    • Inaccurate health representations due to weak or inaccurate thresholds

    If you choose to monitor IT infrastructure health with thresholds, be wary. If you don’t set and monitor your thresholds, you could waste time on false positives of poor health—or not recognize truly unhealthy situations.

    2. Enhanced Threshold Comparison

    Enhanced threshold comparison is the next step up from threshold comparison.

    These thresholds are usually more complicated. Multiple measurement statistics and formulas are used within a given threshold. And threshold severity is often introduced.

    You’ll determine IT infrastructure health same way you do for threshold comparison: by looking at the number and type of thresholds that are exceeded. These thresholds just happen to be more complicated.

    In the pro column, enhanced threshold comparisons are:

    • Moderately easy to setup
    • Customizable

    The cons of enhanced threshold comparisons are:

    • Assumptions that a threshold is a boundary between radically different states
    • Requirements for even greater environmental knowledge
    • Conditions that frequently change within the environment will require thresholds to change
    • Inaccurate health representations (though fewer than standard threshold comparison)

    Using enhanced thresholds can minimize some of the inaccuracy of standard thresholds. But you’re still taking a risk with your IT infrastructure health.

    3. Event Detection

    Event detection utilizes alarms, alerts, and other techniques to recognize something noteworthy has occurred (i.e., an event). Thresholds, logs, and other types of data are typically used to recognize events.

    You’ll determine IT infrastructure health based on which events have occurred—and how many.

    On the pro side, event detection is:

    • Relatively easy to set up
    • Customizable

    The cons of event detection are:

    • Inaccurate health representations resulting from inaccurate or missed event detection
    • Requirements that events occur in order to detect health issues
    • Requirements for an understanding of the environment in order to set up alerts and alarms
    • Conditions changing in the environment will require you to change which events are detected
    • Significant resources might be required if big data is involved

    Event detection is a great reactive measure for IT infrastructure health. But being proactive will require a better method.

    4. Variation from Normal

    Variation from normal is what it sounds like.

    You define normal situations for IT infrastructure health. This is typically based on historical data and identification of “normal” behavior and events. So situations that vary from the set “normal” are considered unhealthy.

    It’s a simple concept, but you will need some complex algorithms and the right resources to make these determinations.

    On the pro side, variations from normal:

    • Work best for situations that do not experience significant variations in workload and configuration changes
    • Have the ability to adjust what is considered “normal” over longer periods of time.

    But the cons of the variation from normal method are that it:

    • Takes at least 90 days of data to get a very basic understanding of what is “normal” This means the ROI takes longer than for most of the other methodologies
    • Results in inaccurate health representations in the beginning (or if the environment changes)
    • Requires significant resources for analysis

    Variation from normal can take longer to hit ROI than other methods. But it can also be more precise.

    5. Allocation Comparison

    Allocation comparison determines health by looking at the available capacity versus the capacity that has been allocated. If your allocated capacity of resources gets close enough to the available capacity of resources, then your infrastructure’s unhealthy.

    On the pro side, allocation comparison:

    • Applies virtualized environments and disk space resources
    • Adapts well when capacity is added or removed
    • Allows for customization

    But the cons of allocation comparison are:

    • Inaccurate health representations because you’re counting what’s allocated—not what’s used
    • Difficulty determining available capacity
    • Knowledge requirements for your capacity needs

    It might be a good idea to use allocation comparisons if you’re an expert in your capacity already. But this might not be the right method for you if you lack that expertise.

    6. Queuing Theory for Health

    Queuing theory involves analysis on system utilization, throughput, queue length, and response time. These are typically based on the amount of work running on systems and system configuration.

    You can use queuing theory to determine IT infrastructure health by measuring these components against your optimal scenario.

    On the pro side, queueing theory for health:

    • Applies to CPU and IO activity in both standalone and virtualized environments
    • Adapts well to workload and capacity changes
    • Allows for customization
    • Provides very accurate health results for system, CPU, and IO resources

    But there are cons to queuing theory, like it:

    • Isn’t applicable to memory or disk space health
    • Requires more resources for analytics that threshold methodologies (but less than variation from normal and event detection)

    Queuing theory for health might be a smart choice if you have the right resources to do analysis. (And if you’re more focused on CPU, IO, and system health.)

    How to Calculate IT Infrastructure Risk

    1. Linear Trending

    Linear trending involves using historical data to create a trend line.

    In terms of IT infrastructure risk, this means using that line to project future values of your historical data. This creates a simulation for when you’re established thresholds will be surpassed.

    On the pro side, linear trending:

    • Is relatively easy to set up—once appropriate threshold are established
    • Provides fairly accurate projections for CPU and IO activity—at least for resources with moderate-to-low utilization and consistent growth
    • Allows for limited customization

    But there are cons to linear trending, like:

    • Behavior projections are linear in nature, so they’re not useful for busier resources
    • Overprovisioning since you have to set lower thresholds to avoid inaccurate linear trends
    • Inaccurate risk representations with higher utilization
    • Inability to take configuration or workload changes into account

    So using linear trending to predict IT infrastructure risk might be enough if your utilization is moderate-to-low.

    2. Enhanced Trending

    Enhanced trending uses basic algebraic quadratic functions. These usually take multiple statistics into account in one equation.

    You can use enhanced trending to predict IT infrastructure risk. Here’s how. You can use quadratic function to project out future values from historical data to determine when your thresholds will be exceeded.

    In the pro column, enhanced trending:

    • Provides fairly accurate projections for CPU and IO activity—at least for resources with moderate-to-low utilization and consistent growth
    • Delivers the best risk prediction for memory and disk space—when correct functions are provided
    • Allows for customization

    The cons of enhanced trending are that:

    • You need a very strong understanding of the environment to establish the quadratic functions (and that takes a strong math background)
    • The results will be more accurate for resources with low to moderate utilizations than linear trending, but still is not accurate for higher utilizations
    • Overprovisioning happens, since you have to set lower thresholds to avoid inaccurate linear trends
    • You get inaccurate risk representations
    • Configuration or workload changes aren’t taken into account

    Enhanced trending can be a good choice—if you have the strong math resources in your IT department. Otherwise, it can be difficult to do well.

    3. Event Projection

    Event projection uses historical data about events to project when future events will occur.

    You can use this to calculate IT infrastructure risk based on which events will occur and when. In this sense, event projection is similar to variation from normal.

    On the pro side, event projection:

    • Works best if you don’t have significant variations in workload or configuration changes
    • Can adjust what’s “normal” over longer periods of time.
    • Allows for customization

    The cons of event projection are:

    • Inaccurate risk representations because it takes time to determine “normal”
    • Events need to happen in order to calculate risk
    • Requirements for understanding the environment to determine events
    • Conditions changing in an environment will cause events to change
    • Significant resources are needed when big data is involved
    • 90 days of data is required to understand “normal”
    • Analytics depend on the availability of your resources

    Event projection can work—if you’re okay with being reactive, rather than proactive, with IT risk.

    4. Allocation Projection

    Allocation projection determines risk by looking at the total amount of capacity available versus the allocated capacity.

    So when the allocated amounts of capacity close enough to the available capacity of resources, risk is goes up.

    On the pro side, allocation projection

    • Applies to virtualized environment placement and disk space resources
    • Adapts when capacity and work growth is well understood
    • Allows for customization

    The cons to allocation projection are:

    • Inaccurate risk representations because uses what’s allocated instead of what’s used
    • Overprovisioning when the work being performed doesn’t equal the allocation

    If you’re primarily concerned about risk in virtualized environments or disk space, allocation projections might make sense for you.

    5. Queuing Theory for Risk

    The same metrics used in queuing theory for health—system utilization, throughput, queue length, and response time—are used for risk.

    You can determine IT infrastructure risk by comparing the predicted values to what you need to maintain service levels.

    On the pro side, queueing theory:

    • Adapts well to workload, configuration, and capacity changes
    • Allows for customization
    • Provides very accurate risk results for CPU and IO activity

    The cons of queueing theory for risk are:

    • It doesn’t apply to memory or disk space risk determination
    • Analytics require more resources than trending methods (but usually less than event or allocation projection)

    Queuing theory for risk might make sense if you have the right resources to do the calculations.

    IT Health and Risk—Made Easy

    There are many routes you can take to IT health and risk management. The right one will depend on your organization and your environment.

    But it doesn’t need to be difficult to decide the right method.

    Capacity management makes it easy to manage IT health and risk. Learn how. Download our white paper, Health and Risk: A New Paradigm for Capacity Management.

    Get the white paper