English

# Comparing Different Methods for Calculating Health and Risk

## Introduction

For determining IT infrastructure health, the following methods are generally implemented:

• Threshold comparison
• Enhanced threshold comparison
• Event detection
• Variation from normal
• Allocation comparison
• Queuing theory for health

For determining IT infrastructure risk, the following methods are generally implemented:

• Linear trending
• Enhanced trending
• Event projection
• Allocation projection
• Queuing theory for risk

We will take a brief look at each of the above and discuss their strengths and weaknesses.

## Threshold comparison

For threshold comparison, one must determine what measurement statistics need to have thresholds set for them. These measurements usually include measurement statistics like CPU utilization, memory utilization, queue lengths, etc. Once the measurement statistics are determined, the next step is setting up their threshold. It is not uncommon to see a threshold like CPU % utilization greater than 70%. IT infrastructure health is then determined by how many and which of the thresholds are being exceeded.

The strengths of this methodology are:

• Relatively easy to set up.
• Allows for customization

The weaknesses of this methodology are:

• Makes the assumption that a threshold is a boundary beyond which a radically different state of affairs exists. The real world is not so black and white.
• Requires that the appropriate thresholds be established to get an accurate health determination and this difficult requiring significant knowledge of the environment.
• Changing conditions within the environment will require thresholds to change. This is a significant manual maintenance issue.
• Generally results in a significant number of inaccurate health representations due to weak or inaccurate thresholds. This manifests itself in a high number of false positive determinations of poor health thus wasting valuable time. Even worse it means that truly unhealthy situations are often times not recognized.

## Enhanced threshold comparison

Enhanced threshold comparison is usually built on top of threshold comparison. Its thresholds are usually more complicated and use multiple measurement statistics and formulas within a given threshold. Threshold severity is often times also introduced. IT infrastructure health is determine in much the same way as for threshold comparison by looking at the number and type of thresholds that are exceeded. The thresholds are more complicated than those for threshold comparison.

The strengths of this methodology are:

• Moderately easy to setup.
• Allows for customization

The weaknesses of this methodology are:

• Makes the assumption that a threshold is a boundary beyond which a radically different state of affairs exists. The real world is not so black and white
• Results in a significant number of inaccurate health is representations but generally marginally better that threshold comparison
• Requires that the appropriate thresholds and formulas are established requiring significant knowledge of the environment even more than for threshold comparison.
• Changing conditions within the environment will require threshold and formula changes. This is a significant manual maintenance issue.

## Event detection

Event detection utilizes alarms/alerts and other techniques to recognize something noteworthy has occurred; an event. It usually uses thresholds, logs and other sorts of data to recognize events. IT infrastructure health is determined by how many and which events have occurred.

The strengths of this methodology are:

• Relatively easy to moderate set up.
• Allows for customization

The weaknesses of this methodology are:

• Results in a significant number of inaccurate health representations resulting from inaccurate event detection and missed event detection.
• Requires that events actually occur in order to identify health issues. This makes it difficult to identify health degradation.
• Requires that the appropriate alarms/alerts are established and this requires a strong understanding of the environment.
• Changing conditions within the environment will require alerts/alarms to change. This is a significant manual maintenance issue.
• Can require significant resources especially when big data techniques are incorporated.

## Variation from normal

With variation from normal, unhealthy situations are considered to be situations that vary a set amount from what is considered to be normal. Normal is usually determined by evaluating a certain amount of historical data and identifying ‘normal” behavior or events. This usually requires complex algorithms and appropriate resources to run the analytics.

The strengths of this methodology are:

• Works best for situations that do not experience significant variations in workload and configuration changes
• Generally has the ability to adjust what is considered ‘normal’ over longer periods of time.

The weaknesses of this methodology are:

• It usually takes at least 90 days’ worth of data to gain even a very basic understanding of what is ‘normal’. This means the ROI takes longer than for most of the other methodologies
• Results in a significant number of inaccurate health representations in the beginning or for environments with non-cyclical behavior patterns because normalcy is not determined yet or it keeps changing.
• Analytics require significant resources and can rarely be used for real-time analysis.

## Allocation comparison

Allocation comparison determines health by looking at the total amount of resources available (capacity) versus the capacity that has been allocated. When the allocated capacity of resources gets close enough to the available capacity of resources, then that is considered to be unhealthy.

The strengths of this methodology are:

• Is most applicable to some virtualized environment and for disk space resource
• Allows for customization

The weaknesses of this methodology are:

• Results in a significant number of inaccurate health representations because it takes into account what is allocated not what is actually being used to complete work.
• Determining the total capacity available is difficult to do in many environments.
• Requires that one knows the capacity that is needed in order to allocate it to resources before the work characteristics are understood.

## Queuing theory for health

Queuing theory performs analytics on system utilization, throughput, queue length, and response time based on the amount of work running on systems and its configuration. Health is determined by evaluating the current utilization, throughput, queue lengths and response times for all components and the system as a whole against what is optimal for getting work done the fastest and most efficient.

The strengths of this methodology are:

• Is most applicable to CPU and I/O activity for both standalone and virtualized environments
• Allows for customization
• Provides very accurate health results for system, CPU and I/O resources.

The weaknesses of this methodology are:

• Not applicable for memory or disk space health determination
• Analytics requires more resources than threshold methodologies but usually less than normalization variation and event detection.

## Linear trending

Linear trending is used to impose a line of best fit to time series historical data. For risk prediction, this means that a line is used to project out future values for the historical data. It simulate behavior that is only linear in nature. This method is used to project when established thresholds will be exceeded.

The strengths of this methodology are:

• It is relatively easy to setup once appropriate thresholds have been established
• Provides reasonably accurate projections for CPU and I/O activity for resources with moderate to low utilizations and consistent growth rates
• Allows for limited customization

The weaknesses of this methodology are:

• It can only project behavior that is linear in nature. IT resources such as CPU, I/O, memory, networks, etc. do not behave in a linear manner when utilizations are at moderate to high levels. This means that the busier the resource, the more inaccurate linear trending results will be.
• Results in over provisioning because the thresholds are set lower to protect from having behaviors that cannot be accurately projected with linear trending.
• Results in a significant number of inaccurate risk representations when risk becomes most important which is when utilizations are higher.
• Does not take into account configuration or workload changes.

## Enhanced trending

Enhanced trending imposes simple algebraic quadratic functions to best fit time series data. Often times these functions take into account multiple statistics in one equation. For risk prediction, this means that a quadratic function is used to project out future values from the historical data. This method is used to project when established thresholds will be exceeded.

The strengths of this methodology are:

• Provides reasonably accurate risk projections for CPU and I/O activity for resources with moderate to low utilizations and consistent growth rates.
• Provides the best risk prediction for memory and disk space when correct functions are provided.
• Allows for customization

The weaknesses of this methodology are:

• It takes a very strong understanding of the environment to establish the quadratic functions that will accurately represent the environment’s behavior. It also takes a person that has a reasonably strong math background to come up with the functions.
• Provides more accurate results for resources with low to moderate utilizations than linear trending but still is not accurate for higher utilizations. Again the behavior of IT resources does not conform to quadratic functions consistently.
• Results in over provisioning because the thresholds are set lower to protect from having behaviors that cannot be accurately projected with enhanced trending.
• Results in a significant number of inaccurate risk representations when risk becomes most important which is when utilizations are higher.
• Does not take into account configuration or workload changes.

## Event projection

Event projection uses historical data about events to project when future events will occur. Risk is then calculated based upon which events will occur and when they will occur. Event projection can also incorporate some of the capabilities of normalcy variation when making risk determinations.

The strengths of this methodology are:

• Works best for situations that do not experience significant variations in workload and configuration changes
• Generally has the ability to adjust what is considered ‘normal’ over longer periods of time.
• Allows for customization of event types.

The weaknesses of this methodology are:

• Results in a significant number of inaccurate risk representations because of the time required for it to adapt to normalcy changes in the environment.
• Requires that events actually occurred in order to predict risk. The goal of risk is to prevent such events not assume they will continue in the future.
• Requires that the appropriate events are established and this requires a very strong understanding of the environment.
• Changing conditions within the environment will require events to change. This is a significant manual maintenance issue.
• Can require significant resources especially when big data techniques are incorporated
• It usually takes at least 90 days’ worth of data to gain even a very basic understanding of what is ‘normal’ resulting in inaccurate risk representations. This means the ROI takes longer than for most of the other methodologies
• Analytics require significant resources and can rarely be used for real-time analysis.

## Allocation projection

Allocation projection determines risk by looking at the total amount of resources available (capacity) versus the amounts of capacity that will be allocated. When the allocated amounts of capacity gets to a close enough level to the available capacity of resources, then risk is considered to increase.

The strengths of this methodology are:

• Is most applicable to virtualized environment placement and for disk space resource
• Adapts when capacity and work growth is well understood
• Allows for customization

The weaknesses of this methodology are:

• Results in a significant number of inaccurate risk representations because it takes into account what is allocated not what is actually being used to complete work.
• Results in significant over provisioning when the work being performed does not equal the allocation that is given. It is very difficult to know what appropriate allocations are when the workload is not well understood.

## Queuing theory for risk

Queuing theory predicts future system utilization, throughput, queue length, and response time based on the impact of the work to be performed and the configuration of the system. These metrics are computed for a single resource, systems or multiple systems. Risk is determined by using the predicted values to automatically compare with what is required to provide acceptable service levels. Queuing theory supports changes to workload and configuration very well. It incorporates various model types to best meet the given implementation.

The strengths of this methodology are:

• Allows for customization
• Provides very accurate risk results for CPU and I/O activity.

The weaknesses of this methodology are:

• Not applicable for memory or disk space risk determination
• Analytics requires more resources than threshold methodologies but usually less than event or allocation projection.

## Conclusion

One of the keys to using any of the above methodologies is automation. These methodologies have to evaluate a significant amount of data and apply some not so simple math. Vityl Adviser provide the automation, algorithms and efficiency so you do not have to deal with those things. This results in significant manpower savings.

In order to really make health and risk valuable to an organization, it must be extremely easy to understand the results and those results must be presented in actionable manner. Vityl Adviser provide device independent access using tailored personas. The personas ensure that the correct and most actionable information is presented to each stakeholder in an easy to understand manner. This results in issues being addressed in the most efficient ways.

TeamQuest provides the most accurate determinations for health and risk. For CPU and I/O activity TeamQuest uses queuing theory. It is by far the most accurate method for determining health and risk for CPU and I/O activity. For disk space and memory health TeamQuest uses enhanced threshold comparison that incorporates more than 25 years of experience in this field. These health determinations for disk space and memory are the most accurate in the industry. TeamQuest uses enhanced trending for disk space and memory risk determination. Again these risk calculation incorporate our vast experience and are the most accurate in the industry. The end results of all of this is that TeamQuest provides the most accurate health and risk determination for key components. All the methods used by TeamQuest adapt well to workload, configuration and other environment changes.

TeamQuest does not stop there. We take the industry leading component health and risk determinations and apply them to derive single values for systems and services health and risk while maintaining accuracy and efficiency. No one provides this single easy understand accurate value for health and risk for services and systems but TeamQuest. This allows various stakeholders to see and understand health and risk in ways that make sense to them, services. If they desire they can drill down to the system and component levels.

TeamQuest brings together automation, industry leading algorithms, flexibility, device independence, usability, functionality, accuracy, efficiency, tailored personas and value in its Vityl Adviser product.