|
White paper:
|
TeamQuest Model: Your System's Health PractitionerLike going to the doctor for a checkup periodically, there are several reasons why organizations should regularly revisit TeamQuest Model. It provides proactive capacity management, it enables organizations to examine the fitness of their systems and determine how they are performing today compared to previous visits. This makes it easy to see at a glance whether IT is staying on plan, if service levels are being met, and if a select few systems need to be more closely monitored. Without the discipline of scheduled visits, months and even years can pass between system models. And that's a sure recipe for ill health. Let's take a look at three patients systems. Each scenario provides different symptoms, diagnoses and prescriptions.
First Patient: Suffering from "visitosis" A look at the patient's charts revealed that CPU utilization had increased from 33% to 54% with Oracle being the major culprit. Checking further, Oracle response times changed from 64 seconds to 71 seconds. Java response times had also increased from 22 seconds to 27 seconds. From there, the systems management doctor drilled down to analyze the components of response. A chart in TeamQuest Model shows where you are spending your time and resources to complete a unit of work. It breaks down all of the components that make up the overall response time. For this scenario, most components remained the same for this patient when comparing the previous model to the current model. However, CPU service rose from 5.9 seconds to 8.2 seconds and CPU queuing went up from 3.8 seconds to 8.2 seconds. Clearly something was going on with the patient's CPU. The doctor then looked into the measured population and throughput the same as before. The population chart basically tells you how much work is going on in the system, while the throughput is how much work has been completed. Yet CPU utilization had risen from 30% to 42%. At this point the system management doctor has enough data for a diagnosis. System utilization is up and the relative response time has increased without a surge in workload volume or change in throughput. Clearly, this patient was suffering from "visitosis" of an active resource! Visitosis is a deadly, yet common disease that is caused by changes in a workload's executable code. This can cause an increase in activity at active resources. Such changes happen all the time in development without systems personnel being informed. But before jumping to final conclusions and charging into a solution, it is important to validate the diagnosis. One way to do this is to observe visit counts for the CPU. This shows how much time a workload spends visiting a resource. During this same period, the visit count went up by 30% to get the same work done. Clearly, it was time to speak to the Oracle application development group to find out what recent changes had been made to code that could have caused an increase of visits to the CPU active resource. Whatever had been done, the Capacity Management group was unaware of the change. Thus this highlighted a possible gap in Change Management as part of an ITIL implementation. Once the bug had been resolved, the patient was scheduled for a follow-up appointment in three months.
Modeling of visitosis using TeamQuest can be accomplished by:
a) Selecting What if > Choose Step Growth Type.
b) Selecting the workload of interest. c) Selecting the type of growth and completing the model to verify that the handling has actually addressed the problem with visit count.
Second Patient: System utilization up The TeamQuest drill down revealed that while there were general increases in most sectors as predicted, the "Other" category had gone up significantly on most metrics. Thus this was a classic case of "processes undefineness." The cure for this malady was to conduct further analysis of processes within the "Other" workload category using TeamQuest IT Service Analyzer or TeamQuest View, and to then fully define processes to current or new workloads and workload sets. Once accomplished, verify all workloads have been defined using TeamQuest IT Service Analyzer or TeamQuest View. Definitions should include a workload and a workload set. One of the key items for success with TeamQuest Model is to know your environment. This includes know your work and the associated hardware. But as environments are always changing, this patient was scheduled for a follow-up appointment in 6 months.
Last Patient: Disk constraints A rapid look at the charts is the best way to see what had transpired. As predicted, transaction volume had doubled and now the one controller and its few associated disks were slowing the entire system down. Modeling was done using the Stretch Factor metric. Stretch Factor is a key indicator of queuing delays within a system. It is calculated as Service time + Queue time, divided by service time. Obviously, an ideal Stretch factor is 1. That indicates that there is no queuing time. Once the Stretch Factor reaches 2, however, that means thereafter you will experience exponential growth in utilization. The system management doctor modeled the addition of another controller and a group of 10 more disks. This took the Stretch Factor down from an alarming 25 to only 1.4. This added up to an acceptable response time of 2.4 ms. The patient could now move forward confidently with this upgrade, knowing that it could deal with the forecasted increase in transaction volume without experiencing further delays in disk response. The patient was told to implement the prescription and was scheduled to return for an appointment within two weeks to model the next wave of business growth. This is just a sampling of how you, as a system management doctor, can use TeamQuest Model as your systems' health practitioner. Follow TeamQuest and other system management doctors online at LinkedIn. |