Using Surveyor to Make Capacity Management More Proactive
A major retailer uses TeamQuest Surveyor to analyze growth trends of thousands of systems containing tens of thousands of file systems each week. A few of these file systems will require intervention. However, finding out which ones would take IT staff many hours of laborious action, accessing each file system and individually determining the state it was in. Surveyor provides IT with a report on each file system automatically so that IT staff assigned to those file systems can check the report and zero in on the specific file systems that demand some attention. That way, they spend their time actively inspecting likely candidates for action rather than wasting time looking in places that are in good shape. The value of this approach is considerable. Every avoided outage saves thousands of dollars, and some might even save as much as tens of thousands of dollars. By focusing attention where it provides the most benefit, the Surveyor file system report allows IT to act more proactively in other areas.
This retailer has outlined a progression from reactive incident response to being fully proactive. This can be looked upon as a natural progression.
In the initial stages a company is largely reactive, only taking action when calls, tickets or complaints come in. By then, the impact of an incident has already been felt. As such, IT rarely gets to look at more than 5 to 10% of the enterprise so is often surprised by “unforeseen” events. Incidents, therefore, have to be handled now, interrupting whatever projects were ongoing. The work to fix such problems has to be completed fast, so in many cases the solution adopted is far from optimum. To make matters worse, when these incidents involve running out of capacity (in this case storage), they are no small matters. Therefore, this retailer looked for a better way.
The retailer attempted to become proactive without TeamQuest Surveyor to gain more predictability over its 5000 systems encompassing mainframe and open systems as well as a large Storage Area Network (SAN). The company moved from simply coping with whatever calls came to its attention to being able to focus its actions more on critical systems. This was achieved by talking more with the business to find out what their priorities were. However, the new approach of periodically reviewing critical systems still only meant that 5 to 10% of systems were adequately covered. The difference was that the work done was scheduled as opposed to constant interruptions. The quality of the remedial work also improved and the volume of emergencies dropped. But IT management realized that to get in full control, it had to be fully responsible for all systems.
To take things to the next level, the organization implemented Surveyor. That gave it a massive gain in scope, from around 10% up to almost 100% of systems covered. Result: instead of working gradually through all critical systems in roughly a quarterly basis to ensure there were no capacity issues, Surveyor enabled IT to check everything weekly. The system points out the areas to address. It still needs a person to deal with the exceptions. But Surveyor finds almost all of the hot spots.
Take, for example, file system full problems. In the past, the capacity manager had to go to the server, type in a script, view the results and analyze them. IT could see an immediate problem such as a full file system and fix it. But as there was no historical context, it was sometimes hard to determine if action should be taken or not. A file system 90% full might seem a concern. But if it had stayed that way for 6 months, it might not be a priority. Another file system might only be 75% full, but if it is growing at 20% per month, it is about to have serious problems. TeamQuest Surveyor reports provide the raw data on file systems above a certain threshold of volume and/or growth. And IT can then drill down in the report to see how growth is trending over time.
It is up to the capacity manager to program the report to fit operational needs. Parameters can be adjusted, exceptions added and regular reports are made available on matters like file system full forecasts. At this retailer, for instance, thresholds were set as follows:
- Is the file system utilization above 90% AND growing by more than 0.2% for the interval?
- Is the file system utilization above 75% AND growing by more than 2% for the interval?
- Is the file system utilization above 15% AND growing by more than 15% for the interval?
- Is /appl/patrol above 90% AND growing for the interval?
Physical and virtual servers can be included in the same report, but can also be treated uniquely. They are sorted by date/time most likely to fill up and they show all candidates for a single server together (sorted highest to lowest), thus minimizing the time for operations to respond. Additionally, it is possible to forecast the utilization trend into the future using multiple statistical options.
Using this method, IT’s task is simplified. Instead of having to access almost 5000 systems hosting almost 50,000 file systems, Surveyor serves up less than 500 candidates that may fill up in the next 2 weeks. When IT staff has an available hour or two, they pull up the report and quickly sort through the possible candidates, isolating the ones needing actual attention.
If the number of candidates provided by the report becomes too large, it is time to change the parameters being used. In the case of this retailer, new rules were set to filter out temporary file systems and shutdown servers. While this methodology is not perfect, it catches most of the upcoming issues.
This file system forecast is just one example of how this retailer is utilizing TeamQuest Surveyor. It removes all the time of looking for a potential problem and focuses that time on how to best respond (cleanup/add capacity/change settings/etc.) Instead of IT staff with a spare hour wasting 50 minutes looking for potential problems, and only 10 minutes fixing them, they now consume no more than 5 minutes looking and 55 minutes fixing them. In the past, it took 10 minutes of computer time to review 4 servers with 385 total file systems and identify 5 candidates on 2 servers. With Surveyor, IT can look at 45,000 file systems a week automatically and be served up a list of a couple of hundred candidates.
Runaway CPU Candidates
Another example of the value of Surveyor is runaway CPUs. This retailer crunches through thousands of servers to isolate those with a total system CPU utilization above 25%, and where that CPU utilization runs at that level (within 2%) for 5 or more consecutive hourly intervals. By running this report, IT was provided with a list of seven candidate runaway processes out of 6870. The operations staff can see the exact process ID of the runaway as well as which server.
Based on its experience with Surveyor in recent months, the retailer shared the following lessons learned:
- Launch slowly in phases to not overwhelm the groups who receive reports
- Design reports as flexibly as possible
- Watch out for special cases and exclusions
- Accept the fact that some people like to say they want to be proactive until given the tools to do so, then they “don’t have the time”
- You have to sell Surveyor by presenting it as a means of saving them time
- Its scope will creep as you gain experience with the tool and realize how much more it can do
- Don’t spend days fighting through something, ask for support help from TeamQuest. It can usually be solved in under an hour