The flow of information in a hospital labor and delivery room provides an interesting parallel to corporate IT. The nurses constantly monitor the mother’s blood pressure, heart rate, temperature, contractions, etc., as well as the baby’s heart rate, stress level and position. This enables them to make minor adjustments to fluids, IVs and more. The doctor, receiving periodic reports with unnecessary information filtered out, is able to easily monitor the “big picture” for risk, make strategic decisions and guide the process. While decisions are being made at many different levels, they are all derived from the same underlying data being gathered from mom and baby.
This is not unlike an IT organization. Support staff, administrators and engineers, and at a higher level, directors and CIOs, all have to make decisions related to capacity. And just like with the doctors and nurses, all levels of IT capacity questions can be answered through the appropriate use of the available data. The good news is that we have lots of data at our fingertips. And that is part of the problem. What data should be mined and analyzed to achieve the right answers?
Technicians and admins, for example, are swamped with so much information that they don’t know where to focus their energy. They are faced with too many alerts and incidents to be proactive, and they struggle to be able to demonstrate the extent of capacity problems to the boss. CIOs, vice presidents and directors, too, are challenged on where to best direct investment in IT capacity, and how much progress is being made in addressing issues. But in reality, many of the questions asked by admins and technicians are the same as those being asked by CIOs and VPs.
Specific to our organization, we suffered from too many questions and not enough answers. We needed a way to analyze our data that would provide answers to questions posed by all levels of the organization. TeamQuest Surveyor provides that analytical capability. With this tool, we are able to use the same data through a single solution that answers questions for both ends of the IT spectrum.
Our environment consists of about 5,000 servers of which around 1,700 are physical. About 200 significant applications are running on a range of systems such as Unix, Linux, Windows, Teradata, Tandem and mainframe. Across multiple data centers, we serve thousands of internal IT customers, and millions of shoppers. And this is only the back-end side. It doesn’t include our massive store infrastructure of more than 1000 retail locations.
A majority of this infrastructure is now virtualized, which poses its own problems when it comes to capacity reporting levels for the enterprise as a whole by application, by OS, by server and by resource (CPU, Memory, etc.). Therefore, it made sense for us to construct a reporting hierarchy consisting of different levels of aggregation.
At the enterprise level, we need periodic analysis and reporting, not day to day, in order to drive strategic IT decision making. At the next layer down – the application/OS Level – more frequent data is required that focuses on overall health of application infrastructures to assist portfolio managers in marketing, supply chain and retail in their decision-making actions.
Moving further down the hierarchy, we are concerned with server application environments and specific resource risks for groups of servers. And then at the bottom of the heap, specific resources are monitored. If a server is identified as having a capacity risk, for instance, you have to be able to rapidly discover what is driving that problem. This is the province of system admins and support staff.
To make things easy for each level, we set up different views of the data specific to their needs. Case in point: executive capacity dashboards are useful when the CIO wants to track trends quarterly and annually to see where we are at in capacity risk within the IT infrastructure. Note, though, that this dashboard is the end result of an in-depth analysis by TeamQuest Surveyor that encompasses and consolidates data from thousands of physical and virtual servers. Each server has anywhere from 6 to 17 metrics against platform specific thresholds. In all, 16,000 capacity risk indicators from 40,000 metric checks on 4,000 systems are analyzed completely within one day and summarized on the dashboard. This would have taken weeks to accomplish manually using spreadsheets.
This kind of high-level analysis and summarization recently highlighted that half of the enterprise was appropriately utilized. However, there were some areas of concern where high utilization might be impacting quality of service. This pointed up that the organization has a significant opportunity to find ways to better utilize more of our resources.
Such dashboards have also been supplied at each level of the IT hierarchy. Consistent color coding helps the various stakeholders to identify the stress level of their servers, applications and resources.
The real power of TeamQuest Surveyor is that we don’t just have a canned product with a few extensions from which we can build some standard charts. Instead, we are able to think about what we want to present and to whom, and then give them what they need specifically to improve the company or their area of it.
In the past, IT has tended to set overly general metrics for items such as CPU utilization. That doesn’t work for us. We use many metrics for CPU, IO, network and memory, and set the levels of what is stressed at the different levels of the hierarchy. Instead of a blanket, “70% CPU utilization is bad” we use a rule-based approach grounded in multiple metrics. For example, on Unix hosts we consider CPU utilization in conjunction with run queue lengths. On VMware virtuals we consider ready time and overall host utilization in addition to virtual CPU usage. Memory utilization in virtual environments requires the analysis of active, consumed, allocated, balloon and other memory metrics.
When we have a methodology defined to determine capacity risk for our systems, we can group the data in a variety of ways to answer different questions. When it comes to servers, we can categorize by location, age, usage and also see how older servers are doing versus newer models. These types of condensed views of the enterprise make mountains of data far easier to digest.
We also have a VMware executive dashboard which presents a very high-level view of our VMware capacity at an enterprise level. An aggregate indicator is the highest level and shows the capacity risk for every VMware host based on an aggregate of resource risks. Thresholds are customized based on best practices and company-specific needs. At the same time, TeamQuest Surveyor enables us to provide views of our VMware infrastructure for every level of the IT hierarchy. With Surveyor, we can use the same data to provide reports at the proper level of aggregation for all levels of IT staff. All of these reports are fully automated, published and distributed to appropriate staff on a regular basis. We can also do detailed VMware rightsizing using the analytical capabilities of Surveyor which allows complex algorithms to be applied.
Here are the lessons we have learned in implementing TeamQuest Surveyor:
By bearing these lessons in mind, TeamQuest Surveyor has helped us achieve far better utilization of our existing infrastructure. We have also seen a reduction in problem tickets and capacity related problems overall. It moved us from a reactive to a proactive approach to capacity and performance management.
In addition, it helped us achieve better alignment and communication between the business and IT. The business can now understand pertinent capacity risks at a level that is appropriate to their interests. As a result, IT staff and management can focus on the capacity level appropriate to themselves while being able to communicate with each other effectively.
Steve Wenzbauer, Enterprise Systems Management Manager.