Newsletters
Providing a Global Perspective in the Data Center
Issues and Concerns: Managing the Distributed Environment
It isn’t easy monitoring a distributed environment. Unchecked server growth fills data center space rapidly. Heterogeneous platforms and operating systems make provisioning and inventorying a long and laborious process. Power supply limitations can prevent cooling upgrades or curtail the addition of servers. It is difficult to know what IT resources are available, making it extremely difficult to plan effectively for future needs.
Capacity planners, therefore, have two key questions to answer: How do we use the existing environment? And what is our total installed capacity? Most of the time there isn’t an easy response.
The picture is muddled by the fact that the capacity planner is often being forced to compare apples to oranges. The sheer variety of platforms and configurations makes it difficult to draw comparisons across multi-tiered and multi-OS applications, multiple environments that are each unique (Production, Development, QA), and amidst much fluidity within the server population.
Clustering and failover simply add to the confusion, as do the myriad of details, such as hyperthreading being turned on or off, including servers with more or less cache and so on. Even if the capacity planner begins to grasp the types of hardware present, it is another matter entirely to come to grips with how that hardware is used, by what applications, and by which business unit.
Despite this spiraling confusion, management takes a simple view. They want answers - now. Top management doesn’t want 15 spreadsheets full of complex equations and calculations that "show" the existing state of IT. Typically, management wants the information boiled down to one number and delivered within a few hours.
Normalization can help here. Industry standards should be utilized to normalize. SpecIntRates, for example, are very useful in this regard.
This effort to normalize is also materially assisted by leveraging TeamQuest Performance Software. TeamQuest Model, for example, provides a wealth of inventory information (such as model, # of CPUs and hyper-threading), and relative CPU processing power.
Take the case of normalizing x86 systems running Linux and Windows. In a large enterprise, this can equate to hundreds of different configurations. The picture can be further complicated if the applications are multi-OS.
Experience here has demonstrated the value of SpecInt2000Rate used in combination with TeamQuest Model. Most of the servers can be easily matched with the TeamQuest database, automatically assigning a value to them. If a resource is unavailable, try to obtain a match at www.spec.org. If only the older SpecInt95 value is available, TeamQuest has devised a conversion formula to translate it into Spec INT2000Rate (and vice-versa):
|
|
|
INT2000 to INT95 conversion factor is 9.4
|
|
|
|
INT95rate to INT95 conversion factor is 9.02
|
|
|
|
INT2000Rate to INT2000 conversion factor is 86.7
|
|
|
|
INT2000Rate to INT2000 conversion factor is 86.2
|
Capacity Reconciliation
Capacity reconciliation, then, is all about constructing a monthly capacity profile for each server based on peak CPU demand, normalizing that CPU utilization, and creating zones of aggregation such as by OS, environment, line of business, application, and location.
Valuable profiles for the capacity manager to capture are monthly peak CPU utilization and CPU capacity usage for each measured server. Ideally, these would be available for each 10-minute interval on a 24-hour timeline and with approximately 30 days in each sample. By calculating the 95th percentile for all values, it is possible to derive peak demand figures for the month.
Normalization is achieved by expressing CPU utilization in CPU Capacity Units (CCU) as a function of utilization and the SPECint_rate_base2000 rating of the physical server. This normalized utilization metric is called CPU Capacity Usage and is abstracted from the hardware platform.
Capacity aggregation is realized by determining the Total CPU Capacity Usage vs. Total CPU Installed Capacity in the desired aggregation levels or server groupings. This can be done in many different ways depending on the requirements: among all Windows servers, all Linux servers or for all servers utilized by a line of business, for example.
To calculate these numbers, utilize a 24-hour timeline in 10-minute increments to sum up the following metrics in CCU units of measure:
|
|
All servers CPU Installed Capacity (A)
|
|
|
All servers CPU Capacity Usage (B)
|
For each 10-minute interval, calculate percentage of total utilization (UI) as follows:
Be aware, though, that total utilization is calculated as the maximum value of the 24-hour profile for the aggregated platform.
Note that variance accounting must be done to have the above exercise make sense in reality. Alla Piltser of Merrill Lynch said, "Month to month, we noticed that our numbers were not consistent. Huge variances were showing up such as 2000 Linux servers one month and 1500 the next."
This led to the realization that there was no stable base from which to compare numbers month to month. Thus Piltser and her team had to pay attention to how many servers were modeled the previous month, calculate the number of dropped and added servers, and work out the utilization rates on any added servers. Such variances must be taken into account and included in any modeling to ensure accuracy.
End Result
The end result of this action is to highlight who really needs and who doesn’t need more capacity. Using TeamQuest, Merrill Lynch can now search by server, application and month to view the CPU Capacity profile. This demonstrates conclusively, for example, that some users are asking for more servers even though underutilized servers may be in their immediate vicinity.
Another factor well worth knowing is how many CCUs are available for production. While the organization as a whole might have 63,351 CCUs, 19 percent could be earmarked for QA and 14 percent for development.
Thus capacity managers have to understand that only 67 percent of the total can be used for production purposes. TeamQuest can be fed that data so capacity plans are based upon effective capacity as opposed to total capacity.
But the overall picture doesn’t tell the whole story. While total effective capacity may show plenty of room to breathe, it takes drilling down into each OS, line of business and application to determine that one or more of them is exceeding monthly peak CPU capacities at a particular time of day.
For example, Piltser’s team discovered that they exceeded capacity between 5 and 7 each day when they ran this on Linux. By using TeamQuest to take a more granular approach to capacity management, it is possible to spot such extremes and take them into account in the planning process.
Power of Information
The capacity manager has a wealth of information at his or her fingertips to understand installed and used capacity. This can be harnessed to optimize data center operations, control growth and become far more proactive in dealing with capacity and performance issues. Underutilized servers and environments, for example, can be identified as candidates for virtualization and/or consolidation.
The capacity manager is also a great friend to any area of the organization that requires more hardware. By substantiating requests to purchase new servers, the capacity manager makes it far more likely that deserving lines of business will prevail and wasteful units will be forced to become more efficient.
Realize, however, that capacity planning is never more than a work in progress. There is always something to improve and there are always changes being made - either in IT or to the various business units.
|