August 20, 2012

    If you're at this level, you probably...

      1. Focus on the workload analysis rather than the performance of technical components; and

      1. Rely on trending to give you early warnings regarding future incidents.

    However, you also probably...

      1. Have a hard time predicting the outcome of complex scenarios with enough confidence.

    Organizations in lower maturity levels are triggered by events, without putting much
    effort into understanding the cyclicality or seasonality of applications. At this level you are taking a more forward-looking view to try to predict and avoid those problems. Tools are in place for quicker diagnosis of performance problems and trending is used to predict performance bottlenecks or incidents in the near future. When detected, work is commissioned to mitigate the risk. This enables you to predict and avoid downtime in many cases.

    In order to do trending you need historical information, not just a view of current
    performance or during the last few days. The tools used must be able to continuously collect and aggregate data and put it into a historical database. Another key attribute of this level is the focus on workloads instead of just the compound activity of a technical component. Rather than looking at infrastructure components as entities, the activity of a component is broken down by application or service. These streams of activity are then the focus for analysis and reporting.

    Combining the trending capabilities with the notion of different workloads makes for anew level of proactivity. This is the first stage where an organization truly automates the prevention and resolution of problems that are now known to recur. This is where the IT department starts caring about the availability of applications, not just equipment.

    To move to this stage, you need to see the full spectrum of an application, not just one layer or one tier of a multi-tiered application. This requires the use of a standardized toolset that covers the complete technology stack of a server, as
    well as the diversity of different platforms when looking across numerous servers.

    In order to keep it all together, this stage requires a stricter approach to monitoring. If you are stuck with a continuous flood of unrelated alerts and events from a multitude of sources, it's hard to prioritize and optimize the effort. You need to identify the key factors beforehand and monitor, store and analyze that subset of information. Of course, one still might need to put out any fires that occur. But if you already know where the fires are likely to occur and which fires will hurt the most, you can prioritize actions to minimize damage.

    Find out where you fall in the TeamQuest Capacity Management Maturity Model!

    Category: best-practices