Bernd Harzog of the Virtualization Practice shared his insight on service optimization and how it affects capacity and performance in virtualized and cloud-based environments. Take a minute to assess his views on how you can optimize your always on and highly dynamic environment.
Service optimization is all about business agility i.e., being in a position to respond to the markets, to the users and to the customers. IT has to be extremely agile in the services it delivers to support the business. If that is lost sight of, IT isn’t doing service optimization in actuality. It’s all about being able to deliver the services the business needs in order to be competitive. If IT does that, suddenly the business begins to care about you. Instead of being at odds with line of business managers, IT earns itself greater appreciation and that translates directly into more people and a bigger budget.
There are many possible metrics available. In fact, IT can get a little lost due to the sheer volume of performance and capacity data available to them. But in the world of the software-defined data center and the cloud, there are the two metrics that matter the most: response time and latency. Any other metrics are subordinate to those two. If they are watched like a hawk and maintained at an acceptable level, IT is doing its job. Other metrics should be relied upon only as a means of drilling down into a response time or latency issue.
And if I was to highlight one other metric that will help IT align better with the business, it would be revenue per second. If you can set that up and run on it, you immediately give IT the right orientation from a business perspective. IT can then dig into possible bottlenecks that might be inhibiting the bottom line. Do that and line of business managers will realize just how useful you can be.
One simple way to create time for yourself is to figure out a way to monitor metrics in order to get out of the trap of constantly being hauled into meetings to address the latest flap. I call them blame-storming meetings. Something is slow, so an executive calls together some IT personnel and yells at them.
Most of the time, the response time and latency metrics will do the trick –boiling down to app response or availability. Yet many performance managers spend their time monitoring resources such as CPU or memory utilization. Those are fine in their own way, but only if you relate them to the bigger picture as shown in app response time and latency. If you talk about CPU utilization in meetings, you are probably speaking a different language to the line of business managers. They care about slowdowns to their apps. They want them speeded up – Now! So do your monitoring in such a way that you can have a relevant conversation with the business.
People who use Amazon regularly have come to realize that the right approach is to assume it is going to break at some point and therefore you are going to have related performance problems. A good example is Netflix. Its reliance on Amazon was costing it business. So Netflix had to rewrite its app to withstand the vagaries of Amazon. That led them to develop what they call the Chaos Monkey to randomly shoot things to see that the service level stayed in a desirable range.
Things were well and good for a while until an entire Amazon zone went down. The timing of that event was during the busy Christmas period and the consequences were dire in terms of revenue and customer satisfaction. The response? Netflix then developed a Zone Monkey so they could stay up despite that occurrence.
If everything is a service, the old ways of managing tend to become obsolete. Traditionally, IT would put on project managers to look after things. But the mentality of the project manager is wrong for IT services which really can’t be managed as a project for the simple reason that they are not projects with a definitive start and end. I would go so far as to recommend you fire all your project managers as they have a 180-degree wrong view of managing software. The project manager comes along and he or she is oriented to want the project done. This doesn’t work well.
A better way to manage is the way a software company operates. A software “project” is never done. It is an unending series of software releases and bug fixes. It is a completely different mindset. So if you are looking after apps such as online revenue generation or a mission-critical business app, you are dealing with a continually changing item.
There is no escaping the fact that you are interacting with finance and that can be one of your biggest challenges. As you start to run more workloads that matter in the cloud, you are going to be faced with decisions about what to move to Amazon and what to keep in-house or on other services. Obviously, the security side is probably the first argument you will have to address. But if you get beyond that aspect, the correct way to make the decision on cloud services is on a price/performance basis. You have to break it down to a certain number of transactions at a certain response time for a certain cost internally and then figure out the same thing for using Amazon. That may not be easy to do, so you have to put the time in to work it out accurately. If you don’t make the decision in that way, get ready for some unpleasant surprises sooner or later.
Capacity planning and performance management have more relevance than ever. Realize, for example that VMware vCenter Operations and vCloud Automation Center are not well integrated yet. That means that workloads can get ordered from a service catalog in vCAC and there is no visibility into the impacts upon capacity and performance until those workloads are automatically provisioned by vCAC.
Virtualization and Cloud Computing therefore both create the need for better capacity planning. Virtualization makes it easier to create servers which accelerates the rate at which capacity is consumed. Cloud Computing and cloud management solutions like vCAC cause workloads to get provisioned in a fully automated manner which accelerates the use of capacity to an even greater degree.
It is also critical not to overlook the relationship between capacity and performance. You can think you are fine when you look at overall CPU and memory utilization metrics, but you could in fact be causing issues with response time by creating periodic bottlenecks. This all goes back to using response time and latency as the key indicators of capacity utilization.
I saw a report from a company that receives data from a billion mobile devices and tablets. They take this data and analyze it to find patterns. One of the amusing, and perhaps disturbing trends is that usage spikes during rush-hour drive times. In other words, people are accessing their devices the most as they drive and as they sit at traffic lights.
Bernd Harzog is CEO of APM Experts analyst firm and consultancy focused upon: Infrastructure Performance and Capacity Management of Virtualized Systems, Application Performance Management, Transaction Performance Management and End User Experience Management. He can be contacted at firstname.lastname@example.org.