IT metrics: Why the five 9s must go

If your team measures IT uptime using the five 9s approach, it may be time to rethink how you deliver business value to the organization. Here's how we updated our metrics
255 readers like this.
IT metrics

Does your organization measure IT uptime using "nines" (99.9%, 99.99%, 99.999%)?

Primarily applied to a specific application or ITIL-defined service (microservices, serverless, APIs, etc.), this common approach counts the number of minutes that the system is fully available and divides this number by minutes within a given period.

However, it creates a distorted view of service delivery to your clients, both internal and external. Consider these questions:

If you have hundreds or thousands of services that are critical to the business, is it possible to mathematically achieve five nines (99.999 percent) within each service while still impacting business operations on a regular basis? If most of the features of the service are not functional, is it down? Where do you draw the line? If the service is running slow, impacting the customer experience and business operations, is it down? What if some users can access the service but others cannot?

[ How can automation free up more staff time for innovation? Get the free eBook: Managing IT with Automation. ]

The risks of shooting for 99.999 percent uptime

Even the process of calculating the nines presents its own questions:

Do you exclude scheduled or emergency maintenance? If you have 10 sites on a wide area network (WAN), do you count a single site’s downtime? Do you count time only within the business hours of a particular area? What about daily maintenance, or downtime caused by third-party solutions?

This is just a glimpse of the complexities of uptime performance. The bottom line? None of the above is relevant to delivering value to the business. In fact, if you rely on this technique, you face the following risks:

  • Increasing operational costs by having chronically slow systems
  • Damaging the company’s reputation with customers by taking downtime at the worst possible time
  • Becoming blind to the underperformance of your associates
  • Creating stress in your relationship with operational executives and the C-Suite as you try to explain why your numbers are strong even as the business suffers service outages

But most importantly, you’ll miss out on the greatest discussion you can ever have with your operational peers: Talking to business leaders about what uptime really means to them.

All this leads to the key question: How should you measure?

IT metrics: How we updated 

Here’s the approach that we evolved with our site reliability, L1/L2 support, and operations teams:

1. Count all the minutes that affect business performance. These are “impacted” minutes.

2. Not all incidents are the same, so it’s important to agree on definitions. Here’s what we decided:

  • If a system is 100 percent unavailable, this is contacted as a Global Service Outage. This complies with ITIL and has the added benefit of complying with clients’ contractual SLA agreements (this is the “nines” approach).
  • If the service is available but unable to support the designed load (users/call/volume), it is counted as partial.
  • If the system has any diminishing of features or performance, we consider it degraded.

3. Count all impacted minutes (global, partial, and degraded) against the total number of minutes in a month. This is can be an arduous task, but it represents the percentage of time that the business is receiving the full benefit of your IT services.

4. Meet with business leadership regularly (we do this weekly) to discuss the numbers and the impact of service interruptions on the business. Gain alignment with your partners and discuss mutual ways to improve it.

Meet with business leadership regularly to discuss the numbers and the impact of service interruptions on the business.

5. Track instances in which your monitoring leads to action that avoids impacted time. (We refer to these as “mitigated events.”)

6. Count the minutes that high-availability services were not fully redundant. These are not considered impacted minutes, but they will highlight how often you used this valuable feature and whether you have an unusual amount of non-redundant time.

The approach doesn’t eliminate the need for effective architectural design to track and resolve problems or reduce your monitoring requirement. It does allow you to baseline your performance and understand how it actually impacts the business. It also creates a deep bond with the SLT and operational leadership and makes you a business leader as well as an IT executive.

When you realign the metrics, each company’s numbers will be different. The number is not as relevant as creating the starting point. The goal is to improve on that number consistently over time. If you have the same amount of total impacted time but fewer global minutes, that is a win. If your partial decreases but degraded increases, you are improving your service to the company and learning the “puts and takes” of service delivery. Driving down the total impacted minutes is the primary goal, but progress is progress. Celebrate mitigated events – they are the best news of all.

Improving partnerships with business leaders

In your discussions with leadership, you will likely regularly disagree on some events. For example, we had an interesting discussion about how a partial outage involving 10 agents on a Sunday at 2 a.m. is more damaging than one impacting 100 agents at 11 a.m. on a Monday. Because our network supports 4,000 agents, having 10 disabled agents does not seem like a high-impact event. But if only 250 agents are active at 2 a.m., the business result is much different.

Always set high standards for yourself and your team – that’s the best way to learn about the business and drive the most impactful results.

This change in approach and process allows you to build on the operational relationship and foster a more effective partnership. Our business leaders have become a critical factor in completing our “blameless postmortems:” They serve as our business impact experts and help us discuss the pros and cons of any investments or process modifications.

Use this as an opportunity to reset your relationship with business leaders. Partnering with them to realign how you mature system reliability will lead to more effective communications in all areas.

[ Learn the do’s and don’ts of cloud migration: Get the free eBook, Hybrid Cloud Strategy for Dummies. ]

Robert (Bob) Sullivan leads Agero’s Technology Shared Services organization. This includes Cloud Operations, Data Networking, Site Reliability, Infrastructure Security, End User Services, IT Procurement, Infrastructure Security, and Telecommunications teams.