About a year-and-a-half ago I assumed responsibility for Adobe’s cloud operations for our Creative Cloud business and Cloud Technology platform. Working to achieve operational excellence has been at the forefront of my expanded role.
As more and more services move to the cloud to benefit from agility and scale, operational excellence becomes a key differentiator for cloud-based businesses. Customer-centric attributes including continuous uptime, reliability, and ease of use are all table stakes in the cloud service market.
To be competitive, enterprises must elevate their approach to bolster reliability and scalability, to reach 99.99 percent service levels – a goal not easily attainable.
Contributing to the challenge is the growing complexity and critical inter-dependencies between microservices and platforms. Each solution and service is part of a larger ecosystem. If microservices break or there are outages, it has a ripple effect and impacts service availability. And this applies not only to the enterprise business, but also to their customers’ businesses.
[ Read also: Adobe CIO: How IT automation became a team eye-opener. ]
To up-level the experience our customers receive, we must evolve our strategies to improve operational excellence across the organization – from engineering teams in how they develop code and deploy products, to our product management teams thinking through enhanced features, to IT teams running infrastructure and applications. We believe that everyone should have a deep knowledge of what it means to be operationally excellent and how they are enabling stand-out customer experiences.
We have identified four areas that are key to achieving operational excellence: operating at scale, anticipating failure, unleashing automation, and embracing the culture of DevOps.
1. Operating at scale
Operating at scale isn’t about coding and launching a product: It’s about leveraging engineering principles to scale. You need to consider the front end as you design and release; and build into the code observability, recovery, and learning from the start.
We need to strive for four-nines availability at scale, but that doesn’t mean a service needs to be four-nines from the beginning. For instance, maybe you’re trying something new where you want to market-test it to see how people respond.
When you start to engineer though, you need to think about what will be at scale and how to put in place the necessary design principles. As your services and products run, statistics will help you learn and observe how people are using the product. When you reach that point, you can scale with traffic and customer interest, all the way up to that 99.99 percent.
I like to remind my group that you can’t reboot your way out of a problem. If you reboot one service, you don’t know the ripple effect it has, and you are not getting to the root cause of the issue. That’s why it’s important to understand the key performance characteristics, then look for deviations and be able to plan, monitor, and adjust programmatically. That’s also why you need to adopt an engineering mindset at the front-end – because performance characteristics should be built into the product from the start.
2. Anticipating failure
Given the interdependencies between services and platforms, I often tell my teams that things will fail, and we will have issues. As teams code, I want that notion in the back of their minds, so they can develop a work around up-front that enables a consistent user experience. Anticipating failure is looking at the landscape of where your services will run, understanding your ecosystem, and considering what could possibly go wrong – then engineering around it so that services remain resilient.
The other element of anticipating failure is chaos engineering. You think that certain services or technologies will operate in a certain way, but when you start testing how they operate, it’s a real eye-opener. Chaos engineering tests an assumption then asks, “If I were to do this, why does it happen?” Then, you build those learnings into your code.
Anticipating failure is about end-to-end observability. In many cases, you won’t have control over that whole end to end experience, so you need to think about instrumentation and observability, and log those correct metrics to make sure you get the right responses.
[ Read also: Adobe CIO: Cross-functional collaboration requires embracing failure and loss of control. ]
3. Unleashing automation
Automation, driven by advancements in AI and machine learning, has emerged as a key strategy for operating at scale and anticipating failure.
When problems inevitably arise, automation can help IT teams identify and fix issues with speed and precision, allowing humans to focus on more creative problem-solving.
At Adobe, we are enabling event-driven automation by applying AI/ML to remediate issues and outages across our systems without human intervention. Our self-healing platform identifies patterns and learns how to solve problems based on past experiences. Failure remediations that took someone 30 minutes to fix manually, now take under 3 minutes to remediate automatically with self-healing capabilities.
To operate at scale, you can’t throw more people and budget to solve challenges. Automation strategies play a key role in helping you scale and maintain uptime and reliability as the business grows. Similarly, automation plays a key role in anticipating issues or failures and addressing them before they impact a service or experience.
4. Embracing DevOps culture
Some people think DevOps is an organizational structure, but it’s really a way of working and thinking in which you have freedom from your organizational silo. You’re not part of a DevOps team; you’re working within a culture of DevOps and bringing various disciplines to the table.
[ Read our related story: DevOps culture: 3 ways to strengthen yours in 2019 ]
In the past, operations would come in and put out fires. That’s hard work, of course, but that’s not the culture you want for the long term. With DevOps, you take perspectives from all corners – from the developer, operations, and security – and bring them together. Long-term, you want people who are working on fire-suppression systems. DevOps is about working across all different disciplines to bring a successful service to market.
By embracing DevOps, you also need to think about standards and best practices, which lends itself to a language that’s spoken across your culture. You don’t have an ‘operations’ standard or an ‘engineering’ standard: You have one for the organization that brings that collaboration and mindset together.
Achieving operational excellence is about evolving your strategies and the way you think about working. It’s a collaborative and proactive approach that involves multiple disciplines including engineering, product management, operations, and IT. And it’s achieved by thinking about how you operate at scale, how you anticipate failure, how you unleash automation, and how you embrace the culture of DevOps.
[ Why is adaptability the new power skill? Read our report from HBR Analytic Services: Transformation Masters: The New Rules of CIO Leadership. ]