DevOps lessons: 4 aspects of healthy experiments

You need to encourage experiments  – without people living in fear of the blame game. Four factors will prove crucial: Scope, approach, workflow, and incentives
385 readers like this.

Fast iteration is all the rage. And it’s not just DevOps and software.

It’s even made its way into distilling bourbon. When Bryan Davis of Lost Spirits distillery talked about accelerated bourbon aging in a recent Gastropod podcast, I expected the win would be about reducing costs; it’s expensive to keep bourbon aging for a decade or more! But no. It’s more about tweaking variables over the course of days, rather than years. “I would never have been able to build up the business to the point where I could take all the failed batches and throw them away. And so what the technology really did for me was make it possible for me to compete,” Davis says.

Experimentation has a flip side, though: Failure.

Software failures have widespread financial and human costs. The blame game is customary.

Now that’s a word with a negative vibe. Among engineering and construction projects, it conjures up the Titanic sinking, the Tacoma Narrows bridge twisting in the wind, or the space shuttle Challenger exploding. These were all failures of engineering design or management.

Most failures in the pure software realm don’t lead to the same visceral imagery as the above, but they can have widespread financial and human costs all the same. Think of the failed Healthcare.gov launch, data breaches too numerous to list, or really any number of multi-million dollar projects that basically didn’t work in the end. A single big software project failure can rack up costs in excess of $1 billion dollars.

In cases like these, playing the blame game is customary. Even when most of those involved don’t literally go down with the ship – as in the case of the Titanic – people get fired, careers get curtailed, and the Internet has a field day with both the individuals and the organizations.

But how do we square that with the frequent admonition to embrace failure in DevOps? If we should embrace failure, how can we punish it?

[ Read our related story: Adobe CIO: Cross-functional collaboration requires embracing failure and loss of control. ]

Not all failure is created equal 

Not all failures are the same. Understanding different types of failure and structuring the environment and processes to minimize the bad kinds is the key to success. The key is to “fail well,” as Megan McArdle writes in her book, "The Up Side of Down."

In that book, McArdle describes the Marshmallow Challenge, an experiment originally concocted by Peter Skillman, the former VP of design at Palm. In this challenge, groups receive 20 sticks of spaghetti, one yard of tape, one yard of string, and one marshmallow. Their objective is to build a structure that gets the marshmallow off the ground, as high as possible.

Skillman conducted his experiment with all sorts of participants, from business school students to engineers to kindergarteners. The business school students did worst. I’m a former business school student, and this does not surprise me. According to Skillman, they spent too much time arguing about who was going to be the CEO of Spaghetti, Inc. The engineers did well, but also did not come out on top. As someone who also has an engineering degree and has participated in similar exercises, I suspect that they spent too much time arguing over the optimal structural design approach using a front-loaded waterfall software development methodology writ small.

By contrast, the kindergartners didn’t sit around talking about the problem. They just started building to determine what works and what doesn’t. And they did the best.

Change the nature of accountability

Setting up a system and environment that allows and encourages such experiments enables successful failure in agile software development. It doesn’t mean that no one is accountable for failures. In fact, it makes accountability easier because “being accountable” needn’t equate to “having caused some disaster.” In this respect, it changes the nature of accountability.

We should consider four principles when we think about such a system: scope, approach, workflow, and incentives.

1. The right scope: This is about constraining the impact of failure and stopping the cascading of additional failures. This is central to encouraging experimentation because it minimizes the effect of a failure. (And if you don’t have failures, you’re not experimenting.) In general, you want to decouple activities and decisions from each other. From a DevOps perspective, this means making deployments incremental, frequent, and routine events – in part by deploying small, autonomous, and bounded context services (such as microservices or similar patterns).

2. The right approach: Here, you're continuously experimenting, iterating, and improving. This gets back to the Toyota Production System’s kaizen (continuous improvement) and other manufacturing antecedents. The most effective processes have continuous communication – think scrums and kanban – and allow for collaboration that can identify failures before they happen. At the same time, when failures do occur, the process allows for feedback to continuously improve and cultivate ongoing learning.

3. The right workflow: Repeatedly automate for consistency and thereby reduce the number of failures attributable to inevitable casual mistakes like a mistyped command. This allows for a greater focus on design errors and other systematic causes of failure. In DevOps, much of this takes the form of a CI/CD workflow that uses monitoring, feedback loops, and automated test suites to catch failures as early in the process as possible.

4. The right incentives: Align rewards and behavior with desirable outcomes. Incentives (such as advancement, money, recognition) need to reward trust, cooperation, and innovation. The key is that individuals have control over their own success. This is probably a good place to point out that failure is not always a positive outcome. Especially when failure is the result of repeatedly not following established processes and design rules, actions still have consequences.

The culture challenge 

I said there were four principles. But actually, there are five. A healthy culture is a prerequisite for both successful DevOps projects and successful open source projects and communities. In addition to being a source of innovative tooling, open source often serves as a great model for the iterative development, open collaboration, and transparent communities that DevOps requires to succeed.

You need an understanding that even good decisions can have bad outcomes.

The right culture is, at least in part, about building organizations and systems that allow for failing well – and thereby make accountability within that framework a positive attribute rather than part of a blame game. This requires transparency. It also requires an understanding that even good decisions can have bad outcomes. A technology doesn’t develop as expected. The market shifts. An architectural approach turns out not to scale. Stuff happens. Innovation is inherently risky. Cut your losses and move on, avoiding the sunk cost fallacy.

One of the key transformational elements is developing trust among developers, operations, IT management, and business owners through openness and accountability.

Ultimately, DevOps becomes most effective when its principles pervade an organization rather than being limited to developer and IT operations roles. This includes putting the incentives in place to encourage experimentation and (fast) failure, transparency in decision-making, and reward systems that encourage trust and cooperation. The rich communication flows that characterize many distributed open source projects are likewise important to both DevOps initiatives and modern organizations more broadly.

Shifting culture is always challenging and often needs to be an evolution. For example, Target CIO Mike McNamara noted in a 2017 interview that “What you come up against is: ‘My area can’t be agile because…’ It’s a natural resistance to change – and in some mission-critical areas, the concerns are warranted. So in those areas, we started developing releases in an agile manner but still released in a controlled environment. As teams got more comfortable with the process and the tools that support continuous integration and continuous deployment, they just naturally started becoming more and more agile.”

It’s tempting to say that getting the cultural aspects right is the main thing you have to nail in both open source projects and in DevOps. But that’s too narrow, really. Culture is a broader story in IT and elsewhere. For all we talk about technology, that is in some respects the easy part. It’s the people who are hard.

Writing in The Open Organization Guide to IT Culture Change, Red Hat CIO Mike Kelly observes how “This shift to open principles and practices creates an unprecedented challenge for IT leaders. As their teams become more inclusive and collaborative, leaders must shift their strategies and tactics to harness the energy this new style of work generates. They need to perfect their methods for drawing multiple parties into dialog and ensuring everyone feels heard. And they need to hone their abilities to connect the work their teams are doing to their organization’s values, aims, and goals — to make sure everyone in the department understands that they’re part of something bigger than themselves (and their individual egos).”

This article was adapted in part from How Open Source Ate Software by Gordon Haff (Apress 2018).

Gordon Haff is Technology Evangelist at Red Hat where he works on product strategy, writes about trends and technologies, and is a frequent speaker at customer and industry events on topics including DevOps, IoT, cloud computing, containers, and next-generation application architectures.