Summary of all State of DevOps Reports since 2013

It’s not that easy to find all the annual state of DevOps reports, partly because they forked in 2017 between Puppet and Google/DORA. Below I’ve listed each report by year, and I’m in the process of listing all the key findings from each report. Some reports provide greater insights than others.

The first report was in 2013, and showed quite clearly that adopting DevOps practices resulted in technological and business improvements. Along the way, Puppet and Google / DORA joined forces, parted ways, and now (as of writing in 2021) there are two State of DevOps Reports, and the focus has broadened to SRE, Organisational Culture, Security, and even Documentation.

2013 – Puppet:

  1. Respondents from organisations that implemented DevOps reported improved software deployment quality and more frequent software releases.
  2. DevOps enables high performance by increasing agility and reliability. High performing organisations ship code 30x faster and complete those deployments 8,000 times faster than their peers. They also have 50% fewer failures and restore service 12 times faster than their peers.
  3. Organisations that have implemented DevOps practices are up to five times more likely to be high-performing than those that have not. In fact, the longer organisations have been using DevOps practices, the better their performance: The best are getting better.

2014 – Puppet and DORA –

  1. Strong IT performance is a competitive advantage. Firms with high-performing IT organisations were twice as likely to exceed their profitability, market share and productivity goals.
  2. DevOps practices improve IT performance. IT performance strongly correlates with well-known DevOps practices such as use of version control and continuous delivery.
  3. Organizational culture matters. Organizational culture is one of the strongest predictors of both IT performance and overall performance of the organisation. High-trust organisations encourage good information flow, cross-functional collaboration, shared responsibilities, learning from failures and new ideas; they are also the most likely to perform at a high level.
  4. Job satisfaction is the No. 1 predictor of organisational performance. Job satisfaction includes doing work that’s challenging and meaningful, and being empowered to exercise skills and judgment. Where there is job satisfaction, employees bring the best of themselves to work: their engagement, their creativity and their strongest thinking.

2015 – Puppet and DORA:

  1. High-performing IT organisations deploy 30x more frequently with 200x shorter lead times; they have 60x fewer failures and recover 168x faster. Failures are unavoidable, but how quickly you detect and recover from failure can mean the difference between leading the market and struggling to catch up with the competition.
  2. Lean management and continuous delivery practices create the conditions for delivering value faster, sustainably.  This results in higher quality, shorter cycle times with quicker feedback loops, and lower costs. These practices also contribute to creating a culture of learning and continuous improvement.
  3. High performance is achievable whether your apps are greenfield, brownfield or legacy. As long as systems are architected with testability and deployability in mind, high performance is achievable.
  4. IT managers play a critical role in any DevOps transformation. Managers can do a lot to improve their team’s performance by ensuring work is not wasted
    and by investing in developing the capabilities of their people.
  5. Diversity matters. Research shows that teams with more women members have higher collective intelligence and achieve better business outcomes.
  6. Deployment pain can tell you a lot about your IT performance. Where code deployments are most painful, you’ll find the poorest IT performance, organisational performance and culture.
  7. Burnout can be prevented, and DevOps can help. Burnout is associated with pathological cultures and unproductive, wasteful work.

2016 – Puppet and DORA:

  1. High-performing organisations are decisively outperforming their lower-performing peers in terms of throughput. High performers deploy 200 times more frequently than low performers, with 2,555 times faster lead times. They also continue to significantly outperform low performers, with 24 times faster recovery times and three times lower change failure rates.
  2. High performers have better employee loyalty, as measured by employee Net Promoter Score (eNPS). Employees in high-performing organisations were 2.2 times more likely to recommend their organisation to a friend as a great place to work, and 1.8 times more likely to recommend their team to a friend as a great working environment. Other studies have shown that this is correlated with better business outcomes.
  3. Improving quality is everyone’s job. High-performing organisations spend 22 percent less time on unplanned work and rework. As a result, they are able to spend 29 percent more time on new work, such as new features or code. They are able to do this because they build quality into each stage of the development process through the use of continuous delivery practices, instead of retrofitting quality at the end of a development cycle.
  4. High performers spend 50 percent less time remediating security issues than low performers. Through better integrating information security objectives into daily work, teams achieve higher levels of IT performance and build more secure systems. less time on unplanned work and rework.
  5. Taking an experimental approach to product development can improve your IT and organisational performance. The product development cycle starts long before a developer starts coding. Your product team’s ability to decompose products and features into small batches; provide visibility into the flow of work from idea to production; and gather customer feedback to iterate and improve will predict both IT performance and deployment pain.

2017 – Puppet and DORA:

  1. Transformational leaders share five common characteristics that significantly shape an organisation’s culture and practices, leading to high performance. The characteristics of transformational leadership — vision, inspirational communication, intellectual stimulation, supportive leadership, and personal recognition — are highly correlated with IT performance.
  2. High-performing teams continue to achieve both faster throughput and better stability. The gap between high and low performers narrowed for throughput measures, as low performers reported improved deployment frequency and lead time for changes, compared to last year. However, the low performers reported slower recovery times and higher failure rates. It’s possible that pressure to deploy faster and more often causes lower performers to pay insufficient attention to building in quality.
  3. Automation is a huge boon to organisations. High performers automate significantly more of their configuration management, testing, deployments and change approval processes than other teams. The result is more time for innovation and a faster feedback cycle.
  4. Loosely coupled architectures and teams are the strongest predictor of continuous delivery. If you want to achieve higher IT performance, start shifting to loosely coupled services — services that can be developed and released independently of each other — and loosely coupled teams, which are empowered to make changes.
  5. Lean product management drives higher organisational performance. Lean product management practices help teams ship features that customers actually want, more frequently. This faster delivery cycle lets teams experiment, creating a feedback loop with customers.

2018 – Puppet:

  1. DevOps drives business growth – maintaining a robust software delivery and operability function increases productivity, profitability, and market share.
  2. Cloud technology correlates with business performance – this is enabled by reliable and sustainable cloud infrastructure, utilised via cloud native patterns.
  3. Open source software improves performance – high-performing IT teams are 1.75 times more likely to use open-source applications.
  4. Functional outsourcing can be detrimental to software performance, and Elite Performers are rarely using it.
  5. Technical practices such as monitoring and observability, continuous testing, database change management, and the early integration of security in software development all enable organisational performance.
  6. DORA identified high-performing organisations in a range of profit, not-for-profit, regulated, and non-regulated industries. The industry you’re in doesn’t affect your ability to perform.
  7. Diversity in tech is poor, but improving, and teams with improved diversity demonstrate higher performance than those that don’t.

2018 – DORA  (Accelerate):

  1. SDO (Software Delivery Organisation – i.e. development teams) performance unlocks competitive advantages. Those include increased profitability, productivity, market share, customer satisfaction, and the ability to achieve organisation and mission goals.
  2. How you implement cloud infrastructure matters. Proper (effective) usage of the public cloud improves software delivery performance and teams that leverage all of cloud computing’s essential characteristics are 23 times more likely to be high performers.
  3. Open source software improves performance. Open source software is 1.75 times more likely to be extensively used by the highest performers.
  4. Outsourcing by function is rarely adopted by elite performers and hurts performance. While outsourcing can save money, low-performing teams are almost 4 times as likely to outsource whole functions such as testing or operations than their highest-performing counterparts.
  5. Key technical practices drive high performance. These include monitoring and observability, continuous testing, database change management, and integrating security earlier in the SDLC.
  6. Industry doesn’t matter when it comes to achieving high performance for software delivery. High performers exist in both non-regulated and highly regulated industries alike.

2019 – Puppet:

  1. Doing DevOps well enables you to do security well.
  2. Integrating security deeply into the software delivery lifecycle makes teams more than twice as confident of their security posture.
  3. Integrating security throughout the software delivery lifecycle leads to positive outcomes.
  4. Security integration is messy, especially in the middle stages of evolution.

2019 – Google:

  1. The industry continues to improve, particularly among the elite performers.
  2. The best strategies for scaling DevOps in organisations focus on structural solutions that build community, including Communities of Practice.
  3. Cloud continues to be a differentiator for elite performers and drives high performance.
  4. To support productivity, organisations can foster a culture of psychological safety and make smart investments in tooling, information search, and reducing technical debt through flexible, extensible, and viewable systems.
  5. Heavyweight change approval processes, such as change approval boards, negatively impact speed and stability. In contrast, having a clearly understood process for changes drives speed and stability, as well as reductions in burnout.

2020 – Puppet:

  1. The industry still has a long way to go and there remain significant areas for improvement across all sectors.
  2. Internal platforms and platform teams are a key enabler of performance, and more organisations are adopting this approach.
  3. Adopting a product approach over project-oriented improves performance and facilitates improved adoption of DevOps cultures and practices.
  4. Lean, automated, and people-oriented change management processes improve velocity and performance.

2021 – Puppet:

  1. Organisational dynamics must be considered crucial to transformation.
  2. Cloud-native approaches are critical. It is no good to simply move traditional workloads to the cloud.
  3. Shift security, compliance and change governance left, and include security stakeholders in all stages of value delivery.
  4. Culture change is key, and must be promoted from the very “top” as well as delivered from the “bottom”. Psychological safety is at the core of digital and cultural transformations.

2021 – Accelerate:

  1. The “highest performers” continue to improve the velocity of delivery.
  2. Adoption of SRE practices improves wider organisational performance.
  3. Adoption of cloud technology accelerates software delivery and organisational performance. Multi-cloud adoption is also on the increase.
  4. Secure Software Supply Chains enable teams to deliver secure software quickly, safely and reliably.
  5. Documentation is important to being able to implement technical practices, make changes, and recover from incidents. 
  6. Inclusive and generative team cultures improve resilience and performance.

2022 – Google / DORA:

  1. Generative Cultures are indicators of higher performance.
  2. Less experienced teams who implemented trunk-based development actually show less positive results than teams who do not use trunk-based development.
  3. Healthy, high-performing teams also tend to have good security practices broadly established.
  4. Software delivery performance alone does not predict organisational success. Excellent software delivery combined with high reliability (high DORA Metrics in this case) correlate with organisational success.

The Puppet State of DevOps Report 2021 – A Summary

I get a bit confused every year about who is writing the State of DevOps Report, and how that gets decided, and in the past it’s been Puppet, Google, DORA and others, but this year, 2021, it was definitely Puppet.

[Edit: apparently there are two State of DevOps reports now… I’m staying out of that particular argument though!]

The state of DevOps report each year attempts to synthesise and aggregate the current state of the technology industry across the world in respect to our collective transformation towards delivering value faster and more reliably. Or as Jonathan Smart puts it, “Sooner. Safer, Happier”. The DevOps shift has been in progress for over a decade now, and whilst DevOps was always really about culture, the most recent reports are now emphasising the importance of culture, progressive leadership, inclusion, and diversity more than ever before.

Last year, in 2020, the core findings of the State of DevOps Report focussed on:

  1. The technology industry in general still had a long way to go and there remained significant areas for improvement across all sectors.
  2. Internal platforms and platform teams are a key enabler of performance, and more organisations were starting to adopt this approach.
  3. Adopting a long-term product approach over short-term project-oriented improves performance and facilitates improved adoption of DevOps cultures and practices.
  4. Lean, automated, and people-oriented change management processes improve velocity and performance over traditional gated approaches.

This year (2021), there are a number of key findings building on previous DevOps reports:

1. Well defined and architected Team Topologies improve flow.

Clear organisational dynamics including well-defined boundaries, responsibilities, and interactions, are critical to achieve fast flow of of value. Whilst last year highlighted the importance of internal platforms, this report emphasises the importance of Conway’s Law, and shows that well defined team structures and interactions, such as platform teams (which scale out the benefits of DevOps transformations across multiple teams), cross-functional value-stream aligned teams, and enabling teams strongly influence the architecture and performance of the technology they build. Team “Interaction Modes” as seen in the diagram below are also critical to define, in the same way that we would define API specifications.

DevOps and Team Topologies

The book Team Topologies expands upon this concept in great detail, and Matthew and Manuel, the authors, also provide excellent training in order to apply these concepts to your contexts.

Clear team responsibilities

What is also clear from the State of DevOps Report this, and has been for some time, is that siloing DevOps practices into separate “DevOps teams” is an antipattern to success in most cases. And there should still be no such thing as a “DevOps Engineer”.

2. Use of cloud technology remains immature in many organisations.

Whilst the majority of organisations are now using cloud technology such as IaaS (infrastructure-as-a-service), most organisations are still using it in ways that are analogous to the ways we used to manage on-premise or datacentre technology. High performers are adopting “cloud-native” technologies and ways of working, including the NIST (National Institute of Standards and Technology)essential characteristics of cloud computing: “on-demand self-service, broad network access, resource pooling, rapid elasticity or expansion, and measured service.” How these are implemented is very context-specific, but includes the principles of platform(s) as a product or service, and high competencies in monitoring and alerting and SRE (Site Reliability Engineering) capabilities, whether as SRE teams, or SRE roles in cross-functional teams.

Cloud native capabilities and devops

3. Security is shifting left.

High performers in the technology space integrate security requirements early in the value chain, including security stakeholders into the design phase and build phases rather than just at deploy, or even worse, run, phases. Traditional “inspection” approaches to security, governance and compliance significantly impact flow and quality, resulting in higher risk and lower reliability. Applying DevOps principles and practices to include change management, security and compliance improves flow, reliability, performance, and keeps the auditors off your back.

DevSecOps transformation

Whilst some call this DevSecOps, many would simply call it DevOps the way it was always intended to be.

4. DevOps and Digital Transformation must be delivered from the bottom-up, and empowered from the top-down.

Culture is the reflection of what we do, the behaviours we manifest, the practices we perform, the way we interact and what we believe. Culture change is never successfully implemented only from the top-down, and must be driven and engaged with by those expected to actually change their behaviours and practices.

DevOps transformation promotion

Cultural barriers to change include unclear responsibilities (enter Team Topologies), insufficient feedback loops, fear of change and a low prioritisation for fast flow, and most importantly, a lack of psychological safety.

Psychological safety and risk

A lot of these findings, unsurprisingly, echo the findings from Google’s 2013 Project Aristotle, which showed that psychological safety, clarity, dependability, meaning and impact were crucial for high performance in teams.

Extra note on “Legacy” workloads.

The report highlighted the “dragging” effect that legacy workloads can have on flow and change rate, as an effect of their architecture, codebase, or infrastructure, or the fact that nobody in the organisation understands it any longer. Rather than leave alone your legacy workloads, invest in them “so that they’re no longer an inhibitor of progress”. This could be as simple as virtualisation of physical hardware, or decomposing part of the system and moving certain components to cloud-native platforms such as Kubernetes or OpenShift. Even if you have to do something a bit “ugly” such as creating 18GB containers, it’s still a step forward.

TL;DR

  1. Organisational dynamics must be considered crucial to transformation.
  2. Cloud-native approaches are critical. It is no good to simply move traditional workloads to the cloud.
  3. Shift security, compliance and change governance left, and include security stakeholders in all stages of value delivery.
  4. Culture change is key, and must be promoted from the very “top” as well as delivered from the “bottom”. Psychological safety is at the core of digital and cultural transformations.

If you’re interested in finding out more about DevOps and Digital Transformations, Psychological Safety, or Cloud Native approaches, please get in touch.

Thanks to Nigel Kersten, Kate McCarthy, Michael Stahnke and Caitlyn O’Connell for working on the 2021 State of DevOps Report and providing us with these insights.

View the 2021 Accelerate State of DevOps Report summary here.

Resilience Engineering and DevOps – A Deeper Dive

robustness vs resilience

[This is a work in progress. If you spot an error, or would like to contribute, please get in touch]

The term “Resilience Engineering” is appearing more frequently in the DevOps domain, field of physical safety, and other industries, but there exists some argument about what it really means. That disagreement doesn’t seem to occur in those domains where Resilience Engineering has been more prevalent and applied for some time now, such as healthcare and aviation. Resilience Engineering is an academic field of study and practice in its own right. There is even a Resilience Engineering Association.

Resilience Engineering is a multidisciplinary field associated with safety science, complexity, human factors and associated domains that focuses on understanding how complex adaptive systems cope with, and learn from, surprise.

It addresses human factors, ergonomics, complexity, non-linearity, inter-dependencies, emergence, formal and informal social structures, threats and opportunities. A common refrain in the field of resilience engineering is “there is no root cause“, and blaming incidents on “human error” is also known to be counterproductive, as Sidney Dekker explains so eloquently in “The Field Guide To Understanding Human Error”.

Resilience engineering is “The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions.Prof Erik Hollnagel

It is the “sustained adaptive capacity” of a system, organisation, or community.

Resilience engineering has the word “engineering” in, which makes us typically think of machines, structures, or code, and this is maybe a little misleading. Instead, maybe try to think about engineering being the process of response, creation and change.

Systems

Resilience Engineering also refers to “systems”, which might also lead you down a certain mental path of mechanical or digital systems. Widen your concept of systems from software and machines, to organisations, societies, ecosystems, even solar systems. They’re all systems in the broader sense.

Resilience engineering refers in particular to complex systems, and typically, complex systems involve people. Human beings like you and I (I don’t wish to be presumptive but I’m assuming that you’re a human reading this).

Consider Dave Snowden’s Cynefin framework:

cynefin

Systems in an Obvious state are fairly easy to deal with. There are no unknowns – they’re fixed and repeatable in nature, and the same process achieves the same result each time, so that we humans can use things like Standard Operating Procedures to work with them.

Systems in a Complicated state are large, usually too large for us humans to hold in our heads in their entirety, but are finite and have fixed rules. They possess known unknowns – by which we mean that you can find the answer if you know where to look. A modern motorcar, or a game of chess, are complicated – but possess fixed rules that do not change. With expertise and good practice, such as employed by surgeons or engineers or chess players, we can work with systems in complicated states.

Systems in a Complex state possess unknown-unknowns, and include realms such as battlefields, ecosystems, organisations and teams, or humans themselves. The practice in complex systems is probe, sense, and respond. Complexity resists reductionist attempts at determining cause and effect because the rules are not fixed, therefore the effects of changes can themselves change over time, and even the attempt of measuring or sensing in a complex system can affect the system. When working with complex states, feedback loops that facilitate continuous learning about the changing system are crucial.

Systems in a Chaotic state are impossible to predict. Examples include emergency departments or crisis situations. There are no real rules to speak of, even ones that change. In these cases, acting first is necessary. Communication is rapid, and top-down or broadcast, because there is no time, or indeed any use, for debate.

Resilience

As Erik Hollnagel has said repeatedly since Resilience Engineering began (Hollnagel & Woods, 2006), resilience is about what a system can do — including its capacity: 

  • to anticipate — seeing developing signs of trouble ahead to begin to adapt early and reduce the risk of decompensation 
  • to synchronize —  adjusting how different roles at different levels coordinate their activities to keep pace with tempo of events and reduce the risk of working at cross purposes 
  • to be ready to respond — developing deployable and mobilizable response capabilities in advance of surprises and reduce the risk of brittleness 
  • for proactive learning — learning about brittleness and sources of resilient performance before major collapses or accidents occur by studying how surprises are caught and resolved 

(From Resilience is a Verb by David D. Woods)

 

Capacity Description
Anticipation Create foresight about future operating conditions, revise models of risk
Readiness to respond Maintain deployable reserve resources available to keep pace with demand
Synchronization Coordinate information flows and actions across the networked system
Proactive learning Search for brittleness, gaps in understanding, trade-offs, re-prioritisations

Provan et al (2020) build upon Hollnagel’s four aspects of resilience to show that resilient people and organisations must possess a “Readiness to respond”, and states “This requires employees to have the psychological safety to apply their judgement without fear of repercussion.”

Resilience is therefore something that a system “does”, not “has”.

Systems comprise of structures, technology, rules, inputs and outputs, and most importantly, people.

Resilience is about the creation and sustaining of various conditions that enable systems to adapt to unforeseen events. *People* are the adaptable element of those systems” – John Allspaw (@allspaw) of Adaptive Capacity Labs.

Resilience therefore is about “systems” adapting to unforeseen events, and the adaptability of people is fundamental to resilience engineering.

And if resilience is the potential to anticipate, respond, learn, and change, and people are part of the systems we’re talking about:

We need to talk about people: What makes people resilient?

Psychological safety

Psychological safety is the key fundamental aspect of groups of people (whether that group is a team, organisation, community, or nation) that facilitates performance. It is the belief, within a group, “that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes.” – Edmondson, 1999.

Amy Edmondson also talks about the concept of a “Learning organisation” – essentially a complex system operating in a vastly more complex, even chaotic wider environment. In a learning organisation, employees continually create, acquire, and transfer knowledge—helping their company adapt to the un-predictable faster than rivals can. (Garvin et al, 2008)

“A resilient organisation adapts effectively to surprise.” (Lorin Hochstein, Netflix)

In this sense, we can see that a “learning organisation” and a “resilient organisation” are fundamentally the same.

Learning, resilient organisations must possess psychological safety in order to respond to changes and threats. They must also have clear goals, vision, and processes and structures. According to Conways Law:

“Any organisation that designs a system (defined broadly) will produce a design whose structure is a copy of the organisation’s communication structure.”

In order for both the organisation to respond quickly to change, and for the systems that organisation has built to respond to change, the organisation must be structured in such a way that response to change is as rapid as possible. In context, this will depend significantly on the organisation itself, but fundamentally, smaller, less-tightly coupled, autonomous and expert teams will be able to respond to change faster than large, tightly-bound teams with low autonomy will. Pais and Skelton’s Team Topologies explores this in much more depth.

Engineer the conditions for resilience engineering

“Before you can engineer resilience, you must engineer the conditions in which it is possible to engineer resilience.” – Rein Henrichs (@reinH)

As we’ve seen, an essential component of learning organisations is psychological safety. Psychological safety is a necessary condition (though not sufficient) for the  conditions of resilience to be created and sustained. 

Therefore we must create psychological safety in our teams, our organisations, our human “systems”. Without this, we cannot engineer resilience. 

We create, build, and maintain psychological safety via three core behaviours:

  1. Framing work as a learning problem, not an execution problem. The primary outcome should be knowing how to do it even better next time.
  2. Acknowledging your own fallibility. You might be an expert, but you don’t know everything, and you get things wrong – if you admit it when you do, you allow others to do the same.
  3. Model curiosity – ask a lot of questions. This creates a need for voice. By you asking questions, people HAVE to speak up. 

Resilience engineering and psychological safety

Psychological safety enables these fundamental aspects of resilience – the sustained adaptive capacity of a team or organisation.:

  • Taking risks and making changes that you don’t, or can’t, fully understand the outcomes of. 
  • Admitting when you made a mistake. 
  • Asking for help
  • Contributing new ideas
  • Detailed systemic cause* analysis (The ability to get detailed information about the “messy details” of work)

(*There is never a single root cause)

Let’s go back to that phrase at the start:

Sustained adaptive capacity.

What we’re trying to create is an organisation, a complex system, and sub systems (maybe including all that software we’re building) that possesses a capacity for sustained adaptation.

With DevOps we build systems that respond to demand, scale up and down, we implement redundancy, low-dependancy to allow for graceful failure, and identify and react to security threats.

Pretty much all of these only contribute to robustness.

robustness vs resilience

(David Woods, Professor, Integrated Systems Engineering Faculty, Ohio State University)

You may want to think back to the cynefin model, and think of robustness as being able to deal well with known unknowns (complicated systems), and resilience as being able to deal well with unknown unknowns (complex, even chaotic systems). Technological or DevOps practices that primarily focus on systems, such as microservices, containerisation, autoscaling, or distribution of components, build robustness, not resilience.

However, if we are to build resilience, the sustained adaptive capacity for change, we can utilise DevOps practices for our benefit. None of them, like psychological safety, are sufficient on their own, but they are necessary. Using automation to reduce the cognitive load of people is important: by reducing the extraneous cognitive load, we maximise the germane, problem solving capability of people. The provision of other tools, internal platforms, automated testing pipelines, and increasing the observability of systems increases the ability of people and teams to respond to change, and increases their sustained adaptive capacity.

If brittleness is the opposite of resilience, what does “good” resilience look like? The word “anti-fragility” appears to crop up fairly often, due to the book “Antifragile: Things that Gain from Disorder” by Nassim Taleb. What Taleb describes as antifragile, ultimately, is resilience itself.

I have my own views on this, but fundamentally I think this is the danger of academia (as in the field of resilience engineering) restricting access to knowledge. A lot of resilience engineering literature is held behind academic paywalls and journals, which most practitioners do not have access to.  It should be of no huge surprise that people may reject a body of knowledge if they have no access to it.

Observability

It is absolutely crucial to be able to observe what is happening inside the systems. This refers to anything from analysing system logs to identify errors or future problems, to managing Work In Progress (WIP) to highlight bottlenecks in a process.

Too often, engineering and technology organisations look only inward, whilst many of the threats to systems are external to the system and the organisation. Observability must also concern external metrics and qualitative data: what is happening in the marketspace, the economy, and what are our competitors doing?

Resilience Engineering and DevOps

What must we do?

Create psychological safety – this means that people can ask for help, raise issues, highlight potential risks and “apply their judgement without fear of repercussion.” There’s a great piece here on psychological safety and resilience engineering.

Manage cognitive load – so people can focus on the real problems of value – such as responding to unanticipated events.

Apply DevOps practices to technology – use automation, internal platforms and observability, amongst other DevOps practices. 

Increase observability and monitoring – this applies to systems (internal) and the world (external). People and systems cannot respond to a threat if they don’t see it coming.

Embed practices and expertise in component causal analysis – whether you call it a post-mortem, retrospective or debrief, build the habits and expertise to routinely examine the systemic component causes of failure. Try using Rothmans Causal Pies in your next incident review.

Run “fire drills” and disaster exercises. Make it easier for humans to deal with emergencies and unexpected events by making it habit. Increase the cognitive load available for problem solving in emergencies.

Structure the organisation in a way that facilitates adaptation and change. Consider appropriate team topologies to facilitate adaptability.

In summary

Through facilitating learning, responding, monitoring, and anticipating threats, we can create resilient organisations. DevOps and psychological safety are two important components of resilience engineering, and resilience engineering (in my opinion) is soon going to be seen as a core aspect of organisational (and digital) transformation.

 

References:

Conway, M. E. (1968) How Do Committees Invent? Datamation magazine. F. D. Thompson Publications, Inc. Available at: https://www.melconway.com/Home/Committees_Paper.html

Dekker, S. 2006. The Field Guide to Understanding Human Error. Ashgate Publishing Company, USA.

Edmondson, A., 1999. Psychological safety and learning behavior in work teams. Administrative science quarterly, 44(2), pp.350-383.

Garvin, David & Edmondson, Amy & Gino, Francesca. (2008). Is Yours a Learning Organization?. Harvard business review. 86. 109-16, 134.

Hochstein, L. (2019)  Resilience engineering: Where do I start? Available at: https://github.com/lorin/resilience-engineering/blob/master/intro.md (Accessed: 17 November 2020).

Hollnagel, E., Woods, D. D. & Leveson, N. C. (2006). Resilience engineering: Concepts and precepts. Aldershot, UK: Ashgate.

Hollnagel, E. Resilience Engineering (2020). Available at: https://erikhollnagel.com/ideas/resilience-engineering.html (Accessed: 17 November 2020).

Provan, D.J., Woods, D.D., Dekker, S.W. and Rae, A.J., 2020. Safety II professionals: how resilience engineering can transform safety practice. Reliability Engineering & System Safety, 195, p.106740. Available at https://www.sciencedirect.com/science/article/pii/S0951832018309864

Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I.
(Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org.

John Allspaw has collated an excellent book list for essential reading on resilience engineering here.

Remote Working – What Have We Learned From 2020?

Remote working improves productivity.

Even way back in 2014, evidence showed that remote working enables employees to be more productive and take fewer sick days, and saves money for the organisation.  The rabbit is out of the hat: remote working works, and it has obvious benefits.

Source: Forbes Global Workplace Analytics 2020

More and more organisations are adopting remote-first or fully remote practices, such as Zapier:

“It’s a better way to work. It allows us to hire smart people no matter where in the world, and it gives those people hours back in their day to spend with friends and family. We save money on office space and all the hassles that comes with that. A lot of people are more productive in remote setting, though it does require some more discipline too.”

We know, through empirical studies and longitudinal evidence such as Google’s Project Aristotle that colocation of teams is not a factor in driving performance. Remote teams perform as well as, if not better than colocated teams, if provided with appropriate tools and leadership.

Teams that are already used to more flexible, lightweight or agile approaches adapt adapt to a high performing and fully remote model even more easily than traditional teams.

The opportunity to work remotely, more flexibly, and save on time spent commuting helps to improve the lives of people with caring, parenting or other commitments too. Whilst some parents are undoubtedly keen to get into the office and away from the distractions of home schooling, the ability to choose remote and more flexible work patterns is a game changer for some, and many are actually considering refusing to go back to the old ways.

What works for some, doesn’t work for others, and it will change for all of us over time, as our circumstances change. But having that choice is critical.

However, remote working is still (even now in 2020 with the effects of Covid and lockdowns) something that is “allowed” by an organisation and provided to the people that work there as a benefit.

Remote working is now an expectation.

What we are seeing now is that, for employees at least, particularly in technology, design, and other knowledge-economy roles, remote working is no longer a treat, or benefit – just like holiday pay and lunch breaks,  it’s an expectation.

Organisations that adopt and encourage remote working are able to recruit across a wider catchment area, unimpeded by geography, though still somewhat limited by timezones – because we also know that synchronous communication is important.

Remote work is also good for the economy, and for equality across geographies. Remote work is closing the wage gap between areas of the US and will likely have the same effect on the North-South divide in the UK. This means London firms can recruit top talent outside the South-East, and people in typically less affluent areas can find well paying work without moving away.

But that view isn’t shared by many organisations.

However, whilst employees are increasingly seeing remote working as an expectation rather than a benefit, many organisations, via pressure from command-control managers, difficulties in onboarding, process-oriented HR teams, or simply the most dangerous phrase in the English language: because “we’ve always done it this way“, possess a desire to bring employees back into the office, where they can see them.

Indeed, often by the managers of that organisation, remote working may be seen as an exclusive benefit and an opportunity to slack off. The Taylorist approach to management is still going strong, it appears.

People are adopting remote faster than organisations.

In 1962, Everett Rogers came up with the principle he called “Diffusion of innovation“.

It describes the adoption of new ideas and products over time as a bell curve, and categorises groups of people along its length as innovators, early adopters, early majority, late majority, and laggards. Spawned in the days of rapidly advancing agricultural technology, it was easy (and interesting) to study the adoption of new technologies such as hybrid seeds, equipment and methods.

Some organisations are even suggesting that remote workers could be paid less, since they no longer pay for their commute (in terms of costs and in time), but I believe the converse may become true – that firms who request regular attendance at the office will need to pay more to make up for it. As an employee, how much do you value your free time?

It seems that many people are further along Rogers’ adoption curve than the organisations they work for.

There are benefits of being in the office.

Of course, it’s important to recognise that there are benefits of being colocated in an office environment. Some types of work simply don’t suit it. Some people don’t have a suitable home environment to work from. Sometimes people need to work on a physical product or collaborate and use tools and equipment in person. Much of the time, people just want to be in the same room as their colleagues – what Tom Cheesewright calls “The unbeatable bandwidth of being there.”

But is that benefit worth the cost? An average commute is 59 minutes, which totals nearly 40 hours per month, per employee. For a team of twenty people, is 800 hours per month worth the benefit of being colocated? What would you pay to obtain an extra 800 hours of time for your team in a single month?

The question is one of motivation: are we empowering our team members to choose where they want to work and how they best provide value, or are we to revert to the Taylorist principles where “the manager knows best”? In Taylors words: “All we want of them is to obey the orders we give them, do what we say, and do it quick.

We must use this as a learning opportunity.

Whilst 2020 has been a massive challenge for all of us, it’s also taught us a great deal, about change, about people and about the future of work. The worst thing that companies can do is ignore what they have learned about their workforce and how they like to operate. We must not mindlessly drift back to the old ways.

We know that remote working is more productive, but there are many shades of remoteness, and it takes strong leadership, management effort, good tools, and effective, high-cadence communication to really do it well.

There is no need for a binary choice: there is no one-size-fits-all for office-based or remote work. There are infinite operating models available to us, and the best we can do to prepare for the future of work is simply to be endlessly adaptable.

“Root” Cause Analysis using Rothmans Causal Pies

rothmans causal pies

Context: It sometimes seems to me that in the tech industry, maybe because we’re often playing with new technologies and innovating in our organisation, or even field, (when we’re not trying to pay down tech debt and keep legacy systems running), we’re sometimes guilty of not looking outside our sphere for better practices and new (or even old) ideas.

Rothman’s Causal Pies

Whilst studying for my Master’s degree in Global Health, I discovered the concept of “Rothman’s Causal Pies”.

The Epidemiological Triad

Epidemiology is the study of why and how diseases (including non-communicable diseases) occur. As a field, it encompasses the entire realm of human existence, from environmental and biological aspect to heuristics and even economics. It’s a real exercise into Systems Thinking, which is kinda why I love it.

In epidemiology, there is a concept known as the “Epidemiological Triad“, which describes the necessary relationship between vector, host, and environment. When all three are present, the disease can occur. Without one or more of those three factors, the disease cannot occur. It’s a very simplistic but useful model. As we know, all models are wrong, but some are useful.

This concept is useful because through understanding this triad, it’s possible to identify an intervention to reduce the incidence of, or even eradicate, a disease, such as by changing something in the environment (say, by providing clean drinking water) or a vaccination programme (changing something about the host).

What the triad doesn’t provide, however, is a description of the various factors necessary for the disease to occur, and this is especially relevant to non-communicable diseases (NCDs), such as back pain, coronary heart disease, or a mental health problem. In these cases, there may be many different components, or causal factors. Some of these may be “necessary”, whilst some may contribute. There may be many difference combinations of causes that result in the disease.

To use heart disease as an example, the component causes, or “risk factors” could include poor diet, little or no exercise, genetic predisposition, smoking, alcohol, and many more. No single component is sufficient to cause the disease, and one (genetic predisposition, for example) may be necessary in all cases.

Rothman, in 1976, came up with a model that demonstrates the multifactorial nature of causation.

Rothman’s Causal Pies

An individual factor that contributes to cause disease is shown as a piece of a pie, like the triangles in the game Trivial Pursuit. After all the pieces of a pie fall into place, the pie is complete, and disease occurs.

The individual factors are called component causes. The complete pie, which is termed a causal pathway, is called a sufficient cause. A disease may have more than one sufficient cause, with each sufficient cause being composed of several component causes that may or may not overlap. A component that appears in every single pie or pathway is called a necessary cause, because without it, disease does not occur. An example of this is the role that genetic factors play in haemophilia in humans – haemophilia will not occur without a specific gene defect, but the gene defect is not believed to be sufficient in isolation to cause the disease.

An example: Note in the image below that component cause A is a necessary cause because it appears in every pie. But this should not mean that it is the “root cause”, because it is not sufficient on its own.

Root Cause Analysis

I’m a huge proponent of holding regular retrospectives (for incidents, failures, successes, and simply at regular intervals), but it seems that in technology, particularly when we’re carrying out a Root Cause Analysis due to an incident, there’s a tendency to assume one single “root cause” – the smoking gun that caused the problem.

We may tend towards assuming that once we’ve found this necessary cause, we’re finished. And whilst that’s certainly a useful exercise, it’s important to recognise that there are other component causes and there may be more than one sufficient cause.

The Five Why’s model is a great example of this – it fails to probe into other component factors, and only looks for a single root cause. As any resilience engineer will tell you: There is no Single Root Cause.

The 5 whys takes the team down a single linear path, and will certainly find a root cause, but leaves the team blind to other potential component or sufficient causes – and even worse: it leads the team to believe that they’ve identified the problem. In the worst case scenario, a team may identify “human error” as a root cause, which could re-affirm a faulty, overly-simplistic world view and result in not only the wrong cause identified, but harm the team’s ability to carry out RCAs in the future.

Read more about the flaws in the “five whys” model in John Allspaw’s “Infinite Hows”. Allspaw has recently published another great piece about “root causes” in this blog article.

In reality, we’re dealing with complex, maybe even chaotic states, alongside human interactions. There exist multiple causal factors, some necessary for the “incident” to have occurred, and some simply component causes that together become sufficient – the completed pie!

Take Away: There is usually more than one causal pie.

An improved approach could be to use Ishikawa diagrams, but in my experience, particularly when dealing with complex systems, these diagrams very quickly become visibly cluttered and complex, which makes them hard to use. Additionally, because each “fish bone” is treated as a separate pathway, interrelationships between causes may not be identified.

Instead of a complex fishbone diagram, try identifying all the component causes, and visually complete (on a whiteboard for example) all the pies that could (or did) result in the outcome. You almost certainly won’t identify all of them, but that doesn’t matter very much.

If we adopt the Rothman’s causal pie model instead of approaches such as the 5 whys or Ishikawa, it provides us with an easy to use and easy to visualise tool that can model not only “what caused this incident”, but “what factors, if present, could cause this incident to occur again?“. 

In order to prevent the incident (the disease, in epidemiological terms), the key factor we’re looking for is the “necessary cause” – component A in the pies diagram. But we’re also looking for the other component causes.

Application: The prevention of future incidents.

Suppose we can’t easily solve component A – maybe it’s a third party system that’s outside our control – but we can control causal components B and C which occur in every causal pie. If we control for those instead, it’s clear that we don’t need to worry about component A anyway!

Next time you’re carrying out a Root Cause Analysis or retrospective, try using Rothman’s Causal Pies.

Addendum: “Post-Mortem” exercises.

Even though the term “post-mortem” is ubiquitously used in the technology industry as a descriptor for analysis into root causes, I don’t like it.

Firstly, in the vast majority of tech incidents, nobody died – post-mortem literally means “after death”. It implies that a Very Bad Thing happened, but if we’re trying to hold constructive, open exercises where everyone present possesses enough psychological safety in order to contribute honestly and without fear, we should phrase the exercise in less morbid terms. The incident has already happened – we should treat it as a learning opportunity, not a punitive sounding exercise.

Secondly, we should run these root cause analysis exercises for successes, not just for failures. You don’t learn the secrets of a great marriage by studying divorce. The term “post-mortem” isn’t particularly appropriate for studying the root causes of successes.

 

I should probably highlight something about Safety I vs Safety II approaches here. I’ll add that when I have time!