Adoption vs Adaption in Resilience Engineering and DevOps

Photo by Chris Abney on Unsplash

DevOps was, and still is to a degree, a “ground-up” phenomenon. It became to be adopted, adapted and evolved by engineering teams before “management” even really understood what it was. 

The openness and flexibility that was expounded by DevOps meant that it was able to be interpreted in different ways by different teams in different contexts. This was a key strength, because unlike rigid frameworks such as ITIL, the people responsible for doing the work were able to modify and apply DevOps to their own work, in the ways that best suited them.

But this loose definition also proved to be a weakness. Because there were no limits to how DevOps could be interpreted and applied, it was often (and still is) interpreted as a technology solution rather than cultural change. This resulted in “DevOps engineers”, or “DevOps teams” whose remit is focussed on cloud technology, CI/CD pipelines, or automation. 

Due to this, we’re still far behind from where we could have been as an industry. Despite everyone in technology knowing the term “DevOps” and almost every firm adopting some degree of DevOps practices, these transformations have often stuttered or even failed, in part because it’s unclear to many what DevOps really is and how to “do” DevOps.

Resilience Engineering is a field of applied research that considers organisational-scale capability to anticipate, detect, respond and adapt to change. The principle of socio-technicality is core to RE: the premise that you can’t separate people from technology; if you change the technology, it will affect people, and if you change the way people work or communicate or the way teams are structured, it will impact the technology created or consumed by those people. 

RE as a field has been around for almost two decades, but only now (for various reasons including the Covid pandemic) is beginning to touch mainstream discussions and discussed in the same conversations as digital and organisational transformation. 

Researchers and Practitioners of RE are quick to clarify what RE is, and is not, during these discussions. Whilst it may seem dogmatic to be so strict about what is within the remit of the field, I think this could be the valuable lesson learned from one of the weaknesses of DevOps. In order for organisations to successfully adopt and adapt to a new operating model and principles such as RE, it’s essential to understand very clearly what it is. 

Resilience Engineering, despite being nearly twenty years old as a field, is somewhat embryonic in its adoption outside of a narrow field of specialist researchers and practitioners, and as such, it’s crucial that we define accurately what it is, what it is not, and resist attempts (intentional or unintentional) to co-opt the term to mean something more akin to chaos engineering, automation, or system hardening efforts. 

A balance must be struck between defining accurately what RE is, and tolerating (indeed, encouraging) a flexibility of interpretation and adoption in different contexts. DevOps was maybe too loose in this respect, other paradigms such as ITIL or SAFe were maybe too strict and dogmatic. Maybe with Resilience Engineering, the sweet spot will be found.

Digital Transformation and DevOps: Enterprise Resilience

digital transformation

Digital Transformation is having a real moment in industry, in part due to the huge changes as a result of the pandemic of 2020.  But as usual, there’s little agreement about what it means. In contrast to previous “transformations” such as ITIL, Lean, Agile, or DevOps, digital transformation doesn’t simply mean automating processes, becoming more efficient, offering your existing products and services online, creating an app, or shifting your infrastructure to the cloud. Even the annual State of DevOps Reports are beginning to focus more on digital and organisational transformation rather than a specific focus solely on DevOps.

What is digital transformation?

True digital transformation means transforming everything about your organisation in respect to people and technology towards an engaged, agile, happy and high performing organisation. DevOps was (and still is) one key aspect of this approach. The only way to truly achieve organisational resilience or enterprise agility is to fundamentally transform the foundations of the organisation. The list below describes just some of the aspects of digital transformation and the areas to address

  • Culture, values and behaviours
  • Practices and ways of working
  • Communication structures
  • Hierarchies
  • How financial budgets and funding models are managed
  • How teams and people are measured and incentivised
  • How and what metrics are used
  • Cloud native architectures and practices
  • Moving from projects to products
  • Team structures, topologies and interactions
  • Recruitment and onboarding/offboarding practices
  • Value stream alignment
  • Breaking down silos
  • Embedding the ability to change and adapt
  • Reducing cognitive load
  • Psychological safety in delivery teams, senior leadership teams and functional teams
  • IT services and operational technologies
  • Facilities, colocation, office layouts (especially options for open-plan or not)
  • And many many more – in fact, here is an (incomplete) list of organisational factors relevant to transformation.

Why digital transformation?

What’s your organisational goal? Maybe it’s increasing your speed to market for new products and features, maybe it’s reducing risk of failure in production and improving reliability, or maybe it’s to keep doing what you’re doing but with less stress and improved flow. If you’re only looking to reduce costs however, digital transformation is not for you: one of the core requirements for a transformation to succeed is for everyone in the organisation to be psychologically safe, engaged and get behind it, so reducing costs and potentially cutting workforce numbers is not going to create that movement.

What is Enterprise Resilience?

Resilience Engineering is a decades-old field of applied research that focusses on the capacity to anticipate, detect, respond to and adapt to change. Organisational “robustness” might mean being able to withstand massive disrupting events such as pandemics or competition, but enterprise agility represents the resilience engineering concept of true resilience – not just “coping” with change, but improving from it and future challenges. I believe that Resilience Engineering is the direction that DevOps is evolving into.

Why is digital transformation so complex?

Despite many attempts to simplify the concept of digital transformation, it remains one of the most challenging endeavours we could embark upon.

Galbraith Star model
Galbraith Star model

I’m not a huge fan of over-simplifying organisational complexity into components, especially models such as Galbraith’s Star that place “people” as one of the components (and certainly not models that consider anything other than people to be the primary element). Whilst models such as this may help people compartmentalise the transformation challenge, in almost every case, the fractures between the various components don’t actually exist in the way they’re presented.

Organisations are not simply jigsaw pieces of technology, tools, and people that react and function in predictable ways. As the Cynefin model shows us, systems exist in multiple different states. Complex states, such as the state in which most sociotechnical systems (the organisations that we work in) reside in, require a probe-sense-respond approach that applies built-in feedback loops to determine what effect the intervention you’re working on is having. Everything in digital transformation is an experiment.

upload.wikimedia.org/wikipedia/commons/1/15/Cyn...

It’s also important to avoid localised optimisation – applying digital transformation approaches to one part of an organisation whilst ignoring other parts will only result in tension, bottlenecks, high-friction and failures elsewhere. We must observe and examine the entire system, even if we cannot change it. Ian Miell discusses here in this excellent piece why we must address the money flows in an organisation.

Likewise, changing one small part of a system, especially a complex system, will have unintended and unanticipated effects elsewhere, so a complete, holistic view of the entire organisation is critical.

Digital transformation is a series of experiments

This is why, if anyone suggests that there is a detailed “roadmap”, or even worse, a Gannt chart, for a digital transformation project, at best it’s naive and worst, it’s fiction. Any digital transformation process must be made not of a fixed plan, but a series of experiments that allow for iterative improvements to be made.

Digital transformaiton - everything is an experiment

When you think about digital transformation in this way, it also becomes clear why it will never be “finished”. Organisations, like the people they consist of, constantly change and evolve, just like the world we operate in, so whilst digital transformation is undoubtedly of huge value and an effective approach to organisational change, you will never, ever, be “done”.

In my role as Transformation Lead at Red Hat Open Innovation Labs, we use the Mobius Loop approach to provide structure to this experimental, feedback-based, continuous improvement and transformation journey.  If you’re interested in digital transformation, DevOps, Psychological Safety and how you can begin to set transformation in motion in your own organisation, get in touch.

 

Resilience Engineering, DevOps, and Psychological Safety – resources

With thanks to Liam Gulliver and the folks at DevOps Notts, I gave a talk recently on Resilience Engineering, DevOps, and Psychological Safety.

It’s pretty content-rich, and here are all the resources I referenced in the talk, along with the talk itself, and the slide deck. Please get in touch if you would like to discuss anything mentioned, or you have a meetup or conference that you’d like me to contribute to!

Here’s a psychological safety practice playbook for teams and people.

Open Practice Library

https://openpracticelibrary.com/

Resilience Engineering and DevOps slide deck  

https://docs.google.com/presentation/d/1VrGl8WkmLn_gZzHGKowQRonT_V2nqTsAZbVbBP_5bmU/edit?usp=sharing

Resilience engineering – Where do I start?

Resilience engineering: Where do I start?

Turn the ship around by David Marquet

Lorin Hochstein and Resilience Engineering fundamentals 

https://github.com/lorin/resilience-engineering/blob/master/intro.md

 

Scott Sagan, The Limits of Safety:
“The Limits of Safety: Organizations, Accidents, and Nuclear Weapons”, Scott D. Sagan, Princeton University Press, 1993.

 

Sidney Dekker: “The Field Guide To Understanding Human Error: Sidney Dekker, 2014

 

John Allspaw: “Resilience Engineering: The What and How”, DevOpsDays 2019.

https://devopsdays.org/events/2019-washington-dc/program/john-allspaw/

 

Erik Hollnagel: Resilience Engineering 

https://erikhollnagel.com/ideas/resilience-engineering.html

 

Cynefin

Home

 

Jabe Bloom, The Three Economies

The Three Economies an Introduction

 

Resilience vs Efficiency

Efficiency vs. Resiliency: Who Won The Bout?

 

Tarcisio Abreu Saurin – Resilience requires Slack

Slack: a key enabler of resilient performance

 

Resilience engineering and DevOps – a deeper dive

Resilience Engineering and DevOps – A Deeper Dive

 

Symposium with John Willis, Gene Kim, Dr Sidney Dekker, Dr Steven Pear, and Dr Richard Cook: Safety Culture, Lean, and DevOps

 

Approaches for resilience and antifragility in collaborative business ecosystems: Javaneh Ramezani Luis, M. Camarinha-Matos:

https://www.sciencedirect.com/science/article/pii/S0040162519304494

 

Learning organisations:
Garvin, D.A., Edmondson, A.C. and Gino, F., 2008. Is yours a learning organization?. Harvard business review, 86(3), p.109.
https://teamtopologies.com/book
https://www.psychsafety.co.uk/cognitive-load-and-psychological-safety/

 

Psychological safety: Edmondson, A., 1999. Psychological safety and learning behavior in work teams. Administrative science quarterly, 44(2), pp.350-383.

The four stages of psychological safety, Timothy R. Clarke (2020)

Measuring psychological safety:

 

And of course the youtube video of the talk:

Please get in touch if you’d like to find out more.

A Critique of SAFe – The Scaled Agile Framework

whats wrong with SAFe?

This is a critique of the Scaled Agile Framework (SAFe).

It’s a critique, so it’s pretty negative! There are some benefits of using SAFe in large organisations, some very good use cases for full or partial adoption as long as it’s considered part of a journey, as long as your eyes are open to the problems with SAFe, and your reasons for adopting it are sound.

However, here I’m describing ten key points emphasising why it’s not an appropriate approach for most organisations looking to scale software delivery.

I’m really interested in your opinion, so please do get in touch if you wish to make a comment or suggestion. Ultimately, we must remember to scale down the problem before scaling up the solution.

In Summary: Problems with SAFe approaches:

  1. SAFe encourages normalisation of batch sizing across teams, incentivises increasing task sizes, and fundamentally misappropriates what story points are for.
  2. SAFe can cause increased localised technical debt.
  3. SAFe creates conflicts with support, operational and SRE functions.
  4. SAFe decreases inter-team (particularly value stream) collaboration.
  5. SAFe uses fallacies in estimation.
  6. SAFe decreases the agile focus on value in favour of “what management wants”.
  7. SAFe decreases the utility of, and the focus on, retrospectives.
  8. SAFe is not Agile – it encourages top-down, large-batch planning rather than small, iterative, feedback loops.
  9. SAFe is framed as a solution, rather than a stage of a journey.
  10. SAFe scales up the solution rather than scaling down the problem.

11 – (bonus point, thanks to Mathew Skelton) – SAFe encourages temporal coupling of teams.

In Detail: a critique of the Scaled Agile Framework:

1 – Nothing in agile suggests that we need to, or even *should* measure work units (i.e. story points) in uniform manners across teams. Story points exist to help the people *doing* the work break things down into optimum batch size, which makes deliverables achievable, less complex, and facilities flow. Indeed, SAFe actually encourages larger batch sizes through front-loaded planning, not smaller sizes planned through more iterative methods.

SAFe tries to normalise story points across teams for various reasons, but there is often a strong desire to measure and compare the delivery of teams and people. This is not what story points are for. Story points do not exist to measure how “productive” developers are.

2 – Technical debt tends to increase in SAFe organisations because the prioritisation of dealing with it is raised to a management level rather than team level. This is counter-productive for technical debt that originates at the team level (which most of it does). Management will tend to prioritise features and functions, delaying the pay-back of localised technical debt, and resulting in slower, higher risk, more brittle systems.

3 – If SAFe is applied to more operational functions, such as technology support, operations, or SRE, conflicts between delivery and support functions arise, because supporting teams typically need to work either responsively, dealing with issues as they arise, or on very short cycles – not the Programme Increment cycle time imposed by SAFe.

4 – Due to the focus on deliverables and accountability through project or product managers, teams may be discouraged from assisting each other, as they are measured by their own delivery and productivity: how much they assist other teams is rarely valued.

5 – The concept of “ideal dev days” is often used for estimating in SAFe. Everyone else knows that ideal dev days are a fallacy. Instead, look at past similar deliverables, and see how long they took. This is a much more predictive metric, and is less susceptible to optimism bias or wanting to please the boss.

6 – The concept of “value” often breaks down in SAFe, through a focus on volume of delivery and meeting the (often arbitrary) deadlines imposed by management in PI planning. As a result, what end-users actually want is often ignored in favour of what management wants.

7 – PI planning includes a small element of retrospective activity, but it’s too little, too late. The retrospective feedback loops need to be short and light, not tagged on to PI planning as an afterthought. Here’s a comprehensive guide to retrospectives that also covers some really useful suggestions for running them with remote and distributed teams.

8 – Agile was created as a response to frustrations felt across the industry from heavyweight, top-down project management methodology that was killing the sector. Trying to scale Agile up by applying heavyweight, top-down methodologies is antithetical. 

9 – Some SAFe practitioners describe it as a transition stage, a process through which organisations can achieve increased capability at scale. I would agree: if an organisation feels the need to adopt SAFe, it should be as training wheels, a structure through which great capabilities can be built, before throwing off the shackles of a rigid, top-down framework. If it was really true that SAFe is a transitionary framework, why does the SAFe model not include anything about the transition away from it?

10 – In reality, most organisations don’t need SAFe. They’re not so big that they need such a big solution. SAFe is a comfort blanket for organisations used to traditional, slow, heavyweight, command-control structures. Your projects and products actually aren’t that big – and if they are, then that’s the problem, not the management process.

Fundamentally, SAFE tends to ignore, or encourages management to ignore the possibility that those closest to the work might be the best equipped to make decisions about it. Scale the work down, not the process up. SAFe fits the delivery model to the organisational structure, rather than forcing the organisation to adopt new ways.

Here’s a bonus point 11, thanks to Matt Skelton of Team Topologies: SAFe, via the enforced Program Increment approach, encourages (or very possibly forces) at least a temporal coupling of teams that isn’t warranted. In fact, any sort of forced coupling is an antipattern for a fast flow of change, and via Conway’s Law, probably introduces architectural coupling too (which is bad). Given that SAFe adopts the PI as the core foundation of the approach, it’s unlikely that any SAFe practitioner would suggest dropping PI when the teams are mature enough to do so… or would they?

2023 update: Here’s a along with case studies and expert commentary from practitioners and researchers alike.

 

Resilience Engineering and DevOps – A Deeper Dive

robustness vs resilience

[This is a work in progress. If you spot an error, or would like to contribute, please get in touch]

The term “Resilience Engineering” is appearing more frequently in the DevOps domain, field of physical safety, and other industries, but there exists some argument about what it really means. That disagreement doesn’t seem to occur in those domains where Resilience Engineering has been more prevalent and applied for some time now, such as healthcare and aviation. Resilience Engineering is an academic field of study and practice in its own right. There is even a Resilience Engineering Association.

Resilience Engineering is a multidisciplinary field associated with safety science, complexity, human factors and associated domains that focuses on understanding how complex adaptive systems cope with, and learn from, surprise.

It addresses human factors, ergonomics, complexity, non-linearity, inter-dependencies, emergence, formal and informal social structures, threats and opportunities. A common refrain in the field of resilience engineering is “there is no root cause“, and blaming incidents on “human error” is also known to be counterproductive, as Sidney Dekker explains so eloquently in “The Field Guide To Understanding Human Error”.

Resilience engineering is “The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions.Prof Erik Hollnagel

It is the “sustained adaptive capacity” of a system, organisation, or community.

Resilience engineering has the word “engineering” in, which makes us typically think of machines, structures, or code, and this is maybe a little misleading. Instead, maybe try to think about engineering being the process of response, creation and change.

Systems

Resilience Engineering also refers to “systems”, which might also lead you down a certain mental path of mechanical or digital systems. Widen your concept of systems from software and machines, to organisations, societies, ecosystems, even solar systems. They’re all systems in the broader sense.

Resilience engineering refers in particular to complex systems, and typically, complex systems involve people. Human beings like you and I (I don’t wish to be presumptive but I’m assuming that you’re a human reading this).

Consider Dave Snowden’s Cynefin framework:

cynefin

Systems in an Obvious state are fairly easy to deal with. There are no unknowns – they’re fixed and repeatable in nature, and the same process achieves the same result each time, so that we humans can use things like Standard Operating Procedures to work with them.

Systems in a Complicated state are large, usually too large for us humans to hold in our heads in their entirety, but are finite and have fixed rules. They possess known unknowns – by which we mean that you can find the answer if you know where to look. A modern motorcar, or a game of chess, are complicated – but possess fixed rules that do not change. With expertise and good practice, such as employed by surgeons or engineers or chess players, we can work with systems in complicated states.

Systems in a Complex state possess unknown-unknowns, and include realms such as battlefields, ecosystems, organisations and teams, or humans themselves. The practice in complex systems is probe, sense, and respond. Complexity resists reductionist attempts at determining cause and effect because the rules are not fixed, therefore the effects of changes can themselves change over time, and even the attempt of measuring or sensing in a complex system can affect the system. When working with complex states, feedback loops that facilitate continuous learning about the changing system are crucial.

Systems in a Chaotic state are impossible to predict. Examples include emergency departments or crisis situations. There are no real rules to speak of, even ones that change. In these cases, acting first is necessary. Communication is rapid, and top-down or broadcast, because there is no time, or indeed any use, for debate.

Resilience

As Erik Hollnagel has said repeatedly since Resilience Engineering began (Hollnagel & Woods, 2006), resilience is about what a system can do — including its capacity: 

  • to anticipate — seeing developing signs of trouble ahead to begin to adapt early and reduce the risk of decompensation 
  • to synchronize —  adjusting how different roles at different levels coordinate their activities to keep pace with tempo of events and reduce the risk of working at cross purposes 
  • to be ready to respond — developing deployable and mobilizable response capabilities in advance of surprises and reduce the risk of brittleness 
  • for proactive learning — learning about brittleness and sources of resilient performance before major collapses or accidents occur by studying how surprises are caught and resolved 

(From Resilience is a Verb by David D. Woods)

 

Capacity Description
Anticipation Create foresight about future operating conditions, revise models of risk
Readiness to respond Maintain deployable reserve resources available to keep pace with demand
Synchronization Coordinate information flows and actions across the networked system
Proactive learning Search for brittleness, gaps in understanding, trade-offs, re-prioritisations

Provan et al (2020) build upon Hollnagel’s four aspects of resilience to show that resilient people and organisations must possess a “Readiness to respond”, and states “This requires employees to have the psychological safety to apply their judgement without fear of repercussion.”

Resilience is therefore something that a system “does”, not “has”.

Systems comprise of structures, technology, rules, inputs and outputs, and most importantly, people.

Resilience is about the creation and sustaining of various conditions that enable systems to adapt to unforeseen events. *People* are the adaptable element of those systems” – John Allspaw (@allspaw) of Adaptive Capacity Labs.

Resilience therefore is about “systems” adapting to unforeseen events, and the adaptability of people is fundamental to resilience engineering.

And if resilience is the potential to anticipate, respond, learn, and change, and people are part of the systems we’re talking about:

We need to talk about people: What makes people resilient?

Psychological safety

Psychological safety is the key fundamental aspect of groups of people (whether that group is a team, organisation, community, or nation) that facilitates performance. It is the belief, within a group, “that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes.” – Edmondson, 1999.

Amy Edmondson also talks about the concept of a “Learning organisation” – essentially a complex system operating in a vastly more complex, even chaotic wider environment. In a learning organisation, employees continually create, acquire, and transfer knowledge—helping their company adapt to the un-predictable faster than rivals can. (Garvin et al, 2008)

“A resilient organisation adapts effectively to surprise.” (Lorin Hochstein, Netflix)

In this sense, we can see that a “learning organisation” and a “resilient organisation” are fundamentally the same.

Learning, resilient organisations must possess psychological safety in order to respond to changes and threats. They must also have clear goals, vision, and processes and structures. According to Conways Law:

“Any organisation that designs a system (defined broadly) will produce a design whose structure is a copy of the organisation’s communication structure.”

In order for both the organisation to respond quickly to change, and for the systems that organisation has built to respond to change, the organisation must be structured in such a way that response to change is as rapid as possible. In context, this will depend significantly on the organisation itself, but fundamentally, smaller, less-tightly coupled, autonomous and expert teams will be able to respond to change faster than large, tightly-bound teams with low autonomy will. Pais and Skelton’s Team Topologies explores this in much more depth.

Engineer the conditions for resilience engineering

“Before you can engineer resilience, you must engineer the conditions in which it is possible to engineer resilience.” – Rein Henrichs (@reinH)

As we’ve seen, an essential component of learning organisations is psychological safety. Psychological safety is a necessary condition (though not sufficient) for the  conditions of resilience to be created and sustained. 

Therefore we must create psychological safety in our teams, our organisations, our human “systems”. Without this, we cannot engineer resilience. 

We create, build, and maintain psychological safety via three core behaviours:

  1. Framing work as a learning problem, not an execution problem. The primary outcome should be knowing how to do it even better next time.
  2. Acknowledging your own fallibility. You might be an expert, but you don’t know everything, and you get things wrong – if you admit it when you do, you allow others to do the same.
  3. Model curiosity – ask a lot of questions. This creates a need for voice. By you asking questions, people HAVE to speak up. 

Resilience engineering and psychological safety

Psychological safety enables these fundamental aspects of resilience – the sustained adaptive capacity of a team or organisation.:

  • Taking risks and making changes that you don’t, or can’t, fully understand the outcomes of. 
  • Admitting when you made a mistake. 
  • Asking for help
  • Contributing new ideas
  • Detailed systemic cause* analysis (The ability to get detailed information about the “messy details” of work)

(*There is never a single root cause)

Let’s go back to that phrase at the start:

Sustained adaptive capacity.

What we’re trying to create is an organisation, a complex system, and sub systems (maybe including all that software we’re building) that possesses a capacity for sustained adaptation.

With DevOps we build systems that respond to demand, scale up and down, we implement redundancy, low-dependancy to allow for graceful failure, and identify and react to security threats.

Pretty much all of these only contribute to robustness.

robustness vs resilience

(David Woods, Professor, Integrated Systems Engineering Faculty, Ohio State University)

You may want to think back to the cynefin model, and think of robustness as being able to deal well with known unknowns (complicated systems), and resilience as being able to deal well with unknown unknowns (complex, even chaotic systems). Technological or DevOps practices that primarily focus on systems, such as microservices, containerisation, autoscaling, or distribution of components, build robustness, not resilience.

However, if we are to build resilience, the sustained adaptive capacity for change, we can utilise DevOps practices for our benefit. None of them, like psychological safety, are sufficient on their own, but they are necessary. Using automation to reduce the cognitive load of people is important: by reducing the extraneous cognitive load, we maximise the germane, problem solving capability of people. The provision of other tools, internal platforms, automated testing pipelines, and increasing the observability of systems increases the ability of people and teams to respond to change, and increases their sustained adaptive capacity.

If brittleness is the opposite of resilience, what does “good” resilience look like? The word “anti-fragility” appears to crop up fairly often, due to the book “Antifragile: Things that Gain from Disorder” by Nassim Taleb. What Taleb describes as antifragile, ultimately, is resilience itself.

I have my own views on this, but fundamentally I think this is the danger of academia (as in the field of resilience engineering) restricting access to knowledge. A lot of resilience engineering literature is held behind academic paywalls and journals, which most practitioners do not have access to.  It should be of no huge surprise that people may reject a body of knowledge if they have no access to it.

Observability

It is absolutely crucial to be able to observe what is happening inside the systems. This refers to anything from analysing system logs to identify errors or future problems, to managing Work In Progress (WIP) to highlight bottlenecks in a process.

Too often, engineering and technology organisations look only inward, whilst many of the threats to systems are external to the system and the organisation. Observability must also concern external metrics and qualitative data: what is happening in the marketspace, the economy, and what are our competitors doing?

Resilience Engineering and DevOps

What must we do?

Create psychological safety – this means that people can ask for help, raise issues, highlight potential risks and “apply their judgement without fear of repercussion.” There’s a great piece here on psychological safety and resilience engineering.

Manage cognitive load – so people can focus on the real problems of value – such as responding to unanticipated events.

Apply DevOps practices to technology – use automation, internal platforms and observability, amongst other DevOps practices. 

Increase observability and monitoring – this applies to systems (internal) and the world (external). People and systems cannot respond to a threat if they don’t see it coming.

Embed practices and expertise in component causal analysis – whether you call it a post-mortem, retrospective or debrief, build the habits and expertise to routinely examine the systemic component causes of failure. Try using Rothmans Causal Pies in your next incident review.

Run “fire drills” and disaster exercises. Make it easier for humans to deal with emergencies and unexpected events by making it habit. Increase the cognitive load available for problem solving in emergencies.

Structure the organisation in a way that facilitates adaptation and change. Consider appropriate team topologies to facilitate adaptability.

In summary

Through facilitating learning, responding, monitoring, and anticipating threats, we can create resilient organisations. DevOps and psychological safety are two important components of resilience engineering, and resilience engineering (in my opinion) is soon going to be seen as a core aspect of organisational (and digital) transformation.

 

References:

Conway, M. E. (1968) How Do Committees Invent? Datamation magazine. F. D. Thompson Publications, Inc. Available at: https://www.melconway.com/Home/Committees_Paper.html

Dekker, S. 2006. The Field Guide to Understanding Human Error. Ashgate Publishing Company, USA.

Edmondson, A., 1999. Psychological safety and learning behavior in work teams. Administrative science quarterly, 44(2), pp.350-383.

Garvin, David & Edmondson, Amy & Gino, Francesca. (2008). Is Yours a Learning Organization?. Harvard business review. 86. 109-16, 134.

Hochstein, L. (2019)  Resilience engineering: Where do I start? Available at: https://github.com/lorin/resilience-engineering/blob/master/intro.md (Accessed: 17 November 2020).

Hollnagel, E., Woods, D. D. & Leveson, N. C. (2006). Resilience engineering: Concepts and precepts. Aldershot, UK: Ashgate.

Hollnagel, E. Resilience Engineering (2020). Available at: https://erikhollnagel.com/ideas/resilience-engineering.html (Accessed: 17 November 2020).

Provan, D.J., Woods, D.D., Dekker, S.W. and Rae, A.J., 2020. Safety II professionals: how resilience engineering can transform safety practice. Reliability Engineering & System Safety, 195, p.106740. Available at https://www.sciencedirect.com/science/article/pii/S0951832018309864

Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I.
(Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org.

John Allspaw has collated an excellent book list for essential reading on resilience engineering here.