A Short Critique of SAFe – The Scaled Agile Framework

whats wrong with SAFe?

This is a critique of the Scaled Agile Framework (SAFe). It’s a critique, so it’s pretty negative! There actually are benefits through using SAFe, and some very good use cases for full or partial adoption, as long as your eyes are open to the problems with SAFe, and your reasons for adopting it are sound.

However, here I’m describing ten key points why it might not be the magic bullet for an organisation looking to scale technology delivery.

Problems and issues of SAFe:

  1. Encourages normalisation of batch sizing across teams. And increases batch size.
  2. Causes increased localised technical debt.
  3. Creates a conflict with support, operational and SRE functions.
  4. Decreases inter-team collaboration.
  5. Uses fallacies in estimation.
  6. Decreases the agile focus on value.
  7. Decreases the utility and focus on retrospectives.
  8. Encourages top-down planning rather than bottom-up.
  9. Is not actually required to scale.
  10. Is not agile.

1 – Nothing in agile suggests that we need to, or even *should* measure work units (i.e. story points) in uniform manners across teams. Story points exist to help the people *doing* the work break things down into optimum “batch size”, which makes deliverables achievable, less complex, and facilities flow. Indeed, SAFe actually encourages larger batch sizes through front-loaded planning, not smaller sizes planned through more iterative methods.

SAFe tries to normalise story points across teams for various reasons, but there is often a strong desire to measure and compare the delivery of teams and people. This is not what story points are for. Story points do not exist to measure how “productive” developers are.

2 – Technical debt tends to increase in SAFe organisations because the prioritisation of dealing with it is raised to a management level rather than team level. This is counter-productive for technical debt that originates at the team level (which most of it does). Management will tend to prioritise features and functions, delaying the pay-back of localised technical debt, and resulting in slower, higher risk, more brittle systems.

3 – If SAFe is applied to more operational functions, such as technology support, operations, or SRE, conflicts between delivery and support functions arise, because supporting teams typically need to work either responsively, dealing with issues as they arise, or on very short cycles – not the Programme Increment cycle time imposed by SAFe.

4 – Due to the focus on deliverables and accountability through project or product managers, teams may be discouraged from assisting each other, as they are measured by their own deliver: how much they assist other teams is rarely valued.

5 – The concept of “ideal dev days” is often used for estimating in SAFe. Everyone else knows that ideal dev days are a fallacy. Instead, look at past similar deliverables, and see how long they took. This is a much more predictive metric, and is less susceptible to optimism bias or wanting to please the boss.

6 – The concept of “value” often breaks down in SAFe, through a focus on volume of delivery and meeting the (often arbitrary) deadlines imposed by management in PI planning. As a result, what end-users actually want is often ignored in favour of what management wants.

7 – PI planning includes a small element of retrospective activity, but it’s too little, too late. The retrospective feedback loops need to be short and light, not tagged on to PI planning as an afterthought.

8 – Agile was created as a response to frustrations felt across the industry from heavyweight, top-down project management methodology that was killing the sector. Trying to scale Agile up by applying heavyweight, top-down methodologies is antithetical. 

9 – Some SAFe practitioners describe it as a transition stage, a process through which organisations can achieve increased capability at scale. I would agree: if an organisation feels the need to adopt SAFe, it should be as training wheels, a structure through which great capabilities can be built, before throwing off the shackles of a rigid, top-down framework. If it was really true that SAFe is a transitionary framework, why does the SAFe model not include anything about the transition away from it?

10 – In reality, most organisations don’t need SAFe. They’re not so big that they need such a big solution. SAFe is a comfort blanket for organisations used to traditional, slow, heavyweight, command-control structures. Your projects and products actually aren’t that big – and if they are, then that’s the problem, not the management process.

Fundamentally, SAFE tends to ignore, or encourages management to ignore the possibility that those closest to the work might be the best equipped to make decisions about it. Scale the work down, not the process up. SAFe fits the delivery model to the organisational structure, rather than forcing the organisation to adopt new ways.

Resilience Engineering and DevOps – A Deeper Dive

robustness vs resilience

[This is a work in progress. If you spot an error, or would like to contribute, please get in touch]

The term “Resilience Engineering” is appearing more frequently in the DevOps and technology world, and there exists some argument about what it really means. Resilience Engineering is a field in its own right. There is even a Resilience Engineering Association.

It addresses complexity, non-linearity, inter-dependencies, emergence, formal and informal social structures, threats and opportunities. A common refrain in the field of resilience engineering is “there is no root cause”, and blaming incidents on “human error” is also highly frowned upon, as Sydney Dekker explains so eloquently in “The Field Guide To Understanding Human Error”.

Resilience engineering is “The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions.Prof Erik Hollnagel

It is the “sustained adaptive capacity” of a system, organisation, or community.

Resilience engineering has the word “engineering” in, which makes us typically think of machines, structures, or code, and this is maybe a little misleading. Instead, maybe try to think about engineering being the process of response, creation and change.

Systems

Resilience Engineering also refers to “systems”, which might also lead you down a certain mental path of mechanical or digital systems. Widen your concept of systems from software and machines, to organisations, societies, ecosystems, even solar systems. They’re all systems in the broader sense.

Resilience engineering refers in particular to complex systems, and typically, complex systems involve people. Human beings like you and I (I don’t wish to be presumptive but I’m assuming that you’re a human reading this).

Consider Dave Snowden’s Cynefin framework:

cynefin

Obvious systems are fairly easy to deal with. There are no unknowns – they’re fixed and repeatable in nature, and the same process achieves the same result each time, so that we humans can use things like Standard Operating Procedures to work with them.

Complicated systems are large, usually too large for us humans to hold in our heads in their entirety, but are finite and have fixed rules. They possess known unknowns – by which we mean that you can find the answer if you know where to look. A modern motorcar, or a game of chess, are complicated – but possess fixed rules that do not change. With expertise and good practice, such as employed by surgeons or engineers or chess players, we can work with these systems. 

Complex systems possess unknown-unknowns, and include realms such as battlefields, ecosystems, organisations and teams, or humans themselves. The practice in complex systems is probe, sense, and respond. Complex systems resist reductionist attempts at determining cause and effect because the rules are note fixed, therefore the effects of changes can themselves change over time, and even the attempt of measuring or sensing in a complex system can affect the system. When working with complex systems, feedback loops that facilitate continuous learning about the changing system are crucial.

Chaotic systems are impossible to predict. Examples include emergency departments or crisis situations. There are no real rules to speak of, even ones that change. In these cases, acting first is necessary. Communication is rapid, and top-down or broadcast, because there is no time, or indeed any use, for debate.

Resilience

As Erik Hollnagel has said repeatedly since Resilience Engineering began (Hollnagel & Woods, 2006), resilience is about what a system can do — including its capacity

  • to anticipate — seeing developing signs of trouble ahead to begin to adapt early and reduce the risk of decompensation 
  • to synchronize —  adjusting how different roles at different levels coordinate their activities to keep pace with tempo of events and reduce the risk of working at cross purposes 
  • to be ready to respond — developing deployable and mobilizable response capabilities in advance of surprises and reduce the risk of brittleness 
  • for proactive learning — learning about brittleness and sources of resilient performance before major collapses or accidents occur by studying how surprises are caught and resolved 

(From Resilience is a Verb by David D. Woods)

 

Capacity Description
Anticipation Create foresight about future operating conditions, revise models of risk
Readiness to respond Maintain deployable reserve resources available to keep pace with demand
Synchronization Coordinate information flows and actions across the networked system
Proactive learning Search for brittleness, gaps in understanding, trade-offs, re-prioritisations

Provan et al (2020) build upon Hollnagel’s four aspects of resilience to show that resilient people and organisations must possess a “Readiness to respond”, and states “This requires employees to have the psychological safety to apply their judgement without fear of repercussion.”

Resilience is therefore something that a system “does”, not “has”.

Systems comprise of structures, technology, rules, inputs and outputs, and most importantly, people.

Resilience is about the creation and sustaining of various conditions that enable systems to adapt to unforeseen events. *People* are the adaptable element of those systems” – John Allspaw (@allspaw) of Adaptive Capacity Labs.

Resilience therefore is about “systems” adapting to unforeseen events, and the adaptability of people is fundamental to resilience engineering.

And if resilience is the potential to anticipate, respond, learn, and change, and people are part of the systems we’re talking about:

We need to talk about people: What makes people resilient?

Psychological safety

Psychological safety is the key fundamental aspect of groups of people (whether that group is a team, organisation, community, or nation) that facilitates performance. It is the belief, within a group, “that one will not be punished or humiliated for speaking up with ideas, questions, concerns, or mistakes.” – Edmondson, 1999.

Amy Edmondson also talks about the concept of a “Learning organisation” – essentially a complex system operating in a vastly more complex, even chaotic wider environment. In a learning organisation, employees continually create, acquire, and transfer knowledge—helping their company adapt to the un-predictable faster than rivals can. (Garvin et al, 2008)

“A resilient organisation adapts effectively to surprise.” (Lorin Hochstein, Netflix)

In this sense, we can see that a “learning organisation” and a “resilient organisation” are fundamentally the same.

Learning, resilient organisations must possess psychological safety in order to respond to changes and threats. They must also have clear goals, vision, and processes and structures. According to Conways Law:

“Any organisation that designs a system (defined broadly) will produce a design whose structure is a copy of the organisation’s communication structure.”

In order for both the organisation to respond quickly to change, and for the systems that organisation has built to respond to change, the organisation must be structured in such a way that respond to change is as rapid as possible. In context, this will depend significantly on the organisation itself, but fundamentally, smaller, less-tightly coupled, autonomous and expert teams will be able to respond to change faster than large, tightly-bound teams with low autonomy will. Pais and Skelton’s Team Topologies explores this in much more depth.

Engineer the conditions for resilience engineering

“Before you can engineer resilience, you must engineer the conditions in which it is possible to engineer resilience.” – Rein Henrichs (@reinH)

As we’ve seen, an essential component of learning organisations is psychological safety. Psychological safety is a necessary condition (though not sufficient) for the  conditions of resilience to be created and sustained. 

Therefore we must create psychological safety in our teams, our organisations, our human “systems”. Without this, we cannot engineer resilience. 

We create, build, and maintain psychological safety via three core behaviours:

  1. Framing work as a learning problem, not an execution problem. The primary outcome should be knowing how to do it even better next time.
  2. Acknowledging your own fallibility. You might be an expert, but you don’t know everything, and you get things wrong – if you admit it when you do, you allow others to do the same.
  3. Model curiosity – ask a lot of questions. This creates a need for voice. By you asking questions, people HAVE to speak up. 

Resilience engineering and psychological safety

Psychological safety enables these fundamental aspects of resilience – the sustained adaptive capacity of a team or organisation.:

  • Taking risks and making changes that you don’t, or can’t, fully understand the outcomes of. 
  • Admitting when you made a mistake. 
  • Asking for help
  • Contributing new ideas
  • Detailed systemic cause* analysis (The ability to get detailed information about the “messy details” of work)

(*There is never a single root cause)

Let’s go back to that phrase at the start:

Sustained adaptive capacity.

What we’re trying to create is an organisation, a complex system, and sub systems (maybe including all that software we’re building) that possesses a capacity for sustained adaptation.

With DevOps we build systems that respond to demand, scale up and down, we implement redundancy, low-dependancy to allow for graceful failure, and identify and react to security threats.

Pretty much all of these only contribute to robustness.

robustness vs resilience

(David Woods, Professor, Integrated Systems Engineering Faculty, Ohio State University)

You may want to think back to the cynefin model, and think of robustness as being able to deal well with known unknowns (complicated systems), and resilience as being able to deal well with unknown unknowns (complex, even chaotic systems). Technological or DevOps practices that primarily focus on systems, such as microservices, containerisation, autoscaling, or distribution of components, build robustness, not resilience.

However, if we are to build resilience, the sustained adaptive capacity for change, we can utilise DevOps practices for our benefit. None of them, like psychological safety, are sufficient on their own, but they are necessary. Using automation to reduce the cognitive load of people is important: by reducing the extraneous cognitive load, we maximise the germane, problem solving capability of people. The provision of other tools, internal platforms, automated testing pipelines, and increasing the observability of systems increases the ability of people and teams to respond to change, and increases their sustained adaptive capacity.

Observability

It is absolutely crucial to be able to observe what is happening inside the systems. This refers to anything from analysing system logs to identify errors or future problems, to managing Work In Progress (WIP) to highlight bottlenecks in a process.

Too often, engineering and technology organisations look only inward, whilst many of the threats to systems are external to the system and the organisation. Observability must also concern external metrics and qualitative data: what is happening in the marketspace, the economy, and what are our competitors doing?

Resilience Engineering and DevOps

What must we do?

Create psychological safety – this means that people can ask for help and “apply their judgement without fear of repercussion.”

Manage cognitive load – so people can focus on the real problems of value – such as responding to unanticipated events.

Apply DevOps practices to technology – use automation, internal platforms and observability, amongst other DevOps practices. 

Increase observability and monitoring – this applies to systems (internal) and the world (external). People and systems cannot respond to a threat if they don’t see it coming.

Structure the organisation in a way that facilitates adaptation and change. Consider appropriate team topologies to facilitate adaptability.

In summary

Through facilitating learning, responding, monitoring, and anticipating threats, we can create resilient organisations. DevOps and psychological safety are two important components of resilience engineering.

 

References:

Conway, M. E. (1968) How Do Committees Invent? Datamation magazine. F. D. Thompson Publications, Inc. Available at: https://www.melconway.com/Home/Committees_Paper.html

Edmondson, A., 1999. Psychological safety and learning behavior in work teams. Administrative science quarterly, 44(2), pp.350-383.

Garvin, David & Edmondson, Amy & Gino, Francesca. (2008). Is Yours a Learning Organization?. Harvard business review. 86. 109-16, 134.

Hochstein, L. (2019)  Resilience engineering: Where do I start? Available at: https://github.com/lorin/resilience-engineering/blob/master/intro.md (Accessed: 17 November 2020).

Hollnagel, E., Woods, D. D. & Leveson, N. C. (2006). Resilience engineering: Concepts and precepts. Aldershot, UK: Ashgate.

Hollnagel, E. Resilience Engineering (2020). Available at: https://erikhollnagel.com/ideas/resilience-engineering.html (Accessed: 17 November 2020).

Provan, D.J., Woods, D.D., Dekker, S.W. and Rae, A.J., 2020. Safety II professionals: how resilience engineering can transform safety practice. Reliability Engineering & System Safety, 195, p.106740. Available at https://www.sciencedirect.com/science/article/pii/S0951832018309864

Woods, D. D. (2018). Resilience is a verb. In Trump, B. D., Florin, M.-V., & Linkov, I.
(Eds.). IRGC resource guide on resilience (vol. 2): Domains of resilience for complex interconnected systems. Lausanne, CH: EPFL International Risk Governance Center. Available on irgc.epfl.ch and irgc.org.

The State of DevOps Report 2020 – A Summary

Every year, for the past decade, Puppet have carried out their “State of DevOps” report, apart from 2019, when it was carried out and released by DORA through Google.

This year, Puppet took the reins again and despite 2020 being the year from Hell, they managed to survey 2,400 technology professionals and released their report on 12th November.

The State of DevOps report attempts to gather, aggregate and analyse progress across the technology industry, backed by data and statistical analysis.

Here are the key takeaways from the 2020 State of DevOps Report:state of DevOps report

DevOps continues to evolve.

One of the things I like about the Puppet approach is that they see DevOps as a continual evolution towards improved delivery, quality and security, and steer away from a more traditional “maturity model” that implies a possibly fictional end state where DevOps is “done”.

From personal experience and what we’ve seen over the past ten years of data, we need to recognise that technical practices are important, but practices that are isolated to a few teams simply aren’t enough to help organisations achieve widespread DevOps success. DevOps is not a CI/CD pipeline, it’s not technology, public cloud, or automation. DevOps is people, culture, mindset, technology, constraints, experience and expertise.*

As the 2019 report by DORA showed, a culture of psychological safety is crucial to both team & organisational performance, and productivity.

psychological safety and devops

Internal platform teams

One major evident transition is the shift to internal platform teams. Unlike product teams, which are responsible for the end-to-end delivery of a product, internal platform teams are responsible for providing a platform that provides the infrastructure, environments, deployment pipelines and other internal services that enable internal customers (such as those product teams)  to build, deploy and run their applications.

The platform model can make product teams far more efficient by allowing them to focus on their primary goals and their core competencies: building and delivering products. A platform team can improve governance, compliance and cost efficiency through providing a standardised toolset that can be easily understood and consumed by value stream-oriented teams.

The 2020 State of DevOps report shows that high performing organisations are six times more likely to report the use of internal platforms as compared to low performing organisations.

devops and shared platforms

Shared internal platforms provide a balance between standardisation and team autonomy. Finding where to place this balance and draw the line can be challenging, but the important thing is to start.

A really useful resource is Manuel Pais and Matthew Skelton’s book “Team Topologies”, which will help you understand what team structures will contribute to building high performing products and services, and how internal platform teams could work in your organisation.

Product over project

More organisations are transitioning away from a traditional project mindset, towards value-stream-aligned, product approaches. Organisations that still possess a traditional “project mindset” may suffer from the proliferation of temporary teams that form and disperse as projects begin and end, impacting team cohesion and performance.

A project mindset encourages teams to focus on the next shiny thing, and throw things over the wall for ops to support, rather than own a product or service longer term and ensure that it’s not only fit for purpose, but constantly improving.

Adopting a product-oriented approach and tying work to value streams improves the delivery of features, reduces defects, increases security, and lowers technical debt. Mik Kersten’s Project To Product is an excellent book to learn more about how to adopt a product approach.

The 2020 State of DevOps report shows that a product mindset is a key enabler of performance in the technology space, and accelerates DevOps adoption and evolution.

product oriented approach and devops

Change management

Ever since Gene Kim wrote The Phoenix Project, we’ve known that fast and lean change management is a precursor for technology performance. Nicole Forsgren describes in her book Accelerate how lead time for changes is an essential trailing metric for high performing teams.

The 2020 State of DevOps report revealed four different approaches to change management based on approval processes (orthodox “gatekeeping” approaches versus adaptive and collaborative), automated testing and deployment, and advanced risk mitigation techniques.

The four approaches described by Puppet are:

  • Operationally mature: High levels of both process and automation.
  • Engineering driven: High emphasis on automation.
  • Governance focused: High emphasis on manual approvals and low emphasis on automation.
  • Ad hoc: Low emphasis on both process and automation.

Puppet also showed that organisations that trust in their change management processes are more likely to adopt automation, which further improves performance.  Additionally, organisations that encourage high engagement with employees in the change management process are five times more likely to have effective change management processes.

devops and change management

It is interesting to note that ITIL, originally intended to improve the quality and performance of technology, has been adopted by many organisations (90% of Fortune 500 firms have adopted ITIL) and has resulted in cumbersome bureaucratic processes that actually resulted in slower change and higher risk. Fortunately, the latest version of ITIL, v4, departs from this heavyweight approach and instead encourages change enablement and collaboration.

To put it simply:

  • Orthodox approvals damage performance
  • Automation gives teams confidence in change management
  • Giving people agency over the process results in higher performance

Challenges to improving change management practices include incomplete test coverage, organisational mindsets of fear and compliance instead of trust and value, and tightly coupled and or monolithic architectures.

As with any DevOps transformation, improve change management processes but primarily focus on people and culture. Break down silos and build empathy across people and teams: enable and encourage engineers to understand and empathise with the concerns of compliance and risk teams, whilst working with governance to create an culture of shifting security and compliance left.

TL;DR:

  1. The industry still has a long way to go and there remain significant areas for improvement across all sectors.
  2. Internal platforms and platform teams are a key enabler of performance, and more organisations are adopting this approach.
  3. Adopting a product approach over project-oriented improves performance and facilitates improved adoption of DevOps cultures and practices.
  4. Lean, automated, and people-oriented change management processes improve velocity and performance.

 

Thanks to the team at Puppet and DORA for carrying out the State Of DevOps reports every year, including the team for this years report, Alanna Brown (@alannapb) , Michael Stahnke (@stahnma), and Nigel Kersten (@nigelkersten).

 

 

*Thanks to Tom Hoyland for the articulate description of DevOps.