DevOps, Psychological Safety and Resilience Engineering

The Links Between Psychological Safety, Resilience Engineering and DevOps

Note: since writing this, I’ve learned a lot more about resilience engineering and its relation to DevOps and psychological safety. 


Psychological safety is cited as the key factor in team performance by numerous studies including Google’s Project Aristotle and the DORA/Google State Of DevOps Reports. The evidence shows that teams that operate in psychologically safe environments where they can present their true selves at work, take risks, admit mistakes, and ask for support from their teammates, significantly outperform other organisations

However, establishing psychological safety is rarely prioritised by delivery-focussed leaders who use output-oriented metrics. Instead, these leaders tend to focus on objectives, metrics, and modern practices such as value-stream alignment and cross-functional teams. While these have great value, and will go some way, or indeed a long way, to drive performance and delivery, they are not the full picture.

It can be very challenging, particularly for less experienced leaders, or capable leaders in difficult circumstances, to build and facilitate psychologically safe environments. This is particularly true in technologically-oriented organisations where the domain is complex and failure is explicit, obvious and can generate a large blast radius. 

Mistakes happen. They must happen.

In a psychologically unsafe team, a software engineer who makes a mistake in a complex system and releases a small flaw into production that later causes an outage may be blamed for the mistake. The flaw will be easily attributable, and the impact of the outage can be significant. In many organisations, the resultant fear of error can dramatically slow down the rate of change and speed to market.

Of course, the converse is not appealing either; it’s not acceptable to tolerate errors, outages, and mistakes. Speed to market with a faulty product or service may be equally as bad as a significant delay to reach the market. Customers do not tolerate poor quality services, so we need to build high quality services and do it at velocity. This delivery at pace is one of the key tenets of DevOps, and an effective DevOps culture requires psychological safety.

Resilience Engineering and Psychological Safety

In my work on psychological safety in high performance teams, I’m often asked about how to achieve it, and whilst there are many general approaches that overlap significantly with principles of excellence in servant (or empathetic) leadership, there are also specific actions and approaches that are suitable specifically for technology teams. Here, we’re going to drill down into one of the key aspects of a DevOps approach: Resilience Engineering, and how psychological safety is a fundamental component of resilience.

Resilience Engineering is a field of study that emerged from cognitive system engineering in the early 2000s, largely in response to NASA events in 1999 and 2000, including the failure of the Mars Climate Orbiter. It is defined as “The intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions.” Erik Hollnagel

Resilience Engineering is the intentional engineering of a system (a sociotechnical system, such as a community,  organisation, or nation) to anticipate, detect and respond to both external and internal changes, planned or unplanned, to the system itself and continue to operate whilst change occurs.

Very little theory within this domain has been generated that doesn’t emerge from studies of real work; Resilience Engineering exists within high-stakes domains such as aviation, construction, surgery, military agencies and law enforcement and is becoming more visible in DevOps and Digital Transformation.

There is a difference between robustness and resilience engineering,  as described by David Woods, Professor, Integrated Systems Engineering Faculty, Ohio State University. Technological measures such as autoscaling, failovers, retries and queues, for example, only really contribute to robustness, not resilience:

  • Decoupling and reducing dependencies between components
  • Utilising microservices and containerisation
  • Autoscaling applications based on demand
  • Creating self-healing applications and systems
  • Using monitoring and visibility tools to facilitate responses to out-of-bounds events
  • Adopting error budgeting instead of (or in addition to) uptime measures
  • Using automated code testing, continuous integration and advanced deployment practices

robustness vs resilience

These approaches extend to concepts such as chaos engineering. This is where flaws and interruptions are intentionally introduced in order to examine how the system behaves and help engineers identify how they can improve it.

DevOps practices such as these help to build and improve psychological safety, through facilitating safe risk taking , and resilience Engineering requires psychological safety to be present, because it is only psychological safety that enables people (the adaptable component of a system) to anticipate, respond and adapt to changes and challenges.

The formation of DevOps teams

As a new DevOps-oriented team moves through Tuckman’s Forming-Storming-Norming-Performing cycle, it relies more and more on cultures and practices that facilitate risk taking and admitting mistakes. If these practices are not embedded, the team will never be able to progress to the “performing” stage, because high performance explicitly requires innovation, and therefore, risk taking. Without psychological safety, teams will cycle around the Storming and Norming phases as elements change in or around the team, such as people leaving or joining.

It is only once an engineering team reaches the high performing stage that they can truly deliver high quality services at velocity. By utilising resilience engineering principles and DevOps practices, engineers are supported to take risks, experiment, deploy changes and recover quickly. They can feel comfortable in the knowledge that if something did go wrong, they’d find out straight away, before customers start calling. These practices, far from being so-called “soft” skills, are measurable by solid metrics that describe velocity whilst maintaining reliability, such as Change Rate, Mean Time Between Failures (MTBF) and Mean Time To Restore (MTTR).

Engineering Team Topologies

Resilience Engineering echoes many capabilities with the concept of Site Reliability Engineering (SRE), introduced by Ben Traynor’s team at Google in 2004. SRE practices and capabilities may be implemented by an expert, dedicated, shared SRE team, or it may suit your organisation to embed an SRE function into each stream-aligned (SA) team if the products and systems are large enough to justify it. Alternatively, it may be feasible to empower software engineers themselves to carry out SRE responsibilities if your systems are small enough.

In addition to expert leadership practice, well organised teams, adopted shared values, systemic root causes being diagnosed in retrospectives, and an embrace of continuous improvement, we must adopt capabilities that empower team members to carry out their roles without fear of failure. In a technology team, those capabilities are the very same ones that enable high velocity change, security, and reliability. 

For more information regarding technology team organisation, Matthew Skelton and Manuel Pais explore in great depth how software teams can be organised to deliver most value, safety, and performance in their book “Team Topologies”, where they examine how the concepts of Conways Law, Cognitive Load and Organisational Evolution converge. 

Resilience engineering is about entire organisations, not just technology.

Whatever team you’re on, or whatever team you lead, considering resilience engineering principles will improve delivery, safety, happiness, and performance. This enables people to work without fear, psychologically safe in the knowledge that errors do not flow downstream. This place is where true high performance, speed to market, quality and innovation happens.

Psychological safety is also a core component of Agile delivery teams, as it fundamentally enables truthful communication, response to change, and the ability to make mistakes and innovate.

Build your own high performing teams with psychological safety

For more information about high performance teams, psychological safety, DevOps, or any of the other concepts covered in this article, get in touch. I’m always for collaboration, speaking at events, podcasts, or other ways to get involved and help teams become more productive, safer, and most importantly, happier.

Download a complete Psychological Safety Action Pack full of workshops, tools, resources, and posters to help you measure, build, and maintain Psychological Safety in your teams.

@tom_geraghty

tom@tomgeraghty.co.uk 

Psychological Safety in High Performing Teams – Digital Lincoln Meetup April 2020

This is a recording of a webinar I did for the meetup group Digital Lincoln on the 28th April 2020.

Psychological Safety in High Performing and Distributed Teams

Safety isn’t just necessary in order to prevent disasters, it’s also crucial to building and maintaining high performance teams and organisations.

Building high performing software requires high performing teams, in which team members need to feel able to express their creativity, talents and skills without self-censoring, self-silencing, or fear of failure. In this talk, Tom introduces the latest research in high performance technology teams, and provides actionable concepts to help you build and elevate your team, whether co-located or distributed and remote.

Download a complete Psychological Safety Action Pack full of workshops, tools, resources, and posters to help you measure, build, and maintain Psychological Safety in your teams.

Find out about Psychological Safety and Information Security, and more here about high performing design teams who utilise psychological safety.

Tom is an expert in transforming team performance. A “veteran technology team builder” according to Computing Magazine, Over his 15 years as a technology leader, he has come to believe that culture trumps strategy, and happiness precedes success.

Tom is currently the Head of Technology at MoreNiche in Nottingham, CTO of Ydentity, an identity protection startup, and a management consultant for Q5. Outside of work, Tom is a yoga teacher and mountain biker.

https://www.crowdcast.io/e/missile-destroyers

The State of DevOps Report 2019 – A Summary

Every year or so since 2013, Puppet have carried out their “State of DevOps” report that attempts to gather, aggregate and analyse progress across the technology industry, backed by data and statistical analysis. In 2019, both Puppet and Google researched and released their own State Of DevOps Reports.

Here is a summary of the findings from both 2019 State of DevOps Reports.

(More detail to come shortly!)

  1. 2019 – Puppet:
    1. Doing DevOps well enables you to do security well.
    2. Integrating security deeply into the software delivery lifecycle makes teams more than twice as confident of their security posture.
    3. Integrating security throughout the software delivery lifecycle leads to positive outcomes.
    4. Security integration is messy, especially in the middle stages of evolution.

     

  2. 2019 – Google:
    1. The industry continues to improve, particularly among the elite performers.
    2. The best strategies for scaling DevOps in organisations focus on structural solutions that build community, including Communities of Practice.
    3. Cloud continues to be a differentiator for elite performers and drives high performance.
    4. To support productivity, organisations can foster a culture of psychological safety and make smart investments in tooling, information search, and reducing technical debt through flexible, extensible, and viewable systems.
    5. Heavyweight change approval processes, such as change approval boards, negatively impact speed and stability. In contrast, having a clearly understood process for changes drives speed and stability, as well as reductions in burnout.

 

 

Using Hugo and AWS to build a fast, static, easily managed and deployed website.

Most of my websites are built using WordPress on Linux in AWS, using EC2 for compute, S3 for storage and Aurora for the data layer. Take a look at sono.life as an example.

For this site, I wanted to build something that aligned with, and demonstrated some of the key tenets of cloud technology – that is: scalability, resiliency, availability, and security, and was designed for the cloud, not simply in the cloud.

I chose technologies that were cloud native, were as fast as possible, easily managed, version controlled, quickly deployed, and presented over TLS. I opted for Hugo, a super-fast static website generator that is managed from the command line. It’s used by organisations such as Let’s Encrypt to build super fast, secure, reliable and scalable websites. The rest of my choices are listed below. Wherever possible, I’ve used the native AWS solution.

The whole site loads in less than half a second, and there are still improvements to be made. It may not be pretty, but it’s fast. Below is a walk through and notes that should help you build your own Hugo site in AWS. The notes assume that you know your way around the command line, that you have an AWS account and have a basic understanding of the services involved in the build. I think I’ve covered all the steps, but if you try to follow this and spot a missing step, let me know.

Notes on Build – Test – Deploy:

Hugo was installed via HomeBrew to build the site. If you haven’t installed Homebrew yet, just do it. Fetch by running:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Then install Hugo:
brew install hugo

One of the things I love about Hugo is the ability to make rapid, on-the-fly changes to the site and see the result instantly, running the Hugo server locally.

hugo server -w -D

The option -D includes drafts in the output, whilst -w watches the filesystem for changes, so you don’t need to rebuild with every small change, or even refresh in the browser.

To create content, simply run

hugo new $postname.md

Then create and edit your content, QA with the local Hugo server, and build the site when you’re happy:

hugo -v

V for verbose, obvs.

You’ll need to install the AWS CLI, if you haven’t already.

brew install awscli

Check it worked:

aws --version

Then set it up with your AWS IAM credentials:

aws configure
AWS Access Key ID [None]: <your access key>
AWS Secret Access Key [None]: <your secret key>
Default region name [None]: <your region name>
Default output format [None]: ENTER

You don’t need to use R53 for DNS, but it doesn’t cost much and it will make your life a lot easier. Plus you can use funky features like routing policies and target health evaluation (though not when using Cloudfront distributions as a target).

Create your record set in R53. You’ll change the target to a Cloudfront distribution later on. Create the below json file with your config.

{
            "Comment": "CREATE/DELETE/UPSERT a record ",
            "Changes": [{
            "Action": "CREATE",
                        "ResourceRecordSet": {
                                    "Name": "a.example.com",
                                    "Type": "A",
                                    "TTL": 300,
                                 "ResourceRecords": [{ "Value": "4.4.4.4"}]
}}]
}
And run:
aws route53 change-resource-record-sets --hosted-zone-id ZXXXXXXXXXX --change-batch file://sample.json

Create a bucket. Your bucket name needs to match the hostname of your site, unless you want to get really hacky.

aws s3 mb s3://my.website.com --region eu-west-1

If you’re using Cloudfront, you’ll need to specify permissions to allow the Cloudfront service to pull from S3. Or, if you’re straight up hosting from S3, ensure you allow the correct permissions. There are many variations on how to do this – the AWS recommended way would be to set up an Origin Access Identity, but that won’t work if you’re using Hugo and need to use a custom origin for Cloudfront (see below). If you don’t particularly mind if visitors can access S3 assets if they try to, your S3 policy can be as below:

{
  "Version":"2012-10-17",
  "Statement":[{
	"Sid":"PublicReadGetObject",
        "Effect":"Allow",
	  "Principal": "*",
      "Action":["s3:GetObject"],
      "Resource":["arn:aws:s3:::example-bucket/*"
      ]
    }
  ]
}

Request your SSL certificate at this time too:

aws acm request-certificate --domain-name $YOUR_DOMAIN --subject-alternative-names "www.$YOUR_DOMAIN" 

ACM will automatically renew your cert for you when it expires, so you can sleep easy at night without worrying about SSL certs expiring. That stuff you did last summer at bandcamp will still keep you awake though.

Note: Regards Custom SSL client support, make sure to select ONLY SNI. Supporting old steam driven browsers on WinXP will cost you $600, and I don’t think you want that.

The only way to use https with S3 is to stick a Cloudfront distribution in front of it, and by doing this you get the added bonus of a super fast CDN with over 150 edge locations worldwide.

Create your Cloudfront distribution with a json config file, or straight through the cli.

aws cloudfront create-distribution --distribution-config file://distconfig.json

Check out the AWS documentation for details on how to create your config file.

Apply your certificate to the CF distribution too, in order to serve traffic over https. You can choose to allow port 80 or redirect all requests to 443. Choose “custom” certificate to select your cert, otherwise Cloudfront will use the Amazon default one, and visitors will see a certificate mismatch when browsing to the site.

When configuring my Cloudfront distribution, I hit a few issues. First of all, it’s not possible to use the standard AWS S3 origin. You must use a custom origin (specifying the region of the S3 bucket as below in order for pretty URLs and CSS references in Hugo to work properly. I.e.

cv.tomgeraghty.co.uk.s3-website-eu-west-1.amazonaws.com 

instead of

cv.tomgeraghty.co.uk.s3.amazonaws.com

Also, make sure to specify the default root object in the CF distribution as index.html.

Now that your CF distribution is ready, anything in your S3 bucket will be cached to the CF CDN. Once the status of your distribution is “deployed”, it’s ready to go. It might take a little while at first setup, but don’t worry. Go and make a cup of tea.

Now, point your R53 record at either your S3 bucket or your Cloudfront disti. You can do this via the cli, but doing it via the console means you can check to see if your target appears in the list of alias targets. Simply select “A – IPv4 address” as the target type, and choose your alias target (CF or S3) in the drop down menu.

Stick an index.html file in the root of your bucket, and carry out an end-to-end test by browsing to your site.

Build – Test – Deploy

Now you have a functioning Hugo site running locally, S3, R53, TLS, and Cloudfront, you’re ready to stick it all up on the internet.

Git push if you’re using Git, and deploy the public content via whichever method you choose. In my case, to the S3 bucket created earlier:

aws s3 cp public s3://$bucketname --recursive

The recursive switch ensures the subfolders and content will be copied too.

Crucially, because I’m hosting via Cloudfront, a new deploy means the old Cloudfront content will be out of date until it expires, so alongside every deploy, an invalidation is required to trigger a new fetch from the S3 origin:

aws cloudfront create-invalidation --distribution-id $cloudfrontID  --paths /\*

It’s not the cleanest way of doing it, but it’s surprisingly quick to refresh the CDN cache so it’s ok for now.

Time to choose a theme and modify the hugo config file. This is how you define how your Hugo site works.

I used the “Hermit” theme:

git clone https://github.com/Track3/hermit.git themes/hermit

But you could choose any theme you like from https://themes.gohugo.io/

Modify the important elements of the config.toml file:

baseURL = "$https://your-website-url"
languageCode = "en-us"
defaultContentLanguage = "en"
title = "$your-site-title"
theme = "$your-theme"
googleAnalytics = "$your-GA-UA-code"
disqusShortname = "$yourdiscussshortname"

Get used to running a deploy:

hugo -v
aws s3 cp public s3://your-site-name --recursive
aws cloudfront create-invalidation --distribution-id XXXXXXXXXX  --paths /\*

Or, to save time, set up npm to handle your build and deploy. Install node and NPM if you haven’t already (I’m assuming you’re going to use Homebrew again.

$ brew install node

Then check node and npm are installed by checking the version:

npm -v

and

node -v

All good? Carry on then:

npm init

Create some handy scripts:

{
    "name": "hugobuild",
    "config": {
        "LASTVERSION": "0.1"
    },
    
    "version": "1.0.0",
    "description": "hugo build and deploy",
    "dependencies": {
        "dotenv": "^6.2.0"


    },

    "devDependencies": {},
    "scripts": {
        "testvariable": "echo $npm_package_config_LASTVERSION",
        "test": "echo 'I like you Clarence. Always have. Always will.'",
        "server": "hugo server -w -D -v",
        "build": "hugo -v",
        "deploy": "aws s3 cp public s3:// --recursive && aws cloudfront create-invalidation --distribution-id  --paths '/*'"
    },
    "author": "Tom Geraghty",
    "license": "ISC"
}

Then, running:

npm run server

will launch a local server running at http://localhost:1313

Then:

npm run build

will build your site ready for deployment.

And:

npm deploy

will upload content to S3 and tell Cloudfront to invalidate old content and fetch the new stuff.

Now you can start adding content, and making stuff. Or, if you’re like me and prefer to fiddle, you can begin to implement Circle CI and other tools.

Notes: some things you might not find in other Hugo documentation:

When configuring the SSL cert – just wait, be patient for it to load. Reload the page a few times even. This gets me every time. The AWS Certificate manager service can be very slow to update.

Take a look at custom behaviours in your CF distribution for error pages so they’re cached for less time. You don’t want 404’s being displayed for content that’s actually present.

Finally, some things I’m still working on:

Cloudfront fetches content from S3 over port 80, not 443, so this wouldn’t be suitable for secure applications because it’s not end-to-end encrypted. I’m trying to think of a way around this.

I’m implementing Circle CI, just for kicks really.

Finally, invalidations. As above, if you don’t invalidate your CF disti after deployment, old content will be served until the cache expires. But invalidations are inefficient and ultimately cost (slightly) more. The solution is to implement versioned object names, though I’m yet to find a solution for this that doesn’t destroy other Hugo functionality. If you know of a clean way of doing it, please tell me 🙂

 

Re:Develop.io History of DevOps talk

I recently gave a talk at Re:Develop.io about the history of DevOps, and this page is where you can find the slide deck and other relevant resources.

Video of the talk

Re:develop.io DevOps Evolution Slide deck

The Theory of constraints

Agile 2008 presentation by Patrick DuBois

10+Deploys per day, Dev and Ops at Flickr – Youtube

10+deploys a day at Flickr – slide deck

DORA 

Nicole Forsgren research paper

2018 State of DevOps Report

My “Three Ways” slide deck

Safety culture

Amazon DevOps Book list

Deming’s 14 points of management

Lean management

The Toyota Way

The Agile Manifesto

Beyond the phoenix project (audio)

Netflix – simian army

My Continuous Lifecycle London talk about compliance in DevOps and the cloud