Resilience Engineering, DevOps, and Psychological Safety – resources

With thanks to Liam Gulliver and the folks at DevOps Notts, I gave a talk recently on Resilience Engineering, DevOps, and Psychological Safety.

It’s pretty content-rich, and here are all the resources I referenced in the talk, along with the talk itself, and the slide deck. Please get in touch if you would like to discuss anything mentioned, or you have a meetup or conference that you’d like me to contribute to!

Here’s a psychological safety practice playbook for teams and people.

Open Practice Library

https://openpracticelibrary.com/

Resilience Engineering and DevOps slide deck  

https://docs.google.com/presentation/d/1VrGl8WkmLn_gZzHGKowQRonT_V2nqTsAZbVbBP_5bmU/edit?usp=sharing

Resilience engineering – Where do I start?

Resilience engineering: Where do I start?

Turn the ship around by David Marquet

Lorin Hochstein and Resilience Engineering fundamentals 

https://github.com/lorin/resilience-engineering/blob/master/intro.md

 

Scott Sagan, The Limits of Safety:
“The Limits of Safety: Organizations, Accidents, and Nuclear Weapons”, Scott D. Sagan, Princeton University Press, 1993.

 

Sidney Dekker: “The Field Guide To Understanding Human Error: Sidney Dekker, 2014

 

John Allspaw: “Resilience Engineering: The What and How”, DevOpsDays 2019.

https://devopsdays.org/events/2019-washington-dc/program/john-allspaw/

 

Erik Hollnagel: Resilience Engineering 

https://erikhollnagel.com/ideas/resilience-engineering.html

 

Cynefin

Home

 

Jabe Bloom, The Three Economies

The Three Economies an Introduction

 

Resilience vs Efficiency

Efficiency vs. Resiliency: Who Won The Bout?

 

Tarcisio Abreu Saurin – Resilience requires Slack

Slack: a key enabler of resilient performance

 

Resilience engineering and DevOps – a deeper dive

Resilience Engineering and DevOps – A Deeper Dive

 

Symposium with John Willis, Gene Kim, Dr Sidney Dekker, Dr Steven Pear, and Dr Richard Cook: Safety Culture, Lean, and DevOps

 

Approaches for resilience and antifragility in collaborative business ecosystems: Javaneh Ramezani Luis, M. Camarinha-Matos:

https://www.sciencedirect.com/science/article/pii/S0040162519304494

 

Learning organisations:
Garvin, D.A., Edmondson, A.C. and Gino, F., 2008. Is yours a learning organization?. Harvard business review, 86(3), p.109.
https://teamtopologies.com/book
https://www.psychsafety.co.uk/cognitive-load-and-psychological-safety/

 

Psychological safety: Edmondson, A., 1999. Psychological safety and learning behavior in work teams. Administrative science quarterly, 44(2), pp.350-383.

The four stages of psychological safety, Timothy R. Clarke (2020)

Measuring psychological safety:

 

And of course the youtube video of the talk:

Please get in touch if you’d like to find out more.

Using Hugo and AWS to build a fast, static, easily managed and deployed website.

Most of my websites are built using WordPress on Linux in AWS, using EC2 for compute, S3 for storage and Aurora for the data layer. Take a look at sono.life as an example.

For this site, I wanted to build something that aligned with, and demonstrated some of the key tenets of cloud technology – that is: scalability, resiliency, availability, and security, and was designed for the cloud, not simply in the cloud.

I chose technologies that were cloud native, were as fast as possible, easily managed, version controlled, quickly deployed, and presented over TLS. I opted for Hugo, a super-fast static website generator that is managed from the command line. It’s used by organisations such as Let’s Encrypt to build super fast, secure, reliable and scalable websites. The rest of my choices are listed below. Wherever possible, I’ve used the native AWS solution.

The whole site loads in less than half a second, and there are still improvements to be made. It may not be pretty, but it’s fast. Below is a walk through and notes that should help you build your own Hugo site in AWS. The notes assume that you know your way around the command line, that you have an AWS account and have a basic understanding of the services involved in the build. I think I’ve covered all the steps, but if you try to follow this and spot a missing step, let me know.

Notes on Build – Test – Deploy:

Hugo was installed via HomeBrew to build the site. If you haven’t installed Homebrew yet, just do it. Fetch by running:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Then install Hugo:
brew install hugo

One of the things I love about Hugo is the ability to make rapid, on-the-fly changes to the site and see the result instantly, running the Hugo server locally.

hugo server -w -D

The option -D includes drafts in the output, whilst -w watches the filesystem for changes, so you don’t need to rebuild with every small change, or even refresh in the browser.

To create content, simply run

hugo new $postname.md

Then create and edit your content, QA with the local Hugo server, and build the site when you’re happy:

hugo -v

V for verbose, obvs.

You’ll need to install the AWS CLI, if you haven’t already.

brew install awscli

Check it worked:

aws --version

Then set it up with your AWS IAM credentials:

aws configure
AWS Access Key ID [None]: <your access key>
AWS Secret Access Key [None]: <your secret key>
Default region name [None]: <your region name>
Default output format [None]: ENTER

You don’t need to use R53 for DNS, but it doesn’t cost much and it will make your life a lot easier. Plus you can use funky features like routing policies and target health evaluation (though not when using Cloudfront distributions as a target).

Create your record set in R53. You’ll change the target to a Cloudfront distribution later on. Create the below json file with your config.

{
            "Comment": "CREATE/DELETE/UPSERT a record ",
            "Changes": [{
            "Action": "CREATE",
                        "ResourceRecordSet": {
                                    "Name": "a.example.com",
                                    "Type": "A",
                                    "TTL": 300,
                                 "ResourceRecords": [{ "Value": "4.4.4.4"}]
}}]
}
And run:
aws route53 change-resource-record-sets --hosted-zone-id ZXXXXXXXXXX --change-batch file://sample.json

Create a bucket. Your bucket name needs to match the hostname of your site, unless you want to get really hacky.

aws s3 mb s3://my.website.com --region eu-west-1

If you’re using Cloudfront, you’ll need to specify permissions to allow the Cloudfront service to pull from S3. Or, if you’re straight up hosting from S3, ensure you allow the correct permissions. There are many variations on how to do this – the AWS recommended way would be to set up an Origin Access Identity, but that won’t work if you’re using Hugo and need to use a custom origin for Cloudfront (see below). If you don’t particularly mind if visitors can access S3 assets if they try to, your S3 policy can be as below:

{
  "Version":"2012-10-17",
  "Statement":[{
	"Sid":"PublicReadGetObject",
        "Effect":"Allow",
	  "Principal": "*",
      "Action":["s3:GetObject"],
      "Resource":["arn:aws:s3:::example-bucket/*"
      ]
    }
  ]
}

Request your SSL certificate at this time too:

aws acm request-certificate --domain-name $YOUR_DOMAIN --subject-alternative-names "www.$YOUR_DOMAIN" 

ACM will automatically renew your cert for you when it expires, so you can sleep easy at night without worrying about SSL certs expiring. That stuff you did last summer at bandcamp will still keep you awake though.

Note: Regards Custom SSL client support, make sure to select ONLY SNI. Supporting old steam driven browsers on WinXP will cost you $600, and I don’t think you want that.

The only way to use https with S3 is to stick a Cloudfront distribution in front of it, and by doing this you get the added bonus of a super fast CDN with over 150 edge locations worldwide.

Create your Cloudfront distribution with a json config file, or straight through the cli.

aws cloudfront create-distribution --distribution-config file://distconfig.json

Check out the AWS documentation for details on how to create your config file.

Apply your certificate to the CF distribution too, in order to serve traffic over https. You can choose to allow port 80 or redirect all requests to 443. Choose “custom” certificate to select your cert, otherwise Cloudfront will use the Amazon default one, and visitors will see a certificate mismatch when browsing to the site.

When configuring my Cloudfront distribution, I hit a few issues. First of all, it’s not possible to use the standard AWS S3 origin. You must use a custom origin (specifying the region of the S3 bucket as below in order for pretty URLs and CSS references in Hugo to work properly. I.e.

cv.tomgeraghty.co.uk.s3-website-eu-west-1.amazonaws.com 

instead of

cv.tomgeraghty.co.uk.s3.amazonaws.com

Also, make sure to specify the default root object in the CF distribution as index.html.

Now that your CF distribution is ready, anything in your S3 bucket will be cached to the CF CDN. Once the status of your distribution is “deployed”, it’s ready to go. It might take a little while at first setup, but don’t worry. Go and make a cup of tea.

Now, point your R53 record at either your S3 bucket or your Cloudfront disti. You can do this via the cli, but doing it via the console means you can check to see if your target appears in the list of alias targets. Simply select “A – IPv4 address” as the target type, and choose your alias target (CF or S3) in the drop down menu.

Stick an index.html file in the root of your bucket, and carry out an end-to-end test by browsing to your site.

Build – Test – Deploy

Now you have a functioning Hugo site running locally, S3, R53, TLS, and Cloudfront, you’re ready to stick it all up on the internet.

Git push if you’re using Git, and deploy the public content via whichever method you choose. In my case, to the S3 bucket created earlier:

aws s3 cp public s3://$bucketname --recursive

The recursive switch ensures the subfolders and content will be copied too.

Crucially, because I’m hosting via Cloudfront, a new deploy means the old Cloudfront content will be out of date until it expires, so alongside every deploy, an invalidation is required to trigger a new fetch from the S3 origin:

aws cloudfront create-invalidation --distribution-id $cloudfrontID  --paths /\*

It’s not the cleanest way of doing it, but it’s surprisingly quick to refresh the CDN cache so it’s ok for now.

Time to choose a theme and modify the hugo config file. This is how you define how your Hugo site works.

I used the “Hermit” theme:

git clone https://github.com/Track3/hermit.git themes/hermit

But you could choose any theme you like from https://themes.gohugo.io/

Modify the important elements of the config.toml file:

baseURL = "$https://your-website-url"
languageCode = "en-us"
defaultContentLanguage = "en"
title = "$your-site-title"
theme = "$your-theme"
googleAnalytics = "$your-GA-UA-code"
disqusShortname = "$yourdiscussshortname"

Get used to running a deploy:

hugo -v
aws s3 cp public s3://your-site-name --recursive
aws cloudfront create-invalidation --distribution-id XXXXXXXXXX  --paths /\*

Or, to save time, set up npm to handle your build and deploy. Install node and NPM if you haven’t already (I’m assuming you’re going to use Homebrew again.

$ brew install node

Then check node and npm are installed by checking the version:

npm -v

and

node -v

All good? Carry on then:

npm init

Create some handy scripts:

{
    "name": "hugobuild",
    "config": {
        "LASTVERSION": "0.1"
    },
    
    "version": "1.0.0",
    "description": "hugo build and deploy",
    "dependencies": {
        "dotenv": "^6.2.0"


    },

    "devDependencies": {},
    "scripts": {
        "testvariable": "echo $npm_package_config_LASTVERSION",
        "test": "echo 'I like you Clarence. Always have. Always will.'",
        "server": "hugo server -w -D -v",
        "build": "hugo -v",
        "deploy": "aws s3 cp public s3:// --recursive && aws cloudfront create-invalidation --distribution-id  --paths '/*'"
    },
    "author": "Tom Geraghty",
    "license": "ISC"
}

Then, running:

npm run server

will launch a local server running at http://localhost:1313

Then:

npm run build

will build your site ready for deployment.

And:

npm deploy

will upload content to S3 and tell Cloudfront to invalidate old content and fetch the new stuff.

Now you can start adding content, and making stuff. Or, if you’re like me and prefer to fiddle, you can begin to implement Circle CI and other tools.

Notes: some things you might not find in other Hugo documentation:

When configuring the SSL cert – just wait, be patient for it to load. Reload the page a few times even. This gets me every time. The AWS Certificate manager service can be very slow to update.

Take a look at custom behaviours in your CF distribution for error pages so they’re cached for less time. You don’t want 404’s being displayed for content that’s actually present.

Finally, some things I’m still working on:

Cloudfront fetches content from S3 over port 80, not 443, so this wouldn’t be suitable for secure applications because it’s not end-to-end encrypted. I’m trying to think of a way around this.

I’m implementing Circle CI, just for kicks really.

Finally, invalidations. As above, if you don’t invalidate your CF disti after deployment, old content will be served until the cache expires. But invalidations are inefficient and ultimately cost (slightly) more. The solution is to implement versioned object names, though I’m yet to find a solution for this that doesn’t destroy other Hugo functionality. If you know of a clean way of doing it, please tell me 🙂

 

The OSI model for the cloud

[Note: this is now what I call “archival content”. It’s out of date (originally posted in 2017) and I wouldn’t necessarily agree with the content now; particularly in respect to the advent of containerisation. Hopefully it’s still useful though.]

While I was putting together a talk for an introduction to AWS, I was considering how to structure it and thought about the “layers” of cloud technology. I realised that the more time I spend talking about “cloud” technology and how to best exploit it, manage it, develop with it and build business operations using it, the more some of our traditional terminologies and models don’t apply in the same way.

Take the OSI model, for example:

 

When we’re managing our own datacentres, servers, SANS, switches and firewalls, we need to understand this. We need to know what we’re doing at each layer, who’s responsible for physical connectivity, who manages layer three routing and control, and who has access to the upper layers. We use the terms “layer 3” to describe IP-based routing or “layer 7” to describe functions interacting at a software level, and crucially, we all know what each other means when we use these terms.

With virtualisation, we began to abstract layers 3 and above (layer 2? Maybe?) into software defined networks, but we were still in control of the underlying layers, just a little less “concerned” about them.

Now, with cloud tech such as AWS and Azure, even this doesn’t apply any longer. We have different layers of control and access, and it’s not always helpful to try to use the OSI model terms.

We pay AWS or Azure, or someone else, to manage the dull stuff – the cables, the internet connections, power, cooling, disks, redundancy, and physical security. Everything we see is abstract, virtual, and exists only as code. However, we still possess layers of control and management. We may create multiple AWS accounts to separate environments from each other, we’ll create different VPCs for different applications, multiple subnets for different functions, and instances, services, storage units and more. Then we might hand off access to these to developers and testers, to deploy and test applications.

The point is that it seems we don’t yet have a common language, similar to the OSI model, for cloud architecture. Below is a first stab at what this might be. It’s almost certainly wrong, and certainly can be improved.

Let’s start with layer 1 – the physical infrastructure. This is entirely in the hands of the cloud provider such as AWS. Much of the time, we don’t even know where this is, let alone have any visibility of what it looks like or how it works. This is analogous to layer 1 of the OSI model too, but more complex. It’s the physical machines, cabling, cooling, power and utilities present in the various datacentres used by the cloud providers.

Layer 2 is the hypervisor. The software that allows the underlying hardware to be utilised – this is the abstraction between the true hardware and the virtualised “hardware” that we see. AWS uses Xen, Azure uses a modified Hyper-V, and others use KVM. Again, we don’t have access to this layer, but a GUI or CLI layered on top. For those of us who started our IT careers managing physical machines, then adopted virtualisation, we’ll be familiar with how layer 2 allowed us to create and modify servers far quicker and easier than ever before.

Layer 3 is where we get our hands dirty. The Software Defined Data Centre(SDDC). From here, we create our cloud accounts and start building stuff. This is accessed via a web GUI, command line tools, APIs or other platforms and integrations. This is essentially a management layer, not a workload layer, in that it allows us to govern our resources, control access, manage costs, security, scale, redundancy and availability. It is here that “infrastructure as code” becomes a reality.

Layer 4. The Native Service (such as S3, Lambda, or RDS) or machine instance (such as EC2) layer. This is where we create actual workloads, store data, process information and make things happen. At this level, we could create an instance inside a VPC, set up the security groups and NACLs, and provide access to a developer or administrator via RDP, SSH, or other protocol. At this layer, humans that require access don’t need Layer 3 (SDDC) access in order to do their job. In many ways, this is the actual IaaS (Infrastructure as a Service) layer.

Layer 5. I’m not convinced this is all that different to layer 4, but it’s useful to distinguish it for the purpose of defining *who* has access. This layer is analogous to layer 7 of the OSI, that is, it’s end-user-facing, such as the front end of a web application, the interactions taking place on a mobile app, or the connectivity to IoT devices. Potentially, this is also analogous to SaaS (Software as a Service), if you consider it from the user’s perspective. Layer 5 applications exist as a function of the full stack underneath it – the physical resources in datacentres, the hypervisor, the management layer, virtual machines and services, and the code that runs on or interacts with the services.

Whether something like an OSI model for cloud becomes adopted or not, we’re beginning to transition into a new realm of terminology, and the old definitions no longer apply.

I hope you found this useful, and I’d love to hear your feedback and improvements on this model. Take a look at ISO/IEC 17788 if you’d like to read more about cloud computing terms and definitions.

Finally, if you’d like me to speak and present at your event or your business, or provide consultation and advice, please get in touch. 

Tom@tomgeraghty.co.uk

@tomgeraghty

https://www.linkedin.com/in/geraghtytom/

Fixing “the trust relationship between this workstation and the primary domain failed” without leaving the domain or restarting.

Sometimes you’ll find that for any one of a multitude of reasons, a workstation’s computer account becomes locked or somehow otherwise disconnected from the actual workstation (say, if a machine with the same name joins the network, or if it’s been offline for a very long time). When you try to log on to the domain you’ll get a message that states:

 

“the trust relationship between this workstation and the primary domain failed”

 

Now, what I would normally do in this situation is un-join and re-join the workstation to the domain, which works, but creates a new SID (Security Identifier) and can therefore break existing trusts in the domain with that machine, and of course it requires a reboot. So if you don’t want to reboot, and you don’t want to break existing trusts, do this:

 

Use netdom.exe in a command prompt to reset the password for the machine account, from the machine with the trust problem.

 

netdom.exe resetpwd /s:<server> /ud:<user> /pd:*

<server> = a domain controller in the joined domain

<user> = DOMAIN\User format with rights to change the computer password

 

* = the domain user password

 

That should do it, in *most* cases.



Find mailboxes that are set to automatically forward email in Exchange 2010

Every time someone leaves your organisation, you’ll probably need to forward their mail to another mailbox, but over time this can get disorganised and messy. Use the below command to extract a .csv formatted table of mailboxes that have a forwarding address:

Get-Mailbox -resultsize 6000 | Where {$_.ForwardingAddress -ne $null} | Select Name, ForwardingAddress, organizationalunit, whencreated, whenchanged, DeliverToMailboxAndForward | export-csv E:\forwardedusers.csv

I set a limit of 6000 because we have almost that many mailboxes, and the limit in this case is the number of mailboxes this will query, rather than the number of actual results. I’m sure this means that there’s a more efficient way of running this query, but it’s not like you’re doing this every day, so it doesn’t really matter.

Once you’ve got this information, you might want to match this up with further details about the users that own these mailboxes. Use the Active Directory powershell tools with Server 2008 to extract this information.

Fire up a powershell on a domain controller (or remotely), and run “import-module activedirectory”.

Then execute:

Get-Aduser -SearchBase "DC=yourdomain,DC=local" -properties SamAccountName,description | export-csv c:\allusers.csv

At the “Filter:” prompt, type:

name –like “*”

Than get this data into excel in two different worksheets.

Use the VLOOKUP tool to compare the two worksheets (in a third one), and collate the fields for the user’s name, forwarding address, and description:

In your “working worksheet” make the first column pull the display name from the mail worksheet, then name the second column “description” (this is what I’m looking for, anyway), and the third columns can be any other data you’d like to show, such as OU, modified dates, or suchlike.

In the description column, enter:

=VLOOKUP(mail!A2,allusers!$D:$E,2,FALSE)

“mail” refers to the worksheet containing data extracted from Exchange, and A2 should be the first user’s Name field (copy this downwards to that you’re looking up A3, A4, A5, etc.

“allusers” refers to the Active directory information worksheet – so in this case it will attempt to match the mail A2 field with anything in the D column in allusers (this being the first column in the $D:$E array, and will then return the corresponding value from the E column in allusers (because I’ve specified “2”, which in my case is the description field.) The FALSE bit at the end ensures that you’re searching for an exact match.

Copy this formula down along with the list of users that have email forwarding enabled, and you’ll have a list of forwarded users along with their names, descriptions, modified dates, OUs, and any other data you like.