The History of DevOps

devops loop

[Updated June 2023]

DevOps may be one of the most hyped concepts in the tech industry in recent times. Yet what it actually consists of is the subject of much debate: some describe DevOps as a culture of process improvement, whilst others describe it in purely technological terms of automation and cloud technologies.

The Origins of DevOps

What few disagree on though are its origins. In the tech industry, it has long been accepted that technologists are either “devs”: those who “create”, or “ops”: those who “build and maintain”. Developers write code while engineers build the system and keep it running. Conflict frequently emerges between these two camps and their seemingly incongruent goals –  whereas development teams are motivated and measured by their high change frequency and scale (deploying features, fixes, and improvements), operations teams are judged by reliability and consistency, qualities which are often seen as an outcome of low change frequency and scale (though we shall see later how this isn’t necessarily true). This often results in an antagonistic relationship between the two teams.

DevOps is or at least originated as, the effort to reconcile this fracture and improve business performance.

…all ideas are second-hand, consciously and unconsciously drawn from a million outside sources.” Mark Twain

At a high level, the practice of DevOps focuses on culture, process, velocity, feedback loops, repeatability via automation, responsiveness to change, and continuous improvement. (Also often condensed to CALMS – Culture, Automation, Lean, Measurement, Safety). These practices have accelerated the web-scale revolution behind high-performance tech giants such as Google, Netflix, Amazon, and Facebook.

However, these concepts are not new. They have been used by industrialists, researchers, and technologists to improve the quality and efficiency of production since the dawn of the industrial revolution.

Industry and Scientific Management

In 1620 Francis Bacon codified what was to become the fundamental basis for empirical knowledge: the origin of the scientific method. Bacon’s method described the conception of a theory based upon observation, and the use of experiments to test the theory. 400 years on, we still use Bacon’s approach to create and test theories, monitor systems and check technological functionality.

In the past, the man has been first; in the future, the system must be first.” Frederick Taylor

Frederick Taylor, in the 1880’s, applied the scientific method to management and workflows to improve labour productivity. He was one of the first people to deem work itself worthy of systematic study, using the principles that Bacon derived 200 years before. Whilst Taylor’s views on what makes a “good” worker were somewhat disturbing – he defined the “best” worker as “so stupid and so phlegmatic that he more nearly resembles in his mental make-up the ox than any other type.” – Taylorism had a huge impact on productivity across the industrialised world.

Taylor summed up his efficiency strategies in the 1911 book “The Principles of Scientific Management.” This was voted the most influential management book of the twentieth century by Fellows of the Academy of Management in 2011. Without Taylor, it’s unlikely that Apple or Google would even exist as they do now.

20th Century Production

At the beginning of the 20th Century, most manufacturing utilised inefficient techniques – cars for instance were built the way you or I would go about the task, by assembling the all the parts in one place: craft production. However when demand for cars increased, it became clear that a form of linear, or mass production was needed. One of the most well known examples of the production line is the one adopted by Henry Ford in 1913 for the Ford Model T, which was based on Taylor’s principles. Through the use of time and motion studies, Ford refined his production line until he had reduced the production time for a car from over twelve hours to just 93 minutes. He also introduced to mainstream manufacturing the concept of repeatability and standardisation. In contrast to Taylor, however, Ford always maintained his belief in the importance of the skill and craftsmanship of the worker.

Without data, you’re just another person with an opinion.” William Edwards Deming

In the 1950s, William Edwards Deming, a statistician, physicist, and management consultant, began to apply statistical analysis to manufacturing. Deming found that prioritising quality over throughput would actually decrease costs and improve productivity. Whilst Taylorism and scientific management had boosted productivity, quality had suffered. Defects were sent down the line and built into finished products because workers were incentivised to ignore flaws in order to meet quotas.

He defined what is now known as the Deming Cycle: Plan – Do – Check – Act. This is similar to the software development lifecycle most of the technology industry use today. Deming championed continual analysis and improvement of processes – one of the key tenets of DevOps.

He saw effective quality assurance as an essential function of high-performing organisations, the key message of the third of his “Fourteen Points”; key principles of management for transforming business effectiveness:

  1. Constancy of purpose, with the aim to become competitive and stay in business, and to provide jobs.
  2. Adopt the new philosophy. Embrace change.
  3. Cease dependence on inspection to achieve quality. Build quality checks and feedback loops into the process.
  4. End the practice of awarding business on the basis of lowest bid. Build long term relationships with suppliers, and value loyalty and trust.
  5. Continuously improve processes, aim to improve quality and productivity, which in turn leads to cost reductions through less wastage and higher efficiencies.
  6. Institute training on the job and integrate development into employees’ roles.
  7. Institute leadership. Leadership should help people and machines do a better job, remove barriers to working effectively, identify improvements, and develop teams.
  8. Drive out fear. Fear paralyses people and teams. Transparent communication, motivation, respect and care for each other and each other’s work will contribute to this aim.
  9. Break down barriers between departments. Cross-functional teams can solve problems more easily and effectively than single-function teams or siloes.
  10. Eliminate slogans and exhortations for the workforce asking for zero defects.
  11. Defects (and quality) are a result of the system, not the individual.
  12. Eliminate targets or quotas. Substitute quantity for quality, and quantity will follow.
  13. Permit pride of workmanship. Eliminate management by objective or by numbers. Employees feel more satisfaction when they get a chance to execute their work well and professionally, rather than trying to meet a quota.
    Institute training and self-improvement. Encourage employees to study for themselves and to see their studies and training as a self-evident part of their jobs.
  14. The transformation is everyone’s job. Transformation happens only when everyone in the organisation works to accomplish it.

Deming’s System of Profound Knowledge is the culmination of his work and ties together his seminal theories on quality, management and leadership into four interrelated areas:

  1. Appreciation for a system,
  2. Knowledge of variation,
  3. Theory of knowledge
  4. Psychology

Each area corresponds to one or more of his fourteen points, and we can reflect on how these four areas correspond to fundamental DevOps tenets too.

Appreciation for a system means that as a leader, engineer, developer or tester, you ought to understand the system that you are looking to work within – and that thoroughly understanding that system endows you with far greater capacity to improve it. This is systems thinking, a concept which will be revisited throughout this book.

Knowledge of variation refers to two types of “cause” determined by Deming: “Common” and “Special”. Common causes are those anticipated by, or inherent to, a system. An example of this would be scaling; for example, you might know that a particular system generates logs at a rate of 500GB per day, and as a result you build functions into your system to deal with this growing demand for storage. This growth (the “cause”) is understandable and predictable, and thus you are able to implement measures to manage the variation. Deming’s second cause is “Special”, and refers to those aspects that are unknown or unpredictable, such as a change made that had unintended consequences, or a datacentre outage, or action by a malicious third party. Deming estimated that over 94% of quality issues (in his case, in manufacturing, but the same principle applies to modern software delivery) are catalysed by “common” causes, but human nature looks for the “special” cause: the one-off event, the human error, or bad actor at play. If someone accidentally shuts down a production server, Deming’s solution is not to fire the human (thereby removing the unpredictable, unknown element), but to build improvements into the system to prevent a human making that mistake again, or preventing that mistake affecting the system.

Deming’s theory of knowledge concentrates on the importance of understanding our own knowledge. How do we discern what is true from what is false? How do we identify our own innate biases, and how can we make ourselves less susceptible to confirmation bias? Deming goes back to Bacon’s scientific method with the Plan-Do-Check-Act cycle, reflecting the concept of creating a hypothesis and then testing those assumptions. People appear to learn more effectively when they make predictions. Making a prediction forces us to think ahead about the potential outcomes and also causes us to examine more deeply the system that we’re working in or on.

At around the same time, after studying consumer behaviour in supermarkets, the Toyota Motor Corporation began using Kanban (which means “signboard” in Japanese) to control and record work. Kanban boards have vertical columns with work packages in the form of cards to represent stages in a process. Each process is a “customer” of the preceding process to the left – that is, the work is “pulled” from left to right, rather than “pushed”. This concept reduces inventory pile-up, enabling a delivery system called just in time and minimising waste. It also aids the identification of bottlenecks in the process by highlighting Work In Progress (WIP). Kanban makes “work” visible. And making work visible is crucial to further improvement, because “you can’t manage what you don’t measure”.

Any improvements made anywhere besides the bottleneck are an illusion.” Eliyahu M. Goldratt

The above constitutes Goldratt’s Theory of Constraints. In his 1984 management novel “The Goal”, Eli Goldratt built on Deming’s ideas and codified Lean Production, a precursor of DevOps methodology. He described a failing manufacturing plant where Alex, the main character, is brought in to turn things around within three months. Through a series of telephone calls and meetings with an acquaintance called Jonah (another physicist, like Deming), Alex solves the organisation’s problems by utilising pull rather than push processes, reducing WIP, and employing the Theory of Constraints. “The Goal” itself, Goldratt demonstrates, is simply to make money for the business. Anything else, if it cannot be demonstrated to help make money, is likely to be vanity.

the goal

People and Process

By the 1980s, the modern manufacturing revolution was in full swing, however, its often reductionist approach to workers wasn’t helpful, and staff turnover was high. Among those to recognise this was Burrhus Frederic Skinner, a psychologist, author, inventor and the Edgar Pierce Professor of Psychology at Harvard University. In describing the nature of quality work and happiness, he said:

It’s the difference between a craftsman who makes a complete chair and a person on an assembly line who makes only the legs. The craftsman’s work is constantly reinforced by the process of seeing the chair take form, and finally of producing the finished chair. But the assembly-line worker sees only chair leg after chair leg — never the completed product.

This is a near-definitive example of “systems thinking”- another key tenet of DevOps.

Being able to see the end result of the process is key to improving quality in the individual stages – how can someone build the perfect component if they don’t understand in the final product? Systems thinking is a cultural practice, rather than a process or tool, and relies on believing in the capability of team members to make small but important decisions regarding their part in the process, and thus being more invested in the outcome.

Photo by <a href="https://unsplash.com/@lennykuhne?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Lenny Kuhne</a> on <a href="https://unsplash.com/photos/jHZ70nRk7Ns?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
Photo by Lenny Kuhne on Unsplash

Further developments in understanding of how to develop an aspirational working culture came once again from Toyota when in 2001 they defined their philosophy, values and manufacturing ideals in four key headlines, “The Toyota Way”. These were:

  1. Long-Term Philosophy – Base your management decisions on a long-term philosophy, even at the expense of short-term goals.
  2. The Right Process Will Produce the Right Results – Focus on pull processes, managing WIP, and making work visible.
  3. Add Value to the Organization by Developing Your People – Provide effective training, highlight team success over individual success, and challenge your partners and suppliers.
  4. Continuously Solving Root Problems Drives Organizational Learning – Continuously improve (in Japanese, kaizen), use the “5 whys” to get to the root cause of problems, standardise, decide slowly and act quickly, and encourage a knowledge sharing culture.

Everything in The Toyota Way and Lean Production aligns with, and indeed comprises part of the DevOps principles.

The Agile Manifesto

Also that year, at Snowbird resort in Utah, seventeen developers, frustrated with traditional heavyweight project management methodologies, came up with the Agile Manifesto. At the time, industry experts estimated that the time between a validated business need and an actual application in production was around three years. There was a real desire to find more lightweight ways to deliver value from technology, faster. The Agile manifesto is as follows:

Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

Image result for agile manifestoThe Agile Manifesto gives a clear guide to what to prioritise. For example, whilst documentation is valuable, it is more important to the business that the software works. The most well-known element of Agile is possibly the fourth line: Responding to change over following a plan. Given how quickly customer requirements, finances, and technology can change, it is often unrealistic to believe that specifications created at the start of a project will remain 100% accurate and true throughout the lifetime of the project. Thus, responding to change is one of the ways that software teams can provide a competitive edge over teams that do not.

Whilst Agile methodology is not fundamentally part of DevOps, the two usually go hand-in-hand. In technology teams, one is certainly easier to achieve in the presence of the other.

The First DevOps “Role”

Shortly after the Agile manifesto was written, Google was undergoing rapid expansion. As one of the few web-scale tech businesses at the time, they experienced the unprecedented challenge of trying to rapidly introduce new features whilst maintaining a highly complex, always-on, massive scale platform. The Site Reliability Engineering (SRE) team, led by Ben Traynor, was their solution.

A Site Reliability Engineer (SRE) would typically spend up to half their time performing operations-related work such as troubleshooting system issues and performing maintenance. They would spend the other half of their time on development tasks such as new features, scaling challenges, or automation. An SRE is an example of one of the first true DevOps roles in technology.

DevOps Detractors

…the opportunities for gaining IT-based advantages are already dwindling… And as for IT-spurred industry transformations, most of the ones that are going to happen have likely already happened or are in the process of happening.” Nicholas Carr

It’s worth noting that not all in business recognised the potential of DevOps. In May 2003, Nicholas Carr published an article in the Harvard Business Review, titled “IT doesn’t matter.” In this now infamous piece, Carr defines IT as a commodity, in the same category as electricity or water. He suggests that being the first to utilise a particular technology provides only a small competitive advantage, since your competitor can purchase the same system or replicate the same technology, but you incur the lion’s share of the cost by doing it first. He stated:

The key to success, for the vast majority of companies, is no longer to seek advantage aggressively but to manage costs and risks meticulously. If, like many executives, you’ve begun to take a more defensive posture toward IT in the last two years, spending more frugally and thinking more pragmatically, you’re already on the right course. The challenge will be to maintain that discipline when the business cycle strengthens and the chorus of hype about IT’s strategic value rises anew.

Carr’s piece was taken very seriously at the time, and still is by many business leaders. Perhaps it is fortunate for organisations such as Salesforce and Google that they pursued technology as a competitive advantage, and disregarded Carr’s advice.

Improving IT

It is not unsurprising however that technology had such a poor reputation at the time, since research suggests that at least 80% of outages were (and potentially still are) self-inflicted. A book by Kevin Behr, Gene Kim and George Spafford, The Visible Ops Handbook (2004), described a methodology to improve operational IT. This methodology of “Visible Ops” is described in four stages:

  1. Stabilize Patient, Modify First Response – This first step controls risky changes and reduces MTTR (Mean Time To Resolution).
  2. Catch and Release, Find Fragile Artifacts – Here assets, configurations and services are inventoried in order to identify those with the lowest change success rates, highest MTTR and highest downtime costs.
  3. Establish Repeatable Build Library – This creates repeatable builds for critical services, to make it “cheaper to rebuild than to repair.
  4. Enable Continuous Improvement – This implements metrics to enable continuous improvement of processes.

To some degree, these four stages are evolutions of elements of The Toyota Way. They formed an embryonic codification of what was to become the principles of DevOps.

Over the next few years, the technology industry underwent a paradigm shift, where methods of working were analysed, and technology became far more fundamental to the success of organisations (possibly to the chagrin of Nicholas Carr).

#DevOps

In 2008, the term DevOps was used in the industry for the first time. There’s some confusion and misinformation regarding how this came about, but I spoke to Andrew Clay Shafer and Patrick Debois, both widely credited with creating the term “DevOps”, to get the full story…

Andrew clay Shafer DevOps Ghent 2009

In August 2008 at the Agile Conference in Toronto, software developer Andrew Clay Shafer posted notice of a discussion group session entitled “Agile Infrastructure.” Just one person, system administrator Patrick Debois attended. Debois had become frustrated by the now ubiquitous conflicts between developers and operations while working on a data centre migration for the Belgium government and was looking for solutions. Shafer actually skipped his own session because he didn’t think anyone was interested, but Debois later tracked him down for a chat in the hallway. Inspired by that hallway discussion, they formed an “Agile Systems Administration” Google Group”

Patrick Debois

 

In November the following year, 2009, Patrick organised the first DevOpsDays conference in Belgium, though it was Shafer who (it’s believed) coined the term DevOps by tweeting using the #DevOps hashtag at the Velocity conference in June 2009 whilst watching the now famous “10 deploys a day” talk by John Allspaw and Paul Hammond of Flickr.

The Role of Cloud Technology

Image result for amazon cloud

It wasn’t long after the #DevOps hashtag was first used that adoption of cloud technology accelerated rapidly. The AWS EC2 service (virtual servers on-demand) only went out of beta in late 2008. It was (and still is) a fast evolving technology. Cloud technology tends to align well with DevOps practices, because its features lend themselves to elasticity and scaling, automation, measurement and repeatability, key fundamentals of DevOps.

The tide had turned. Increasingly organisations began looking at ways of improving software deployments, moving away from large, disruptive (and frankly, stressful) deployments, towards a model of more frequent, smaller, low-risk deployments.

Jez Humble and Dave Farley wrote what is still one of the definitive texts on this approach: “Continuous Delivery” in 2010. It describes in detail how to automate your build, deployment, and testing pipeline so that you can release changes in hours or even minutes. That might not seem that impressive today, but at the time, a release cycle of months or years was very common.

Continuous delivery, according to Farley and Humble, requires:

  • Comprehensive configuration management
  • Continuous integration and short lived branches (in reference to Trunk-Based Development)
  • Continuous testing

The automation of the build, deployment, and testing process, coupled with better collaboration between development, test, and ops teams, means that changes can be released rapidly. These smaller, low risk changes are more easily rolled back should something go wrong. “Continuous Delivery” showed how to increase velocity of change, whilst reducing risk and improving quality.

With cloud technology becoming mainstream and a desire to release software more rapidly, automation technology and tools took off. Software firms such as Puppet and Chef grew fast as developers and engineers strove to streamline their build processes and manage ever-increasing scales of infrastructure in the cloud. These tools also provided a new ability to fire up duplicate environments, such as staging, QA, test and validation, within minutes rather than weeks or months. Organisations exploiting these automation tools and using native cloud technologies felt that they were gaining significant competitive advantage by doing so, and what evidence there was, was in their favour. Even Gartner, in a 2011 report, stated that:

By 2015, DevOps will evolve from a niche strategy employed by large cloud providers into mainstream strategy employed by 20% of Global 2000 organisations.” Gartner, March 18, 2011.

In the same report, Gartner recognised that ITIL and other “top-down” best practice frameworks had not delivered on their goals, and IT organisations were looking for something new. They understood that because DevOps was primarily a cultural shift, driven from the ground up, it could prove far easier for technology departments to adopt than ITIL or similar frameworks

The Codification of DevOps

Two years later, Gene Kim, Kevin Behr and George Spafford wrote The Phoenix Project, a novel about a failing organisation struggling to meet the demands of modern technological complexity and competition. This novel inspired technology leaders and engineers alike, because it described with eerie familiarity what it was like to work in a technology organisation with poor change control, problematic “Ops vs Devs” cultures and inadequate visibility and monitoring of work or performance.

The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win: Gene Kim; Kevin Behr;...The Phoenix Project was inspired by The Goal by Eli Goldratt. It demonstrated a number of actionable ways to improve the performance of your IT organisation, such as effective (but lean) change control, effective (and again, lean, testing), reducing WIP and unplanned work, and avoiding letting anyone become the bottleneck for processes. The “bottleneck person” in the book is Brent, a character who knows everything but hasn’t documented anything. A key message of the book? Don’t be Brent.

Gene also introduces in the Phoenix Project one of the first efforts to codify DevOps, using “The Three Ways”:

  • Flow (or Systems Thinking)
  • Feedback Loops
  • Continuous Improvement

These “Three Ways” are concepts that echo the Toyota Way, Deming’s “Plan-Do-Check-Act” cycle, and other best practices, made specific to the DevOps context.  Gene’s subsequent book, written with Jez Humble (of “Continuous Delivery”), Patrick Debois and John Willis in 2016, “The DevOps Handbook”, goes deeper into the technical application of The Three Ways. It explores how to measure what matters to the business, and how to implement technical processes such as Continuous Integration and Continuous Delivery.

devops handbook

it didn’t take long to realise that there was another functional silo with a somewhat different set of interests than Dev or Ops: Security. Security, the IT profession realised, should be built into code as it is developed rather than added later on by a different team. Predictably, the idea became known as DevSecOps.

Gene also introduces the concept of “DevSecOps” – the integration of DevOps practices into the application of information security. If The Phoenix Project was the “why” to do DevOps, The DevOps Handbook provides the “how”.

Measuring DevOps

Given that DevOps is at least partly about effective measurement and continuous improvement, it’s self-evident that we, as an industry, should measure the success of DevOps itself. In 2012, Puppet began surveying people working in technology to understand the adoption and development of DevOps practices. They published “State of DevOps” reports which focussed on twenty key capabilities.These fall along familiar categories:

  • Technical (version control, test automation, deployment automation, trunk-based development)
  • Process (WIP limits, visual management, visualisation of the value stream)
  • Cultural (team culture, learning cultures, and job satisfaction).

Now taken over by DORA (DevOps Research and Assessment), an organisation created by Nicole Forsgren, Gene Kim, Jez Humble and Soo Choi, the State of Devops Report is being improved every year. According to Alanna Brown at Puppet, they “have built the deepest and most widely referenced body of DevOps research available, drawing on the experience of more than 30,000 technical professionals around the world.” The data from these reports demonstrates that Carr’s view of IT as a cost centre was misguided. It is clear that IT is a powerful driver of value to an organisation where velocity, security and stability are essential for success.

“…software delivery is an exercise in continuous improvement, and our research shows that year over year the best keep getting better, and those who fail to improve fall further and further behind.” Nicole Forsgren

On the back of the last four years of State of DevOps reports, Nicole Forsgren wrote the illuminating book “Accelerate”. It explains which metrics correlate to organisational performance, and what you should measure in order to find out where and how to improve.

Accelerate Nicole Forsgren

Forsgen states that the key metrics separating high from low performers in tech organisations are:

  • Deployment frequency (and pain!)
  • Lead time for change (from code commit to code deploy)
  • Mean Time To Restore (MTTR)
  • Change failure rate

Interestingly, the first two of these metrics are throughput (traditionally development-oriented) measures; the last two are stability (traditionally operations-oriented) measures.

The State of DevOps in 2019

As of the 2018 State of Devops report, the findings consistently show that:

  • Software delivery and availability unlock competitive advantages.
  • How you implement cloud infrastructure matters.
  • Use of open source software improves performance.
  • Outsourcing by function is rarely adopted by elite performers and hurts performance.
  • Key technical practices drive high performance. (i.e monitoring, automated testing, security integration)
  • Industry doesn’t matter when it comes to achieving high performance for software delivery.

The statistics show that the high performers exhibit 46 times more frequent code deployments than low performers. They have a 7 times lower change failure rate, over 2,500 times faster lead time from code commit to deployment, and are over 2,600 times faster to recover from incidents.

When an organisation can deploy quickly, recover rapidly, and suffer few outages, it has the ability to reach the market before competitors and respond to customer demand quickly. It will also provide more stable and secure service. This results, ultimately, in Goldratt’s “Goal”, making more money for the business.

Such a state is not reached by simply automating, using cloud technology, or recruiting a “DevOps Engineer” – it is the culmination of great team culture, continuous improvement, feedback loops, systems thinking, and a rigorous approach to using the right technology. DevOps is not a framework (like ITIL), an industry standard, a suite of tools, or a job title.

DevOps encompasses the culture, technologies, tools, skills and processes that enable organisations to go from idea to production as rapidly as possible, incurring low risk and cost, and providing high security and reliability at scale.

The definition of DevOps itself is continually evolving and improving, and while I may offer a definition as above, it will be out of date within days of writing, because, like the technology and services we build, it is continuously in flux, and being improved by the same people practising it.

Where does DevOps go next? I believe that the scope of DevOps needs to widen. As mentioned above, a large reason why DevOps is so successful is that it’s a ground-up movement, created and progressed by the actual people doing it (unlike ITIL, for example). However, this has meant that DevOps, naturally, focusses tightly on the technological functions of an organisation.

The next phase of DevOps includes practices and approaches such as “Platform as a Product” and also broadens the scope of DevOps to the wider organisation, evolving into “digital transformation” using Andrew Clay Shafer’s 5 Elements, Jabe Bloom’s Three Economies, and the broader, cross-sectoral concepts of resilience engineering in sociotechnical systems.

2023 Update: Safety Cultures and Platform Engineering

Over the past two to three years, DevOps has seen further maturity and adaptation to new norms, driven by unprecedented global circumstances and evolving technological trends. A particularly noteworthy shift has been the focus on building robust ‘Safety Cultures’. This approach emphasizes the creation of an environment where experimentation is encouraged, failures are seen as opportunities for learning, and psychological safety is paramount. Teams are given the latitude to innovate while knowing that missteps are not only tolerated but expected as part of the process of continuous improvement. This aspect has greatly enhanced the resilience of DevOps, fostering a more transparent, responsive, and adaptive culture.

Platform Engineering has also been a rising trend, presenting a shift in how organizations perceive their development infrastructure. Rather than treating platforms as a collection of tools and services, they are viewed as integrated products that evolve with the needs of the end-users, who are the developers. This perspective empowers developers, reduces overhead, and ultimately accelerates the delivery of value to the business.

The COVID-19 pandemic brought its own set of challenges and lessons. The necessity of remote working underlined the importance of strong communication channels, reliable cloud-based tooling, and the autonomy of distributed teams. It revealed the strength of DevOps practices in enabling organizations to maintain their pace of innovation even in the face of major disruptions. Companies that had already embraced DevOps were better positioned to navigate the transition to a remote work environment, demonstrating the value of adaptability inherent in the DevOps philosophy.

As we look to the future, the trajectory of DevOps and related methodologies appears more integrated and comprehensive. The focus will likely continue to expand beyond the technological realm, permeating deeper into business strategies and driving broader digital transformation initiatives. The trends of Safety Cultures and Platform Engineering are expected to solidify, with even greater emphasis on psychological safety, learning from failures, and treating internal platforms as products.

Furthermore, the remote working lessons from the pandemic will likely catalyze a shift towards more distributed, asynchronous ways of working. We may see a rise in ‘RemoteOps’, an evolution of DevOps practices adapted for a world where remote and flexible work arrangements become the norm. In this era, principles of effective remote communication, time-zone friendly practices, and trust-based management will become critical. In essence, the future of DevOps is about expanding its boundaries, integrating more closely with business goals, and continually evolving to meet the demands of our ever-changing world.

Using Hugo and AWS to build a fast, static, easily managed and deployed website.

Most of my websites are built using WordPress on Linux in AWS, using EC2 for compute, S3 for storage and Aurora for the data layer. Take a look at sono.life as an example.

For this site, I wanted to build something that aligned with, and demonstrated some of the key tenets of cloud technology – that is: scalability, resiliency, availability, and security, and was designed for the cloud, not simply in the cloud.

I chose technologies that were cloud native, were as fast as possible, easily managed, version controlled, quickly deployed, and presented over TLS. I opted for Hugo, a super-fast static website generator that is managed from the command line. It’s used by organisations such as Let’s Encrypt to build super fast, secure, reliable and scalable websites. The rest of my choices are listed below. Wherever possible, I’ve used the native AWS solution.

The whole site loads in less than half a second, and there are still improvements to be made. It may not be pretty, but it’s fast. Below is a walk through and notes that should help you build your own Hugo site in AWS. The notes assume that you know your way around the command line, that you have an AWS account and have a basic understanding of the services involved in the build. I think I’ve covered all the steps, but if you try to follow this and spot a missing step, let me know.

Notes on Build – Test – Deploy:

Hugo was installed via HomeBrew to build the site. If you haven’t installed Homebrew yet, just do it. Fetch by running:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Then install Hugo:
brew install hugo

One of the things I love about Hugo is the ability to make rapid, on-the-fly changes to the site and see the result instantly, running the Hugo server locally.

hugo server -w -D

The option -D includes drafts in the output, whilst -w watches the filesystem for changes, so you don’t need to rebuild with every small change, or even refresh in the browser.

To create content, simply run

hugo new $postname.md

Then create and edit your content, QA with the local Hugo server, and build the site when you’re happy:

hugo -v

V for verbose, obvs.

You’ll need to install the AWS CLI, if you haven’t already.

brew install awscli

Check it worked:

aws --version

Then set it up with your AWS IAM credentials:

aws configure
AWS Access Key ID [None]: <your access key>
AWS Secret Access Key [None]: <your secret key>
Default region name [None]: <your region name>
Default output format [None]: ENTER

You don’t need to use R53 for DNS, but it doesn’t cost much and it will make your life a lot easier. Plus you can use funky features like routing policies and target health evaluation (though not when using Cloudfront distributions as a target).

Create your record set in R53. You’ll change the target to a Cloudfront distribution later on. Create the below json file with your config.

{
            "Comment": "CREATE/DELETE/UPSERT a record ",
            "Changes": [{
            "Action": "CREATE",
                        "ResourceRecordSet": {
                                    "Name": "a.example.com",
                                    "Type": "A",
                                    "TTL": 300,
                                 "ResourceRecords": [{ "Value": "4.4.4.4"}]
}}]
}
And run:
aws route53 change-resource-record-sets --hosted-zone-id ZXXXXXXXXXX --change-batch file://sample.json

Create a bucket. Your bucket name needs to match the hostname of your site, unless you want to get really hacky.

aws s3 mb s3://my.website.com --region eu-west-1

If you’re using Cloudfront, you’ll need to specify permissions to allow the Cloudfront service to pull from S3. Or, if you’re straight up hosting from S3, ensure you allow the correct permissions. There are many variations on how to do this – the AWS recommended way would be to set up an Origin Access Identity, but that won’t work if you’re using Hugo and need to use a custom origin for Cloudfront (see below). If you don’t particularly mind if visitors can access S3 assets if they try to, your S3 policy can be as below:

{
  "Version":"2012-10-17",
  "Statement":[{
	"Sid":"PublicReadGetObject",
        "Effect":"Allow",
	  "Principal": "*",
      "Action":["s3:GetObject"],
      "Resource":["arn:aws:s3:::example-bucket/*"
      ]
    }
  ]
}

Request your SSL certificate at this time too:

aws acm request-certificate --domain-name $YOUR_DOMAIN --subject-alternative-names "www.$YOUR_DOMAIN" 

ACM will automatically renew your cert for you when it expires, so you can sleep easy at night without worrying about SSL certs expiring. That stuff you did last summer at bandcamp will still keep you awake though.

Note: Regards Custom SSL client support, make sure to select ONLY SNI. Supporting old steam driven browsers on WinXP will cost you $600, and I don’t think you want that.

The only way to use https with S3 is to stick a Cloudfront distribution in front of it, and by doing this you get the added bonus of a super fast CDN with over 150 edge locations worldwide.

Create your Cloudfront distribution with a json config file, or straight through the cli.

aws cloudfront create-distribution --distribution-config file://distconfig.json

Check out the AWS documentation for details on how to create your config file.

Apply your certificate to the CF distribution too, in order to serve traffic over https. You can choose to allow port 80 or redirect all requests to 443. Choose “custom” certificate to select your cert, otherwise Cloudfront will use the Amazon default one, and visitors will see a certificate mismatch when browsing to the site.

When configuring my Cloudfront distribution, I hit a few issues. First of all, it’s not possible to use the standard AWS S3 origin. You must use a custom origin (specifying the region of the S3 bucket as below in order for pretty URLs and CSS references in Hugo to work properly. I.e.

cv.tomgeraghty.co.uk.s3-website-eu-west-1.amazonaws.com 

instead of

cv.tomgeraghty.co.uk.s3.amazonaws.com

Also, make sure to specify the default root object in the CF distribution as index.html.

Now that your CF distribution is ready, anything in your S3 bucket will be cached to the CF CDN. Once the status of your distribution is “deployed”, it’s ready to go. It might take a little while at first setup, but don’t worry. Go and make a cup of tea.

Now, point your R53 record at either your S3 bucket or your Cloudfront disti. You can do this via the cli, but doing it via the console means you can check to see if your target appears in the list of alias targets. Simply select “A – IPv4 address” as the target type, and choose your alias target (CF or S3) in the drop down menu.

Stick an index.html file in the root of your bucket, and carry out an end-to-end test by browsing to your site.

Build – Test – Deploy

Now you have a functioning Hugo site running locally, S3, R53, TLS, and Cloudfront, you’re ready to stick it all up on the internet.

Git push if you’re using Git, and deploy the public content via whichever method you choose. In my case, to the S3 bucket created earlier:

aws s3 cp public s3://$bucketname --recursive

The recursive switch ensures the subfolders and content will be copied too.

Crucially, because I’m hosting via Cloudfront, a new deploy means the old Cloudfront content will be out of date until it expires, so alongside every deploy, an invalidation is required to trigger a new fetch from the S3 origin:

aws cloudfront create-invalidation --distribution-id $cloudfrontID  --paths /\*

It’s not the cleanest way of doing it, but it’s surprisingly quick to refresh the CDN cache so it’s ok for now.

Time to choose a theme and modify the hugo config file. This is how you define how your Hugo site works.

I used the “Hermit” theme:

git clone https://github.com/Track3/hermit.git themes/hermit

But you could choose any theme you like from https://themes.gohugo.io/

Modify the important elements of the config.toml file:

baseURL = "$https://your-website-url"
languageCode = "en-us"
defaultContentLanguage = "en"
title = "$your-site-title"
theme = "$your-theme"
googleAnalytics = "$your-GA-UA-code"
disqusShortname = "$yourdiscussshortname"

Get used to running a deploy:

hugo -v
aws s3 cp public s3://your-site-name --recursive
aws cloudfront create-invalidation --distribution-id XXXXXXXXXX  --paths /\*

Or, to save time, set up npm to handle your build and deploy. Install node and NPM if you haven’t already (I’m assuming you’re going to use Homebrew again.

$ brew install node

Then check node and npm are installed by checking the version:

npm -v

and

node -v

All good? Carry on then:

npm init

Create some handy scripts:

{
    "name": "hugobuild",
    "config": {
        "LASTVERSION": "0.1"
    },
    
    "version": "1.0.0",
    "description": "hugo build and deploy",
    "dependencies": {
        "dotenv": "^6.2.0"


    },

    "devDependencies": {},
    "scripts": {
        "testvariable": "echo $npm_package_config_LASTVERSION",
        "test": "echo 'I like you Clarence. Always have. Always will.'",
        "server": "hugo server -w -D -v",
        "build": "hugo -v",
        "deploy": "aws s3 cp public s3:// --recursive && aws cloudfront create-invalidation --distribution-id  --paths '/*'"
    },
    "author": "Tom Geraghty",
    "license": "ISC"
}

Then, running:

npm run server

will launch a local server running at http://localhost:1313

Then:

npm run build

will build your site ready for deployment.

And:

npm deploy

will upload content to S3 and tell Cloudfront to invalidate old content and fetch the new stuff.

Now you can start adding content, and making stuff. Or, if you’re like me and prefer to fiddle, you can begin to implement Circle CI and other tools.

Notes: some things you might not find in other Hugo documentation:

When configuring the SSL cert – just wait, be patient for it to load. Reload the page a few times even. This gets me every time. The AWS Certificate manager service can be very slow to update.

Take a look at custom behaviours in your CF distribution for error pages so they’re cached for less time. You don’t want 404’s being displayed for content that’s actually present.

Finally, some things I’m still working on:

Cloudfront fetches content from S3 over port 80, not 443, so this wouldn’t be suitable for secure applications because it’s not end-to-end encrypted. I’m trying to think of a way around this.

I’m implementing Circle CI, just for kicks really.

Finally, invalidations. As above, if you don’t invalidate your CF disti after deployment, old content will be served until the cache expires. But invalidations are inefficient and ultimately cost (slightly) more. The solution is to implement versioned object names, though I’m yet to find a solution for this that doesn’t destroy other Hugo functionality. If you know of a clean way of doing it, please tell me 🙂

 

Compliance in DevOps: Using public cloud services and automated governance

As a DevOps engineer, you’ve achieved greatness. You’ve containerised everything, built your infrastructure and systems in the cloud and you’re deploying every day, with full test coverage and hardly any outages. You’re even starting to think you might really enjoy your job.

Then why are your compliance teams so upset?

Let’s take a step back. You know how to build secure applications, create back ups, control access to the data and document everything, and in general you’re pretty good at it. You’d do this stuff whether there were rules in place or not, right?

Not always. Back in the late 90’s, a bunch of guys in suits decided they could get rich by making up numbers in their accounts. Then in 2001 Enron filed for bankruptcy and the suits went to jail for fraud. That resulted in the Sarbanes-Oxley Act, legislation which forced publicly listed firms in the US to enforce controls to prevent fraud and enable effective audits.

Sarbanes-Oxley isn’t the only law that makes us techies do things certain ways though. Other compliance rules include HIPAA, ensuring that firms who handle clinical data do so properly; GDPR, which ensures adequate protection of EU citizens’ personal data; and PCI-DSS, which governs the use of payment card data in order to prevent fraud (and isn’t a law, but a common industry standard). Then there are countless other region and industry specific rules, regulations, accreditations and standards such as ISO 27001 and Cyber Essentials.

Aside from being good practice, the main reason you’d want to abide by these rules is to avoid losing your job and/or going to jail. It’s also worth recognising that demonstrating compliance can provide a competitive advantage over organisations that don’t comply, so it makes business sense too.

The trouble is, compliance is an old idea applied to new technology. HIPAA was enacted in 1996, Sarbanes-Oxley in 2002 and PCI DSS in 2004 (though it is frequently updated). In contrast the AWS EC2 service only went out of beta in late 2008, and the cloud has we know it has been around for just a few years. Compliance rules are rarely written with cloud technology in mind, and compliance teams sometimes fail to keep up to date with these platforms or modern DevOps-style practices. This can make complying with those rules tricky, if not downright impossible at times. How do you tell an auditor exactly where your data resides, if the only thing you know is that it’s in Availability Zone A in region EU-West-1? (And don’t even mention to them that one customer’s Zone A isn’t the same as another’s).

As any tech in a regulated industry will appreciate, compliance with these rules is checked by regular, painful and disruptive audits. Invariably, audits result in compliance levels looking something like a sine wave:

This is because when an audit is announced, the pressure is suddenly on to patch the systems, resolve vulnerabilities, update documents and check procedures. Once the audit is passed, everyone relaxes a bit, patching lags behind again, documentation falls out of date and the compliance state drifts away from 100%. This begs the question, if we only become non-compliant between audits, is the answer to have really, really frequent audits?

In a sense, yes. However, we can no longer accept that audits with spreadsheet tick box exercises, and infosec sign-off at deployment approval stage actually work. Traditional change management and compliance practices deliberately slow us down, with the intention of reducing the risk of mistakes.

This runs counter to modern DevOps approaches. We move fast, making rapid changes and allowing teams to be autonomous in their decision making. Cloud technology confuses matters even further. For example, how can you easily define how many servers you have and what state they’re in, if your autoscaling groups are constantly killing old ones and creating new ones?

From a traditional compliance perspective, this sounds like a recipe for disaster. But we know that making smaller, more frequent changes will result in lower overall risk than large, periodic changes. What’s more, we take humans out of the process wherever possible, implementing continuous integration and using automated tests to ensure quality standards are met.

From a DevOps perspective, let’s consider compliance in three core stages. The first pillar represents achieving compliance. That’s the technical process of ensuring workloads and data are secure, everything is up to date, controlled and backed up. This bit’s relatively easy for competent techs like us.

The second pillar is about demonstrating that you’re compliant. How do you show someone else, without too much effort, that your data is secure and your backups actually work? This is a little more difficult, and far less fun.

The third pillar stands for maintaining compliance. This is a real challenge.  How do you ensure that with rapid change, new technology, and multiple teams’ involvement, the system you built a year ago is still compliant now? This comes down to process and culture, and it’s the most difficult of the three pillars to achieve.

But it can be done. In DevOps and Agile culture, we shift left. We shorten feedback loops, decrease batch size, and improve quality through automated tests. This approach is now applied to security too, by embedding security tests into the development process and ensuring that it’s automated, codified, frictionless and fast. It’s not a great leap from there towards shifting compliance left too, codifying the compliance rules and embedding them within development and build cycles.

First we had Infrastructure as Code. Now we’re doing Compliance as Code. After all, what is a Standard Operating Procedure, if not a script for humans? If we can “code” humans to carry out a task in exactly the same way every time, we should be able to do it for machines.

Technologies such as AWS Config or Inspec allow us to constantly monitor our environment for divergence from a “compliant” state. For example, if a compliance rule deems that all data at rest is encrypted, we can define that rule in the system and ensure we don’t diverge from it – if something, human or machine, creates some unencrypted storage, it will be either be flagged for immediate attention or automatically encrypted.

One of the great benefits of this approach is that the “proof” of compliance is baked into your system itself. When asked by an auditor whether data is encrypted at rest, you can reassure them that it’s so by showing them your rule set. Since the rules are written as code, the documentation (the proof) is the control itself.

If you really want your compliance teams to love you (or at least quit hassling you), this automation approach can be extended to documentation. Since the entire environment can be described in code at any point in time, you can provide point-in-time documentation of what exists in that environment to an auditor, or indeed at any point in time previously, if it’s been recorded.

By involving your compliance teams in developing and designing your compliance tools and systems, asking what’s important to them, and building in features that help them to do their jobs, you can enable your compliance teams to serve themselves. In a well designed system, they will be able to find the answers to their own questions, and have confidence that high-trust control mechanisms are in place.

Compliance as Code means that your environment, code, documentation and processes are continuously audited, and that third, difficult pillar of maintaining compliance becomes far easier:

This is what continuous compliance looks like. Achieve this, and you’ll see what a happy compliance team looks like too.