Fundamentals of DevOps and Software Delivery

Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing software in production, published by O’Reilly Media!

There are many guides out there on how to write software. This blog post series is a guide to software delivery: that is, all the processes, tools, and techniques that are required to run and maintain software in production on an ongoing basis. In particular, this series is a guide to DevOps, a methodology for making software delivery more efficient.

Whereas most books and talks on DevOps focus on culture, values, and organizational structure, I’ve found that when most people talk about DevOps, what they are really interested in is the mechanics of effective software delivery. That’s what this series is all about. It’s a hands-on guide that includes dozens of step-by-step examples of how to run production systems. You’ll start with the basics—an app running on a single server—and work your way up to microservices in a Kubernetes cluster with a service mesh, automated deployment pipeline, end-to-end encryption, and more. By the time you’re done with the blog post series, you will have had hands-on practice with all the core concepts and practices of modern DevOps and software delivery, including:

Before jumping into all this content, I want to take a moment to talk about why I felt the need to write this series—and why you may want to read it.

Why I Wrote This Series

Almost every piece of software depends on software delivery practices for deployment, maintenance, and security. And yet, I’m not aware of any hands-on guides that teach software delivery end-to-end. Just about everyone that learns software delivery today is learning it the hard way: that is, through trial and error. Unfortunately, errors in software delivery can be costly: they involve outages, data loss, and security breaches. The lack of a good way to learn software delivery is making the entire software industry slower, less effective, and less secure.

I experienced this first hand. Back in 2011, I was working at LinkedIn. From the outside, everything looked great: the company had just had its IPO, the share price was up by over 100%, revenue was growing by more than 100% year over year, and the website had over 100M members, with 2 new members joining every single second. But from the inside, the company was in turmoil. Why? Because our software delivery practices had gotten so bad that we could no longer deploy.

Back then, we would do deployments once every two weeks, and it was always a painful, tedious, slow, and error-prone affair. In 2011, we had a deployment that went so badly, that we couldn’t complete it, no matter how hard we tried. We rolled out some new changes, which caused system instability; we pushed some fixes, but those caused new bugs; we pushed more fixes, but that only led to more issues. Teams worked through the night, into the next day, and we still couldn’t get things stable. In the end, after a several day deployment nightmare, we had to roll everything back.

Here was a company worth nearly $10 billion, and we could not deploy code. To get out of this mess, we kicked off Project Inversion, which was a complete freeze on all new feature development for several months while the entire engineering, product, and design team reworked all the underlying infrastructure, tooling, and practices. The result was a huge success: months later, we were able to deploy dozens of times per day, with far fewer issues and outages, and that allowed the whole company to move faster.

Today, we might call this a "DevOps transformation" (though back then, the term "DevOps" had just appeared on the scene, so we didn’t call it that), and to get there, we had to go through a lot of pain and outages. The truth is that we didn’t know what we didn’t know. We had to go out and chat with companies across the industry, learning about trunk based development from one company, canary deployments from another, feature toggles from another, and so on.

Sadly, even now, as I write this blog post series nearly 15 years later, relatively few developers know about these DevOps and software delivery practices. After leaving LinkedIn, I co-founded Gruntwork, where I had the opportunity to work with hundreds of companies on their DevOps and software delivery practices. What I saw was LinkedIn’s DevOps nightmare repeated over and over again at companies of all sizes. The techniques that a handful of the top tech companies had figured out were not filtering down to the rest of the industry. Most developers out there still don’t know what they don’t know.

So I decided to write a blog post series.

I hope that this series can be a small step in improving this situation. I hope that a comprehensive, hands-on overview of DevOps and software delivery will help the next generation of software companies get off on the right foot, and avoid some of the DevOps nightmares I’ve seen. I hope that instead of just hacking things together and learning things the hard way, this blog post series will allow you to learn from the experience of others, and perhaps the result will be a software industry that can build software faster, more reliably, and more securely.

If that happens, it could have a profound impact on the industry, as described next.

The Impact Of World-Class Software Delivery

The vast majority of developers have never had the opportunity to see what world-class software delivery looks like first hand. If you’re one of them, you’ll be astonished by the gap between companies with world-class software delivery processes and everyone else. It’s not a 1.1x or 1.5x improvement: it’s 10x, 100x, or more.

Table 1 shows the difference between elite performers and low performers at the four key DevOps Research and Assessment (DORA) metrics, which are a quick way to assess the performance of a software development team:

Table 1. DORA metrics performance from the 2024 State of DevOps Report
Metric	Description	Elite vs low performers
Deployment frequency	How often you deploy to production	182x more often
Lead time	How long it takes a change to go from committed to deployed	127x faster
Change failure rate	How often deployments cause failures that need immediate remediation	8x lower
Recovery time	How long it takes to recover from a failed deployment	2293x faster

These are staggering differences. To put them into perspective, we’re talking the difference between:

Deploying once per month versus many times per day.
Deployment processes that take 36 hours versus 5 minutes.
Two out of three deployments causing problems versus one out of twenty.
Outages that last 24 hours versus 2 minutes.

It’s almost a meme that developers who leave companies with world-class software delivery processes, such as Google, Facebook, Amazon, or LinkedIn (after Project Inversion), complain bitterly about how much they miss the infrastructure and tooling. That’s because they are used to a world where:

They can deploy any time they want, even thousands of times per day.
Deployments can happen in minutes, and they are 100% automated.
Problems can be detected in seconds, often before there is any user-visible impact.
Outages can be resolved in minutes, often automatically.

What do the equivalent numbers look like at your organization?

If you’re not even in the ballpark, don’t fret. The first thing you need to know is that it’s possible to achieve these results, even if you’re not a multi-billion dollar company. In fact, there are many ways to achieve these sorts of results, and each of those world-class companies does it a bit differently. That said, they also have a lot in common, and DevOps is an attempt to capture some of the common patterns.

Where DevOps Came From

DevOps emerged in the late 2000s as a response to inefficiencies in how companies delivered software. Historically, most companies had a Developer ("Dev") team responsible for writing the software, and an Operations ("Ops") team responsible for managing the hardware. In many companies, these teams worked in silos with conflicting objectives. The Ops team’s core objectives were usually security and reliability, so they were the only ones with access to production systems, and were on the hook for doing deployments and dealing with outages. The Dev team’s core objectives were usually to ship features, and to do so as quickly as possible. This arrangement often led to problems.

The Dev team, who was typically working under time pressure, with no access to production systems or on-call duties, would write the code, and with little testing, "toss it over the wall" to the Ops team. The Ops team would then deploy that code, typically via a manual process, and either from an error in this deployment process, or a bug in the code, there would often be failures and outages. The Ops team, tired of waking up at 3 a.m. to deal with outages, would end up reducing the release cadence. Unfortunately, this led to a vicious cycle: fewer releases meant each release was bigger and more complicated, which led to more outages, which led to an even slower release cadence, and so on, until the company would grind to a halt.

Although the problems with manual processes and siloing were known for many years, it wasn’t until around 2007-2008 that the software industry started to discuss these issues more openly in various conferences, meetups, and articles. These discussions coincided with, and were big drivers of, several major shifts that happened over the next decade:

The shift to the cloud: Instead of managing their own datacenters, many companies started moving to the cloud, renting servers and other infrastructure from third party providers (you’ll learn all about the cloud in Part 1). This represented a big shift for both the Ops team, who was suddenly spending most of their time dealing with software (from the cloud provider) instead of hardware (e.g., racking servers and plugging in network cables), and the Dev team, who could now access production systems through the same software tools that Ops was using.
The shift to automation: Instead of doing everything manually, many companies started to automate their processes using configuration management tools (Part 2), orchestration tools (Part 3), automated testing (Part 4), and continuous integration / continuous delivery (Part 5). These automations were another big shift for both Dev and Ops, as they defined a new set of interaction points between the two teams (a bit like an API): e.g., configuration management and orchestration tools defined how the Dev team’s code would run on the Ops team’s servers, and continuous integration / continuous delivery defined how the Dev team’s code would get integrated, tested, and deployed onto the Ops team’s servers.
The shift to cross-functional teams: Instead of working in silos, many companies started to have Dev and Ops work more closely together on cross-functional teams with shared objectives and shared tooling (e.g., such as the cloud and automation tooling).

All these trends together became known as the DevOps movement. DevOps isn’t the name of a team or a job title or a particular technology (though you’ll see it used for all three). Instead, it’s a set of processes, ideas, and techniques. Everyone has a slightly different definition of DevOps, but for this blog post series, I’m going to define DevOps as a methodology with the following goal:

The goal of DevOps is to make software delivery vastly more efficient.

You heard how LinkedIn’s DevOps transformation saved the company, but there are many other examples. Nordstrom found that after applying DevOps practices to its organization, it was able to increase the number of features it delivered per month by 100%, reduce defects by 50%, reduce lead times by 60%, and reduce the number of production incidents by 60% to 90%. After HP’s LaserJet Firmware division began using DevOps practices, the amount of time its developers spent on developing new features went from 5% to 40%, and overall development costs were reduced by 40%. Etsy used DevOps practices to go from stressful, infrequent deployments that caused numerous outages to deploying 25 to 50 times per day, with fewer outages.^[1]

All that said, a fair warning: while the results from adopting DevOps can be wonderful, the experience along the way can be anything but wonderful, as described next.

Watch Out for Snakes

I’m going to let you in on a little secret: we use a single word, "DevOps," to describe what’s actually dozens and dozens of largely unrelated concepts. What does the cryptography behind a TLS certificate have to do with defining a deployment pipeline in GitHub Actions YAML or backing up data from a PostgreSQL database? Not much. And yet, your typical SRE or DevOps Engineer has to deal with all of these, and countless other concepts, too.

What makes DevOps hard is not that any one of these concepts is incredibly complicated by itself, but that there are so many concepts to master—and you have to connect them all together just right. The TLS certificate must be configured just right, or your users will get scary errors that prevent them from accessing your website; your deployment pipeline must be configured just right, or your team won’t be able to deploy; your database backup must be set up just right, or you are at risk of data loss, and if you lose all your data, you may go out of business entirely. DevOps is remarkable combination of an incredibly broad surface area, but also one where you have to sweat every single detail, for either you get everything connected together correctly, or nothing works at all.

I often use the analogy of a box of cables: you reach into the box, hoping to pull out just one cable, but you inevitably end up pulling out a giant mess where everything is tangled together. Unfortunately, that’s the state of DevOps today. It’s a relatively new industry, the tools and techniques we have just aren’t that mature, and it often feels like everything is broken and frustrating and hopelessly tangled.

My hope in this blog post series is, as much as I can, to untangle this mess of cables for you. To show you that these are, in fact, separate cables—separate concepts—that, in isolation, are something you can readily understand, begin to work with, and ultimately become proficient in.

But sometimes, this is hard to do. Sometimes, reaching into this box of cables feels more like reaching into a box of snakes. You just end up getting bitten. If you find yourself sitting there, staring at some nonsense error message, tearing your hair out, stressed, angry, and afraid, know this:

You are not alone.

There are thousands of other developers reaching into that box of snakes and getting bitten, every day. I’m one of them. I’ve lost more hair to DevOps than I care to admit. Even while writing this blog post series, I frequently found myself frustrated, or confused, or yelling at my screen, even though I’ve done most of these things a thousand times. That’s just how it is today.

In fact, there are a few places in this series where I haven’t been able to untangle the wires as much as I’d like: e.g., some example code that’s just too complicated and long to include in the blog post series, so I have to settle for a simpler and less realistic version, or a concept I can’t explain without introducing ten other concepts that come later, so I can only give you a partial explanation for now. In cases like these, I’ve added a "box of snakes" warning that looks like this:

Watch out for snakes: example title for the warning

An example warning. When you see these, be prepared to enter a particularly hairy and tangled corner of DevOps.

Whenever you see such a warning, understand that you’re going to see part of the picture now, but perhaps won’t be able to get the full picture until later. In fact, this is true of DevOps in general. If you’re new to DevOps, initially, it’ll all seem strange, confusing, and full of incomprehensible buzzwords. And each time you go to learn a new buzzword, you’re hit with ten more unfamiliar buzzwords, so you never feel like you’re getting the whole picture. But I promise you that if you give it enough time, you’ll eventually get over a hump, and suddenly, the pieces will start to make sense, and really come together. You need to build up a big enough base of knowledge and experience, and it’s hard going at first, but at some point, it starts to get easier. It never becomes completely easy, but you get to a point where you always feel confident that you can figure it out.

So stick with it. And watch out for those snakes.

Who Should Read This Series

This blog post series is for anyone responsible for deploying and managing apps in production—that is, anyone responsible for software delivery. This includes:

Individual contributors in operations roles: Current and aspiring Site Reliability Engineers, DevOps Engineers, Sysadmins, Operations Engineers, and Release Engineers who want to level up their knowledge of software delivery.
Individual contributors in dev roles: Software Engineers, Software Developers, Web Developers, and Full Stack Engineers who want to learn more about the operations side of the house.
Managers: Engineering Managers, Engineering Directors, CTOs, VPEs, and CIOs who want to learn how to adopt DevOps practices in their organizations.

This blog post series does not assume that you’re already an expert coder or expert sysadmin. A basic familiarity with programming, the command line, and server-based software (e.g., websites) should suffice. Everything else you need you’ll be able to pick up as you go. The only tools you need are a computer, an internet connection, and the desire to learn.

What You’ll Find in This Series

Table 2 shows a part-by-part outline of what the blog post series covers, including the key ideas you’ll explore and the hands-on examples you’ll try in each part:

Table 2. An outline of the blog post series
Part	Key ideas you’ll explore	Examples you’ll try out
Part 1, How to Deploy Your App	Personal computers vs servers On-prem vs cloud PaaS vs IaaS	Run an app locally Run an app on Render Run an app on an EC2 instance in AWS
Part 2, How to Manage Your Infrastructure as Code	Ad hoc scripts Configuration management tools Server templating tools Provisioning tools	Use Bash to deploy an EC2 instance Use Ansible to deploy an EC2 instance Use Packer to build an AMI Use OpenTofu to deploy an EC2 instance
Part 3, How to Manage Your Apps Using Orchestration Tools	Server orchestration VM orchestration Container orchestration Serverless orchestration	Use Ansible to deploy app servers & Nginx Use OpenTofu to deploy an ASG and ALB Deploy a Dockerized app in Kubernetes Deploy a serverless app with AWS Lambda
Part 4, How to Version, Build, and Test Your Code	Version control Build systems Dependency management Automated testing	Store your code in GitHub Configure your build in NPM Set up automated tests for a Node.js app Set up automated tests for OpenTofu code
Part 5, How to Set Up Continuous Integration (CI) and Continuous Delivery (CD)	Trunk-based development Feature toggles Deployment strategies, pipelines	Use OIDC with GitHub Actions and AWS Run tests in GitHub Actions Run deployments in GitHub Actions
Part 6, How to Work with Multiple Teams and Environments	Multiple environments Multiple libraries Multiple services	Create multiple AWS accounts Configure apps for multiple environments Deploy microservices in Kubernetes
Part 7, How to Set Up Networking	Domain Name System (DNS) Virtual private clouds (VPCs) Network access and hardening Service discovery, service meshes	Set up a custom domain name in Route 53 Deploy a custom VPC in AWS Use SSH to connect to a server Use Istio as a service mesh with Kubernetes
Part 8, How to Secure Communication and Storage	Cryptography Encryption at rest Encryption in transit	Encrypt data with AES and RSA Store secrets in AWS Secrets Manager Set up HTTPS with LetsEncrypt
Part 9, How to Store Data	Relational DBs, schemas NoSQL, NewSQL, queues, streams File storage and CDNs Backup and recovery	Deploy PostgreSQL using RDS Configure RDS backup, replicas Use Knex.js for schema migrations Use S3 and CloudFront for static assets
Part 10, How to Monitor Your Systems	Logs and log aggregation Metrics, dashboards, alerts Observability and tracing	Create a dashboard in CloudWatch Do structured logging with Node.js Set up Route 53 health checks and alerts
Part 11, The Future of DevOps and Software Delivery	Infrastructureless Generative AI Shift left, supply chain security Platform engineering	Runme Snyk Chain guard Backstage

Feel free to read the blog post series from beginning to end or jump around to the parts that interest you the most. Note that the examples in each part reference and build upon the examples from the previous parts, so if you skip around, use the open source code examples (as described in Open Source Code Examples) to catch up.

On the blog post series’s website, you’ll find a dedicated page for each blog post (for example, here is the dedicated page for Chapter 1), which includes a list of recommended resources (books, blog posts, courses) you can use to go deeper on the topics covered in that blog post, as well as a list of all the tools mentioned in that blog post.

Given the breadth of DevOps, this blog post series covers a lot of ground and includes a lot of detail. To help you avoid missing the forest for the trees, I try to call out the key takeaways in each blog post as follows:

Key takeaway #1

A key takeaway from the blog post.

Pay special attention to these items, as they typically highlight the most important lessons in that post.

What You Won’t Find in This Series

This blog post series is meant to fill a specific gap: a hands-on guide to DevOps and software delivery, targeted at practitioners. This is already a huge amount of content to cover, which means there are some DevOps and software delivery topics that this series will either skip or only touch on lightly:

DevOps culture and organizational processes: Most of the DevOps books out there today primarily focus on DevOps culture and organizational processes such as cross-functional teams, capacity planning, blameless postmortems, on-call rotations, KPIs, SLOs, and error budgets, so this blog post series won’t spend much time on these items.
Server hardening: While this blog post series covers a range of security topics, I can’t cover them all. In particular, one area I won’t be able to discuss too much is how to harden your servers against attacks: e.g., OS permissions, intrusion protection, file integrity monitoring, sandboxing, hardened images, etc.
Low-level networking: This blog post series includes a post on networking, but it only focuses on higher level concepts: DNS, CDNs, VPCs, VPNs, service meshes, and basic network hardening. This post will not go into any lower-level details, such as routers, switches, links, routing protocols, and so on.
Compliance: DevOps engineers are often tasked with helping their companies meet various compliance standards and regulations, such as SOC 2, ISO 27001, HIPAA/HITRUST, PCI-DSS, and so on. While the practices I recommend in this blog post series go a long way towards setting up the kind of security posture you need to meet these compliance standards, this series is not meant to be a detailed guide towards meeting any standard in particular.
Cost optimization and performance tuning: DevOps engineers are also often asked to help optimize the company’s systems to reduce costs or improve performance. These are detailed and ever-changing topics in their own right, so this blog post series will only touch on them at a surface level.

Open Source Code Examples

This blog post series includes many examples for you to work through. You can find all these code samples in the following GitHub repository:

https://github.com/brikis98/devops-book

You might want to check out this repo before you begin reading so you can follow along with all the examples on your own computer:

console
$ git clone https://github.com/brikis98/devops-book.git

The code samples are organized by part (e.g., ch1, ch2, etc.), and within each part, by tool (e.g., ansible, kubernetes, tofu). For example, the example Packer template in Part 2 will be in the folder ch2/packer, and the example OpenTofu module called lambda in Part 3 will be in the folder ch3/tofu/modules/lambda.

Embedded sandbox on the O’Reilly learning platform

The version of this blog post series on the O’Reilly online learning platform includes an embedded sandbox preconfigured with all the code examples, all the tools the blog post series uses, access credentials for an AWS account, a code editor, and a terminal, so you can try out all the examples in your browser, without having to install anything.

It’s worth noting that most of the examples show you what the code looks like at the end of a part. If you want to maximize your learning, you’re better off writing the code yourself, from scratch, and checking the "official" solutions only at the end.

An important note for Windows users

While the example code included in this blog post series should work on any operating system, the series also includes many example terminal commands that you run locally. These terminal commands are mostly written in Bash, so to run them, you need either a computer with Unix, Linux, or macOS, or, if you’re on Windows, you can use the Windows Subsystem for Linux or Cygwin.

Opinionated Code Examples

The core concepts in the blog post series—e.g., managing infrastructure as code, CI / CD, networking, secrets management, etc—are relatively ubiquitous and applicable across the entire software industry. The code samples, however, represent just one opinionated way to implement these core concepts. The examples are there to give you hands-on practice, and to help with learning. They are not there as a claim that this is the only way or the best way to do things.

In the real world, there is no single "best" way that applies to all circumstances. All technology choices are trade-offs, and some solutions will be a better fit in some situations than others. The goal of this blog post series is to teach you the underlying concepts and techniques of DevOps and software delivery, and not a specific set of tools or technologies, so once you understand the basics, feel free to explore other technologies and approaches, and always use your judgment to pick the right tool for the job.

A Note About Versions

Whereas the core concepts in this blog post series change only over relatively long time spans, the code samples used to demonstrate and implement the core concepts change more frequently. Therefore, it’s possible that by the time you read this, some of the examples will be out of date. I’ll try to update the examples as often as I can, but if you hit an issue, please file a bug in the series’s GitHub repo.

You Have to Get Your Hands Dirty

A book on weight lifting can teach you principles, routines, and exercises, but you have to spend hours in the gym practicing, sweating, and applying what you learned to be able to lift serious weight. Likewise, a blog post series on DevOps and software delivery can teach you principles, techniques, and tools, but you have to spend hours writing code, debugging systems, and applying what you learned to be able to achieve serious results.

That’s what the code examples in this series are for. Instead of only reading, you get to learn by doing. So don’t just skim the code examples: write the code, run it, and get it working. Moreover, you’ll see sections like the following throughout the series:

Get your hands dirty

A list of exercises to try at home.

The examples in this blog post series will get you to the point where you have something working; these "get your hands dirty" sections are an opportunity for you to take those examples and tweak them, customize them to your needs, break things, figure out how to fix them, and so on. Think of this as time spent practicing and sweating at the gym. Getting your hands dirty is when the real learning happens.

Let’s Get Started

Now that you have a basic understanding of what this blog post series is all about, it’s time to get started. And where better to begin than at the beginning: head over to Part 1, How to Deploy Your App to learn the basics of deploying apps.