Footnotes


1. From The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations by Gene Kim, Jez Humble, Patrick Debois, and John Willis (IT Revolution Press).
2. For example, Google Cloud’s 2022 price changes increased storage pricing by as much as 50%, and Vercel’s 2024 price changes increased pricing for some users by as much as 10x.
3. The Standish Group, "CHAOS Manifesto 2013: Think Big, Act Small," 2013, https://www.standishgroup.com/sample_research_files/CM2013-8+9.pdf.
4. Dan Milstein, "How to Survive a Ground-Up Rewrite Without Losing Your Sanity," OnStartups.com, April 8, 2013, https://www.onstartups.com/tabid/3339/bid/97052/How-To-Survive-a-Ground-Up-Rewrite-Without-Losing-Your-Sanity.aspx.
5. This is where the term bus factor comes from. Your team’s bus factor is the number of people you can lose (e.g., because they got hit by a bus, or perhaps something less dramatic, like they changed jobs) before you can no longer operate your business. You never want to have a bus factor of 1.
6. Note that with Chef and Puppet, using a master server is the idiomatic approach, whereas with Ansible, the idiomatic approach is to use the Ansible client directly (e.g., from your own computer), and the Automation Controller is an optional, paid add-on.
7. Several new virtualization technologies have appeared in the last few years, including Firecracker and v8 isolates (full list), that offer different trade-offs in terms of security and overhead. As these technologies mature, they may be worth a deeper look.
8. On most modern operating systems, code runs in one of two "spaces": kernel space or user space. Code running in kernel space has direct, unrestricted access to all of the hardware. There are no security restrictions (i.e., you can execute any CPU instruction, access any part of the hard drive, write to any address in memory) or safety restrictions (e.g., a crash in kernel space will typically crash the entire computer), so kernel space is generally reserved for the lowest-level, most trusted functions of the OS (typically called the kernel). Code running in user space does not have any direct access to the hardware and must use APIs exposed by the OS kernel instead. These APIs can enforce security restrictions (e.g., user permissions) and safety (e.g., a crash in a user space app typically affects only that app), so just about all application code runs in user space.
9. As a general rule, containers provide isolation that’s good enough to prevent your own applications from accidentally interfering with each other, but if you need to run third-party applications (e.g., you’re building your own cloud provider) that might intentionally be performing malicious actions, you’ll want the increased isolation guarantees of a VM.
10. OpenTofu is an open source fork of Terraform that was created after Terraform moved away from an open source license. I prefer to use open source tools whenever possible, so this blog post series will use OpenTofu for example code, but most of the examples should work with Terraform as well.
11. As mentioned in Part 2, deploying and managing servers (hardware) is not what configuration management tools were designed to do, but for learning and testing, Ansible is good enough. Note that the way you’ll use Ansible to deploy multiple EC2 instances in this section is meant to showcase server orchestration in its most common form, with a fixed set of servers, and not the idiomatic approach for running multiple servers in the cloud; you’ll see the more idiomatic approach in the VM orchestration section.
12. This was one of the original motivations for container orchestration. With server and VM orchestration, where each app gets its own cluster of servers, you have to provision enough servers in each cluster to handle peak load, but when load is not at peak, as is the case most of the time, most servers sit idle. Containers allowed large companies like Google to make more efficient use of their servers by treating them as one big pool of resources, running multiple containers on each server, and moving containers around as necessary.
13. Kubernetes supports other types of Services as well. See the documentation for details.
14. For example, as of 2024, Lambda pricing was $0.0000166667 for every GB-second of execution. To put that into perspective, an app that handles 10 million requests per month could cost you less than $10/month, depending on execution time and memory usage.
15. Kubernetes Deployments support rolling and canary. Tools such as Argo Rollouts provide blue-green.
16. This is true up to a point. You may have to configure concurrency controls to avoid being throttled.
17. If you want persistent volumes for ASGs, you have to manually move the volumes to new instances during deployment.
18. The noisy neighbor problem is when multiple apps on the same physical server cause performance issues for each other.
19. An easy way to do multiline commit messages is to omit the -m flag when running git commit, which will result in Git opening up a file in your default text editor (see here for how to configure which editor to use), and when you edit that file and exit, Git will use the file’s contents as the commit message.
20. See "Software Defect-Removal Efficiency" by Capers Jones, IEEE Computer, April 1996 and Handbook of Walkthroughs, Inspections, and Technical Reviews: Evaluating Programs, Projects, and Products by Daniel P. Freedman and Gerald M. Weinberg, Dorset House.
21. Unit tests for application code are typically fast. Unit tests for infrastructure code are typically slow.
22. The ISS assembly took more than 40 separate space flights, and before and after each one, individuals from 16 countries were constantly collaborating, running tests, updating designs, integrating changes, and so on.
23. The mutable approach does not require extra resources. The immutable approach requires at least one extra server.
24. Note that support for automatically moving hard drives for stateful systems is limited.
25. These principles are based on the definitions in OpenGitOps.
26. For an example of a GitOps pipeline that implements most of this functionality, see Gruntwork Pipelines.
27. This data was originally compiled by Peter Norvig and popularized by Jeff Dean. The numbers in this table are based on data from Colin Scott and Systems Performance: Enterprise and the Cloud (Pearson Education) by Brendan Gregg.
28. Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly).
29. Code Complete: A Practical Handbook of Software Construction by Steve McConnell (Microsoft Press).
30. Statistia estimates that there are over 5 billion Internet users, but even that’s a drop in the bucket compared to the total number of devices that need IPs, including users who have multiple devices (computer, phone, tablet, TV, car), networking devices (routers, switches), IoT devices, and so on.
31. Based on IPv6 adoption data from Google and CloudFlare.
32. By default, these are dynamic IPs that are chosen at random from the pool of IPs owned by the cloud provider, so they may change every time you redeploy. If you want to use the same IP address for a long period of time, you can typically reserve a static IP for an additional fee: e.g., AWS offers Elastic IPs (EIPs), GCP offers static external IP addresses, and Azure offers static public IP addresses.
33. Don’t confuse cryptography for crypto, which these days refers to cryptocurrency, a type of digital currency.
34. As of 2024, the world’s fastest distributed computer is the Frontier system at Oak Ridge National Laboratory, which was able to perform 1.2 exaFLOPS, or about 1.2 x 1018 floating point operations per second. That’s a remarkable accomplishment, but even if you generously assume that you could try one key per floating point operation, this system would need to run for roughly 9 trillion years to perform 2128 floating point operations, which is 650 times longer than the age of the universe (13.8 billion years).
35. As Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things." I also liked Leon Bambrick’s version of this quote: "There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors."
36. For example, Amazon Athena lets you use SQL to query certain file formats (e.g., CSV, JSON, Avro) in S3.
37. For example, S3 can automatically trigger a Lambda function each time you upload a file.
38. There are some exceptions, such as MongoDB, which has support for joins via the lookup operator, although it’s more limited than the types of joins you get with relational databases.
39. Again, there are some exceptions, such as MongoDB, which has support for distributed transactions, though again, it’s more limited than what you get with relational databases. Moreover, transactions are not the default, but something you have to remember to use, which is quite error-prone.
40. This data comes from a blog post titled Building and deploying MySQL Raft at Meta, which describes that to handle this level of scale, Meta had to create MySQL Raft, a consensus engine that turns MySQL into a "true distributed system," so it’s not clear if you can still call it a relational database.
41. The only other type of NoSQL data store is the graph database, which I rarely see in the wild, but if you’re interested, have a look at the list here.
42. See Data Driven Products Now! by Dan McKinley for a great write-up on data-driven product development.