1. This data was originally compiled by Peter Norvig and later popularized by Jeff Dean. The numbers in this blog post series are based on data compiled by Colin Scott on this this interactive website and Brendan Gregg in his book Systems Performance: Enterprise and the Cloud (Pearson Education).
1. The 2023 State of DevOps Report does not publish raw data, only summaries, so I had to estimate the values in the "elite vs low performance" column. For example, the report says that the deployment frequency for elite performers is "multiple deploys per day," which I conservatively estimate as 2-10 deploys per day, and for low performers is "between once per week and once per month." With 5 working days per week, and 4 weeks per month, this works out to 40-200 deploys per month for elite performers, versus 1-4 deploys per month for low performers, a difference of 10-200x.
2. From The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations by Gene Kim, Jez Humble, Patrick Debois, and John Willis (IT Revolution Press).
5. The loopback network interface is typically 127.0.0.1: you can try out http://127.0.0.1:8080 in your browser and it should give the same "Hello, World!" response
6. One exception to this rule is to temporarily expose an app on your computer to a trusted 3rd party (typically, a coworker) for feedback. For example, you’ve built something really cool, and you want to quickly send a link to one of your colleagues so they can check it out. In that case, you can use tunneling tools such as localtunnel, ngrok, localhost.run, or btunnel to get a temporary, randomly-generated URL for testing. These are great for quick internal tests, but do not use these to share your app with the whole world.
8. The first PaaS, Zimki, was actually created by Canon (the camera company) in 2006, but shut down just a year later.
9. I’m assuming that you’re running the examples in this blog post series in an AWS account dedicated solely to learning and testing so that the broad permissions of the AdministratorAccess Managed Policy are not a big risk.
10. This is where the term bus factor comes from: your team’s bus factor is the number of people you can lose (e.g., because they got hit by a bus, or perhaps something less dramatic, like they changed jobs) before you can no longer operate your business. You never want to have a bus factor of 1.
11. Note that with Chef and Puppet, using a master server is the idiomatic approach, whereas with Ansible, the idiomatic approach is to use the Ansible client directly (e.g., from your own computer), and the Automation Controller is an optional, paid add-on.
12. On most modern operating systems, code runs in one of two "spaces": kernel space or user space. Code running in kernel space has direct, unrestricted access to all of the hardware. There are no security restrictions (i.e., you can execute any CPU instruction, access any part of the hard drive, write to any address in memory) or safety restrictions (e.g., a crash in kernel space will typically crash the entire computer), so kernel space is generally reserved for the lowest-level, most trusted functions of the OS (typically called the kernel). Code running in user space does not have any direct access to the hardware and must use APIs exposed by the OS kernel instead. These APIs can enforce security restrictions (e.g., user permissions) and safety (e.g., a crash in a user space app typically affects only that app), so just about all application code runs in user space.
13. As a general rule, containers provide isolation that’s good enough to run your own code, but if you need to run third-party code (e.g., you’re building your own cloud provider) that might actively be performing malicious actions, you’ll want the increased isolation guarantees of a VM.
14. This was one of the motivations for container orchestration: huge companies such as Google found that, with server and VM orchestration, where each app gets its own cluster of servers, you’d have to provision enough servers to handle peak load, but when load was not at peak (as is the case most of the time), most servers sit completely idle. One of the big benefits of container orchestration was that it allowed running multiple containers on each server, and moving them around quickly, so all your servers could act as one big pool of resources, which is much more efficient.
15. Kubernetes supports other types of Services as well. See the documentation for details.
16. For example, as of 2024, AWS Lambda costs $0.0000166667 for every GB-second of execution, plus $0.20 per 1M requests, with a free tier that includes 400,000 GB-seconds and one million free requests per month. If you built an app that processed three million requests per month, and this app ran on a Lambda function that used 1536 MB of memory and had an average function execution duration of 120 ms, the total cost would be less than $3 per month. See Lambda pricing for more details.
17. Kubernetes Deployments support rolling and canary. Tools such as Argo Rollouts provide blue-green.
19. ASG instances can have EBS Volumes attached (a topic of Part 9), but if you deploy new instances, they get new volumes.
20. The noisy neighbor problem is when multiple VMs on the same physical server cause performance issues for each other.
21. This method of calculating commit IDs is brilliant. First, it ensures that commit IDs are consistent without a central mechanism for issuing IDs: the same commit done on any computer, anywhere in the world, at any time always gets the exact same ID. Second, it ensures commits can’t be tampered with: change even 1 bit of the contents, metadata, or history, and you get a totally different SHA-1 hash. Third, it gives you an efficient way to compare commits: you can compare commits using just the IDs, without having to send the full contents; even more interestingly, you can compare the full history just by comparing IDs, as the commit ID calculation includes the ID of the previous commit, which, in turn, includes the ID of its predecessor, and so on, so two commit IDs are only equal if their entire history is equal.
22. See "Software Defect-Removal Efficiency" by Capers Jones, IEEE Computer, April 1996 and Handbook of Walkthroughs, Inspections, and Technical Reviews: Evaluating Programs, Projects, and Products by Daniel P. Freedman and Gerald M. Weinberg, Dorset House.
23. Unit tests for application code are typically fast. Unit tests for infrastructure code are typically moderate, at best.
24. Note that support for moving hard drives over for stateful systems is limited.
25. For an example of a full-featured deployment pipeline that implements most of this functionality out-of-the-box, see Gruntwork Pipelines.
26. For example, see GitHub’s push rulesets for a way to lock down who can edit specific file paths in a repo.
27. Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy (O’Reilly).
28. Code Complete: A Practical Handbook of Software Construction by Steve McConnell (Microsoft Press).
29. Statistia estimates that there are over 5 billion Internet users, but even that’s a drop in the bucket compared to the total number of devices that need IPs, including users who have multiple devices (computer, phone, tablet, TV, car), networking devices (routers, switches), IoT devices, and so on.
31. By default, these are dynamic IPs that are chosen at random from the pool of IPs owned by the cloud provider, so they may change every time you redeploy. If you want to use the same IP address for a long period of time, you can typically reserve a static IP for an additional fee: e.g., AWS offers Elastic IPs (EIPs), GCP offers static external IP addresses, and Azure offers static public IP addresses.
32. As of 2024, the world’s fastest distributed computer is the Frontier system at Oak Ridge National Laboratory, which was able to perform 1.2 exaFLOPS, or about 1.2 x 1018 floating point operations per second. That’s a remarkable accomplishment, but even if you generously assume that you could try one key per floating point operation, this system would need to run for roughly 9 trillion years to perform 2128 floating point operations, which is 650 times longer than the age of the universe (13.8 billion years).
33. As Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things." I also liked Leon Bambrick’s version of this quote: "There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors."
34. There are some exceptions, such as MongoDB, which has support for joins via the lookup operator, although it’s more limited than the types of joins you get with relational databases.
35. Again, there are some exceptions, such as MongoDB, which has support for distributed transactions, though again, it’s more limited than what you get with relational databases. Moreover, transactions are not the default, but something you have to remember to use, which is quite error-prone.
36. This data comes from a blog post titled Building and deploying MySQL Raft at Meta, which describes that to handle this level of scale, Meta had to create MySQL Raft, a consensus engine that turns MySQL into a "true distributed system," so it’s not clear if you can still call it a relational database.
37. The EventBrite page for the meetup is no longer available, but you can find snapshots in the Internet Archive.
38. The only other type of NoSQL data store that I didn’t cover in this blog post is the graph database, which wasn’t motivated by the need for scalability or availability, but the need to efficiently store, query, and navigate relationship data within a graph. I rarely come across these in the wild, but if you’re interested in them, have a look at Neo4j, Amazon Neptune, and Aerospike.
42. This is also a simplification. Java and the Java Virtual Machine (JVM) have a huge number of optimizations built-in, such as the just-in-time (JIT) compiler, which can make Java faster for many use cases.