Part 2. How to Manage Your Infrastructure as Code

Featured Image of Part 2. How to Manage Your Infrastructure as Code

Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!

This is Part 2 of the Fundamentals of DevOps and Software Delivery series. In Part 1, you learned how to deploy your app using PaaS and IaaS, but it required a lot of manual steps clicking around a web UI. This is fine while you’re learning and experimenting, but if you manage everything at a company this way—what’s sometimes called ClickOps—it quickly leads to problems:

Deployments are slow and tedious

So you can’t deploy too often.

Deployments are error-prone

So you end up with lots of bugs, outages, and late-night debugging sessions.

Only one person knows how to deploy

So that person is overloaded, they never have time for long-term improvements, and if they were to leave or get hit by a bus, everything would grind to a halt.[10]

Fortunately, these days, there is a better way to do things: you can manage your infrastructure as code (IaC). Instead of clicking around manually, you use code to define, deploy, update, and destroy your infrastructure. This represents a key insight of DevOps: most tasks that you used to do manually can now be automated using code, as shown in Table 6.

Table 6. A key insight of DevOps is that you can manage almost everything as code
TaskHow to manage as codeExamplePart

Provision servers

Provisioning tools

Use OpenTofu to deploy a server

This blog post

Configure servers

Server templating tools

Use Packer to create an image of a server

This blog post

Configure apps

Configuration files and services

Read configuration from a JSON file

Part 6

Configure networking

Software-defined networking

Use Kubernetes networking

Part 7

Build apps

Build systems

Build your app with NPM

Part 4

Test apps

Automated tests

Write automated tests using Jest

Part 4

Deploy apps

Automated deployment

Do a rolling deployment with Kubernetes

Part 3

Scale apps

Auto scaling

Set up auto scaling policies in AWS

Part 3

Recover from outages

Auto healing

Set up liveness probes in Kubernetes

Part 3

Manage databases

Schema migrations

Use Knex.js to update your database schema

Part 9

Test for compliance

Policy as code

Check compliance using Open Policy Agent

Part 4

If you search around, you’ll quickly find that there are many tools out there that allow you to manage your infrastructure as code, including Chef, Puppet, Ansible, Pulumi, Terraform, OpenTofu, CloudFormation, Docker, Packer, and so on. Which one should you use? Many of the comparisons you find online between these tools do little more than list the general properties of each tool and make it sound like you could be equally successful with any of them. And while that’s true in theory, it’s not true in practice. There are considerable differences between these tools, and your odds of success go up significantly if you know how to pick the right tool for the job.

This blog post will help you navigate the IaC space by introducing you to the four most common categories of IaC tools:

  • Ad hoc scripts: e.g., use a Bash script to deploy a server.

  • Configuration management tools: e.g., use Ansible to deploy a server.

  • Server templating tools: e.g., use Packer to build an image of a server.

  • Provisioning tools: e.g., Use OpenTofu to deploy a server.

You’ll work through examples where you deploy the same infrastructure using each of these approaches, which will allow you to see how different IaC categories perform across a variety of dimensions (e.g., verbosity, consistency, error handling, and so on), so that you can pick the right tool for the job.

Before digging into the details of various IaC tools, it’s worth asking, why bother? Learning and adopting new tools has a cost, so what are the benefits of IaC that make this worthwhile? This is the focus of the next section.

The Benefits of IaC

When your infrastructure is defined as code, you are able to use a wide variety of software engineering practices to dramatically improve your software delivery processes, including the following:

Speed and safety

If the deployment process is automated, it will be significantly faster, since a computer can carry out the deployment steps far faster than a person, and safer, given that an automated process will be more consistent, more repeatable, and not prone to manual error.

Documentation

If your infrastructure is defined as code, then the state of your infrastructure is in source files that anyone can read, rather than locked away in a single person’s head. In other words, IaC acts as documentation, allowing everyone in the organization to understand how things work.

Version control

Storing your IaC source files in version control (which you’ll do in Part 4) makes it easier to collaborate on your infrastructure, debug issues (e.g., by checking the version history to find out what changed), and to resolve issues (e.g., by reverting back to a previous version).

Validation

If the state of your infrastructure is defined in code, for every single change, you can perform a code review, run a suite of automated tests, and pass the code through static analysis tools—all practices that are known to significantly reduce the chance of defects (you’ll see examples of all of these practices in Part 4).

Self-service

If your infrastructure is defined in code, developers can kick off their own deployments, instead of relying on others to do it.

Reuse

You can package your infrastructure into reusable modules so that instead of doing every deployment for every product in every environment from scratch, you can build on top of known, documented, battle-tested pieces.

Happiness

There is one other important, and often overlooked, reason for why you should use IaC: happiness. Manual deployments are repetitive and tedious. Most people resent this type of work, since it involves no creativity, no challenge, and no recognition. You could deploy code perfectly for months, and no one will take notice—until that one day when you mess it up. IaC offers a better alternative that allows computers to do what they do best (automation) and developers to do what they do best (creativity).

Now that you have a sense of why IaC is so valuable, in the following sections, you’ll explore the most common categories of IaC tools, starting with ad hoc scripts.

Authenticating to AWS on the command line

To run the code examples in this blog post series that use AWS, you’ll need to authenticate to AWS on the command line. This requires using an access key, which you can create by heading over to the IAM Console, clicking Users, and selecting the IAM user you created in Part 1. Next, click on the "Security credentials" tab, scroll down to the "Access keys" section, and click "Create access key" as shown in Figure 20:

Create an access key for your IAM user
Figure 20. Create an access key for your IAM user.

On the next page, select "Command Line Interface (CLI)" as the use case, tick the Confirmation check box, and click Next. This will take you to a page where you should enter a description and click "Create access key." Finally, you’ll end up on a page that shows two values, an access key id and a secret access key. This is the only time AWS will show you these values, so make sure to save both of them in a secure password manager, such as 1Password (you’ll learn more about secrets management in Part 8).

To use the access key on the command line, you need to set the access key id and secret access key as the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, respectively:

console
$ export AWS_ACCESS_KEY_ID=(your access key id) $ export AWS_SECRET_ACCESS_KEY=(your secret access key)

Note that these environment variables apply only to the current shell, so if you open a new terminal window, you’ll need to export these variables again.

Ad Hoc Scripts

The first approach you might think of for managing your infrastructure as code is to use an ad hoc script. You take whatever task you were doing manually, break it down into discrete steps, and use your favorite scripting language (e.g., Bash, Ruby, Python) to capture each of those steps in code. When you run that code, it can automate the process of creating infrastructure for you. The best way to understand this is to try it out, so let’s go through an example of an ad hoc script written in Bash.

Example: Deploy an EC2 Instance Using a Bash Script

Example Code

As a reminder, you can find all the code examples in the blog post series’s sample code repo in GitHub.

As an example, let’s create a Bash script that automates all the manual steps you did in Part 1 to deploy a simple Node.js app in AWS. Head into the fundamentals-of-devops folder you created in Part 1 to work through the examples in this blog post series, and create a new subfolder for this part and the Bash script:

console
$ cd fundamentals-of-devops $ mkdir -p ch2/bash

Copy the exact same user data script from Part 1 into a file called user-data.sh within the ch2/bash folder:

console
$ cp ch1/ec2-user-data-script/user-data.sh ch2/bash/

Next, create a Bash script called deploy-ec2-instance.sh, with the contents shown in Example 3:

Example 3. Bash script to deploy an EC2 instance (ch2/bash/deploy-ec2-instance.sh)
bash
#!/usr/bin/env bash set -e export AWS_DEFAULT_REGION="us-east-2" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" user_data=$(cat "$SCRIPT_DIR/user-data.sh") (1) security_group_id=$(aws ec2 create-security-group \ --group-name "sample-app" \ --description "Allow HTTP traffic into the sample app" \ --output text \ --query GroupId) (2) aws ec2 authorize-security-group-ingress \ --group-id "$security_group_id" \ --protocol tcp \ --port 80 \ --cidr "0.0.0.0/0" > /dev/null (3) instance_id=$(aws ec2 run-instances \ --image-id "ami-0900fe555666598a2" \ --instance-type "t2.micro" \ --security-group-ids "$security_group_id" \ --user-data "$user_data" \ --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=sample-app}]' \ --output text \ --query Instances[0].InstanceId) public_ip=$(aws ec2 describe-instances \ --instance-ids "$instance_id" \ --output text \ --query 'Reservations[*].Instances[*].PublicIpAddress') (4) echo "Instance ID = $instance_id" echo "Security Group ID = $security_group_id" echo "Public IP = $public_ip"

If you’re not an expert in Bash syntax, all you have to know about this script is that it uses the AWS Command Line Interface (CLI) to do the following:

1Create a security group.
2Update the security group to allow inbound HTTP requests on port 80.
3Deploy an EC2 instance that uses that security group and runs the Node.js app on boot in a user data script.
4Output the IDs of the security group and EC2 instance and the public IP of the EC2 instance.
Watch out for snakes: these are simplified examples for learning, not for production

The examples in this blog post are still simplified for learning and not suitable for production usage, due to the security concerns and user data limitations explained in Watch out for snakes: these examples have several problems. You’ll see how to work around some of these limitations starting in the next blog post.

If you want to try the script out, you’ll first need to give the script execute permissions:

console
$ cd ch2/bash $ chmod u+x deploy-ec2-instance.sh

Next, authenticate to AWS as described in Authenticating to AWS on the command line, and run the script as follows:

console
$ ./deploy-ec2-instance.sh Instance ID = i-0335edfebd780886f Security Group ID = sg-09251ea2fe2ab2828 Public IP = 52.15.237.52

After the script finishes, give the EC2 instance a minute or two to boot up, and then try opening up http://<Public IP> in your web browser, where <Public IP> is the IP address the script outputs at the very end. You should see:

Hello, World!

Congrats, you are now managing your infrastructure as code! Well, sort of. This script, and most ad hoc scripts, have quite a few drawbacks in terms of using them to manage infrastructure, as discussed in the next section.

Get your hands dirty

Here are a few exercises you can try at home to go deeper:

  • What happens if you run the Bash script a second time? Do you get an error? If so, why?

  • How would you have to tweak the script if you wanted to run multiple EC2 instances?

When you’re done experimenting with this script, you should manually undeploy the EC2 instance by finding it in the EC2 Console: check the top right corner to make sure you’re in the same region used by the script (us-east-2), then on the Instances page, look for the instance ID the script outputs at the end, click "Instance state," and choose "Terminate instance" in the drop down, as shown in Figure 19. This ensures that your account doesn’t start accumulating any unwanted charges.

How Ad Hoc Scripts Stack Up

Below is a list of criteria, which I’ll refer to as the IaC category criteria in this blog post, that you can use to compare different categories of IaC tools. In this section, I’ll flush out how ad hoc scripts stack up according to the IaC category criteria; in later sections, you’ll see how the other IaC categories perform along the same criteria, giving you a consistent way to compare the different options.

CRUD

CRUD stands for create, read, update, and delete. To manage infrastructure as code, you typically need that code to support all four of these operations, whereas most ad hoc scripts only handle create. For example, this script can create a security group and EC2 instance, but if you run this script a second or third time, the script doesn’t know how to "read" the state of the world, so it has no awareness that the security group and EC2 instance already exist, and will always try to create new infrastructure from scratch. Likewise, this script has no built-in support for deleting any of the infrastructure it creates (which is why you had to terminate the EC2 instance manually). So while ad hoc scripts make it much faster to create infrastructure, they don’t really help you manage it.

Scale

Solving the CRUD problem in an ad hoc script for a single EC2 instance is hard enough, but a real architecture may contain hundreds of instances, plus databases, load balancers, networking configuration, and so on, and there’s no easy way to scale up scripts to keep track of and manage so much infrastructure.

Deployment strategies

In real-world architectures, you typically need to use various deployment strategies to roll out updates, such as zero-downtime rolling deployments, blue-green deployments, canary deployments, and so on (you’ll learn more about deployment strategies in Part 5). With ad hoc scripts, you’d have to write the logic for each deployment strategy from scratch.

Idempotency and error handling

To manage infrastructure, you typically want code that is idempotent, which means it can be re-run multiple times and still produce the desired result. Most ad hoc scripts are not idempotent and do not handle errors gracefully. If you hit an error part way through running this script, it just exits, leaving work in a partially completed state, but retaining no memory of what the script got done. If you then try to re-run the script, you’ll often get a different error because some of the partially completed work will now interfere with the new work the script is trying to do. For example, perhaps you ran the script the first time, and it created the security group called "sample-app" successfully, but when it tried to create the EC2 instance, AWS was out of capacity, and you got an error. If you wait until AWS has more capacity and try to re-run the script, you’ll now get an error as the script tries to create a security group called "sample-app" again, which isn’t allowed, as AWS requires security group names to be unique.

Consistency

The great thing about ad hoc scripts is that you can use any programming language you want, and you can write the code however you want. The terrible thing about ad hoc scripts is that you can use any programming language you want, and you can write the code however you want. I wrote the Bash script one way; you might write it another way; your coworker may choose a different language entirely. If you’ve ever had to maintain a large repository of ad hoc scripts, you know that it almost always devolves into a mess of unmaintainable spaghetti code. As you’ll see shortly, tools that are designed specifically for managing infrastructure as code often provide a single, idiomatic way to solve each problem, so that your codebase tends to be more consistent and easier to maintain.

Verbosity

The Bash script to launch a simple EC2 instance, plus the user data script, add up to around 80 lines of code—and that’s without the code for CRUD, deployment strategies, idempotency, and error handling. An ad hoc script that handles all of these properly would be hundreds or thousands of lines of code. And we’re talking about just one EC2 instance; your production infrastructure may include hundreds of instances, plus databases, load balancers, network configurations, and much more. The amount of custom code it takes to manage all of this with ad hoc scripts quickly becomes untenable. As you’ll see shortly, tools that are designed specifically for managing infrastructure as code typically provide APIs that are more concise for accomplishing common infrastructure tasks.

Ad hoc scripts have always been, and will always be, a big part of software delivery. They are the glue and duct tape of the DevOps world. However, they are not the best choice as a primary tool for managing infrastructure as code.

Key takeaway #1

Ad hoc scripts are great for small, one-off tasks, but not for managing all your infrastructure as code.

If you’re going to be managing all of your infrastructure as code, you should use an IaC tool that is purpose-built for the job, such as one of the ones discussed in the next several sections.

Configuration Management Tools

After trying out ad hoc scripts, and hitting all the issues mentioned in the previous section, the software industry moved on to configuration management tools, such as Chef, Puppet, and Ansible. These tools first started to appear before cloud computing was ubiquitous, so the way they were originally designed was to assume someone else had done the work of setting up the hardware (e.g., your Ops team racked the servers in your own data center), and the primary purpose of these tools was to handle the software, including configuring the operating system, installing dependencies, deploying and updating apps, and so on.

Each configuration management tool has you write code in a different domain specific language (DSL): for example, with Chef, you write code in a DSL built on top of Ruby; with Puppet, you write code in a custom declarative language specifically designed for Puppet; with Ansible, you write code in a DSL built on top of YAML. Once you’ve written the code, most configuration management tools use a mutable infrastructure paradigm, where you have long-running servers that the configuration management tools update (mutate) over and over again, over many years. In order to update your servers, configuration management tools rely on the following two items:

Master servers

You run one or more master servers (Chef Server, Puppet Server, or Ansible Automation Controller[11]), which are responsible for communicating with the rest of your servers, tracking the state of those servers, and running a reconciliation loop that continuously ensures the configuration of each server matches your desired configuration. The master servers also typically provide a central UI and API that you can use to see the state of your servers, perform various operations, and generate reports.

Agents

Chef and Puppet require you to install custom agents (Chef Client and Puppet Agent) on each server, which are responsible for connecting to and authenticating with the master servers. You can configure the master servers to either push changes to these agents, or to have the agents pull changes from the master servers. Ansible, on the other hand, pushes changes to your servers over SSH, which is pre-installed on most servers by default (you’ll learn more about SSH in Part 7). Whether you rely on agents or SSH, this leads to a chicken-and-egg problem: in order to be able to configure your servers (with configuration management tools), you first have to configure your servers (install agents or set up SSH authentication). Solving this chicken-and-egg problem requires either manual intervention or external tools (e.g., you’ll see an example shortly of how you can use AWS APIs to configure SSH access for Ansible).

The best way to understand configuration management is to see it in action, so let’s go through an example of using Ansible.

Example: Deploy an EC2 Instance Using Ansible

To be able to use configuration management, the first thing you need is a server. If you have an existing server you can use—e.g., a physical server on-prem or a virtual server in the cloud—and you have SSH access to that server, you can skip this section, and go to the next one.

If you don’t have a server you can use, this section will show you how to deploy an EC2 instance using Ansible. Note that deploying and managing servers (hardware) is not really what configuration management tools were designed to do—later in this blog post, you’ll see how provisioning tools are typically a better fit for this task—but for spinning up a single server for learning and testing, Ansible is good enough.

Create a new folder called ansible:

console
$ cd fundamentals-of-devops $ mkdir -p ch2/ansible $ cd ch2/ansible

Inside the Ansible folder, create an Ansible playbook called create_ec2_instance_playbook.yml, with the contents shown in Example 4:

Example 4. Ansible playbook to deploy an EC2 instance (ch2/ansible/create_ec2_instance_playbook.yml)
yaml
- name: Deploy an EC2 instance in AWS hosts: localhost gather_facts: no environment: AWS_REGION: us-east-2 tasks: - name: Create security group (1) amazon.aws.ec2_security_group: name: sample-app-ansible description: Allow HTTP and SSH traffic rules: - proto: tcp ports: [8080] cidr_ip: 0.0.0.0/0 - proto: tcp ports: [22] cidr_ip: 0.0.0.0/0 register: aws_security_group - name: Create a new EC2 key pair (2) amazon.aws.ec2_key: name: ansible-ch2 file_name: ansible-ch2.key (3) no_log: true register: aws_ec2_key_pair - name: Create EC2 instance with Amazon Linux 2003 (4) amazon.aws.ec2_instance: name: sample-app-ansible key_name: "{{ aws_ec2_key_pair.key.name }}" instance_type: t2.micro security_group: "{{ aws_security_group.group_id }}" image_id: ami-0900fe555666598a2 tags: Ansible: ch2_instances (5)

Instead of a general-purpose programming language (GPL), such as Bash or Ruby or Python, Ansible uses a DSL defined on top of YAML. The YAML in the preceding playbook does the following:

1Create a security group: Allow inbound HTTP requests on port 8080 and inbound SSH requests on port 22.
2Create an EC2 key pair: An EC2 key pair is a public/private key pair that can be used to authenticate to an EC2 instance.
3Save the private key: Store the private key of the EC2 key pair locally in a file called ansible-ch2.key. You’ll use this private key in the next section to authenticate to the EC2 instance.
4Deploy an EC2 instance: The instance uses the security group and public key from the previous steps.
5Tag the instance: This sets the Ansible tag on the instance to "ch2_instances." You’ll use this tag in the next section.

To run this Ansible playbook, install Ansible, authenticate to AWS as described in Authenticating to AWS on the command line, and run the following:

console
$ ansible-playbook -v create_ec2_instance_playbook.yml

You should get log output for that looks something like this (truncated for readability):

PLAY [Deploy an EC2 instance in AWS]



TASK [Create security group]

changed: [localhost] => {"changed": true, "description": "..."}



TASK [Create a new EC2 key pair]

changed: [localhost] => {"censored": "...", "changed": true}



TASK [Create EC2 instance with Amazon Linux 2003]

changed: [localhost] => {"changed": true, "instance_ids": ["..."]}



PLAY RECAP

localhost: ok=3    changed=3    unreachable=0    failed=0

Now that you have a server to work with, you can see what configuration management tools are really designed to do: configuring servers to run software.

Example: Configure a Server Using Ansible

In order for Ansible to be able to configure your servers, you have to provide an inventory, which is a file that specifies which servers you want configured, and how to connect to them. If you have a set of physical servers on-prem, you can put the IP addresses of those servers in an inventory file, as shown in Example 5:

Example 5. Example Ansible inventory file (inventory.yml)
yaml
webservers: hosts: 10.16.10.1: 10.16.10.2: dbservers: hosts: 10.16.20.1: 10.16.20.2: 10.16.20.3:

The preceding file organizes your servers into groups: the webservers group has two servers in it and the dbservers group has three servers. You’ll then be able to write Ansible playbooks that target specific groups.

If you are running servers in the cloud, where servers come and go often, and IP addresses change more frequently, you’re better off using an inventory plugin that can dynamically discover your servers. For example, if you deployed an EC2 instance in AWS in the previous section, you can use the aws_ec2 inventory plugin by creating a file called inventory.aws_ec2.yml with the contents shown in Example 6:

Example 6. Ansible inventory to discover EC2 instances (ch2/ansible/inventory.aws_ec2.yml)
yaml
plugin: amazon.aws.aws_ec2 regions: - us-east-2 keyed_groups: - key: tags.Ansible (1) leading_separator: '' (2)

This code does the following:

1Create groups based on the Ansible tag of the instance. In the previous section, you set this tag to "ch2_instances," so that will be the name of the group.
2By default, Ansible adds a leading underscore to group names. This disables it so the group name matches the tag name.

For each group in your inventory, you can also specify group variables to configure how to connect to the servers in that group. You define these variables in YAML files in the group_vars folder, with the name of the file set to the name of the group. For example, for the EC2 instance in the ch2_instances group, you should create a file in group_vars/ch2_instances.yml with the contents shown in Example 7:

Example 7. Group variables for the sample app (ch2/ansible/group_vars/ch2_instances.yml)
yaml
ansible_user: ec2-user (1) ansible_ssh_private_key_file: ansible-ch2.key (2) ansible_host_key_checking: false (3)

The variables this file defines are:

1Use "ec2-user" as the username to connect to the EC2 instance. This is the username you need to use with Amazon Linux AMIs.
2Use the private key at ansible-ch2.key to authenticate to the instance. This is the private key of the EC2 key pair the playbook created in the previous section.
3Skip host key checking so you don’t get interactive prompts from Ansible.

Alright, with the inventory stuff out of the way, you can now create a playbook to configure your server to run the Node.js sample app. Create a file called configure_sample_app_playbook.yml with the contents shown in Example 8:

Example 8. The playbook to configure the server to run the sample app (ch2/ansible/configure_sample_app_playbook.yml)
yaml
- name: Configure the EC2 instance to run a sample app hosts: ch2_instances (1) gather_facts: true become: true roles: - sample-app (2)

This playbook does two things:

1Target the servers in the ch2_instances group, which should be a group with the EC2 instance you deployed in the previous section. If you are configuring some other server (e.g., your own servers on-prem), update this to the name of the group to target in your inventory file.
2Configure the servers using an Ansible role called sample-app, as discussed next.

An Ansible role is a structured way to organize tasks, templates, files, and other configuration you might want to apply to a server. The standard folder structure for Ansible roles looks like this:

roles

  └── <role-name>

      ├── defaults

      │   └── main.yml

      ├── files

      │   └── foo.txt

      ├── handlers

      │   └── main.yml

      ├── tasks

      │   └── main.yml

      ├── templates

      │   └── foo.txt.j2

      └── vars

          └── main.yml

Each folder has a specific purpose: e.g., the tasks folder defines tasks to run on a server; the files folder has files to copy to the server; the templates folder lets you use Jinja templates to dynamically fill in data in files; and so on. Having this standardized structure makes it easier to navigate and understand an Ansible code base.

To create the sample-app role for this playbook, create a roles/sample-app folder in the same directory as configure_sample_app_playbook.yml:

.

├── configure_sample_app_playbook.yml

├── group_vars

├── inventory.aws_ec2.yml

└── roles

    └── sample-app

        ├── files

        │   └── app.js

        └── tasks

            └── main.yml

Within roles/sample-app, you should create files and tasks subfolders, which are the only parts of the standardized role folder structure you’ll need for this simple example. Copy the Node.js sample app you saw earlier in in Section 1.2.1 into files/app.js:

console
$ cp ../../ch1/sample-app/app.js roles/sample-app/files/

Next, create tasks/main.yml with the code shown in Example 9:

Example 9. Ansible role to run the Node.js sample app (ch2/ansible/roles/sample-app/tasks/main.yml)
yaml
- name: Add Node packages to yum (1) shell: curl -fsSL https://rpm.nodesource.com/setup_21.x | bash - - name: Install Node.js yum: name: nodejs - name: Copy sample app (2) copy: src: app.js dest: app.js - name: Start sample app (3) shell: nohup node app.js &

This code does the following:

1Install Node.js: Use the shell module to run a command on the server to add Node packages to yum, and then use the yum module to install Node.js.
2Copy the sample app: Use the copy module to copy app.js to the server.
3Start the sample app: Use the shell module to execute the node binary to run the app in the background.

To run this playbook, authenticate to AWS as described in Authenticating to AWS on the command line, and run the following command:

console
$ ansible-playbook -v -i inventory.aws_ec2.yml configure_sample_app_playbook.yml

You should get log output for each step, including a recap at the end that looks something like this:

PLAY RECAP

xxx.us-east-2.compute.amazonaws.com : ok=5    changed=4    failed=0

The value on the left, "xxx.us-east-2.compute.amazonaws.com," is a domain name you can use to access the instance. Open http://xxx.us-east-2.compute.amazonaws.com:8080 (note it’s port 8080 this time, not 80) in your web browser, and you should see:

Hello, World!

Congrats, you’re now using a configuration management tool to manage your infrastructure as code!

Get your hands dirty

Here are a few exercises you can try at home to go deeper:

  • What happens if you run the Ansible playbook a second time? How does this compare to the Bash script?

  • How would you have to change the playbook to run multiple EC2 instances?

When you’re done experimenting with Ansible, you should manually undeploy the EC2 instance by finding it in the EC2 Console: check the top right corner to make sure you’re in the same region used by the Ansible playbook (us-east-2), then on the Instances page, look for the instance ID the script outputs at the end, click "Instance state," and choose "Terminate instance" in the drop down, as shown in Figure 19. This ensures that your account doesn’t start accumulating any unwanted charges.

How Configuration Management Tools Stack Up

Here is how configuration management tools stack up using the IaC category criteria:

CRUD

Most configuration management tools support three of the four CRUD operations: they can create the initial configuration, read the current configuration to see if it matches the desired configuration, and if not, update the existing configuration. That said, support for read and update is a bit hit or miss. It works well for reading and updating the configuration within a server (if you use tasks that are idempotent, as you’ll see shortly), but for managing the servers themselves, or any other type of cloud infrastructure, it only works if you remember to assign each piece of infrastructure a unique name or tag, which is easy to do with just a handful of resources, but becomes more challenging at scale. Most configuration management tools do not support delete (which is why you had to undeploy the EC2 instance manually).

Scale

Most configuration management tools are designed specifically for managing multiple remote servers. For example, you could easily update the preceding Ansible code to deploy 3 EC2 instances, and Ansible will automatically configure all 3 to run the web server (you’ll see an example of this in Part 3).

Deployment strategies

Some configuration management tools have built-in support for deployment strategies. For example, Ansible has built-in support for rolling deployments, so if you deployed 20 servers, then updated the configuration in the Ansible role (e.g., to deploy a new version of the app) and re-ran Ansible, it could roll out the change in batches (e.g., updating 5 servers at a time), with zero downtime.

Idempotency and error handling

Some tasks you do with configuration management tools are idempotent, some are not. For example, the yum task in Ansible is idempotent: it only installs the software if it’s not installed already, so it’s safe to re-run that task as many times as you want. On the other hand, arbitrary shell tasks may or may not be idempotent, depending on what shell commands you execute. For example, the preceding playbook uses a shell task to directly execute the node binary, which is not idempotent. After the first run, subsequent runs of this playbook will fail, as the Node.js app is already running and listening on port 8080, so you’ll get an error about conflicting ports. In Part 3, you’ll see a better way of running apps with Ansible that is idempotent.

Consistency

Most configuration management tools enforce a consistent, predictable structure to the code, including documentation, file layout, clearly named parameters, secrets management, and so on. While every developer organizes their ad hoc scripts in a different way, most configuration management tools come with a set of conventions that makes it easier to navigate and maintain the code, as you saw with the folder structure for Ansible roles.

Verbosity

Most configuration management tools provide a DSL for specifying server configuration that is more concise than the equivalent in an ad hoc script. For example, you saw Ansible’s YAML-based DSL. At first, it might not seem like the code is any shorter than the Bash script: in fact, it’s roughly equal, with around 80 lines of Bash code (script to deploy EC2 instance plus user data script) versus about 80 lines of YAML with Ansible (playbook plus role). However, the 80 lines of Ansible code are doing considerably more: the Ansible code supports most CRUD operations, deployment strategies, idempotency, scaling operations to many servers, and consistent code structure. An ad hoc script that supported all of this would be many times the length.

Configuration management tools brought a number of advantages over ad hoc scripts, but they also introduced their own drawbacks. One big drawback is that some configuration management tools have a considerable setup cost: e.g., you may need to set up master servers and ways to connect to all your other servers (agents or SSH). A second big drawback is that most configuration management tools were designed for a mutable infrastructure paradigm: this can be problematic due to configuration drift, where over time, your long-running servers can build up unique histories of changes, so each server is subtly different from the others, which can make it hard to reason about what’s deployed and debug issues.

As cloud and virtualization becomes more and more ubiquitous, it’s becoming more common to use an immutable infrastructure paradigm, where instead of long-running physical servers, you use short-lived virtual servers that you replace every time you do an update. This is inspired by functional programming, where variables are immutable, so after you’ve set a variable to a value, you can never change that variable again, and if you need to update something, you create a new variable. Because variables never change, it’s a lot easier to reason about your code.

The idea behind immutable infrastructure is similar: once you’ve deployed a server, you never make changes to it again. If you need to update something, such as deploying a new version of your code, you deploy a new server. Because servers never change after being deployed, it’s a lot easier to reason about what’s deployed. The typical analogy used here (my apologies to vegetarians and animal lovers), is cattle vs pets: with mutable infrastructure, you treat your servers like pets, giving each one its own unique name, taking care of it, and trying to keep it alive as long as possible; with immutable infrastructure, you treat your servers like cattle, each one more or less indistinguishable from the others, with random or sequential IDs instead of names, and you kill them off and replace them regularly.

Key takeaway #2

Configuration management tools are great for managing the configuration of servers, but not for deploying the servers themselves, or other infrastructure.

While it’s possible to use configuration management tools with immutable infrastructure patterns, it’s not what they were originally designed for, and that led to new approaches, as discussed in the next section.

Server Templating Tools

An alternative to configuration management that has been growing in popularity recently is to use server templating tools, such as virtual machines and containers. Instead of launching a bunch of servers and configuring them by running the same code on each one, the idea behind server templating tools is to create an image of a server that captures a fully self-contained "snapshot" of the operating system (OS), the software, the files, and all other relevant details. You can then use some other IaC tool (e.g., provisioning tools, as you’ll see in the next section) to install that image on all of your servers.

The two main types of images: VMs on the left, and containers on the right. VMs virtualize the hardware, whereas containers virtualize only user space.
Figure 21. The two main types of images: VMs, on the left, and containers, on the right. VMs virtualize the hardware, whereas containers virtualize only user space.

As shown in Figure 21, there are two categories of tools for working with images:

Virtual machines

A virtual machine emulates an entire computer system, including the hardware. You run a hypervisor, such as VMware vSphere, VirtualBox, or Parallels, to virtualize (i.e., simulate) the underlying CPU, memory, hard drive, and networking. The benefit of this is that any VM image that you run on top of the hypervisor can see only the virtualized hardware, so it’s fully isolated from the host machine and any other VM images, and it will run exactly the same way in all environments (e.g., your computer, a QA server, a production server). The drawback is that virtualizing all this hardware and running a totally separate OS for each VM incurs a lot of overhead in terms of CPU usage, memory usage, and startup time. You can define VM images as code using tools such as Packer (which you typically use to create images for production servers) and Vagrant (which you typically use to create images for local development).

Containers

A container emulates the user space of an OS.[12] You run a container engine, such as Docker or cri-o, to isolate processes, memory, mount points, and networking. The benefit of this is that any container you run on top of the container engine can see only its own user space, so it’s isolated from the host machine and other containers, and will run exactly the same way in all environments (your computer, a QA server, a production server, etc.). The drawback is that all of the containers running on a single server share that server’s OS kernel and hardware, so it’s much more difficult to achieve the level of isolation and security you get with a VM.[13] However, because the kernel and hardware are shared, your containers can boot up in milliseconds and have virtually no CPU or memory overhead. You can define container images as code using tools such as Docker.

You’ll go through an example of using container images with Docker in Part 3. In this blog post, let’s go through an example of using VM images with Packer.

Example: Create a VM Image Using Packer

As an example, let’s take a look at using Packer to create a VM image for AWS called an Amazon Machine Image (AMI). First, create a folder called packer:

console
$ cd fundamentals-of-devops $ mkdir -p ch2/packer $ cd ch2/packer

Next, copy the Node.js sample app you saw earlier in Section 1.2.1 into the packer folder:

console
$ cp ../../ch1/sample-app/app.js .

Create a Packer template called sample-app.pkr.hcl, with the contents shown in Example 10:

Example 10. Packer template to create AMI (ch2/packer/sample-app.pkr.hcl)
hcl
packer { required_plugins { amazon = { version = ">= 1.3.1" source = "github.com/hashicorp/amazon" } } } source "amazon-ebs" "amazon_linux" { (1) ami_name = "sample-app-packer-${uuidv4()}" ami_description = "Amazon Linux 2023 AMI with a Node.js sample app." instance_type = "t2.micro" region = "us-east-2" source_ami = "ami-0900fe555666598a2" ssh_username = "ec2-user" } build { (2) sources = ["source.amazon-ebs.amazon_linux"] provisioner "file" { (3) source = "app.js" destination = "/home/ec2-user/app.js" } provisioner "shell" { (4) inline = [ "curl -fsSL https://rpm.nodesource.com/setup_21.x | sudo bash -", "sudo yum install -y nodejs" ] pause_before = "30s" } }

You create Packer templates using the HashiCorp Configuration Language (HCL) in files with a .hcl extension. The preceding template does the following:

1Source images: Packer will start a server running each source image you specify. The preceding code will result in Packer starting an EC2 instance running the Amazon Linux AMI you saw in the Bash and Ansible examples.
2Build steps: Packer then connects to the server (e.g., via SSH) and runs the build steps in the order you specified. When all the build steps have finished, Packer will take a snapshot of the server and shut the server down. The preceding example runs two build steps, as described in (3) and (4), and the snapshot it creates is an AMI that has everything installed and configured to run the sample app.
3File provisioner: The first build step runs a file provisioner to copy files to the server. The preceding code uses this to copy the Node.js sample app code in app.js to the server.
4Shell provisioner: The second build step runs a shell provisioner to execute shell commands on the server. The preceding code uses this to install Node.js.

So this Packer template is nearly identical to the Bash script and Ansible playbook, except the result of executing Packer is not a server running your app, but the image of a server with your app and all its dependencies installed. The idea is to use other IaC tools to launch one or more servers running that image; you’ll see an example later in this blog post of using OpenTofu to launch an EC2 instance running this AMI.

If you want to try the Packer template out, install Packer, authenticate to AWS as described in Authenticating to AWS on the command line, and run the following commands:

console
$ packer init sample-app.pkr.hcl $ packer build sample-app.pkr.hcl

The first command, packer init, installs any plugins used in this Packer template. Packer can create images for many cloud providers—e.g., AWS, GCP, Azure, etc.—and the code for each of these providers lives not in the Packer binary itself, but in separate plugins that you install via the init command. The second command, packer build, kicks off the build process. When the build is done, which typically takes 3-5 minutes, you should see some log output that looks like this:

==> Builds finished. The artifacts of successful builds are:

--> amazon-ebs.amazon_linux: AMIs were created:

us-east-2: ami-0ee5157dd67ca79fc

Congrats, you’re now using a server templating tool to manage your server configuration as code! The ami-xxx value is the ID of the AMI that was created from this template. Save the value somewhere, as later in this post, you’ll see an example of how to deploy this AMI.

Get your hands dirty

Here are a few exercises you can try at home to go deeper:

  • What happens if you run packer build on this template a second time? Why?

  • Figure out how to update the Packer template so it builds images not only for AWS, but also images you can run on other clouds (e.g., Azure or GCP) or on your own computer (e.g., VirtualBox or Docker).

How Server Templating Tools Stack Up

How do server templating tools stack up using the IaC category criteria?

CRUD

Server templating only needs to support the create operation in CRUD. This is because server templating is a key component of the shift to immutable infrastructure: if you need to roll out a change, instead of updating an existing server, you use your server templating tool to create a new image, and deploy that image on a new server. So, with server templating, you’re always creating totally new images; there’s never a reason to read, update, or delete. That said, server templating tools aren’t used in isolation; you need some other tool to deploy these images (e.g., a provisioning tool, as you’ll see shortly), and you typically want that tool to support all CRUD operations.

Scale

Server templating tools scale very well, as you can create an image once, and then roll that same image out to 1 server or 1,000 servers, as necessary.

Deployment strategies

Server templating tools only create images; you use other tools and whatever deployment strategies those tools support to roll the new images out.

Idempotency and error handling

Server templating tools are idempotent by design. Since you create a new image every time, the tool just executes the exact same steps every time. If you hit an error part of the way through, just re-run, and try again.

Consistency

Most server templating tools enforce a consistent, predictable structure to the code, including documentation, file layout, clearly named parameters, secrets management, and so on.

Verbosity

Because server templating tools don’t have to deal with most CRUD operations and are idempotent "for free," the amount of code you need is typically pretty small. Moreover, server templating tools provide concise DSLs. As a result, the code tends to be fairly short.

Key takeaway #3

Server templating tools are great for managing the configuration of servers with immutable infrastructure practices.

As I mentioned a few times, server templating tools are powerful, but they don’t work by themselves. You need another tool to actually deploy and manage the images you create, such as provisioning tools, which are the focus of the next section.

Provisioning Tools

Whereas configuration management and server templating define the code that runs on each server, provisioning tools such as OpenTofu, Terraform, CloudFormation, OpenStack Heat, and Pulumi are responsible for creating the servers themselves. In fact, you can use provisioning tools to create not only servers but also databases, caches, load balancers, queues, monitoring, subnet configurations, firewall settings, routing rules, TLS certificates, and many other aspects of your infrastructure.

Under the hood, most provisioning tools work by translating the code you write into API calls to the cloud provider you’re using. For example, if you write OpenTofu code to create a server in AWS (which you will do in the next section), when you run OpenTofu, it will parse your code, and based on the configuration you specify, make a number of API calls to AWS to create an EC2 instance, security group, etc.

That means that, unlike with configuration management tools, you don’t have to do any extra work to set up master servers or connectivity. All of this is handled using the APIs and authentication mechanisms already provided by the cloud you’re using. Let’s see this in action by going through an example with OpenTofu.

Example: Deploy an EC2 Instance Using OpenTofu

Terraform versus OpenTofu

Terraform is a popular provisioning tool that HashiCorp open sourced in 2014 under the Mozilla Public License (MPL) 2.0. In 2023, HashiCorp switched Terraform to the non-open source Business Source License (BSL). As a result, the community created OpenTofu, a fork of Terraform that remains open source under the MPL 2.0 license, and is managed by the Linux Foundation. I prefer to use open source tools whenever possible, so this blog post series will use OpenTofu for all example code, but most of the examples should work with Terraform as well.

As an example of using a provisioning tool, lets create an OpenTofu module that can deploy an EC2 instance. You write OpenTofu modules in HCL (the same language you used with Packer), in configuration files with a .tf extension. OpenTofu will find all files with the .tf extension in a folder, so you can name the files whatever you want, but there are some standard conventions, such as putting the main resources in main.tf, input variables in variables.tf, and output variables in outputs.tf.

First, create a new tofu/ec2-instance folder for the module:

console
$ cd fundamentals-of-devops $ mkdir -p ch2/tofu/ec2-instance $ cd ch2/tofu/ec2-instance

Within the tofu/ec2-instance folder, create a file called main.tf, with the contents shown in Example 11:

Example 11. OpenTofu module to deploy an EC2 instance (ch2/tofu/ec2-instance/main.tf)
terraform
provider "aws" { (1) region = "us-east-2" } resource "aws_security_group" "sample_app" { (2) name = "sample-app-tofu" description = "Allow HTTP traffic into the sample app" } resource "aws_security_group_rule" "allow_http_inbound" { (3) type = "ingress" protocol = "tcp" from_port = 8080 to_port = 8080 security_group_id = aws_security_group.sample_app.id cidr_blocks = ["0.0.0.0/0"] } resource "aws_instance" "sample_app" { (4) ami = var.ami_id (5) instance_type = "t2.micro" vpc_security_group_ids = [aws_security_group.sample_app.id] user_data = file("${path.module}/user-data.sh") (6) tags = { Name = "sample-app-tofu" } }

The code in main.tf does something very similar to the Bash script and Ansible playbook from earlier in the blog post:

1Configure the AWS provider: OpenTofu works with many providers, such as AWS, GCP, Azure, and so on. This code configures the AWS provider to use the us-east-2 (Ohio) region. AWS has datacenters all over the world, grouped into regions and availability zones. An AWS region is a separate geographic area, such as us-east-2 (Ohio), eu-west-1 (Ireland), and ap-southeast-2 (Sydney). Within each region, there are multiple isolated datacenters known as availability zones (AZs), such as us-east-2a, us-east-2b, and so on.
2Create a security group: For each type of provider, there are many different kinds of resources that you can create, such as servers, databases, and load balancers. The general syntax for creating a resource in OpenTofu is as follows:
terraform
resource "<PROVIDER>_<TYPE>" "<NAME>" { [CONFIG ...] }

where PROVIDER is the name of a provider (e.g., aws), TYPE is the type of resource to create in that provider (e.g., security_group), NAME is an identifier you can use throughout the OpenTofu code to refer to this resource (e.g., sample_app), and CONFIG consists of one or more arguments that are specific to that resource.

The preceding code creates an aws_security_group resource, which, as you can guess from the name, is a security group that controls what network traffic can go in and out of the EC2 instance.

3Allow HTTP requests: Use the aws_security_group_rule resource to add a rule to the security group from (2) that allows inbound HTTP requests on port 8080.
4Deploy an EC2 instance: Use the aws_instance resource to create an EC2 instance that uses the security group and sets the Name tag to "sample-app-tofu."
5Set the AMI: The EC2 instance sets the AMI to var.ami_id. This is a reference to an input variable defined in variables.tf, as shown in Example 12.
6Set the user data: The EC2 instance configures a user data script by reading in the user-data.sh file shown in Example 13.

In the same folder as main.tf, create a file called variables.tf to define input variables, as shown in Example 12:

Example 12. Input variables for the OpenTofu module (ch2/tofu/ec2-instance/variables.tf)
terraform
variable "ami_id" { description = "The ID of the AMI to run." type = string }

As you’ll see shortly, this input variable will allow you to pass in the ID of a custom AMI to run in the EC2 instance: namely, the AMI you built from the Packer template in the previous section. You should also create a file called user-data.sh, which contains the user data script shown in Example 13:

Example 13. User data script (ch2/tofu/ec2-instance/user-data.sh)
bash
#!/usr/bin/env bash nohup node /home/ec2-user/app.js &

Note how this user data script is much shorter than the one you saw in the Bash code. That’s because all the dependencies (Node.js) and code (app.js) are already installed in the AMI by Packer. So the only thing this user data script does is start the sample app. This is a more idiomatic way to use user data.

Finally, create a file called outputs.tf with the contents shown in Example 14:

Example 14. Output variables (ch2/tofu/ec2-instance/outputs.tf)
bash
output "instance_id" { description = "The ID of the EC2 instance" value = aws_instance.sample_app.id } output "security_group_id" { description = "The ID of the security group" value = aws_security_group.sample_app.id } output "public_ip" { description = "The public IP of the EC2 instance" value = aws_instance.sample_app.public_ip }

The preceding code defines output variables, which you can use to log and share values between modules. The preceding code defines output variables for the EC2 instance ID, security group ID, and EC2 instance public IP.

If you want to try the OpenTofu code out, install OpenTofu, authenticate to AWS as described in Authenticating to AWS on the command line, and run the following command:

console
$ tofu init

The init command installs any providers used in this Tofu configuration. OpenTofu works with many cloud providers, including AWS, which is used in the preceding example, as well as Azure, GCP, Alibaba Cloud, OCI, and so on. The code for each provider doesn’t live in the tofu binary, but in separate provider binaries, which you download via the init command.

Once init has completed, run the apply command to start the deployment process:

console
$ tofu apply

The first thing the apply command will do is prompt you for the ami_id value:

var.ami_id

  The ID of the AMI to run.



  Enter a value:

You can paste in the ID of the AMI you built using Packer in the previous section and hit Enter. Alternatively, if you don’t want to be prompted interactively, you can instead use the -var flag when running apply:

console
$ tofu apply -var ami_id=<YOUR_AMI_ID>

You can also set the value for any input variable foo using the environment variable TF_VAR_foo:

console
$ export TF_VAR_ami_id=<YOUR_AMI_ID> $ tofu apply

The second thing the apply command will do is show you the execution plan (just plan for short), which will look something like this (truncated for readability):

OpenTofu will perform the following actions:



  # aws_instance.sample_app will be created

  + resource "aws_instance" "sample_app" {

      + ami                                  = "ami-0ee5157dd67ca79fc"

      + instance_type                        = "t2.micro"

      ... (truncated) ...

    }



  # aws_security_group.sample_app will be created

  + resource "aws_security_group" "sample_app" {

      + description            = "Allow HTTP traffic into the sample app"

      + name                   = "sample-app-tofu"

      ... (truncated) ...

    }



  # aws_security_group_rule.allow_http_inbound will be created

  + resource "aws_security_group_rule" "allow_http_inbound" {

      + from_port                = 8080

      + protocol                 = "tcp"

      + to_port                  = 8080

      + type                     = "ingress"

      ... (truncated) ...

    }



Plan: 3 to add, 0 to change, 0 to destroy.



Changes to Outputs:

  + instance_id       = (known after apply)

  + public_ip         = (known after apply)

  + security_group_id = (known after apply)

The plan lets you see what OpenTofu will do before actually making any changes, and prompts you for confirmation before continuing. This is a great way to sanity-check your code before unleashing it onto the world. The plan output is similar to the output of the diff command that is part of Unix, Linux, and git: anything with a plus sign (+) will be created, anything with a minus sign (–) will be deleted, and anything with a tilde sign (~) will be modified in place. Every time you run apply, OpenTofu will show you this execution plan; you can also generate the execution plan without applying any changes by running tofu plan instead of tofu apply.

In the preceding plan output, you can see that OpenTofu is planning on creating an EC2 Instance, security group, and security group rule, which is exactly what you want. Type yes and hit Enter to let OpenTofu proceed. You should see log output that looks like this:

Do you want to perform these actions?

  OpenTofu will perform the actions described above.

  Only 'yes' will be accepted to approve.



  Enter a value: yes



aws_security_group.sample_app: Creating...

aws_security_group.sample_app: Creation complete after 2s

aws_security_group_rule.allow_http_inbound: Creating...

aws_security_group_rule.allow_http_inbound: Creation complete after 0s

aws_instance.sample_app: Creating...

aws_instance.sample_app: Still creating... [10s elapsed]

aws_instance.sample_app: Still creating... [20s elapsed]

aws_instance.sample_app: Creation complete after 22s



Apply complete! Resources: 3 added, 0 changed, 0 destroyed.



Outputs:



instance_id = "i-0a4c593f4c9e645f8"

public_ip = "3.138.110.216"

security_group_id = "sg-087227914c9b3aa1e"

You can see the three output variables from outputs.tf at the end, including the public IP address in public_ip. Wait a minute or two for the EC2 instance to boot up, copy the public_ip, open http://<public_ip>:8080 in your web browser, and you should see:

Hello, World!

Congrats, you’re using a provisioning tool to manage your infrastructure as code!

Example: Update and Destroy Infrastructure Using OpenTofu

One of the big advantages of provisioning tools is that they support not just deploying infrastructure, but also updating and destroying it. For example, now that you’ve deployed an EC2 instance using OpenTofu, make a change to the configuration, such as adding a new Test tag with the value "update," as shown in Example 15:

Example 15. Update the tags on the EC2 instance (ch2/tofu/ec2-instance/main.tf)
terraform
resource "aws_instance" "sample_app" { # ... (other params omitted) ... tags = { Name = "sample-app-tofu" Test = "update" } }

Run the apply command again, and you should see output that looks like this (truncated for readability):

console
$ tofu apply aws_security_group.sample_app: Refreshing state... aws_security_group_rule.allow_http_inbound: Refreshing state... aws_instance.sample_app: Refreshing state... OpenTofu used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: ~ update in-place OpenTofu will perform the following actions: # aws_instance.sample_app will be updated in-place ~ resource "aws_instance" "sample_app" { id = "i-0738de27643533e98" ~ tags = { "Name" = "sample-app-tofu" + "Test" = "update" } # (31 unchanged attributes hidden) # (8 unchanged blocks hidden) } Plan: 0 to add, 1 to change, 0 to destroy.

Every time you run OpenTofu, it records information about what infrastructure it created in an OpenTofu state file. OpenTofu manages state using backends; if you don’t specify a backend, the default is to use the local backend, which stores state locally in a terraform.tfstate file in the same folder as the OpenTofu module (you’ll see how to use other backends in Part 5). This file contains a custom JSON format that records a mapping from the OpenTofu resources in your configuration files to the representation of those resources in the real world.

When you run apply the first time on the ec2-instance module, OpenTofu records in the state file the IDs of the EC2 instance, security group, security group rules, and any other resources it created. When you run apply again, you can see "Refreshing state" in the log output, which is OpenTofu updating itself on the latest status of the world. As a result, the new plan output that you see is the diff between what’s currently deployed in the real world and what’s in your OpenTofu code. The preceding diff shows that OpenTofu wants to create a single tag called Test, which is exactly what you want, so type yes and hit Enter, and you’ll see OpenTofu perform an update operation, updating the EC2 instance with your new tag.

When you’re done testing, you can run tofu destroy to have OpenTofu undeploy everything it deployed earlier, which should give you log output that looks something like this (log output truncated for readability):

console
$ tofu destroy OpenTofu will perform the following actions: # aws_instance.sample_app will be destroyed - resource "aws_instance" "sample_app" { - ami = "ami-0ee5157dd67ca79fc" -> null - associate_public_ip_address = true -> null - id = "i-0738de27643533e98" -> null ... (truncated) ... } # aws_security_group.sample_app will be destroyed - resource "aws_security_group" "sample_app" { - id = "sg-066de0b621838841a" -> null ... (truncated) ... } # aws_security_group_rule.allow_http_inbound will be destroyed - resource "aws_security_group_rule" "allow_http_inbound" { - from_port = 8080 -> null - protocol = "tcp" -> null - to_port = 8080 -> null ... (truncated) ... } Plan: 0 to add, 0 to change, 3 to destroy. Changes to Outputs: - instance_id = "i-0738de27643533e98" -> null - public_ip = "18.188.174.48" -> null - security_group_id = "sg-066de0b621838841a" -> null

When you run destroy, OpenTofu shows you a destroy plan, which tells you about all the resources it’s about to delete. This gives you one last chance to check that you really want to delete this stuff before you actually do it. It goes without saying that you should rarely, if ever, run destroy in a production environment—there’s no "undo" for the destroy command. If everything looks good, type yes and hit Enter, and in a minute or two, OpenTofu will clean up everything it deployed.

Get your hands dirty

Here are a few exercises you can try at home to go deeper:

  • How would you have to tweak the OpenTofu code if you wanted to run multiple EC2 instances?

  • Figure out how to configure the EC2 instance with an EC2 key pair so you can connect to it over SSH.

Example: Deploy an EC2 Instance Using an OpenTofu Module

One of OpenTofu’s more powerful features is that the modules are reusable. In a general purpose programming language (e.g., JavaScript, Python, Java), you put reusable code in a function; in OpenTofu, you put reusable code in a module. You can then use that module multiple times to spin up many copies of the same infrastructure, without having to copy/paste the code.

So far, you’ve been using the ec2-instance module as a root module, which is any module on which you run apply directly. However, you can also use it as a reusable module, which is a module meant to be included in other modules (e.g., in other root modules) as a means of code re-use.

Let’s give it a shot. First, create a folder called modules to store your reusable modules:

console
$ cd fundamentals-of-devops $ mkdir -p ch2/tofu/modules

Next, move the ec2-instance module into the modules folder:

console
$ mv ch2/tofu/ec2-instance ch2/tofu/modules/ec2-instance

Create a folder called live to store your root modules, as these modules configure your live environments:

console
$ mkdir -p ch2/tofu/live

Inside the live folder, create a new folder called sample-app, which will house the new root module you’ll use to deploy the sample app:

console
$ mkdir -p ch2/tofu/live/sample-app $ cd ch2/tofu/live/sample-app

In the live/sample-app folder, create a main.tf file with the initial contents shown in Example 16:

Example 16. Basic module usage (ch2/tofu/live/sample-app/main.tf)
terraform
module "sample_app_1" { source = "../../modules/ec2-instance" # TODO: fill in with your own AMI ID! ami_id = "ami-09a9ad4735def0515" }

To use one module from another, all you need is the following:

  • A module block.

  • A source parameter that contains the file path of the module you want to use. The preceding code sets source to the relative file path of the ec2-instance module in the modules folder.

  • If the module defines input variables, you can set those as parameters within the module block. The ec2-instance module defines an input variable called ami_id, which you’ll need to set to the ami_id to the ID of the AMI you built in the server templating section earlier in this blog post.

If you were to run apply on this code, it would use the ec2-instance module code to create a single EC2 instance. But the beauty of code reuse is that you can use the module multiple times, as shown in Example 17:

Example 17. Using a module multiple times (ch2/tofu/live/sample-app/main.tf)
terraform
module "sample_app_1" { source = "../../modules/ec2-instance" # TODO: fill in with your own AMI ID! ami_id = "ami-09a9ad4735def0515" } module "sample_app_2" { source = "../../modules/ec2-instance" # TODO: fill in with your own AMI ID! ami_id = "ami-09a9ad4735def0515" }

This code has two module blocks, so if you run apply on it, it will create two EC2 instances. If you had three module blocks, it would create three EC2 instances; four module blocks would create four EC2 instances; and so on. And, of course, you can mix and match different modules, include modules in other modules, and so on. It’s not unusual for modules to be reused dozens or hundreds of times across a company, so that you put in the work to create a module that meets your company’s needs once, and then use it over and over again.

However, there are two changes you need to make to the ec2-instance module in order for it to work effectively as a reusable module.

The first change is to namespace all the resources created by the ec2-instance module. Currently, it hard-codes all names, such as the name of the security group, to "sample-app-tofu." AWS requires all security group names to be unique, so if you ran apply on these two module blocks, you’d get an error due to the name conflicts. To fix this, introduce a name input variable in modules/ec2-instance/variables.tf, as shown in Example 18:

Example 18. Add a name input variable to the ec2-instance module (ch2/tofu/modules/ec2-instance/variables.tf)
terraform
variable "name" { description = "The base name for the instance and all other resources" type = string }

Next, update the ec2-instance module to use the name input variable everywhere that was hard-coded to "sample-app-tofu," including the aws_security_group resource and the tags in the aws_instance resource, as shown in Example 19:

Example 19. Use the name input variable in the ec2-instance module (ch2/tofu/modules/ec2-instance/main.tf)
terraform
resource "aws_security_group" "sample_app" { name = var.name description = "Allow HTTP traffic into ${var.name}" } resource "aws_instance" "sample_app" { # ... (other params omitted) ... tags = { Name = var.name } }

Now, back in sample-app/main.tf, you can set the name parameter to different values in each of the module blocks, as shown in Example 20:

Example 20. Set the name input to different values (ch2/tofu/live/sample-app/main.tf)
terraform
module "sample_app_1" { source = "../../modules/ec2-instance" # TODO: fill in with your own AMI ID! ami_id = "ami-09a9ad4735def0515" name = "sample-app-tofu-1" } module "sample_app_2" { source = "../../modules/ec2-instance" # TODO: fill in with your own AMI ID! ami_id = "ami-09a9ad4735def0515" name = "sample-app-tofu-2" }

Now you’ll get two EC2 instances, one with all resources named "sample-app-tofu-1," and the other with all resources named "sample-app-tofu-2."

The second change is to remove the provider block from the ec2-instance module. Having a provider block inside a module isn’t wrong per se, but typically, reusable modules do not declare provider blocks, and instead inherit those from the root module. This allows the provider block to be configured in different ways in different usages of the module. For example, one usage might configure the provider to use a different region, another usage might configure it to a different AWS account, and so on. All you need to do is to move the provider block from the ec2-instance (reusable) module to the sample-app (root) module, as shown in Example 21:

Example 21. Move the provider block to the sample-app root module (ch2/tofu/live/sample-app/main.tf)
terraform
provider "aws" { region = "us-east-2" } module "sample_app_1" { source = "../../modules/ec2-instance" # TODO: fill in with your own AMI ID! ami_id = "ami-09a9ad4735def0515" name = "sample-app-tofu-1" } module "sample_app_2" { source = "../../modules/ec2-instance" # TODO: fill in with your own AMI ID! ami_id = "ami-09a9ad4735def0515" name = "sample-app-tofu-2" }

One last step: create an outputs.tf file in the sample-app folder with the contents shown in Example 22:

Example 22. Proxy the output variables from the ec2-instance module (ch2/tofu/live/sample-app/outputs.tf)
terraform
output "sample_app_1_public_ip" { description = "The public IP of the sample-app-1 instance" value = module.sample_app_1.public_ip } output "sample_app_2_public_ip" { description = "The public IP of the sample-app-2 instance" value = module.sample_app_2.public_ip } output "sample_app_1_instance_id" { description = "The ID of the sample-app-1 instance" value = module.sample_app_1.instance_id } output "sample_app_2_instance_id" { description = "The ID of the sample-app-2 instance" value = module.sample_app_2.instance_id }

The preceding code "proxies" the output variables from the underlying ec2-instance module usages so that you can see those outputs when you run apply on the sample-app root module.

OK, you’re finally ready to run this code:

console
$ tofu init $ tofu apply

When apply completes, you should have two EC2 instances running, and the output variables should show their IPs and instance IDs. If you wait a minute or two for the instances to boot up, and open http://<IP>:8080 in your browser, where <IP> is the public IP of either instance, you should see the familiar "Hello, World!" text. When you’re done experimenting, run tofu destroy to clean everything up again.

Example: Deploy an EC2 Instance Using an OpenTofu Module from GitHub

There’s one more trick with OpenTofu modules: the source parameter can be set to not only a local file path, but also to a URL. For example, the blog post series’s sample code repo in GitHub includes an ec2-instance module that is more or less identical to your own ec2-instance module. You can use the module from the series’s sample code repo by setting the source parameters to a URL, as shown in Example 23:

Example 23. Set source to GitHub URLs (ch2/tofu/live/sample-app/main.tf)
terraform
module "sample_app_1" { source = "github.com/brikis98/devops-book//ch2/tofu/modules/ec2-instance" # ... (other params omitted) ... }

The preceding code sets the source URL to a GitHub URL. Note the intentional use of two slashes (//): the part to the left of the two slashes specifies the GitHub repo and the part to the right of the two slashes specifies the subfolder within that repo.

Run init on this code one more time:

console
$ tofu init Initializing the backend... Initializing modules... Downloading git::https://github.com/brikis98/devops-book.git... Downloading git::https://github.com/brikis98/devops-book.git... Initializing provider plugins...

The init command is responsible for downloading provider code and module code, and you can see in the preceding output that, this time, it downloaded the module code from GitHub. If you now run apply, you should get the exact same two EC2 instances as before. When you’re done experimenting, run destroy to clean everything up.

You’ve now seen the power of reusable modules. A common pattern at many companies is for the Ops team to define and manage a library of vetted, reusable OpenTofu modules—e.g., one module to deploy servers, another to deploy databases, another to configure networking, and so on—and for the Dev teams to use these modules as a self-service way to deploy and manage the infrastructure they need for their apps.

This blog post series will make use of this pattern in future blog posts: instead of writing every line of code from scratch, you’ll be able to use modules directly from this series’s sample code repo to deploy the infrastructure you need for each post.

Get your hands dirty

Here are a few exercises you can try at home to go deeper:

  • Update your ec2-instance module to make it more configurable: e.g., add input variables to configure the instance type it uses, the port it opens up for HTTP requests, and so on.

  • Use data sources in OpenTofu to find the ID of your AMI automatically, instead of manually hard-coding the ID.

How Provisioning Tools Stack Up

So, how do provisioning tools stack up using the IaC category criteria from before?

CRUD

Most provisioning tools have full support all four CRUD operations. For example, you just saw OpenTofu create an EC2 instance, read the EC2 instance state, update the EC2 instance (to add a tag), and delete the EC2 instance.

Scale

Provisioning tools scale very well. For example, the self-service approach mentioned in the previous section—where you have a library of reusable modules managed by Ops and used by Devs to deploy the infrastructure they need—can scale to thousands of developers and tens of thousands of resources, something that would be a nightmare to manage with ad hoc scripts.

Deployment strategies

Provisioning tools typically let you use whatever deployment strategies are supported by the underlying infrastructure. For example, OpenTofu allows you to use instance refresh to do a zero-downtime, rolling deployment for groups of servers in AWS; you’ll try out an example of this in Part 3.

Idempotency and error handling

Whereas most ad hoc scripts are procedural, where you specify step by step how to achieve some desired end state, most provisioning tools are declarative, where you specify the end state you want, and the provisioning tool automatically figures out how to get you from your current state to that desired end state. As a result, most provisioning tools are idempotent and can handle errors automatically. For example, you already saw in the CRUD discussion that you can re-run OpenTofu multiple times, and it will refresh its state, and come up with an execution plan to try to make the state of the world match the desired state in your code, handling changes in your code, changes in the outside world, and errors along the way.

Consistency

Most provisioning tools enforce a consistent, predictable structure to the code, including documentation, file layout, clearly named parameters, secrets management, and so on.

Verbosity

The declarative nature of provisioning tools and the custom DSLs they provide typically result in concise code, especially considering that code supports all CRUD operations, deployment strategies, scale, idempotency, and error handling out-of-the-box. The OpenTofu code for deploying an EC2 instance is about half the length of the Bash code, even though it does considerably more, and the more complex the infrastructure you’re managing, the larger this gap becomes.

Provisioning tools should be your go-to option for managing infrastructure. Moreover, many provisioning tools can be used to not only manage traditional infrastructure (e.g., servers), but many other aspects of software delivery as well. For example, you can use OpenTofu to manage your version control system (e.g., using the GitHub provider), metrics (e.g., using the Grafana provider), and your on-call rotation (e.g., using the PagerDuty provider), tying them all together with code.

Key takeaway #4

Provisioning tools are great for deploying and managing servers and infrastructure.

Although I’ve been comparing IaC tools this entire blog post, the reality is that you’ll probably need to use multiple IaC tools together, as discussed in the next section.

Using Multiple IaC Tools Together

Each of the tools you’ve seen in this blog post has strengths and weaknesses. No one of them can do it all, so for most real-world scenarios, you’ll need to use several different tools, and it’s your job to pick the right tool(s) for the job.

Key takeaway #5

You usually need to use multiple IaC tools together to manage your infrastructure.

The following sections show three common combinations I’ve seen work well at a number of companies.

Provisioning Plus Configuration Management

Example: OpenTofu and Ansible. You use OpenTofu to deploy all the underlying infrastructure, including the network topology, data stores, load balancers, and servers. You then use Ansible to deploy your apps on top of those servers, as depicted in Figure 22:

OpenTofu deploys the infrastructure, including servers, and Ansible deploys apps onto those servers
Figure 22. OpenTofu deploys the infrastructure, including servers, and Ansible deploys apps onto those servers.

This is an easy approach to get started with and there are many ways to get Ansible and OpenTofu to work together (e.g., OpenTofu adds tags to your servers, and Ansible uses an inventory plugin to automatically discover servers with those tags). The major downside is that using Ansible typically means mutable infrastructure, rather than immutable, so as your codebase, infrastructure, and team grow, maintenance and debugging can become more difficult.

Provisioning Plus Server Templating

Example: OpenTofu and Packer. You use Packer to package your apps as VM images. You then use OpenTofu to deploy servers with these VM images and the rest of your infrastructure, including the network topology, data stores, and load balancers, as illustrated in Figure 23:

OpenTofu deploys the infrastructure, including servers, and Packer creates the VMs that run on those servers
Figure 23. OpenTofu deploys the infrastructure, including servers, and Packer creates the VMs that run on those servers.

This is also an easy approach to get started with. In fact, you already had a chance to try this combination out earlier in this post. Moreover, this is an immutable infrastructure approach, which will make maintenance easier. The main drawback is that VMs can take a long time to build and deploy, which slows down iteration speed.

Provisioning Plus Server Templating Plus Orchestration

Orchestration tools, such as Kubernetes, Nomad, and OpenShift, help you deploy and manage apps on top of your infrastructure. You’ll do a deep-dive on orchestration in Part 3.

Example: OpenTofu, Packer, Docker, and Kubernetes. You use Packer to create a VM image that has Docker and Kubernetes agents installed. You then use OpenTofu to deploy a cluster of servers, each of which runs this VM image, and the rest of your infrastructure, including the network topology, data stores, and load balancers. Finally, when the cluster of servers boots up, it forms a Kubernetes cluster that you use to run and manage your Dockerized applications, as shown in Figure 24:

OpenTofu deploys the infrastructure, including servers; Packer creates the VMs that run on those servers; and Kubernetes manages those VMs as a cluster for running Docker containers
Figure 24. OpenTofu deploys the infrastructure, including servers; Packer creates the VMs that run on those servers; and Kubernetes manages those VMs as a cluster for running Docker containers.

The advantage of this approach is that Docker images build fairly quickly, you can run and test them on your local computer, and you can take advantage of all of the built-in functionality of Kubernetes, including various deployment strategies, auto healing, auto scaling, and so on. The drawback is the added complexity, both in terms of extra infrastructure to run (Kubernetes clusters are difficult and expensive to deploy and operate, though most major cloud providers now provide managed Kubernetes services, which can offload some of this work) and in terms of several extra layers of abstraction (Kubernetes, Docker, Packer) to learn, manage, and debug.

Adopting IaC

At the beginning of this blog post, you heard about all the benefits of IaC (self-service, speed and safety, code reuse, and so on), but it’s important to understand that adopting IaC has significant costs, too. Not only do your team members have to learn new tools and techniques, they also have to get used to a totally new way of working. It’s a big shift to go from the old-school sysadmin approach of spending all day managing infrastructure manually and directly (e.g., connect to a server and update its configuration) to the new DevOps approach of spending all day coding and making changes indirectly (e.g., write some code and let an automated process apply the changes).

Key takeaway #6

Adopting IaC requires more than just introducing a new tool or technology: it also requires changing the culture and processes of the team.

Changing culture and processes is a significant undertaking, especially at larger companies. Because every team’s culture and processes are different, there’s no one-size-fits-all way to do it, but here are a few tips that will be useful in most situations:

Focus on the most important problems

It might be slightly heretical for the author of a book on DevOps to say this, but not every team needs IaC. Adopting IaC has a relatively high cost, and although it will pay off in the long term for some scenarios, it won’t for others. For example, if your team is spending all of its time dealing with bugs and outages that result from a manual deployment process, then it might make sense to prioritize IaC, but if you’re at a tiny startup where one person can easily manage all your infrastructure, or you’re working on a prototype or side project that might be thrown away in a few months, managing infrastructure by hand may be the right choice.

Don’t adopt IaC, or any other practice, just because you read somewhere that it’s a "best practice." Instead, identify the problems your team has, and always focus on solving the most important ones. As you saw in Section 1.1.2, at a certain scale, most companies face problems that are best solved by IaC, but until you get to that scale and start hitting those problems, it’s OK to focus on other priorities.

Work incrementally

Even if you do prioritize adopting IaC, or any other practice, don’t try to do it all in one massive step. Instead, whenever you adopt any new practice, do it incrementally, as you learned in Part 1: break up the work into small steps, each of which brings value by itself. For example, don’t try to do one giant project where you try to migrate all of your infrastructure to IaC by writing tens of thousands of lines of code. Instead, use an iterative process where you identify the most problematic part of your infrastructure (e.g., the part that is causing the most bugs and outages), fix the problems in that part (e.g., perhaps by migrating that part to IaC), and repeat.

Give your team the time to learn

If you want your team to adopt IaC, then you need to be willing to dedicate sufficient time and resources to it. If your team doesn’t get the time and resources that it needs, then your IaC migration is likely to fail. One scenario I’ve seen many times is that no one on the team has any clue how to use IaC properly, so you end up with a jumble of messy, buggy, unmaintainable code that causes more problems than it solves; another common scenario is that part of the team knows how to do IaC properly, and they write thousands of lines of beautiful code, but the rest of the team has no idea how to use it, so they continue making changes manually, which invalidates most of the benefits of IaC. If you decide to prioritize IaC, then I recommend that (a) you get everyone bought in, (b) you make learning resources available, such as classes, documentation, video tutorials, and, of course, this blog post series, and (c) you provide sufficient dedicated time for team members to ramp up before you start using IaC everywhere.

Get the right people on the team

If you want to be able to use infrastructure as code, you have to learn how to write code. In fact, as you saw at the beginning of the blog post, a key shift with modern DevOps is managing more and more as code, so as a company adopts more DevOps practices, strong coding skills become more and more important. If you have team members who are not strong coders, be aware that some will be able to level up (given sufficient time and resources, as per the previous point), but some will not, which means you may have to hire new developers with coding skills for your team.

Conclusion

You now understand how to manage your infrastructure as code. Instead of clicking around a web UI, which is tedious and error-prone, you can automate the process, making if faster and more reliable. Moreover, whereas manual deployments always require someone at your company to do the busywork, with IaC, you can reuse code written by others, including both open source code (e.g., Ansible Galaxy, Docker Hub, Terraform Registry) and commercial code (e.g., Gruntwork Infrastructure as Code Library). This also includes the examples in the rest of this blog post series, most of which will be defined as code: you’ll see snippets of the code in the series itself, and you can find the fully-working examples in the sample code repo in GitHub.

To help you pick the right category of IaC tool, here are the 6 key takeaways you’ve seen throughout the blog post:

  • Ad hoc scripts are great for small, one-off tasks, but not for managing all your infrastructure as code.

  • Configuration management tools are great for managing the configuration of servers, but not for deploying the servers themselves, or other infrastructure.

  • Server templating tools are great for managing the configuration of servers with immutable infrastructure practices.

  • Provisioning tools are great for deploying and managing servers and infrastructure.

  • You usually need to use multiple IaC tools together to manage your infrastructure.

  • Adopting IaC requires more than just introducing a new tool or technology: it also requires changing the culture and processes of the team.

If the job you’re doing is provisioning infrastructure, you’ll probably want to use a provisioning tool. If the job you’re doing is configuring servers, you’ll probably want to use a server templating or configuration management tool. And as most real-world software delivery setups require you to do multiple jobs, you’ll most likely have to combine several tools together: e.g., provisioning plus server templating.

It’s worth remembering that there is also a lot of variety within an IaC category: e.g., there are big differences between Ansible and Chef within the configuration management category, and between OpenTofu and CloudFormation within the provisioning tool category. For a more detailed comparison, have a look at this comparison of Chef, Puppet, Ansible, Pulumi, CloudFormation, and Terraform/OpenTofu.

Going deeper on OpenTofu / Terraform

Many of the examples in the rest of this blog post series involve provisioning infrastructure, and I use OpenTofu as the provisioning tool for most of these examples, so you may want to become more familiar with this toolset. The best way to do that, with apologies for a bit of self-promotion, is to grab a copy of my other book, Terraform: Up & Running.

Being able to use code to run a server is a huge advantage over managing it manually, but a single server is also a single point of failure. What if it crashes? What if the load exceeds the capacity of a single server? How do you roll out changes without downtime? These topics are the focus of Part 3, How to Deploy Your Apps.

Update, June 25, 2024: This blog post series is now also available as a book called Fundamentals of DevOps and Software Delivery: A hands-on guide to deploying and managing production software, published by O’Reilly Media!