terragrunt

The Road to 1.0: Terragrunt Feature Flags

We’re proud to announce first class support for Feature Flags in Terragrunt! Terragrunt feature flags improve the speed and safety with…
The Road to 1.0: Terragrunt Feature Flags
Yousif Akbar
Principal Software Engineer
Published February 6, 2025

We’re proud to announce first class support for Feature Flags in Terragrunt! Terragrunt feature flags improve the speed and safety with which engineers can integrate their IaC.

Feature flags allow engineers to integrate incomplete work without introducing undue risk to the infrastructure they manage, decouple infrastructure release from deployment and codify important information regarding the evolution of IaC.

This will be a hands-on walk-through where you’ll setup real AWS infrastructure if you follow along, and you’ll be able to try out some relatively new functionality that you might have never tried with Terragrunt, even if you have a lot of experience with it.

Before we start building, we’ll explore why we built feature flags into Terragrunt, how they work, and some common patterns. If you would like to skip straight into building the project, click here.

Why feature flags?

Feature flags were added to Terragrunt as a consequence of working with Gruntwork customers that needed the ability to operate at massive scale (one customer in particular was grappling with over 200 repositories, each containing hundreds of Terragrunt units, and their infrastructure estate is still growing).

Working with customers using Terragrunt at this scale, we learned of scenarios where users were encountering friction between different platform engineers due to points of integration across units. Terragrunt does a good job at reducing the blast radius of infrastructure changes, and part of that is how it handles errors in the DAG. When Terragrunt encounters an error in a unit, it prevents all of its dependents from running, preventing unexpected errors in downstream dependents that expected their dependency to be working correctly.

In very large code bases, users were slowing down code integration for fear that a mistake in the configuration of one unit could have cascading errors on others. A consequence of Terragrunt preventing dependents from running when a dependency throws an error is that it also prevents the dependents of the failing unit’s dependents from running, and that dependent’s dependents, etc. Customers with long chains of dependencies needed to ensure that dependents deep in the dependency graph continued to run correctly, even when introducing potentially flaky changes to a high-level dependency.

As a company whose primary purpose is to support customers on their DevOps journeys, this smelled of an anti-pattern. Slowing down integration for fear of failure is often a doom spiral, as integrations become less granular over time, which makes them riskier, which makes them scarier to perform. Luckily, we also know of a common, industry-standard technique for making it safer to integrate early and often: feature flags!

Feature flags are a common tool in the DevOps world, as they allow users to continuously integrate their code with each other, and deploy code updates frequently without immediately altering runtime behavior. Developers can deploy features that are disabled by default, and decide to enable them at a later time (and roll them back if necessary). This is commonly referred to as decoupling deployment from release.

Collaboration with enterprise customers

Before proceeding with discussing the design of feature flags, it might be interesting to learn about the collaborative process with enterprise customers that lead to the decisions we made while designing them.

Terragrunt has and always will be a free, open source product. We’ve never monetized any Terragrunt feature, and we believe that the Terragrunt community shouldn’t ever be working a sub-optimal or incomplete product because they aren’t Gruntwork customers.

To enable continued investment in Terragrunt, Gruntwork does two things:

  1. We leverage Terragrunt in our paid offerings like Pipelines, Patcher and Account Factory, creating a virtuous cycle where improving Terragrunt improves our paid products, which lets us invest more in Terragrunt.
  2. We collaborate with enterprises that purchase Commercial Terragrunt Support to provide guidance on optimal usage of Terragrunt, prioritized feature requests and bug fixes.

It was in collaboration with that second category of customers that we designed feature flags. These customers frequently operate at scales where Terragrunt is a vital backbone of their business. They require the most out of Terragrunt, and their needs for flexibility, reliability and performance push us to make Terragrunt the best product it can be for everyone. When they have problems that require a change in Terragrunt, we collaborate with them to design and test changes so that we know they work at scale before they’re generally available.

Designing feature flags

Prior to the introduction of a first-class notion of feature flags in Terragrunt, there was already tooling available for users to dynamically control the behavior of Terragrunt at runtime. HCL functions like get_env and run_cmd allow Terragrunt configuration to behave dynamically based on context and users could just use that functionality to build their own feature flags.

Users could, for example, construct Terragrunt configuration like the following:

locals {
  version = get_env("FOO_VERSION", "v1.2.3")
}

terraform {
  source = "github.com/acme/our-cool-modules//foo?ref=${local.version}"
}

Then dynamically toggle updates to the version of the foo module in different environments. They could opt-in to a less stable version of the module in lower environments, and work out the kinks before it’s updated as the default version.

The problem with such an approach is that:

  1. It requires fairly complex usage of Terragrunt. You have to know how to use those capabilities to build your own feature flag implementation, and decide on scalable patterns/naming conventions.
  2. Nothing semantically tells you that feature flags are being used. You could be using get_env for a whole host of reasons, so what would tell you that FOO_MODULE_VERSION in particular was being used as a feature flag? Moreover, how can a platform team confidently identify all the feature flags in use, and determine their current values?
  3. This is still missing features customers were asking for. How can you ensure that flaky features don’t tank the whole dependency graph? How can you avoid running units in certain environments if they aren’t ready yet?

To address those concerns and to build a solution that would scale effectively, we did some internal design, then proposed RFC #3134 to iterate on the design in public. As a consequence of that work, we ended up delivering three heavily interrelated configurations.

1. The feature block

The first configuration we added to Terragrunt was the feature block.

It looks like this:

feature "feature_name" {
default = "feature value"
}

Feature blocks work very similarly to OpenTofu/Terraform variables. They are labeled variables that you can specify in your HCL configuration and access as expressions in your HCL configuration.

Solving the same problem above, addressed using get_env instead can be solved like this:

feature "foo_version" {
default = "v1.2.3"
}

terraform {
source = "github.com/acme/our-cool-modules//foo?ref=${feature.foo_version.value}"
}

At runtime, users can use the --feature flag to control the value of the version like so:

$ terragrunt --feature foo_version=v1.3.0 apply

# Like all Terragrunt flags, you can also use an environment variable instead.
$ TERRAGRUNT_FEATURE=foo_version=v1.3.0 terragrunt apply

This has some useful advantages over the “manual” approach to building out feature flags.

  1. Semantically, you know exactly what foo_version is. It is clearly configured to be a feature flag, and there’s only two ways to set the value at runtime.
  2. As a platform team, you don’t have to search across a wide range of environment variables, flags, etc to determine the values of your feature flags. They are either the default value of the feature block, or they were explicitly set using the --feature flag.
  3. Feature flags always have a valid value by default. Other team members might not set FOO_VERSION before running a unit that looks for it, and it’s valid Terragrunt syntax to leave the fallback value of get_env unset.

Note that Terragrunt feature flags focus entirely on dynamically configuring Terragrunt configuration. They might drive the values of things like module versions, inputs, or anything else, but they are exclusively read by Terragrunt. Other configurations like inputs specifically control OpenTofu/Terraform variables, and are meant to be read there. This allows feature flag values to be used in a wider range of use-cases, as we’ll discuss later.

With the feature block designed, we felt like we had a solid mechanism for users to codify dynamic Terragrunt configurations as feature flags. There were still pieces missing, however. How can a user leverage the status of a feature flag to do things like ignore flaky features, avoid an early exit from an error in the dependency graph or avoid usage of an incomplete unit? To address those problems, we designed two more configurations, errors and exclude.

2. The errors block

Very early on in Terragrunt’s history, the retryable_errors attribute was introduced. Given Terragrunt’s role as an IaC orchestrator, it’s been a valuable tool for platform teams, as IaC can be flaky, and often you just need to try again to solve intermittent network outages.

As we talked through customer requirements for feature flags, however, we came to realize that it was insufficient to solve the type of problem they were facing. At scale, users need the ability to handle errors a little more gracefully. They need the ability to not just retry errors, but control their error recovery strategies, suppress expected errors and signal to external systems when failure has taken place.

As such, we introduced the errors block to replace legacy error handling techniques. They look like this:

errors {
# Retry block for transient errors
retry "transient_errors" {
retryable_errors = [".*Error: transient network issue.*"]
max_attempts = 3
sleep_interval_sec = 5
}

# Ignore block for known safe-to-ignore errors
ignore "known_safe_errors" {
ignorable_errors = [
".*Error: safe warning.*",
"!.*Error: do not ignore.*"
]
message = "Ignoring safe warning errors"
signals = {
alert_team = false
}
}
}

Users now have the ability to retry specific errors with their own retry intervals, ignore specific categories of errors and signal to external systems when they’ve ignored those errors.

When combined with feature flags, users get to author configurations like this:

feature "flaky_foo" {
default = false
}

locals {
foo_version = feature.flaky_foo ? "v1.2.4" : "v1.2.3"
}

terraform {
source = "github.com/acme/our-cool-modules//foo?ref=${local.foo_version}"
}

errors {
# Ignore block for known safe-to-ignore errors
ignore "known_safe_errors" {
ignorable_errors = !feature.flaky_foo.value ? [] : [
".*Error: known error from buggy foo module.*",
]
message = "Foo is still unstable"
signals = {
foo_unstable = true
}
}
}

Now, the authors of the foo module can codify the fact that the v1.2.4 version is unstable, and that errors in the unit consuming it should be suppressed when using it. The rest of the dependency graph will continue to run in the event of an error, a custom warning will be emitted to the logs (Foo is still unstable), and a special error-signals.json will be written to the filesystem right next to the terragrunt.hcl file for the unit with the error that looks like this:

{
"foo_unstable": true
}

The platform team managing the CI/CD for instrumenting IaC can include special logic to parse any error-signals.json files that are written to disk after a Terragrunt run, and know whether any errors were suppressed.

This met the need to enable early and frequent integration of flaky units by teams, but it didn’t address the need to totally exclude them from graphs. For that, we delivered the final piece of the RFC, the exclude block.

The exclude block

Similar to the retryable_errors attribute, Terragrunt had an attribute that almost met the needs of our customers, but didn’t quite. That attribute was the skip attribute. As the name implies, it allows users to specify in their Terragrunt configuration that they want a unit skipped while processing the run queue. The value can be determined dynamically, allowing users to write HCL expressions that determined whether the unit was skipped.

The problem with this attribute was its inflexibility. What if you only want to skip some commands? What if you also wanted to skip everything after the unit in the run queue?

To address these shortcomings, we introduced the exclude block. It looks like this:

exclude {
if = true
actions = ["plan", "apply"]
exclude_dependencies = false
}

By expanding the configuration options for exclusion in the run queue, users are given more flexibility to control how units are excluded and what that exclusion does to the rest of the run queue.

When combined with feature flags, exclude blocks look like this:

terraform {
source = "github.com/acme/our-cool-modules//foo?ref=v1.2.4"
}

feature "enable_foo" {
default = false
}

exclude {
if = !feature.enable_foo.value
actions = ["all"]
exclude_dependencies = !feature.enable_foo.value
}

Using this combination allows users to integrate incomplete units into team codebase while leaving them disabled by default. Note that this setup will also exclude all dependencies of the unit. This is what gives the exclude block the ability to excise the unit and all of its dependencies from the run queue.

User & maintainer feature flags

Delivering feature flags got us thinking about how we control features internally in the Terragrunt codebase, and how we can use feature flags to give users a better experience as the codebase evolves. This is especially true as we look to make stability guarantees after the release of 1.0. In addition to introducing “user” feature flags, we’ve also introduced “maintainer” feature flags.

The feature block allows you, as the user of Terragrunt, to configure feature flags and opt-in to dynamic changes to runtime behavior using the Terragrunt CLI. We can think of these as “user” feature flags. Terragrunt users control what these feature flags do, what their defaults are, and how they are adjusted.

The other kind of feature flags Terragrunt has are “maintainer” feature flags, like Experiments and Strict Mode. These are feature flags that Terragrunt maintainers author, and expose opinionated controls for adjusting them. To avoid confusion, we typically don’t call these feature flags, but they do behave as feature flags in the Terragrunt codebase.

Experiments

Experiments are opt-in experimental functionality that Terragrunt maintainers aim to support by default in the future.

To encourage continuous integration and gather feedback as early as possible, features that are unstable, potentially unfinished and evolving are released behind these flags so that you can try out new features while we build Terragrunt in public.

We document experiments thoroughly, with details on what they do, when they’ll be stabilized and how to give feedback. This allows us to publicly communicate what unstable features are in active development, what needs to be done to make them stable, and gives users a consistent location to share their feedback on experiments.

Make sure that you keep those same considerations in mind when authoring your own feature flags. How do users know when feature flags can be removed? What is the status of the feature flag, and who can teams talk to for the behavior to be the default?

Strict Mode

Strict controls are opt-in breaking changes that Terragrunt will make the default in the future.

To maximize backwards compatibility while allowing Terragrunt to evolve, breaking changes that can be opt-in prior to the next breaking release are gated behind strict controls instead of pushing out breaking changes in Terragrunt releases.

On the road to 1.0, some of these breaking changes will be gradually released in minor versions, as 1.0 marks the beginning of guarantees laid out in The Road to Terragrunt 1.0. After 1.0, this is how users will be able to confidently use Terragrunt without expecting breaking changes in any minor release. As maintainers find that certain breaking changes can result in a better experience for Terragrunt users, those breaking changes will be gated behind opt-in strict controls until the next major release.

Similarly, maintainers encourage you to keep those considerations in mind for your feature flags. What kinds of breaking changes do you want to roll out in a controlled fashion with feature flags? How do you coordinate with your team to make sure that requisite breaking changes are made before removing feature flags, and making their behavior the default?

Where we stand today

I’m happy to report that our customers are able to scale their infrastructure estate faster, and with greater confidence as a consequence of their partnership with Gruntwork. Developers are better able to integrate their updates into large infrastructure estates early and often.

What’s likely more important to you, is that these advancements improve Terragrunt for all users, free of charge. You can benefit from the outcomes of these advancements, if you learn how to take advantage of them. To help you adopt them, let’s get started with our demo project.

Project overview

The example infrastructure that we’re going to create here are some simple AWS Lambda functions and DynamoDB tables. If you don’t have an AWS account don’t worry! You should be able to simply read along to learn about the functionality described here. I’m provisioning some real infrastructure just to make things more interesting.

I’m going to provision two Lambda functions, one to each of my dev and prod environments. Each of these Lambda functions will also be accompanied by DynamoDB tables that will be used to store state.

Note that I’ll also be using the stacks experiment as part of this tutorial to keep the examples shorter, and because I’m excited about the functionality. If you’re not familiar with what Terragrunt Stacks are, I recommend reading the dedicated post on them in this series:

https://blog.gruntwork.io/the-road-to-terragrunt-1-0-stacks-cd97f11ef565

Setting up the project

To get started with our walk-through, we’ll create a directory where we can store our project. I’ve also provided the final version of the project as a public GitHub repository that you can clone to try out here.

$ mkdir feature-flags-blog-demo
$ cd feature-flags-blog-demo

The way I like to pin versions of tools in my projects is to use Mise. You’re free to use whatever you like, or not pin the versions of the tools you’re using at all. Just note the versions I’m using, and know that you might see different behavior if you use different versions.

$ mise use terragrunt@0.72.2 opentofu@1.9.0 go@1.23.4
$ cat mise.toml
[tools]
go = "1.23.4"
opentofu = "1.9.0"
terragrunt = "0.72.2"

I’m just pinning the latest versions of these tools at the time of writing.

Creating the modules

The first thing I like to do when provisioning some new infrastructure is to author the OpenTofu modules I’m going to use. As a Terragrunt user, if I can design these modules well, I’ll be confident that I can reliably deploy them going forward.

$ mkdir -p modules/api modules/db

To save space, I’ll just go over the most interesting part of the modules. You can take a look at them here if you want to see all the details of how they’re put together.

# modules/db/main.tf
resource "aws_dynamodb_table" "table" {
name         = var.name
billing_mode = var.billing_mode
hash_key     = var.hash_key
attribute {
name = var.hash_key
type = var.hash_key_type
}
}

The DynamoDB table is pretty simple. Just a generic table with a single attribute, which we’ll use as the key for storing values.

# modules/api/main.tf
resource "aws_lambda_function" "lambda" {
function_name = var.name
role          = aws_iam_role.role.arn
handler       = var.handler
runtime       = var.runtime
architectures = var.architectures

filename         = var.filename
source_code_hash = filebase64sha256(var.filename)

environment {
variables = {
DYNAMODB_TABLE = var.dynamodb_table
}
}
}

resource "aws_lambda_function_url" "url" {
function_name      = aws_lambda_function.lambda.function_name
authorization_type = "NONE"
}

Here you can see a simple implementation of the aws_lambda_function resource, and an aws_lambda_function_url resource to expose the Lambda function as an API accessible over HTTP.

Note the environment variable DYNAMODB_TABLE defined to let the API know which DB it should connect to.

Writing the Lambda source code

Lambda functions can be built using a variety of languages. For the sake of this post, I’ll choose Go.

Create the src directory where source code will live and initialize the Go module.

$ mkdir src
$ cd src
$ go mod init handler

To avoid bloating this blog post with boring code, I’ll link to the main.go file that has the logic for the Lambda function.

This is the request handler, to give you a sense of what it does:

func handleRequest(ctx context.Context, event events.LambdaFunctionURLRequest) (events.LambdaFunctionURLResponse, error) {
method := event.RequestContext.HTTP.Method

switch method {
case "GET":
return handleGet(ctx)
case "POST":
return handlePost(ctx)
default:
return events.LambdaFunctionURLResponse{
StatusCode: 405,
Body:       "Method Not Allowed",
}, nil
}
}

It just checks for the method of the incoming HTTP request, then increments or returns a counter with each request.

Make sure that you run the following after adding the file to update the go.mod file with relevant dependencies.

$ go mod tidy

Setup root Terragrunt configuration

Now, you’re ready to start adding some Terragrunt configuration.

First, create the root.hcl file that is the shared code all your Terragrunt configuration will include.

locals {
project_name = "feature-flags-blog-demo"
region       = "us-east-1"
}

remote_state {
backend = "s3"
generate = {
path      = "backend.tf"
if_exists = "overwrite"
}
config = {
// Make sure that this to a globally unique bucket name!
// Someone else following this post might have used exactly this value.
bucket         = "${local.project_name}-tfstate"
key            = "${path_relative_to_include()}/tofu.tfstate"
region         = local.region
encrypt        = true
dynamodb_table = "tfstate-lock"
}
}

generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "${local.region}"
}
EOF
}

This will handle critical configuration like remote state and provider configurations. It also provides a convenient place to store globally valid values, like the name of the project, which we’ll access later as well.

Setup environment Terragrunt configuration

To finish scaffolding our stacks, let’s go ahead and create the directories where we’re going to be provisioning our infrastructure, then populate an environment.hcl which will have values that are valid within the scope of each environment:

$ mkdir -p live/dev/service live/prod/service
# live/dev/environment.hcl
locals {
environment = "dev"
}
# live/prod/environment.hcl
locals {
environment = "prod"
}

We don’t have any other difference in how the environments behave besides the names of the resources they provision, so this will do for now.

Authoring the units

Next, let’s author the units we are going to provision in each one of these environments.

$ mkdir -p units/api units/db

I’ll show you the entirety of the configurations for the units, as they’re only constituted by three files.

Later on, we’ll reference them in our stack configurations.

The db unit

# units/db/terragrunt.hcl
include "root" {
path   = find_in_parent_folders("root.hcl")
expose = true
}

locals {
project_name = include.root.locals.project_name

environment_hcl = read_terragrunt_config(find_in_parent_folders("environment.hcl"))
environment     = local.environment_hcl.locals.environment

parent_dir = get_parent_terragrunt_dir("root")

cur_dir_name = basename(get_terragrunt_dir())
}

terraform {
source = "${local.parent_dir}/modules/db"
}

inputs = {
name = "${local.project_name}-${local.cur_dir_name}-${local.environment}"

hash_key      = "Id"
hash_key_type = "S"
}

Here, we are doing a couple things:

  1. Including root configurations from root.hcl, resolving any remote backend or provider configurations that this unit might need, and leveraging that project_name local mentioned earlier.
  2. Fetching values from the context of the unit, like the name of the project, the environment, the current directory of the unit and the path to the directory containing the root configuration.
  3. Specifying the OpenTofu module source for the unit as the db module in the top-level modules directory.
  4. Setting inputs that will be passed to the OpenTofu module.

The api unit

# units/api/terragrunt.hcl
include "root" {
path   = find_in_parent_folders("root.hcl")
expose = true
}

locals {
project_name = include.root.locals.project_name

environment_hcl = read_terragrunt_config(find_in_parent_folders("environment.hcl"))
environment     = local.environment_hcl.locals.environment

parent_dir   = get_parent_terragrunt_dir("root")

dist_dir       = "${local.parent_dir}/dist"
source_dir     = "${local.parent_dir}/src"

package_script = "./package.sh"

cur_dir_name = basename(get_terragrunt_dir())
}

terraform {
source = "${local.parent_dir}/modules/api"

before_hook "package" {
commands = ["plan", "apply"]
execute  = [local.package_script, local.source_dir, local.dist_dir]
}
}

dependency "db" {
config_path = "../../storage/db"

# Mock outputs allow us to continue to plan on the apply of the api module
# even though the db module has not yet been applied.
mock_outputs_allowed_terraform_commands = ["plan", "destroy"]
mock_outputs = {
name = "mock-table"
arn  = "arn:aws:dynamodb:us-west-2:123456789012:table/mock-table"
}
}

inputs = {
name = "${local.project_name}-${local.cur_dir_name}-${local.environment}"

filename = "${local.dist_dir}/package.zip"

dynamodb_table = dependency.db.outputs.name
dynamodb_arn   = dependency.db.outputs.arn
}

The configuration for the api unit is a little longer, but it also does a little more. In addition to doing very similar things to the db unit like including root.hcl, fetching contextual values, specifying the OpenTofu module and setting inputs, it also does the following:

  1. Hooks in the package.sh script which will be doing the packaging of the Lambda function before any plan or apply. We’ll take a look at that next.
  2. Configures the db dependency so that the api unit can take advantage of the outputs exposed there.

The package.sh script

#!/usr/bin/env bash

set -euo pipefail

package() {
local -r src_directory="$1"
local -r dist_directory="$2"

pushd "$src_directory" >/dev/null
GOOS=linux GOARCH=amd64 go build -o bootstrap
zip "$dist_directory"/package.zip bootstrap
popd >/dev/null
}

main() {
local -r src_directory="$(realpath "$1")"
local dist_directory="$2"

mkdir -p "$dist_directory"
dist_directory="$(realpath "$dist_directory")"

package "$src_directory" "$dist_directory"
}

: "${1?"Usage: $0 <src_directory> <dist_directory>"}"
: "${2?"Usage: $0 <src_directory> <dist_directory>"}"

main "$@"

This package.sh script simply builds and packages the Lambda function as required by AWS by building an output binary named bootstrap , and placing it at the root of a zip file, named package.zip here.

Note that this file is located at units/api/package.sh in the project, right next to the units/api/terragrunt.hcl file. Units are directories, not individual files, so we can store accompanying files that are tightly coupled to the orchestration of our IaC right next to the terragrunt.hcl files required for a unit to be valid.

Also note that it’s fairly common in OpenTofu modules to perform this kind of Lambda packaging within the module itself by using local-exec, etc. I personally find the separation of concerns cleaner when doing this logic with a Terragrunt before_hook (and avoid using local-exec whenever possible) but nothing about Terragrunt prevents you delegating that responsibility to OpenTofu if you prefer to do that.

Adding stacks

This is the last piece of IaC before we dig into feature flags, and the simplest. What we want to do now is place a terragrunt.stack.hcl in each place where we want to provision the stack of api + db units so that we can provision them repeatedly as the exact same infrastructure.

$ mkdir -p live/dev/service live/prod/service
# live/dev/service/terragrunt.stack.hcl
# AND live/prod/service/terragrunt.stack.hcl

locals {
units_dir = "${dirname(find_in_parent_folders("root.hcl"))}/units"
}

unit "api" {
source = "${local.units_dir}/api"
path   = "services/api"
}

unit "db" {
source = "${local.units_dir}/db"
path   = "storage/db"
}

Take note that both stacks have identical code. Their dynamicity is entirely defined in the terragrunt.hcl files of the units, and we want to provision the same units in each stack, so we can use the same terragrunt.stack.hcl file each time.

Provisioning the stacks

We’re now ready to provision the stacks.

First, make sure you’re authenticated with an AWS account where you want these resources provisioned, and run the following to opt-in to the stacks experiment for the lifetime of your shell session:

$ export TERRAGRUNT_EXPERIMENT=stacks

This makes it so that you don’t have to add the --experiment flag each time you run a stack command.

Next, run the following in the live/dev/service directory:

# live/dev/service
$ terragrunt stack run apply
09:51:38.142 INFO   Generating stack from ./terragrunt.stack.hcl
09:51:38.143 INFO   Processing unit api
09:51:38.143 INFO   Processing unit db
09:51:38.157 INFO   The stack at ./.terragrunt-stack will be processed in the following order for command apply:
Group 1
- Module ./.terragrunt-stack/storage/db

Group 2
- Module ./.terragrunt-stack/services/api

Are you sure you want to run 'terragrunt apply' in each folder of the stack described above? (y/n) y

...

09:52:27.507 STDOUT [.terragrunt-stack/services/api] tofu: Apply complete! Resources: 6 added, 0 changed, 0 destroyed.
09:52:27.507 STDOUT [.terragrunt-stack/services/api] tofu:
09:52:27.507 STDOUT [.terragrunt-stack/services/api] tofu: Outputs:
09:52:27.508 STDOUT [.terragrunt-stack/services/api] tofu: url = "https://nu6lzajlb2b56quxitqwmt53ti0lbqng.lambda-url.us-east-1.on.aws/"

If you respond “yes”, you’ll have a fully provisioned stack!

As of the writing of this post, the stack output command hasn’t been created yet (remember that we’re building this in public), but luckily the design of stacks makes it so that we don’t even need that command!

We can navigate directly to the unit where we want to acquire outputs, and fetch them there:

# live/dev/service/.terragrunt-stack/services/api
$ url=$(terragrunt output -raw url)

You can test out the API by making curl requests to that URL.

$ curl -s "$url"
{"count":0}
$ curl -s "$url"
{"count":0}
$ curl -s -XPOST "$url"
{"count":1}
$ curl -s -XPOST "$url"
{"count":2}
$ curl -s -XPOST "$url"
{"count":3}
$ curl -s "$url"
{"count":3}

Go ahead and provision live/prod/service to have both stacks provisioned. That will allow us to demonstrate rolling out our new feature progressively through each environment.

The CPU architecture feature flag

The feature flag we’ll be adding in this post is a simple flag that controls the CPU architecture of our Lambda function. Right now, it’s an x86-based Lambda function, but to get the cost and performance benefits of using Graviton chips that Amazon offers, we’ll want to switch to an ARM-based Lambda function.

To do this, we’ll need to make two changes simultaneously:

  1. We need to change the architectures input for the api unit so that it uses arm64.
  2. We need to change the packaging script to compile an ARM binary for the Lambda handler.

In a production setting, you wouldn’t want to make a change like this before testing it out in lower environments, as there may be incompatibilities in your code with different CPU architectures. Let’s take a look at how we can roll out our change in a controlled fashion using feature flags.

Adding the feature flag

To add the feature flag, update the units/api/terragrunt.hcl file so that it looks like the following:

include "root" {
path   = find_in_parent_folders("root.hcl")
expose = true
}

locals {
project_name = include.root.locals.project_name

environment_hcl = read_terragrunt_config(find_in_parent_folders("environment.hcl"))
environment     = local.environment_hcl.locals.environment

parent_dir   = get_parent_terragrunt_dir("root")

dist_dir       = "${local.parent_dir}/dist"
source_dir     = "${local.parent_dir}/src"

package_script = "./package.sh"

cur_dir_name = basename(get_terragrunt_dir())
}

# vvv New Code vvv
feature "architecture" {
default = "x86_64"
}
# ^^^ New Code ^^^

terraform {
source = "${local.parent_dir}/modules/api"

before_hook "package" {
commands = ["plan", "apply"]
# vvv Updated Code vvv
execute  = [local.package_script, local.source_dir, local.dist_dir, feature.architecture.value]
# ^^^ Updated Code ^^^
}
}

dependency "db" {
config_path = "../../storage/db"

# Mock outputs allow us to continue to plan on the apply of the api module
# even though the db module has not yet been applied.
mock_outputs_allowed_terraform_commands = ["plan", "destroy"]
mock_outputs = {
name = "mock-table"
arn  = "arn:aws:dynamodb:us-west-2:123456789012:table/mock-table"
}
}

inputs = {
name = "${local.project_name}-${local.cur_dir_name}-${local.environment}"

filename = "${local.dist_dir}/package.zip"

dynamodb_table = dependency.db.outputs.name
dynamodb_arn   = dependency.db.outputs.arn

# vvv New Code vvv
architectures = [feature.architecture.value]
# ^^^ New Code ^^^
}

This adds an architecture feature flag that optionally adjusts the CPU architecture of the Lambda function, via the architectures input.

Note that we also adjusted the before_hook here. Let’s take a look at what that did:

#!/usr/bin/env bash

set -euo pipefail

package() {
local -r src_directory="$1"
local -r dist_directory="$2"
# vvv New Code vvv
local -r architecture="$3"
# ^^^ New Code ^^^

pushd "$src_directory" >/dev/null
# vvv New Code vvv
GOOS=linux GOARCH="$architecture" go build -o bootstrap
# ^^^ New Code ^^^
zip "$dist_directory"/package.zip bootstrap
popd >/dev/null
}

main() {
local -r src_directory="$(realpath "$1")"
local dist_directory="$2"
local architecture="$3"

mkdir -p "$dist_directory"
dist_directory="$(realpath "$dist_directory")"

# vvv New Code vvv
# Normalize architecture value
case "$architecture" in
"x86_64")
architecture="amd64"
;;
"arm64")
architecture="arm64"
;;
*)
echo "Unsupported architecture: $architecture"
exit 1
;;
esac
# ^^^ New Code ^^^

package "$src_directory" "$dist_directory" "$architecture"
}

: "${1?"Usage: $0 <src_directory> <dist_directory> <architecture>"}"
: "${2?"Usage: $0 <src_directory> <dist_directory> <architecture>"}"
: "${3?"Usage: $0 <src_directory> <dist_directory> <architecture>"}"

main "$@"

We now accept an additional architecture parameter that will control the compilation target for our go build command.

Using the feature flag

To use the feature flag, set the --feature flag to the value arm64 in the dev environment:

$ terragrunt stack run apply --feature architecture=arm64
...
10:33:58.141 STDOUT [.terragrunt-stack/service/api] tofu:       ~ architectures                  = [
10:33:58.141 STDOUT [.terragrunt-stack/service/api] tofu:           - "x86_64",
10:33:58.141 STDOUT [.terragrunt-stack/service/api] tofu:           + "arm64",
10:33:58.141 STDOUT [.terragrunt-stack/service/api] tofu:         ]
10:33:58.141 STDOUT [.terragrunt-stack/service/api] tofu:         id                             = "feature-flags-blog-demo-api-dev"
10:33:58.141 STDOUT [.terragrunt-stack/service/api] tofu:       ~ last_modified                  = "2025-01-28T15:05:17.893+0000" -> (known after apply)
10:33:58.141 STDOUT [.terragrunt-stack/service/api] tofu:       ~ source_code_hash               = "UYCUTX6q4288OPtU1T0iaFERz9HQACxp2gxOI33Hruk=" -> "ihdhvsJANdTU/7SbHX1hwIzxinEpD5gIOJ9oJdwIFXM="
...

To check that nothing exploded, we can go ahead and test our API again:

$ curl -s "$url"
{"count":3}
$ curl -s -XPOST "$url"
{"count":4}
$ curl -s "$url"
{"count":4}

In one fell swoop, we both changed an orchestration detail for how our Lambda function was packaged, and changed an associated input for our OpenTofu module in the unit.

Subsequent applies in this environment without the feature flag will roll back to the previous CPU architecture, and we can decide how to roll out this update in higher environments as need be. Importantly, we haven’t blocked any other updates to our infrastructure. The source code for the Lambda function can continue to be updated in production, and we can make tweaks to configurations like memory allocation, etc. without also introducing the CPU architecture change.

Using feature flags in CI/CD

One thing to consider is how you want to leverage these kinds of feature flags in your continuous integration environments.

Maybe you have a continuous deployment workflow where you always deploy to dev for any new pull request, but deploy to prod on merge to the main branch. In that scenario, you can configure your CI/CD pipeline to set the TERRAGRUNT_FEATURE environment variable for pull requests, but leave it unset in main.

For example:

$ export TERRAGRUNT_FEATURE="architecture=arm64"

When you’re confident that production is ready for the new CPU architecture, simply update the environment variables for production, and re-apply your changes (with no code changes) to roll it out. Rollbacks simply involve removing the environment variable in production, and re-applying your changes.

Using feature flags in Gruntwork Pipelines

If you’re a Gruntwork Pipelines customer, you can use feature flags out of the box today!

Take advantage of the env attribute in your pipelines configuration like so:

# .gruntwork/config.yml
pipelines:
env:
- name: TERRAGRUNT_FEATURE
value: architecture=arm64

This allows you to control your Terragrunt feature flags without the need to make any changes to your CI/CD code. Pipelines will include more integrations with Terragrunt feature flags in the future, so stay tuned for more updates!

Using external feature flag services

Many teams prefer to use an externally managed feature flag service to control the values of their feature flags, as it provides a lot more tooling to control the values of feature flags and their rollout.

Integrating with a service like that could take the form of unit configurations using a run_cmd HCL function like so:

feature "architecture" {
default = run_cmd("--terragrunt-quiet", "<command-to-fetch-feature-flag-value>")
}

Or a step in the CI/CD pipeline to set the value before calling Terragrunt:

$ export TERRAGRUNT_FEATURE="architecture=$(<command-to-fetch-feature-flag-value>)"
$ terragrunt stack run apply --terragrunt-non-interactive

Closing Thoughts

Hopefully, you’re excited about Terragrunt feature flags. If so, give Terragrunt a star on GitHub and start educating your team on Terragrunt feature flags!

If you need any help with feature flags, or want to ask about best practices, make sure to check out our community support documentation for places to join the conversation.