Browse the Repo

file-type-icon_docs
file-type-icon_images
file-type-icon01-architecture-overview.md
file-type-icon02-whats-deployed.md
file-type-icon03-security-compliance-compatibility.md
file-type-icon04-how-code-is-organized.md
file-type-icon05-dev-environment.md
file-type-icon06-ci-cd.md
file-type-icon07-monitoring-alerting-logging.md
file-type-icon08-ssh-vpn.md
file-type-icon09-accounts-and-auth.md
file-type-icon10-gruntwork-tools.md
file-type-icon11-deploying-a-docker-service.md
file-type-icon12-migration.md
file-type-icon13-deploying-the-reference-architecture-fr...
file-type-icon14-undeploying-the-reference-architecture.md
file-type-icon15-adding-new-environments-regions-and-acc...
file-type-iconREADME.md
file-type-icondev
file-type-iconmaster
file-type-iconprod
file-type-iconsecurity
file-type-iconshared-services
file-type-iconstage
file-type-icon.gitignore
file-type-iconCODEOWNERS
file-type-iconREADME.md

Browse the Repo

file-type-icon_docs
file-type-icon_images
file-type-icon01-architecture-overview.md
file-type-icon02-whats-deployed.md
file-type-icon03-security-compliance-compatibility.md
file-type-icon04-how-code-is-organized.md
file-type-icon05-dev-environment.md
file-type-icon06-ci-cd.md
file-type-icon07-monitoring-alerting-logging.md
file-type-icon08-ssh-vpn.md
file-type-icon09-accounts-and-auth.md
file-type-icon10-gruntwork-tools.md
file-type-icon11-deploying-a-docker-service.md
file-type-icon12-migration.md
file-type-icon13-deploying-the-reference-architecture-fr...
file-type-icon14-undeploying-the-reference-architecture.md
file-type-icon15-adding-new-environments-regions-and-acc...
file-type-iconREADME.md
file-type-icondev
file-type-iconmaster
file-type-iconprod
file-type-iconsecurity
file-type-iconshared-services
file-type-iconstage
file-type-icon.gitignore
file-type-iconCODEOWNERS
file-type-iconREADME.md
Multi-account Reference Architecture

Multi-account Reference Architecture

End-to-end tech stack designed to deploy into multiple AWS accounts. Includes VPCs, EKS, ALBs, CI / CD, monitoring, alerting, VPN, DNS, and more.

Code Preview

Preview the Code

mobile file icon

13-deploying-the-reference-architecture-from-scratch.md

down

Deploying the Reference Architecture from scratch

This document is a guide to how to deploy the entire Reference Architecture, or one of the environments in the Reference Architecture (e.g., stage or prod), from scratch. This is useful if you want to know how to quickly set up and tear down environments.

  1. Deployment Order
  2. Build AMIs
  3. Build Docker images
  4. Build lambda functions
  5. Create EC2 Key Pairs
  6. Configure Terraform backends
  7. Configure the VPN server
  8. Create data store passwords
  9. Import Route 53 hosted zones
  10. Create TLS certs
  11. Create an IAM User for KMS
  12. Run Terragrunt

Deployment Order

If you are deploying the entire Reference Architecture from scratch, then you should be aware of the various dependencies that exist between the accounts. In order to ensure that all the dependent resources exist, we recommend deploying the accounts in the following order:

  1. security
  2. shared-services
  3. dev, stage, and prod. These can be done in parallel.

Build AMIs

All the EC2 Instances in the Reference Architecture (e.g., the ECS Cluster instances, the OpenVPN server, etc) run Amazon Machine Images (AMIs) that are defined as code using Packer. You will find the Packer templates (.json files) in the infrastructure-modules-multi-account-acme repo. You will also find that the corresponding Terraform modules (e.g, the ecs-cluster module) expect an ami_id parameter to be set in terragrunt.hcl.

Prerequisite: Deploy Packer Build VPC

All the AMIs are built using an EC2 instance that is temporarily deployed into a VPC of the account and region that owns the AMI. This selection of the VPC and Subnet is done using the tags gruntwork.io/allow-packer=true. You need to ensure a VPC with a Public Subnet, created with the tag exists in the target region of the AMI account owner.

The easiest way to do this would be to deploy a VPC using the networking/vpc-mgmt infrastructure module in the target region. This module will properly tag the public subnets and the VPC for use with packer.

Running Packer

  1. Authenticate to the AWS account. See Accounts and Auth.
  2. Run packer build <PATH-TO-TEMPLATE>. E.g., packer build infrastructure-modules-multi-account-acme/services/ecs-cluster/packer/ecs-node.json.
  3. At the end of the build, Packer will output the new AMI ID. You can use this in the ami_id parameter of terragrunt.hcl files.

AMIs and AWS regions

Note that AMIs live in a specific AWS region, so if you're deploying to multiple regions, you will have to build AMIs for each region separately. If this is a common task, you can specify multiple builders in the Packer template, one for each region, and the builds will run in parallel.

Sharing AMIs across AWS accounts

You can share AMIs across AWS accounts to avoid having to rebuild the same AMI over and over again. For example, you could build the ECS AMI once in the shared-services account and deploy that same AMI in dev, stage, and prod. To enable this, set the ami_users parameter to the IDs of the accounts that should have access to the AMI. For example, to give accounts 11111111111 and 22222222222 access to an AMI:

{
  "builders": [{
    "type": "amazon_ebs",
    "ami_users": ["11111111111", "22222222222"]
  }]
}

AMIs and encryption

If you want to encrypt the root volume of your EC2 Instances (e.g., for end-to-end encryption and compliance purposes), you will need to set the encrypt_boot parameter to true in the Packer template:

{
  "builders": [{
    "type": "amazon_ebs",
    "encrypt_boot": true
  }]
}

Note that encrypted AMIs may NOT be shared with other AWS accounts!

Build Docker images

If you're using Docker, the sample apps in the Reference Architecture will try to deploy Docker images. You will need to:

  1. Build the Docker images
  2. Tag them with a version number of some sort
  3. Push the images to your Docker Registry (typically ECR)
  4. Fill in the Docker image name and version number in the terragrunt.hcl files in infrastructure-live-multi-account-acme

The instructions for building, tagging, and pushing the Docker images are in the READMEs of the sample-app-frontend-multi-account-acme and sample-app-backend-multi-account-acme repos.

Build lambda functions

The Reference Architecture include several sample Lambda functions under infrastructure-modules-multi-account-acme/lambda. These show examples of how to use Lambda to perform various tasks without having to manage any servers.

One of the Lambda functions in infrastructure-modules-multi-account-acme/lambda requires an extra build step to create its deployment package before you can deploy it:

./infrastructure-modules-multi-account-acme/lambda/long-running-scheduled/src/build.sh

At the end of that script, it will output the path of the resulting deployment package, plus instructions on how to use this path, which will tell you to set an environment variable. Make sure to follow those instructions!

Create EC2 Key Pairs

The Reference Architecture installs ssh-grunt on every EC2 Instance so that each developer can use their own username and key to SSH to servers (see SSH and VPN). However, we still recommend associating an EC2 Key Pair with your EC2 Instances as an emergency backup, in case their is some sort of issue with ssh-grunt.

We typically recommend creating at least 2 Key Pairs:

  1. For the OpenVPN server.
  2. For all other services.

To create an EC2 Key Pair:

  1. Go to the Key Pair section of the EC2 Console.
  2. Click "Create Key Pair."
  3. Enter a name for the Key Pair.
  4. Save the Key Pair to disk. Do NOT share this Key Pair with anyone else; it's only for emergency backup!
  5. Add a passphrase to the Key Pair: ssh-keygen -p -f <KEY_PAIR_PATH>.
  6. Change permissions on the Key Pair: chmod 400 <KEY_PAIR_PATH>.
  7. Pass the Key Pair name to the appropriate parameter in terragrunt.hcl in infrastructure-live-multi-account-acme; typically, this parameter will be called ssh_key_name, keypair_name, or cluster_instance_keypair_name. Ensure you only use the OpenVPN keypair for the OpenVPN server.

Configure Terraform backends

The Reference Architecture uses an S3 backend to store Terraform State. We also use DynamoDB for locking. We recommend storing the Terraform State for each AWS account in a separate S3 bucket and DynamoDB table. You will need to fill in the name and region of the S3 bucket and DynamoDB table in two places in the top-level folder for that account in infrastructure-live-multi-account-acme:

  1. terragrunt.hcl

When you run Terragrunt, if the S3 bucket or DynamoDB table don't already exist, they will be created automatically.

Configure the VPN server

The Reference Architecture includes an OpenVPN server. The very first time you deploy the server, it will create the Public Key Infrastructure (PKI) it will use to sign certificates. This process is very CPU intensive and, on t2.micro EC2 Instances, it can take hours, as it seems to exceed the burst balance almost immediately.

To avoid this, we recommend initially deploying the OpenVPN server with a larger instance (t2.medium can generate the PKI in 1-2 minutes). Once the PKI has been generated, you can downgrade to a smaller instance again to save money.

Create data store passwords

Some of the data stores used in the Reference Architecture, such as RDS databases, require that you set a password in the Terraform code. We do NOT recommend putting that password, in plaintext, directly in the code. Instead, we recommend:

  1. Create a long, strong, random password. Preferably 30+ characters.

  2. Store the password in a secure secrets manager.

  3. Every time you go to deploy the data store, set the password as an environment variable that Terraform can find (see Terraform environment variables). For example, for RDS DBs, you typically set the TF_VAR_master_password environment variable:

    export TF_VAR_master_password=(...)
    

Import Route 53 hosted zones

The Reference Architecture configures DNS entries using Route 53. Each domain name will live in a Public Hosted Zone that is either created automatically if you bought a domain name through Route 53, or manually if you are using Route 53 to manage DNS for a domain name bought externally.

If the Public Hosted Zone already exists, you will need to use the import command to put it under Terraform control. Go to the route53-public module in infrastructure-live-multi-account-acme for the account you're deploying and run:

terragrunt import aws_route53_zone.primary_domain <HOSTED_ZONE_ID>

Where HOSTED_ZONE_ID is the primary ID of your Hosted Zone, which you can find in the AWS Console (it typically looks something like Z1AB1Z2CDE3FG4).

Create an IAM User for KMS

The Reference Architecture uses KMS to encrypt and decrypt secrets. When you create a new Customer Master Key (CMK) in KMS, you must assign at least one IAM User as an "administrator" for the CMK. If there are no admins, then the CMK—and any secrets encrypted with it—may become completely inaccessible!

Therefore, you will need to create an IAM User, either in the same AWS account (for single-account deploymens) or in the security account (for multi-account deployments), and provide that IAM Users ARN to the cmk_administrator_iam_arns parameter of the kms-master-key module.

Create TLS certs

Public-facing TLS certs

The Reference Architecture will automatically use TLS certs from the AWS Certificate Manager (ACM) with each of your public load balancers (networking/alb-public in infrastructure-live-multi-account-acme) and CloudFront distributions (services/static-website in infrastructure-live-multi-account-acme). If you are deploying with totally new domain names, you will need to:

  1. Request a certificate from ACM for the AWS region(s) you are deploying to. The Terraform modules in the Reference Architecture typically look for a wildcard certificate of the format *.<your-domain-name> (e.g., *.acme.com), so make sure to request a wildcard certificate. If you don't want to use wild card certs, update the code in infrastructure-live-multi-account-acme and infrastructure-modules-multi-account-acme accordingly.
  2. If this is a certificate for a domain name managed in Route 53, we recommend using DNS to validate domain ownership, as it only takes a couple clicks.
  3. If you are using CloudFront, you must also request a certificate for us-east-1, no matter what region you are deploying too. See the cloudfront module for more info.

Self-signed TLS certs for your apps

If you want to use end-to-end encryption, you will need to generate self-signed TLS certs for your apps so that other apps or the load balancer can send traffic to them over TLS. If you are unfamiliar with how TLS certificates work, start with the Background documentation.

There are many ways to generate a certificate and use it with your apps, but the easiest option is:

  1. Use the create-tls-cert.sh script provided in infrastructure-modules-multi-account-acme. This script will automatically create a CA cert and a TLS cert signed by that CA, and encrypt the private key using gruntkms with the KMS master key for the appropriate environment.

  2. Package the public and encrypted private key of the TLS cert with the app (e.g., update your Dockerfile or Packer template to package the TLS cert). You can take a look at sample-app-backend-multi-account-acme for an example of how to package the certificates.

  3. Use gruntkms to decrypt the private key just before the app boots (the run-app.sh script already does this).

  4. During boot, configure your app to load the public and private key of the TLS cert and listen for TLS connections. How you do this is app-specific.

  5. If you have other apps that are going to talk to your app directly (e.g., via service discovery), distribute the public key of the CA to those apps so they can validate your app's cert.

Self-signed TLS certs for your internal load balancers

If you want to use end-to-end encryption, you will need to generate self-signed TLS certs for your internal load balancers so that your apps can send requests to those load balancers over TLS. If you are unfamiliar with how TLS certificates work, start with the Background documentation.

There are many ways to generate a certificate and use it with a load balancer, but the easiest option is:

  1. Use the private-tls-cert module to generate the certificates. Configure the TLS cert with a domain name you are going to be using for your load balancer (see the next steps for how this domain name will work). This should give you back a public and private key for the TLS certificate and a public key for the CA.

  2. Create an internal domain name (e.g., acme-multi-account.internal) using Route 53 Private Hosted Zones. This domain name should match the one in the TLS certificate you created. See networking/route53-private in infrastructure-live-multi-account-acme.

  3. Import your certificate into the AWS Certificate Manager (ACM). One way to do this is with the aws CLI import-certificate command, passing it the public key of the certificate using the --certificate argument, the private key via the --private-key argument, and the CA's public key using the --certificate-chain argument:

    aws acm import-certificate \
      --region us-east-1 \
      --certificate file://cert.pem \
      --private-key file://cert.key \
      --certificate-chain file://ca.pem
    

    You'll need to do this in each region where you're going to deploy a load balancer. Once you've imported the certificate everywhere, you may want to delete the private key so no one else can access it. If you're going to keep the private key around, then make sure to use gruntkms to encrypt it with KMS.

  4. Create your load balancer with an HTTPS listener and set the certificate ARN to the ARN of the certificate you just imported into ACM. You can find the ARN automatically using the aws_acm_certificate data source.

  5. Create a Route 53 A Record that points at your load balancer with the domain name and Private Hosted Zone you created in step 2. You can use the aws_route53_record resource to create the record and find the Hosted Zone ID of your Private Hosted Zone automatically using the aws_route53_zone data source.

  6. You'll want to distribute the public key of the CA to any app that is going to talk to your load balancer so that it can use it to validate the TLS certificate.

Self-signed TLS certs for java based apps

If you want to use end-to-end encryption, you will need to generate self-signed TLS certs for Java based apps. Java uses a special format for managing their TLS certificates (KeyStores and TrustStores) and requires an extra step to encode the generated certificates. If you are unfamiliar with how TLS certificates work, start with the Background documentation.

  1. Use the generate-trust-stores.sh script provided in infrastructure-modules-multi-account-acme. This script will automatically create a CA cert and a TLS cert signed by that CA, and encode them into a Key Store and Trust Store. The Key Store will be locked with a password that is generaated by the script and stored in AWS Secrets Manager for use. The script will also encrypt the Key Store password using gruntkms with the KMS master key for the appropriate environment and output the KMS encrypted cipher text.

  2. Package the Key Store and Trust Store into the AMI for use with the Java app. You can take a look at the kafka module for an example of how to package the certificates, including specific parameters to use for generating the Key Store and Trust Store.

Run Terragrunt

Now that you have all the prerequisites out of the way, you can finally use Terragrunt to deploy everything!

Authenticate

If you're creating a totally new AWS account, the easiest way to do the initial deployment is to create a temporary IAM User in that account with admin access. Create Access Keys for that IAM User and set them as environment variables:

export AWS_ACCESS_KEY_ID=(your access key for this account)
export AWS_SECRET_ACCESS_KEY=(your secret key for this account)

Once everything is deployed, you can delete this IAM user, and access the account via IAM roles (see the cross-account-iam-access module for details).

If you're using an AWS account that already exists and has already been configured with cross-account IAM roles as part of the multi-account Reference Architecture setup, you should set environment variables for the security account:

export AWS_ACCESS_KEY_ID=(your access key for the security account)
export AWS_SECRET_ACCESS_KEY=(your secret key for the security account)

You should then set the TERRAGRUNT_IAM_ROLE to the ARN of an IAM role in the account you're deploying to that will give you administrative access. Typically, you'll want the allow-full-access-from-other-accounts IAM role:

export TERRAGRUNT_IAM_ROLE="arn:aws:iam::<ACCOUNT_ID>:role/allow-full-access-from-other-accounts"

Run apply-all

To deploy the entire account in a single command, you can use apply-all. For example, to deploy the stage account:

cd infrastructure-live-multi-account-acme/stage
terragrunt apply-all

You may want to run Terragrunt with the --terragrunt-non-interactive flag to avoid any interactive prompts:

terragrunt apply-all \
  --terragrunt-non-interactive

If you want to deploy code from your local checkout of infrastructure-modules-multi-account-acme, rather than a versioned release, use the --terragrunt-source parameter:

terragrunt apply-all \
  --terragrunt-non-interactive \
  --terragrunt-source ../../infrastructure-modules-multi-account-acme

If you want to deploy just a single module at a time, just use terragrunt apply:

cd infrastructure-live-multi-account-acme/stage/us-east-1/stage/services/ecs-cluster
terragrunt apply

Deployment order

Note that, in general, there are no dependencies between different AWS accounts, so you can deploy them in any order. The only exception to this is the security account in the multi-account setup. This account defines all IAM Users, Groups, and the S3 bucket used for CloudTrail audit logs, so it must always be deployed first.

Within an AWS account, there are many deployment dependencies (e.g., almost everything depends on the VPC being deployed first), all of which should be defined in the dependencies blocks of terragrunt.hcl files. Terragrunt takes these dependencies into account automatically and should deploy everything in the right order.

Expected errors

Due to bugs in Terraform, you will most likely hit some of the following (harmless) errors:

  1. TLS handshake timeouts downloading Terraform providers or remote state. See https://github.com/hashicorp/terraform/issues/15817.

  2. "A separate request to update this alarm is in progress". See https://github.com/terraform-providers/terraform-provider-aws/issues/422.

  3. "Error loading modules: module xxx: not found, may need to run 'terraform init'". This typically happens if you run apply-all, change the version of a module you're using, and run apply-all again. Unfortunately, Terragrunt is not yet smart enough to automatically download the updated module (see https://github.com/gruntwork-io/terragrunt/issues/388). Easiest workaround for now is to set TERRAGRUNT_SOURCE_UPDATE=true to force Terragrunt to redownload everything:

    TERRAGRUNT_SOURCE_UPDATE=true terragrunt apply-all
    

If you hit any of these issues—and you'll almost certainly hit one of the first two—simply re-run apply-all and they should go away.

Next steps

Now that you know how to deploy the Reference Architecture, the next thing to learn is how to undeploy the Reference Architecture.

Questions? Ask away.

We're here to talk about our services, answer any questions, give advice, or just to chat.

Ready to hand off the Gruntwork?