aws

Gruntwork Newsletter, June 2019

Once a month, we send out a newsletter to all Gruntwork customers that describes all the updates we’ve made in the last month, news in the…
Gruntwork Newsletter, June 2019
YB
Yevgeniy Brikman
Co-Founder
Published June 14, 2019

Once a month, we send out a newsletter to all Gruntwork customers that describes all the updates we’ve made in the last month, news in the DevOps industry, and important security updates. Note that many of the links below go to private repos in the Gruntwork Infrastructure as Code Library and Reference Architecture that are only accessible to customers.

Hello Grunts,

In the last month, we integrated Kubernetes into the Gruntwork Reference Architecture, wrote a new blog post series on how to build an end-to-end production-grade architecture on AWS, updated Terratest and Terragrunt and many of our modules to work with Terraform 0.12 (which is now out officially!), updated the Reference Architecture to use OpenJDK, fixed a security bug in the SQS module, and much more.

As always, if you have any questions or need help, email us at support@gruntwork.io!

Gruntwork Updates

Kubernetes in the Gruntwork Reference Architecture

Motivation: For the last few years, the Gruntwork Reference Architecture has supported Auto Scaling Groups (ASGs) and EC2 Container Service (ECS) as the primary ways to run workloads. As Kubernetes has grown in popularity, we got steadily more and more requests to add support for it as a first-class offering. Today, we’re excited to announce that we can now offer Kubernetes as a new option for running workloads in the Gruntwork Reference Architecture!

Solution: We’ve created a number of production-grade modules for running Kubernetes on AWS and integrated them into the Reference Architecture (including hooking them into monitoring, alerting, networking, CI/CD, and so on). Under the hood, we run the Kubernetes control plane on top of Amazon’s Elastic Kubernetes Service (EKS), so it’s a fully managed service. On top, we run Helm and Tiller to make it easy to deploy and manage workloads in your Kubernetes cluster. And in between, we’ve spent a lot of time configuring everything for high availability (see our Zero Downtime Server Updates For Your Kubernetes Cluster blog post series), scalability, security (including TLS auth, namespaces, and strict RBAC controls), and testing (see our blog post, Automated Testing for Kubernetes and Helm Charts using Terratest).

If you’re a Gruntwork customer, you can see an example of what the Reference Architecture integration looks like in our Acme Company examples below (and if you’re not a customer, sign up now to get access!):

  • eks-cluster: A module to manage the EKS cluster with its workers.
  • eks-core-services: A module to deploy and manage core administrative services on your EKS cluster.
  • k8s-namespace-with-tiller: A module to provision a new Kubernetes Namespace with a deployed Tiller (Helm Server) so that you can use helm to install services into that Namespace.
  • k8s-service: A module for deploying a dockerized app on to Kubernetes using helm.
  • k8s-tiller: A module for managing Tiller (Helm Server).

See also the corresponding changes in the infrastructure-live repository to see how they are deployed.

What to do about it: We can deploy a Kubernetes-based Reference Architecture for you in about one day as part of the Gruntwork Subscription. Alternatively, if you’re already a subscriber, check out the links in the previous section to learn how to deploy Kubernetes into your existing infrastructure-modules and infrastructure-live repos. Let us know how it works for you, and if you’ve got any comments questions, contact us at support@gruntwork.io!

AWS Reference Architecture Overview Blog Post Series

Motivation: Many people have asked us about the details of what it takes to go to production on AWS. We’ve captured these details in the Gruntwork Reference Architecture, but haven’t done a great job of explaining what those details include. Potential customers wanted to know the specific components of the architecture and how they were set up before purchasing. Existing customers wanted to know about some of the design choices we made.

Solution: We wrote a new blog post series, How to Build an End to End Production-Grade Architecture on AWS! This series is designed to build up to the Reference Architecture from the perspective of addressing the various concerns that need to be answered when going to production on AWS. This includes both an overview of which infrastructure components to choose (e.g., Kubernetes, VPCs, KMS, Jenkins, etc), as well as why those choices make sense.

What to do about it: Click on the links below and start reading!

  1. Part 1: Network Configuration, Kubernetes, Microservices, and Load Balancing
  2. Part 2: CI/CD, Multiple Accounts, Secrets Management, CDN, VPN, and Monitoring
  3. Part 3: Bootstrap Your Production-Grade Infrastructure in a Day

Terraform 0.12 updates

Motivation: Terraform 0.12 final is now out (see the DevOps News section below), so we’ve been hard at work updating all of our modules and tooling to work with it.

Solution: Here are the latest updates:

Terragrunt v0.19.0 and above now supports Terraform 0.12! As a bonus, we’re now using HCL2 syntax with Terragrunt, which (a) makes your code cleaner and (b) allows you to use built-in Terragrunt functions everywhere in your Terragrunt configuration! Make sure to read the migration guide for upgrade instructions. Also, check out Terragrunt: how to keep your Terraform code DRY and maintainable for an overview of how to use Terragrunt in 2019.

Terratest v0.15.8 and above now supports Terraform 0.12. See below for more info on Terratest updates.

Infrastructure as Code Library: we’ve updated a number of modules in the Infrastructure as Code Library—see the module version compatibility chart—, but we still have quite a few more to go. Note that these are backwards incompatible releases, so the latest versions of our modules will no longer support Terraform 0.11.

What to do about it: Since we are still in the process of upgrading all of our modules to work with Terraform 0.12, and since the upgrade process is backwards incompatible, for the time being, we recommend that you continue to use Terraform 0.11.x. Once everything is ready to go with Terraform 0.12.x, we’ll send out full upgrade instructions. We know you’re excited to upgrade, so we’re making every effort to have everything ready by the end of June, but take that as a good faith estimate, and be aware of the usual caveats about DevOps time estimates and yak shaving!

Important SQS Security Fix

Motivation: We discovered that our sqs module had a very unsafe default configuration that allowed unauthenticated incoming requests from any IP.

Solution: We’ve updated the sqs module so that IP-based access is now disabled completely by default. Unless you intend to allow unauthenticated IP-based, we strongly recommend updating to this new version. If you do need to allow IP-based access, set apply_ip_queue_policy to true and specify the IPs that should be able to access the queue via allowed_cidr_blocks.

What to do about it: We strongly recommend updating to package-messaging, v0.2.0 ASAP.

Reference Architecture Open JDK Fix

Motivation: Last month we discovered that Oracle changed their policies to require authentication for all downloads of their JDK, which broke our install-oracle-jdk module. As a solution, we introduced an install-open-jdk module, and updated all our Java based infrastructure packages to use it Kafka, Zookeeper, ELK. However, customers were asking how to apply these changes to their Reference Architectures.

Solution: This month, we updated our Reference Architecture examples to point to the install-open-jdk module where it was referencing install-oracle-jdk. If you use Kafka, Zookeeper, or ELK, you will want to apply the same update to your Packer templates.

What to do about it: Check out this commit for an example of the locations you will need to update.

Reference Architecture Script local readonly Bug Fix

Motivation: In our bash scripts for the Reference Architecture, we have been using local readonly to mark variables as locally scoped and immutable. However, this does not actually do what you would think it would do.

Solution: We updated all our bash scripts in the Reference Architecture to replace the usage of local readonly with local -r. We also took care to mark read only arrays using local -r -a .

What to do about it: Check out this commit for an example of the locations you will need to update.

Terratest updates

Motivation: We needed to make a number of Terratest updates, including improving adding support forour on-going work to update to Terraform 0.12, improved GCP support, and improved features to work around flaky tests.

Solution: We’ve made the following updates:

  • terratest, v0.15.6: You can now specify -var-file in the packer module to use json files as variable input. Check out our example usage.
  • terratest, v0.15.7: Added support for deleting SSH Public Keys attached to a Google user identity.
  • terratest, v0.15.8: Fix a bug where GCP credentials authentication can sometimes transitively fail. We handle this by introducing a retry loop. Fix a regression introduced in v0.15.2 which broke the handling of lists and maps in the vars for Terraform.
  • terratest, v0.15.9: Improved the resiliency of the GCP methods for obtaining an OAuth2 token by adding retries. This helps work around intermittent “TLS handshake timeout” errors. Fix a bug in how Terratest was setting -backend-config parameters during terraform init. We were using a space as a separator, but Terraform requires using equals.
  • terratest, v0.15.10: Improves stability of the terratest CI build.
  • terratest, v0.15.11: Added GetEc2InstanceIdsByFilters which provides an interface to retrieve EC2 instances by defining filters as a map . This release also introduced functionality for testing dynamodb.
  • terratest, v0.15.12: Added support for testing terragrunt . Check out the release notes for more info.
  • terratest, v0.15.13: Fixed the terraform.OutputList and terraform.OutputMap methods to work with Terraform 0.12.
  • terratest, v0.16.0: Added a new DoWithRetryableErrors method that takes in a map of retryable errors and an action to execute, and if the action returns an error, retries it if the error or the action's stdout/stderr matches one of the retryable errors. Updated the terraform code to use this DoWithRetryableErrors method under the hood for retries. Added support for retryable errors for Packer builds via the new RetryableErrors, MaxRetries, and TimeBetweenRetries settings in packer.Options.
  • terratest, v0.16.1: NewAuthenticatedSession in modules/aws now supports returning credentials set by assuming a role. This can be done by setting the environment variable TERRATEST_IAM_ROLE to the ARN of the IAM role that should be assumed. When this env var is not set, it reverts to the old behavior of looking up credentials from the default location.
  • terratest, v0.17.0: InitAndPlan and InitAndPlanE now return the text output from stdout and stderr, instead of the exit code as an integer. The original versions that returned the exit code have been renamed to InitAndPlanWithExitCode and InitAndPlanWithExitCodeE. As a part of this, introduced Plan and PlanE functions, which can be used to just run terraform plan. These will return the stdout and stderr outputs.

Open source updates

Other updates

  • module-aws-monitoring, v0.12.4: This release adds conditional support for the logs/load-balancer-access-logs module. You can now set create_resources = false on the module call to avoid creating the S3 bucket.
  • module-aws-monitoring, v0.12.5: This release fixes the principal on logs/load-balancer-access-logs module’s policy so that NLBs can write to the S3 bucket.
  • module-aws-monitoring, v0.12.6: Fix the period setting for the SQS alarm to use a minimum of 5 minutes rather than 1 minute, as SQS metrics are only collected once every 5 minutes, so trying to alert more often doesn't work.
  • module-security, v0.16.3: You can now tell the iam-groups module to not create the "access-all" group by setting the new input variable should_create_iam_group_cross_account_access_all to false. This can help work around an AWS limitation where we exceed the max IAM policy length.
  • module-security, v0.16.4: You can now configure an optional SNS delivery notification topic for the cloudtrail module using a new sns_delivery_topic input variable.
  • package-elk, v0.2.9: Switched the elasticsearch-cluster-backup and elasticsearch-cluster-restore modules over to using Node 8.10 as the runtime, as 6.10 has been deprecated. The runtime is now also configurable via the lambda_runtime input variable.
  • module-ecs, v0.13.4: All the ECS service modules now allow you to optionally specify a custom prefix to use for the IAM execution role. The default value is to use the service name as before.
  • module-server, v0.6.2: The attach-eni script is now compatible with Ubuntu 18.04.
  • module-vpc, v0.5.8: var.custom_tags now propagate to EIP resources created in the VPCs.

DevOps News

Terraform 0.12 (and 0.12.1) is out!

What happened: HashiCorp has released Terraform 0.12 final. They also followed up shortly after with 0.12.1, which fixes some important bugs.

Why it matters: Terraform 0.12 brings with it a number of powerful new features, but will also require a significant upgrade.

What to do about it: See the “Terraform 0.12 update” section above.

Amazon MSK is now Generally Available

What happened: Amazon’s managed Kafka service, MSK, is now generally available in all AWS accounts.

Why it matters: Before, MSK was only available in “preview mode” to select accounts. The service is now a bit more mature and available everywhere as a managed way to run Apache Kafka (and Apache ZooKeeper).

What to do about it: Give MSK a shot and let us know what you think! We do not have a dedicated module for it, but you can try out the aws_msk_cluster resource to deploy it yourself.

ECS now supports increased ENI limits with awsvpc networking mode

What happened: AWS has added support for trunking, which allows certain instance types to have a higher ENI limit for ECS Tasks in awsvpc networking mode.

Why it matters: When using awsvpc networking mode, each ECS Task gets its own IP address by way of an Elastic Network Interface (ENI). Under the hood, each ECS Task runs on an EC2 Instance, and those instances typically had very low limits on how many ENIs you could attach (e.g., typically only 1–2 until you got to really large instance types). That meant you would often run out of ENIs long before you ran out of CPU or memory resources. Now, if you enable the new awsvpcTrunking mode, certain instance types will allow you to attach 3–8x as many ENIs as before, allowing you to make much better use of your CPU and memory resources.

What to do about it: Check out the announcement blog post for instructions.

AWS Lambda adds support for Node.js v10, deprecates Node.js v6

What happened: AWS Lambda now allows you to use Node.js v10 as a runtime, while the older Node.js v6 runtime is now deprecated.

Why it matters: If you were using Node.js v6, you need to update immediately, as it will stop working soon. Node.js v10 includes a number of performance improvements and is generally a safe upgrade.

What to do about it: If you’re using package-lambda to manage your Lambda functions, update your runtime parameter to nodejs10.x.

Security Updates

Below is a list of critical security updates that may impact your services. We notify Gruntwork customers of these vulnerabilities as soon as we know of them via the Gruntwork Security Alerts mailing list. It is up to you to scan this list and decide which of these apply and what to do about them, but most of these are severe vulnerabilities, and we recommend patching them ASAP.

Multiple CPU Vulnerabilities

  • On May 14th, 2019 multiple teams of security researchers around the world independently discovered various CPU vulnerabilities: The RIDL and Fallout speculative execution attacks allow attackers to leak confidential data across arbitrary security boundaries on a victim system, for instance compromising data held in the cloud or leaking your information to malicious websites. ZombieLoad attack uncovers a novel Meltdown-type effect in the processor’s previously unexplored fill-buffer logic. Store-To-Leak Forwarding exploits CPU optimizations introduced by the store buffer to break address randomization, monitor the operating system or to leak data when combined with Spectre gadgets.