aws

Gruntwork Newsletter, October 2018

Once a month, we send out a newsletter to all Gruntwork customers that describes all the updates we’ve made in the last month, news in the…
Gruntwork Newsletter, October 2018
YB
Yevgeniy Brikman
Co-Founder
Published September 11, 2018

Once a month, we send out a newsletter to all Gruntwork customers that describes all the updates we’ve made in the last month, news in the DevOps industry, and important security updates. Note that many of the links below go to private repos in the Gruntwork Infrastructure as Code Library and Reference Architecture that are only accessible to customers.

Hello Grunts,

In the last month, we hit a big milestone at Gruntwork: $1 million in annual recurring revenue! Then, we got right back to work, and made a huge number of updates, including making major changes to our ELK code to work around NLB limitations, updating Terratest so it can take a “snapshot” of your configs and logs to make it easier to debug test failures, updating Terragrunt so it automatically retries on errors that are known to be transient, fixing the perpetual diffs issue with S3 bucket lifecycle settings, adding support for Oracle Cloud Infrastructure to Terratest, and a huge number of other fixes and improvements. In other news, you can now use Yubikeys with AWS and the Oracle JDK now requires a paid support contract for production usage, so you may need to change JDKs soon.

As always, if you have any questions or need help, email us at support@gruntwork.io!

Gruntwork Updates

Gruntwork is now generating $1 million in annual recurring revenue

Motivation: Our mission is to make it 10x easier to understand, build, and deploy software. To do that at scale, we realized that we needed to build a sustainable company.

Solution: We created Gruntwork and began offering access to world-class infrastructure code, DevOps software, training, and support as a part of a subscription. This subscription is now bringing in over $1 million in annual recurring revenue (ARR). We are deeply grateful to our customers for making this possible.

What to do about it: Check out How we got to $1 million in annual recurring revenue with $0 in fundraising for all the details.

Major Release: ELK Package

Motivation: While using our ELK code the last couple months, we hit a few limitations with using an NLB as the load balancer of choice for our inter-cluster communication:

Solution: We replaced the NLB with an ALB for communication between clusters. However, since Filebeat can only communicate with Logstash on a pure TCP protocol, and the ALB only supports HTTP/HTTPS, we can’t use the ALB with Filebeat. To get around this issue, we came up with an auto discovery mechanism that resides on the application server. It runs as a cron job on the server, periodically looking up Logstash EC2 instance IPs using the AWS APIs, updating the Filebeat configuration with the IPs of the returned instances, and restarting Filebeat to load the new configuration. We also rely on Filebeat’s built-in load balancing feature to distribute requests among the Logstash instances.

What to do about it: This is a hugely backwards incompatible change and special care needs to be taken to ensure a smooth upgrade. The following steps are a good starting point:

  1. Remove your use of the nlb module and replace with an alb. See example here: https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L436
  2. Replace your use of the load-balancer-target-group module with newly added load-balancer-alb-target-group. See example of using the new module https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L71
  3. Finally, update the various target_group_arns arguments passed to the cluster modules. https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L40
  4. If you’re using SSL with the ALB, you’ll need to take note of the ALB upgrade notes: module-load-balancer, v0.12.0

Terratest can now help take a snapshot of your config/logs

Motivation: When a infrastructure test fails, to understand what went wrong, you typically need the logs and config files from your deployed apps and services. Currently, getting at this information is a bit of a pain: you’d need some way to run the tests, “pause” (i.e., not tear down) the infrastructure after a failure, ssh to individual instances, and then view the logs and config files to see what went wrong. This is hard to do, especially when your tests are running automatically on a CI server.

Solution: Terratest can now automate the task of taking a “snapshot” of your whole deployment by grabbing a copy of log files, config files, and any other files useful for debugging. If you configure your CI server correctly, you can make this “snapshot” easy to browse. For example, when one of our ELK automated tests fails, here is how we can use CircleCI to debug what went wrong:

What to do about it: Update your code to use Terratest v0.13.0 and then take a look at our example readme for a full walk-through of the functionality and how to use it.

Terragrunt will now automatically retry on transient errors

Motivation: Occasionally, when you run a command like terraform apply, you get a transient/intermittent error, such as a TLS handshake timeout or CloudWatch concurrency error. If you just re-run apply, the error goes away, but having to deal with these intermittent failures is frustrating, especially in CI environments, and especially when running many commands at once (e.g., via apply-all).

Solution: We’ve updated Terragrunt to automatically retry commands when you hit an error that is known to be transient! There’s nothing for you to do to enable it: if Terragrunt recognizes the error, it will automatically re-run the last command up to a configurable number of times (default is 3) with a configurable sleep between retries (default is 5 seconds). You can find the list of known transient errors in auto_retry_options.go. We will add support for specifying a custom list of retryable errors in the future (if you want this feature soon, PRs are very welcome!).

What to do about it: Give Terragrunt v0.17.0 a shot and see if it makes your Terraform usage a little more stable and reliable. Check out the Auto Retry docs for more details, including how to configure retries and sleeps, and how to disable retry functionality if, for some reason, it doesn’t work with your use cases.

Fix perpetual diff errors with S3 buckets

Motivation: For a while, some of our modules that used S3 buckets with lifecycle settings would always show a diff when you ran plan, even though nothing had changed.

Solution: Thanks to the help of one of our customers, we believe we’ve figured out the cause: you should not set both the expired_object_delete_marker and days parameters in an expiration block. We’ve fixed this issue in our load-balancer-access-logs and cloudtrail modules.

What to do about it: To pick up these fixes, update to module-aws-monitoring, v0.9.3 and module-security, v0.15.2.

Terratest now supports OCI

Motivation: Terratest is Gruntwork’s swiss army knife for infrastructure testing. Last month, we updated Terratest with support for testing infrastructure on Google Cloud Platform (GCP). This month, someone wanted to use Terratest to test infrastructure on Oracle Cloud Infrastructure (OCI).

Solution:**** Terratest now has initial support for OCI! Check out packer_oci_example_test.go for an example.

What to do about it: Grab Terratest v0.12.0 and take the oci package for a spin.

Jenkins backup cleanup fix

Motivation: There was a bug in how we configured the code that cleans up old backups for Jenkins in the Reference Architecture. As a result, backups wouldn’t be cleaned up, and more and more snapshots would pile up over time.

Solution: The fix requires tweaking the value of a single parameter, delete_older_than, from 15 to 15d, as shown in this commit in the Acme sample Reference Architecture.

What to do about it: If you’re using Jenkins with the Reference Architecture:

  1. Update your delete_older_than parameter as shown above.
  2. Publish a new version of your infrastructure-modules repo.
  3. Run terragrunt apply in your infrastructure-live repo to deploy the changes.

Package SAM updates

Motivation: There were several small bugs and no way to pass environment variables to AWS SAM CLI while testing locally.

Solution: We implemented some bug fixes and also added support for passing environment variables to AWS SAM CLI through the Swagger file.

What to do about it: To pick up these fixes, update to package-sam, v0.1.7.

Gruntwork Houston updates

We’ve made a number of updates to Gruntwork Houston in the last month:

  • Documentation for Okta: We’ve added step-by-step documentation for how to use Okta as an identity provider with Houston so that you can login to AWS via the web, CLI, VPN, and SSH using your Okta credentials.
  • houston-cli, v0.0.7: Added the ability to create and setup the houston configuration from the command line using the newly introduced houston configurecommand.
  • houston-cli, v0.0.8: Improved help text output and bugfix to houstonUrl in config file to allow trailing slashes.

Are you interested in joining the Houston beta? Email us at info@gruntwork.io!

ELK updates

In addition to the NLB replacement mentioned at the top of this newsletter, we also made a number of other updates to package-elk in the last month:

  • package-elk: v0.2.1: Added iam_role_idas an output variable for the logstash-cluster module. This variable is useful for adding ssh-grunt IAM policies to this ASG
  • package-elk: v0.2.2: Added a missing = character to a terraform local declaration. There was some inconsistent behavior with some customers reporting issues as a result while other tests running and passing without issue.
  • package-elk: v0.2.3: Added options to Kibana cluster module to pass in ui & ssh security group ids (along with num of ui & ssh security group ids)
  • package-elk: v0.2.4: Added pass through plumbing in logstash-cluster for passing through allowed security groups for collectd and beats to the underlying logstash-security-group-rules module. This is very handy for specifying allowed security groups without having to have a 2nd logstash-security-groups module.
  • package-elk: v0.2.5: Fixed improperly passing allowed_ssh_security_group_ids to aws_launch_configuration resources in both kibana and elastalert modules. Also added proper plumbing for allow_ssh_from_security_group_ids to be specified in the elastalert module and then be passed all the way through to the underlying elastalert-security-group-rules module
  • package-elk: v0.2.6: This release addresses issue: #57. kibana-cluster will now create egress rules for the security group that it creates. Stabilized the ELK tests. Added better documentation/clarified examples with our AMI and example code READMEs
  • package-elk: v0.2.7: Pass through the security groups allowed for service discovery so that we can set that right on the main module. Also renamed vars.tf to variables.tf

Terragrunt updates

We made a number of other updates to Terragrunt in the last month:

  • Terragrunt, v0.16.9: Add support for force_path_style in the S3 config. Add support for skipping S3 bucket versioning via the skip_bucket_versioning config.
  • Terragrunt, v0.16.10: Terragrunt will now properly respect the shared_credentials_file config for S3 backends, using it when creating S3 buckets and DynamoDB tables.
  • Terragrunt, v0.16.11: You can now tell Terragrunt to exclude specific subdirectories when running the xxx-allcommands (e.g., apply-all) by using the --terragrunt-exclude-dir flag. This flag supports wildcard expressions and may be specified multiple times.
  • Terragrunt, v0.16.12: Fix the prevent_destroy flag so it works even when configs are inherited from a parent .tfvars file.
  • Terragrunt, v0.16.13: When you use extra_arguments, Terragrunt will no longer pass -var or -var-file arguments to Terraform when you call apply with a plan file.
  • Terragrunt, v0.16.14: This is a follow-up to v0.16.13 that fixes a bug where -var and -var-file were still passed if you called apply with a plan file and other arguments in between (e.g., terragrunt apply <other args> <plan file>).

Terratest updates

We made a number of other updates to Terratest in the last month:

  • Terratest, v0.10.4: Terratest now has methods for running terraform plan and extracting the exit code, including InitAndPlan and PlanExitCode.
  • Terratest, v0.12.1: Added new helper methods ScpFileFrom and ScpDirFrom that will allow for the transfer of files from remote EC2 instances to the local machine. The main idea with these helper methods is to make it easy to tell terratest to grab all of the various log and config files from your app running on some remote machine in the case that a test is going to fail. We already had methods in terratest that would grab the contents of those files and return the contents as string. The new methods introduced in this release expand upon that functionality and open up the possibility of easily grabbing and archiving all of the log and configuration files on your CI of choice.
  • Terratest, v0.12.2: Added the WorkspaceSelectOrNew method that can be used to create and select Terraform workspaces at test time.

Other open source updates

  • terraform-aws-consul, v0.4.0: Important updates to the way security group rules are managed in the consul-cluster, consul-security-group-rules, and consul-client-security-group-rules modules.
  • terraform-aws-vault, v0.10.3: The vault-security-group-rules module now adds a self rule so that Vault servers can talk to each other via their API port.
  • terraform-aws-nomad, v0.4.5: You can now add EBS Volumes to your Nomad cluster by configuring the new ebs_block_deviceparameter in the nomad-cluster module.
  • gruntwork-cli: v0.2.0: Added a custom HelpPrinter function that will wrap help text at specified line width, while preserving indentations in the output table. To use, you can call entrypoint.NewApp() to construct the cli app which will take care of applying the modifications, or manually apply the changes yourself on the cli app. You can also modify the line width by changing entrypoint.HelpTextLineWidth (defaults to 80).

Other updates

  • module-load-balancer, v0.12.0: Updated the ALB module to accept two new variables https_listener_ports_and_acm_ssl_certs_num and https_listener_ports_and_ssl_certs_num to specify the length of the mappings between ports and their associated (non)acm certificates. This allows the values of the mappings to be dependent on dynamic resources. See: hashicorp/terraform#11482
  • module-ecs, v0.8.4: The ecs-service-with-discovery module now outputs the security group ID via the output variable ecs_task_security_group_id.
  • module-ecs, v0.8.5: You can now configure volumes for the ecs-service module using the new volumes parameter.
  • package-lambda, v0.2.3: Add a new parameter called wait_for to the lambda module. All the resources in the module will not be created until wait_for is resolved, which allows you to execute other steps (e.g., create zip file) before this module runs. This is a workaround for the lack of depends_on for modules in Terraform.
  • module-asg: v0.6.16: Handled a possible concurrency issue that can cause a fatal exception while multiple processes attempt to unzip the boto3 library zip file in get-desired-capacity.py. We will now attempt to unzip the archive and catch any exception, and if it is the exception related to our concurrency issue, simply sleep for 5 seconds and try again.
  • cloud-nuke, v0.1.3: This release improves the nuking strategy by ensuring deletion functions doesn’t return when they encounter an error on just a single resource. It also added the ability to nuke ASG launch configurations.
  • module-aws-monitoring, v0.9.2: The cloudwatch-log-aggregation-scripts, cloudwatch-memory-disk-metrics-scripts, and syslogmodules now support Amazon Linux 2.
  • package-openvpn, v0.8.0: package-openvpn now uses bash-commons under the hood. The behavior is identical, but you must now install bash-commons before installing any of the package-openvpn modules.

DevOps News

AWS now supports Yubikey for MFA

What happened: AWS now supports the Yubikey as a Multi-Factor Auth device.

Why it matters: The Yubikey is a tiny hardware USB device that supports a range of security functionality, including generating one-time passwords that can be used for Multi-Factor Authentication (MFA). It’s easier to use and (arguably) more secure than other MFA options, such as using the Google Authenticator app on your phone.

The way it works is you (or your company) buy a Yubikey and register it with (a) Yubico’s online service and (b) the online service you’re trying to log into, such as AWS. Then, whenever you’re logging into your online service, it will ask you not only for a username and password, but also a Yubikey token. To enter the token, you simply click on the text field in your browser, push a button on the Yubikey itself, and it will automatically enter the token for you (the Yubikey behaves as a USB keyboard), without you having to take your phone out of your pocket or type anything in manually. The web service will then check your token with the Yubikey service, and if it’s valid, allow you to login.

What to do about it: If you wish to start using a Yubikey with AWS, follow the instructions here.

As of version 11, Oracle JDK will no longer be free

Motivation: Oracle has released Java 11, but the terms come with a catch: you may no longer use Oracle’s JDK for commercial or production purposes without a paid support contract from Oracle.

Why it matters: For many years, the Oracle JDK was the recommended JDK for most Java apps, as it was the best maintained, had all the bells and whistles, and gave you the option to purchase support from Oracle. While you can still use the Oracle JDK for developing, testing, prototyping, and learning, the support contract is now no longer optional for production or commercial usage.

What to do about it: If you don’t want to pay Oracle for a support contract, you need to move to one of the flavors of OpenJDK:

The good news is that OpenJDK is more or less identical to Oracle JDK these days, so this should not generally cause issues. We will be updating our code (namely, the JDK installer in package-zookeeper) to use one of the OpenJDK flavors in the future.

RDS now supports deletion protection

What happened: Amazon has added support for deletion protection for RDS and Aurora databases.

Why it matters: You can turn on deletion protection with a single click (or single line of code). Once enabled, if you try to delete a database with deletion protection, you get an error (the only way to delete such a database is to explicitly disable deletion protection). This provides an extra sanity check to help protect your production databases from accidental deletion (e.g., accidental terraform destroy).

What to do about it: You can enable deletion protection via the UI now. We’ll be exposing a flag to enable this feature in module-data-storage in the future (if you need it sooner, PRs are welcome!).

ElastiCache for Redis now supports read replicas for sharded Redis

What happened: Amazon has announced that ElastiCache for Redis now supports adding and removing read replica nodes for both sharded and non-sharded Redis clusters.

Why it matters: This makes it easier to scale your reads and improve availability for your Redis Cluster environments without requiring manual steps or needing to make application changes.

What to do about it: Check out the announcement blog post for the details.

Security Updates

Below is a list of critical security updates that may impact your services. We notify Gruntwork customers of these vulnerabilities as soon as we know of them via the Gruntwork Security Alerts mailing list. It is up to you to scan this list and decide which of these apply and what to do about them, but most of these are severe vulnerabilities, and we recommend patching them ASAP.

Jenkins

  • Jenkins Security Advisory 2018–09–25: A number of vulnerabilities have been found in Jenkins plugins. We did not notify the Gruntwork Security Alerts mailing list, as most of these vulnerabilities are of “low” or “medium” severity, except for one: the Monitoring Plugin has a vulnerability that allows an attacker to send crafted requests to a web application for extraction of secrets from the file system, server-side request forgery, or denial-of-service attacks. If you are using this plugin, we recommend updating immediately.