README.md

Browse the Repo

.circleci

examples

modules

bash-commons

confluent-tools-cluster

confluent-tools-iam-permissions

confluent-tools-security-group-rules

generate-key-stores

install-confluent-tools

install-kafka

kafka-cluster

kafka-iam-permissions

kafka-security-group-rules

run-health-checker

run-kafka-connect

run-kafka-rest

run-kafka

bin

config

README.md

install.sh

run-schema-registry

test

.gitignore

.pre-commit-config.yaml

CODEOWNERS

LICENSE.txt

README.md

Browse the Repo

.circleci

examples

modules

bash-commons

confluent-tools-cluster

confluent-tools-iam-permissions

confluent-tools-security-group-rules

generate-key-stores

install-confluent-tools

install-kafka

kafka-cluster

kafka-iam-permissions

kafka-security-group-rules

run-health-checker

run-kafka-connect

run-kafka-rest

run-kafka

bin

config

README.md

install.sh

run-schema-registry

test

.gitignore

.pre-commit-config.yaml

CODEOWNERS

LICENSE.txt

README.md

Apache Kafka and Confluent Tools

Deploy a cluster of Kafka brokers. Optionally deploy Confluent tools such as Schema Registry, REST Proxy, and Kafka Connect.

Run Kafka Script

This folder contains a script for configuring and running Kafka. Typically, you would run this script while your server is booting to start Kafka. This script assumes that the following are already installed:

Kafka: see the install-kafka module.
Supervisord: see the install-supervisord module in package-zookeeper.
AWS CLI.
jq.

The run-kafka script will generate a Kafka configuration file (see Kafka config docs below for details) and then use Supervisord to start Kafka.

This script has been tested on the following operating systems:

Amazon Linux
Ubuntu

There is a good chance it will work on Debian, CentOS, and RHEL as well, but our automated testing for this module does not cover these other distros at the moment.

Quick start

The easiest way to install the run-kafka script is with the Gruntwork Installer:

gruntwork-install \
  --module-name "run-kafka" \
  --repo "https://github.com/gruntwork-io/package-kafka" \
  --tag "v0.0.4"

We recommend running this module, as well as install-kafka, as part of a Packer template to create an Amazon Machine Image (AMI) (see the kafka-ami example for a fully-working sample code).

Install Command Line Arguments

You can install the run-kafka script by running the install.sh file in the run-kafka module folder. The install.sh script requires the following arguments:

--config-dir-src: The directory containing the Kafka config files to copy.
--log4j-config-dir-src: The directory containing the Log4j config files to copy.

In addition, the following optional arguments are accepted:

--install-dir: The directory where the run-kafka files should be installed. Default: /opt/kafka.
--user: The user who will be set as the owner of --install-dir. Default: kafka.

Run install.sh with the --help option or see the source code to see all additional arguments. If you wish to use SSL with Kafka, see the SSL Settings section below for additional arguments that are accepted.

If you're using gruntwork-install to install this module, you can pass these arguments using --module-param arguments. Example:

gruntwork-install \
  --module-name "run-kafka" \
  --repo "https://github.com/gruntwork-io/package-kafka" \
  --tag "v0.0.4" \
  --module-param "config-dir-src=/tmp/config/kafka/config" \
  --module-param "log4j-config-dir-src=/tmp/config/kafka/log4j"

run-kafka command line arguments

When you run the run-kafka script, you must provide exactly one of the following arguments:

--zookeeper-eni-tag: The name and value of a tag, in the format name=value, that can be used to find ZooKeeper server ENI IPs.
--zookeeper-connect: A comma-separated list of the IPs of ZooKeeper nodes to connect to.

The script also accepts a number of optional parameters to customize Kafka's behavior. Run the script with the --help flag to see all available options. See the Kafka config docs below for the highlights.

In addition, you will most likely want to explicitly specify the following optional arguments:

--config-path: The path to the Kafka config file. Default: /opt/kafka/config/dev.kafka.properties
--log4j-config-path: The path to the Log4j config file. Default: /opt/kafka/config/dev.log4j.properties"

Although the above arguments are optional, in practice, a single server often contains configuration files for many environments (e.g. dev, stage, prod), and you can use these arguments to specify exactly which environment's configuration file should be used.

To see all other parameters excpted by run-kafka, run the script with the --help flag. Also, see the Kafka config docs below for the highlights.

Kafka config

The run-kafka script dynamically fills in the most important values in a Kafka configuration file. The script focuses primarily on values that differ from environment to environment (i.e., stage and prod), so to see how to set other values, see other settings.

Here are the key items to pay attention to:

ZooKeeper IPs
Broker ID
Number of partitions
Replication
Availability
JVM memory settings
SSL settings
Log directories

Other settings

ZooKeeper IPs

Kafka uses ZooKeeper cluster for coordination. You can provide the IPs of the ZooKeeper nodes manually using the --zookeeper-connect argument, or you can allow the run-kafka script to discover them automatically. To use the automatic version, you specify the --zookeeper-eni-tag argument with the name and value, in the format name=value, of a tag used on ZooKeeper ENIs. The latter option is based on the assumption that the ZooKeeper cluster is deployed using package-zookeeper, which uses the server-group module under the hood, which assigns an ENI to each ZooKeeper server with special tags. See Server IPs and IDs for more info.

Broker ID

Every Kafka broker needs a unique ID. By default, the run-kafka script automatically figures out the Broker ID by looking up the ServerGroupIndex tag for the current server. This tag is set by the server-group module to a unique integer for each server in the server group. You can override this value with a custom broker ID by specifying the --broker-id argument.

Number of partitions

Every topic in Kafka consists of one or more partitions, which is one of the most important settings for determining the throughput, availability, and end-to-end latency for that topic. See How to choose the number of topics/partitions in a Kafka cluster? for more info.

You specify the number of partitions when creating a topic (e.g., by using the kafka-topics.sh script). You can also use the --num-partitions argument in the run-kafka script to configure the default number of partitions for automatically created topics (i.e., topics that are created when a producer first tries to write to them).

Replication

The main way Kafka achieves durability and availability for your data is to replicate that data across multiple Kafka brokers. Therefore, if one broker goes down, the data is still available in the other brokers.

You specify replication settings when creating a topic (e.g., by using the kafka-topics.sh script). You can also specify the following replication settings using the run-kafka script:

--replication-factor: The default replication factor for auto-created topics.
--offsets-replication-factor: The replication factor for the internal __consumer_offsets topic.
--transaction-state-replication-factor: The replication factor for the internal __transaction_state topic.

In production, you typically want to set all of these to > 1 to ensure data isn't lost if a single broker dies.

Availability

The number of brokers you are running and the replication settings for your topics will be the biggest influence on your Kafka cluster's availability in the face of outages. There are also two other important settings you can set via the run-kafka script:

--min-in-sync-replicas: The number of replicas that must be in-sync when a producer sets acks to all. For example, if you have 3 replicas, and you set this setting to 2, the producer will wait for 2 of the replicas to acknowledge they received the write before the producer considers the write successful. Setting this to a higher value (e.g., 3 for a topic with 3 replicas) reduces the chance of data loss, but it also reduces availability, as even a single broker going down means writes will fail. Setting this to 2 for a replication factor of 3 is common.
--unclean-leader-election: If set to true, an out-of-sync replica will be elected as leader when there is no live in-sync replica (ISR). This preserves the availability of the partition, but there is a chance of data loss. If set to false and there are no live in-sync replicas, Kafka returns an error and the partition will be unavailable. In general, if you're optimizing for availability, set this setting to true; if you're optimizing for reducing data loss, set this setting to false.

JVM memory settings

By default, we configure Kafka to run with 6g of memory. You can override this with the --memory argument. If you wish to override all JVM settings for Kafka, you can use the --jvm-opts argument.

SSL settings

By default, Kafka brokers communicate over plaintext. If you wish to enable SSL, set the --enable-ssl argument to true. For SSL to work, you need the following:

A Key Store that contains an SSL certificate and a Trust Store that contains the the CA that signed that SSL certificate. You can use the generate-key-stores module to generate the Key Store and Trust Store and you can install them on your server by passing the --key-store-path and --trust-store-path arguments, respectively, to the install-kafka module.
You must use the --key-store-password argument to provide the run-kafka script with the password you used when creating the Key Store.
You must use the --trust-store-password argument to provide the run-kafka script with the password you used when creating the Trust Store.

Log directories

By default, the run-kafka script will configure Kafka to logs in /opt/kafka/kafka-logs/data. We strongly recommend mounting a separate EBS Volume at /opt/kafka/kafka-logs, as every write to Kafka gets persisted to disk, and you get much better performance if there is no contention for the hard drive from other processes.

You can override the log directories via the --log-dirs argument.

Other settings

Kafka has many, many configuration settings. The run-kafka script gives you a convenient way to set just a few of the most important ones, and especially those that may differ from environment to environment. To set other types of settings, your best bet is to put them into a custom server.properties file and to install that file using the install-kafka module by setting the --config argument. You can find the default server.properties file used by install-kafka here.

Please note that the run-kafka script does a simple search and replace using sed to fill in run-time properties, so it will replace or add settings to your custom server.properties at run time.

Questions? Ask away.

We're here to talk about our services, answer any questions, give advice, or just to chat.

Ready to hand off the Gruntwork?

Buy Now

"https://cdn.gruntwork.io/gruntwork-website/"

{"index":{"js":"https://cdn.gruntwork.io/gruntwork-website/index.bundle.c7884255553b53fbca3a.js","map":"https://cdn.gruntwork.io/gruntwork-website/index.bundle.1b14c1b7d19f1f5eb35d6e118e838255.map"},"styles":{"css":"https://cdn.gruntwork.io/gruntwork-website/styles.bundle.f22938926651ddec7c49.css","js":"https://cdn.gruntwork.io/gruntwork-website/styles.bundle.e782420e74a20dcb8691.js","map":"https://cdn.gruntwork.io/gruntwork-website/styles.bundle.d5e2af49807c6ca33f8367d621ece507.map"},"vendors":{"css":"https://cdn.gruntwork.io/gruntwork-website/vendors.bundle.29f7d0366a0978763f96.css","js":"https://cdn.gruntwork.io/gruntwork-website/vendors.bundle.fa8174a130797d75d12c.js","map":"https://cdn.gruntwork.io/gruntwork-website/vendors.bundle.57243d94deeeb29d5061288a338b4eb6.map"}}

Apache Kafka and Confluent Tools

Code Preview

Preview the Code

Run Kafka Script

Quick start

Install Command Line Arguments

run-kafka command line arguments

Kafka config

Other settings

ZooKeeper IPs

Broker ID

Number of partitions

Replication

Availability

JVM memory settings

SSL settings

Log directories

Other settings

Questions? Ask away.

Ready to hand off the Gruntwork?