Consul Auto-Join with Cloud Metadata
We work in a world of distributed systems which operate in rapidly changing environments. Servers come and go, they move across region and distribution groups, and somehow they need to communicate and connect to one another.
To solve this problem, HashiCorp created Consul, which among many other things enabled service registry and service discovery. Application instances register themselves with Consul, and dependent instances query Consul to discover each other. Since Consul itself is a distributed system, this creates a chicken-and-egg problem - how do you boostrap your service discovery.
» Automation Challenges
How do you discover your service discovery? Traditionally this has been a challenge for distributed systems. The technique often involves spinning up a cluster in one operation and then performing a second operation once the IP addresses are known to join the nodes together. This two-step approach not only makes automation challenging, but also raises questions about the behavior of the system when losing a node. Autoscaling could bring another node online, but an operator would still need to manually join the node to the cluster.
» Consul Auto-Join for EC2
Consul 0.7.1 introduced new functionality which allows it to discover other agents using cloud metadata. This blog post explores leveraging AWS metadata to auto-join and auto scale a Consul cluster.
The latest documentation for Consul shows new options we can specify in the Consul configuration file or startup parameters.
-
-retry-join-ec2-tag-key
- The Amazon EC2 instance tag key to filter on. When used with-retry-join-ec2-tag-value
, Consul will attempt to join EC2 instances with the given tag key and value on startup. -
-retry-join-ec2-tag-value
- The Amazon EC2 instance tag value to filter on. -
-retry-join-ec2-region
- (Optional) The Amazon EC2 region to use. If not specified, Consul will use the local instance's EC2 metadata endpoint to discover the region.
The new feature requires permission to read the AWS instance state, and there are a variety of options available to grant these permissions.
- Static credentials (from the config file)
- Environment variables (
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
) - Shared credentials file (
~/.aws/credentials
or the path specified byAWS_SHARED_CREDENTIALS_FILE
) - ECS task role metadata (container-specific)
- EC2 instance role metadata
The startup process for the AWS instance is as follows:’
- The instance bootstraps and installs consul
- Init system starts consul with the configuration to join via EC2 metadata
- On start, consul queries the EC2 metadata service with
ec2:DescribeInstances
to list all instance tags - Consul extracts the private IP addresses of other EC2 instances which have the configured tag name and tag value from the metadata
- Consul runs
consul join
on those private IP addresses
The method we are using in this example is the EC2 role metadata. By assigning the ec2:DescribeInstances
permission to the instances IAM role, we can give Consul this permission without leaking any other control over your AWS account.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "ec2:DescribeInstances",
"Resource": "*"
}
]
}
» Auto-Joining in Action
The repository at https://github.com/hashicorp/consul-ec2-auto-join-example includes a Terraform configuration to demonstrate this functionality. To start and bootstrap the cluster modify the file terraform.tfvars to add your AWS credentials and default region and then run terraform plan
, terraform apply
to create the cluster.
aws_region = "eu-west-1"
aws_access_key = "[AWS_ACCESS_KEY]"
aws_secret_key = "[AWS_SECRET]"
Once this is all up and running, you will see some output from Terraform showing the IP addresses of the created agents and servers.
Outputs:
clients = [
34.253.136.132,
34.252.238.49
]
servers = [
34.251.206.78,
34.249.242.227,
34.253.133.165
]
After provisioning, it is possible to login to one of the client nodes via SSH using the IP address output from Terraform.
$ ssh [email protected]
The cluster should be auto-joined, since the instances share the same auto-join tag value.
Running the consul members
command will show all members of the cluster and their status (both clients and servers).
$ consul members
Node Address Status Type Build Protocol DC
consul-blog-client-0 10.1.1.189:8301 alive client 0.7.5 2 dc1
consul-blog-client-1 10.1.2.187:8301 alive client 0.7.5 2 dc1
consul-blog-server-0 10.1.1.241:8301 alive server 0.7.5 2 dc1
consul-blog-server-1 10.1.2.24:8301 alive server 0.7.5 2 dc1
consul-blog-server-2 10.1.1.26:8301 alive server 0.7.5 2 dc1
This cluster automatically bootstrapped with no human intervention, but what about failure scenarios?
Without the auto-join functionality, scaling Consul servers can be challenging and often involves operator participation. With the new auto-join functionality, scaling (up or down) is incredibly easy. It is so easy, that we do not have to do anything. To demonstrate this, edit the terraform.tfvars
file and increase the number of instances to 5 and re-run terraform plan
and terraform apply
.
$ terraform plan
Plan: 2 to add, 0 to change, 0 to destroy.
$ terraform apply
Apply complete! Resources: 2 added, 0 changed, 0 destroyed.
The state of your infrastructure has been saved to the path
below. This state is required to modify and destroy your
infrastructure, so keep it safe. To inspect the complete state
use the `terraform show` command.
State path: terraform.tfstate
Outputs:
clients = [
34.253.136.132,
34.252.238.49
]
servers = [
34.251.206.78,
34.249.242.227,
34.253.133.165,
34.252.132.0,
34.253.148.148
]
Run consul members
again after the new servers have finished provisioning. It might take a few seconds for the new servers to join the cluster, but they will be available in the memberlist:
Node Address Status Type Build Protocol DC
consul-blog-client-0 10.1.1.189:8301 alive client 0.7.5 2 dc1
consul-blog-client-1 10.1.2.187:8301 alive client 0.7.5 2 dc1
consul-blog-server-0 10.1.1.241:8301 alive server 0.7.5 2 dc1
consul-blog-server-1 10.1.2.24:8301 alive server 0.7.5 2 dc1
consul-blog-server-2 10.1.1.26:8301 alive server 0.7.5 2 dc1
consul-blog-server-3 10.1.2.44:8301 alive server 0.7.5 2 dc1
consul-blog-server-4 10.1.1.75:8301 alive server 0.7.5 2 dc1
The same applies when scaling down - there is no need to manually remove nodes, so long as we stay above the originally-configured minimum number of servers (3 in this example). To demonstrate this functionality, decrease the number of servers in the terraform.tfvars
file and run terraform plan
and terraform apply
again. The deprovisioned server nodes will show in the members list as failed, but the cluster will be fully operational.
Node Address Status Type Build Protocol DC
consul-blog-client-0 10.1.1.189:8301 alive client 0.7.5 2 dc1
consul-blog-client-1 10.1.2.187:8301 alive client 0.7.5 2 dc1
consul-blog-server-0 10.1.1.241:8301 alive server 0.7.5 2 dc1
consul-blog-server-1 10.1.2.24:8301 alive server 0.7.5 2 dc1
consul-blog-server-2 10.1.1.26:8301 alive server 0.7.5 2 dc1
consul-blog-server-3 10.1.2.44:8301 failed server 0.7.5 2 dc1
consul-blog-server-4 10.1.1.75:8301 failed server 0.7.5 2 dc1
» Summary
The Consul EC2 auto-join functionality enables seamless bootstrapping and auto-scaling of Consul clusters by leveraging cloud metadata. This post shows the functionality using AWS EC2, but the same functionality is also available for Google Cloud, and Consul's roadmap includes adding support for additional cloud providers in the future. We hope you enjoy this new functionality and look forward to future improvements.
Sign up for the latest HashiCorp news
More blog posts like this one
HashiCorp at AWS re:Invent: Your blueprint to cloud success
If you’re attending AWS re:Invent in Las Vegas, Dec. 2 - Dec. 6th, visit us for breakout sessions, expert talks, and product demos to learn how to take a unified approach to Infrastructure and Security Lifecycle Management.
Consul 1.20 improves multi-tenancy, metrics, and OpenShift deployment
HashiCorp Consul 1.20 is a significant upgrade for the Kubernetes operator and developer experience, including better multi-tenant service discovery, catalog registration metrics, and secure OpenShift integration.
New SLM offerings for Vault, Boundary, and Consul at HashiConf 2024 make security easier
The latest Security Lifecycle Management (SLM) features from HashiCorp Vault, Boundary, and Consul help organizations offer a smoother path to better security practices for developers.