Unlocking Terraform's S3 Backend: Going DynamoDB-Free #599

ravinitp · 2023-09-27T09:47:33Z

Summary

This RFC Propose a significant enhancement to terraform's S3 backend configuration. The objective is to provide a DynamoDB-free alternative for state file locking, making our infrastructure management more flexible and cost-efficient.

Configuration Update: Modify the terraform backend s3 to include a new lock_storage_type option in the s3 backend configuration block. Users can now choose between two options:

"DynamoDB" (the default): The traditional DynamoDB-based state locking mechanism.
"S3Bucket": The new DynamoDB-free alternative for state locking.

Problem Statement

DynamoDB has been Terraform's go-to solution for state locking, and it has served its purpose well. However, there are a few reasons why some users seek alternatives:

Cost Implications
DynamoDB can incur additional costs, especially for users with substantial workloads. These costs might not always align with the organization’s budget constraints.
Complex Setup
Setting up DynamoDB tables, managing permissions, and ensuring high availability can be complex and time-consuming.
External Dependency
DynamoDB introduces an external dependency into your Terraform workflow, making it harder to manage everything within your infrastructure code.

For Terraform users who prefer simplicity and cost-efficiency, the dependence on DynamoDB for state locking has been a recurring challenge. DynamoDB adds operational overhead and incurs additional costs, making it less attractive for smaller projects or cost-conscious organisations.

User-facing description

To implement this change, users simply need to set lock_storage_type = "S3Bucket" in their Terraform configuration. This will enable state locking using the S3 bucket itself, without the need for a DynamoDB table.
Compatibility: We need to ensure that this customisation is fully compatible with AWS S3 buckets and is backward compatible.

How to Use
I have added a dedicated section explaining the new lock_storage_type option and providing usage examples.
example

terraform {
    backend "s3" {
      bucket = "terraform-backend-ravi"
      region = "ap-south-1"
      skip_region_validation = true
      skip_credentials_validation = true
      force_path_style = true
      key = "terraform.tfstate"
      lock_storage_type = "S3Bucket"
      access_key = "<access_key>"
      secret_key = "<Secrete_Key>"
    }
}

Technical Description

I have done a POC here
opentofu/pull/595

Rationale and alternatives

The DynamoDB-free state locking customisation brings several benefits:

Cost Savings
By eliminating DynamoDB, User can significantly reduce their infrastructure costs, making Terraform more budget-friendly.

Simplified Setup
Say goodbye to the complexities of setting up and managing DynamoDB tables. The new configuration is straightforward and easy to implement.

Reduced External Dependencies
With this customisation, Users reduce external dependencies in their Terraform setup, allowing for a more self-contained infrastructure as code.

Improved Flexibility
Users gain more flexibility in choosing the best state locking solution for their specific use case.

Why is this solution better than alternative solutions to this problem, if there are any?
Terraform backend GCS
GCS uses same mecanism

Downsides

No

Unresolved Questions

NA

Related Issues

NA

Proof of Concept

#595

The text was updated successfully, but these errors were encountered:

cube2222 · 2023-09-27T10:35:12Z

Hey, thanks for creating the issue @ravinitp!

Right now the PoC here does a separate read and write, this is prone to a data race (write skew).

E.g. GCS has conditional writes, so the implementation in OpenTofu uses that for creating lockfiles.

Based on some googling, it seems like S3 doesn't support this. Second comment here, also underlying article.

ravinitp · 2023-09-30T17:16:08Z

Hi @cube2222
I have written https://github.com/ravinitp/s3-object-lock-demo.
this Proof of Concept (POC) demonstrates object locking in AWS S3 using a minimalistic Go tool. Allows users to acquire and release locks on S3 objects with versioning-enabled buckets for secure and coordinated access.
Please review this once.

The locking mechanism in this project follows these steps:

Copy the object from <OBJECT_KEY> to <OBJECT_KEY>.lock to create a locked version of the object. Take note of the version of the locked object.
Delete the original object <OBJECT_KEY>.
check the version of <OBJECT_KEY>.lock. If the version matches the one acquired earlier, the lock is granted; otherwise, it indicates that someone else has acquired the lock. This design guarantees the elimination of race conditions when acquiring the lock.

ziggythehamster · 2023-10-23T21:36:41Z

For anyone else that has been around AWS for a long time, I thought I should point out that S3 now has read-after-write consistency instead of eventual consistency. The S3+DynamoDB combination in Terraform was created when S3 had eventual consistency, where the above solution would not work.

cam72cam · 2024-03-29T13:15:08Z

I've closed the PoC PR as it does not adequately address the consistency concerns and it is very out of date.

Although this may be possible with the current AWS S3 implementation , we also need to consider what happens when people try to use this feature with S3 compatible services.

cube2222 · 2024-05-20T15:22:15Z

It'd be curious to hear, from people who are upvoting this, what is your main use-case here? If you don't want to write a comment, feel free to react to mine with the relevant emoji.

🎉 Want to use this with AWS S3, for any of the benefits listed above (simplicity, cost-savings, etc.)

🚀 Want to use it with s3-compatible state storage elsewhere, due to benefits like getting state locking which is now not possible in such a scenario (as you don't have dynamodb elsewhere).

The reason I'm asking is that I'm worried most people care about the latter, not the former. The problem with the latter is that the correctness of this state-locking approach is heavily dependent on the consistency guarantees of your object storage backend. If those guarantees aren't good enough, then the locking will be broken for you, in a way that's non-deterministic and extremely hard to debug. Overall, I'm worried about the size of footgun that we'd be introducing here.

ravinitp · 2024-05-20T17:15:59Z

Hi @cube2222 ,

Thank you for your comment and the thoughtful considerations regarding the use cases and potential challenges.

I am the one who created this RFC, and I wanted to address your concerns directly. Through my testing, I have implemented a mechanism that ensures consistency in the locking mechanism, which mitigates the risks associated with the consistency guarantees of s3 Backend.

If the OpenTofu core team prefers not to modify the existing S3 backend, I propose creating a separate backend that supports using S3 for state storage along with an integrated lock mechanism. I am fully committed to developing this solution end-to-end.

If others in the community support this idea, please react to this comment.

cube2222 · 2024-05-20T23:27:35Z

I haven't read your PoC in-depth, but your lock function relies on DeleteObject being strongly consistent. If two processes can delete a single object before the delete goes through, then they can both get the lock. In general, if your s3-compatible object storage is eventually-consistent, all bets are off.

In the case of S3 itself it seems like it supports strong consistency for deletes, at least based on S3 delivers strong read-after-write consistency for any storage request, even though it's not explicitly listed in

After a successful write of a new object or an overwrite of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with all changes reflected.

so it would definitely have to be double-checked with AWS support.

Either way, I've left my comment to better understand other voters' use-cases for this. We're not planning to accept big changes to state backends before we decide on #382, so this is purely exploratory.

ravinitp · 2024-05-22T02:41:08Z

Hi @cube2222
Thank you for your insights and concerns.

I request you to review the entire lock function implementation. It performs the following actions:

Creates a copy of the state file and retrieves the latest version ID of the backup (main.go#62).
Deletes the state file (main.go#73).
Reads the backup resource file (main.go#83) and matches the version ID with the version ID obtained in step 1 (main.go#89).
If the version IDs match, the lock is acquired by the current user; otherwise, someone else has acquired the lock.

I also recommend reviewing the main_test.go file. This lock mechanism has been tested with 100 concurrent threads and has demonstrated consistent performance.

I understand the concerns regarding eventual consistency in S3-compatible storage solutions. However, based on AWS documentation, S3 provides strong read-after-write consistency for all storage requests, including delete operations. Nonetheless, double-checking with AWS support for absolute clarity is a prudent step.

Also, I want to get it implemented till my last breath. Please prove me wrong if you see any inconsistency in the lock mechanism.

Your feedback is invaluable as we explore potential solutions.

Note :- Above solution only work with version enabled AWS s3 bucket.

cam72cam · 2024-05-22T11:23:05Z

S3 provides strong read-after-write consistency

That is the AWS implementation, not the protocol itself that others providers implement. Many other providers use the S3 API to wrap other less consistent storage options.

Note :- Above solution only work with version enabled AWS s3 bucket.

Not everyone uses a versioned bucket (although it is highly recommended).

I like the idea of giving users additional flexibility, but want to make sure that anything related to consistency and safety is thoroughly thought out and documented.

Also, I want to get it implemented till my last breath.

I truly appreciate the enthusiasm :)

We are having discussions on if/when #382 will be implemented as it would allow cloud providers and organizations to tailor backends to their specific storage solution and consistency needs.

stevehipwell · 2024-08-22T11:40:39Z

This should now be possible given the announcement that S3 now supports conditional writes.

cam72cam · 2024-08-22T11:54:16Z

The interesting component will be if/how we support this with other S3-compatible storage solutions, though it will probably be worth adding support and letting folks opt-in if they choose.

cam72cam · 2024-08-22T12:58:37Z

Assigning @ollevche to document the options currently available to us in the S3 API. This will eventually be turned into an RFC that the community can weigh in on. If we implement this, it will be an opt-in feature.

skyzyx · 2024-08-29T18:23:49Z

@cam72cam said:

The interesting component will be if/how we support this with other S3-compatible storage solutions, though it will probably be worth adding support and letting folks opt-in if they choose.

IF OpenTofu were to move forward with this, there are a few things I'd be thinking about:

"Support" does not mean that it works. Support refers to who is on-the-hook when something goes wrong. OpenTofu could choose to support Amazon S3, but not other S3-compatible APIs.

A user, however, could choose to support themselves with S3-compatible backends. It would be helpful, however, for OpenTofu (the subject matter experts) to explain in plain english what the end-users should look for when validating whether or not their preferred storage backend provides support for the necessary functionality (e.g., conditional writes, strong read-after-write consistency).
A possible option, if cost/resource-effective, could be to write some automated smoke tests against a smattering of popular S3-compatible services. If the tests pass, the storage backend provides the guarantees. If the tests fail, the storage backend does not. Perhaps a Markdown document could be updated with the results and used as a reference.

This could be provided as a guide to less risk-averse users, irrespective of how OpenTofu chooses to define "support" for these backends.
As a one-man startup company who is leveraging multiple cloud service providers, and has budget alerts configured, the DynamoDB locking barely registers a blip in cost.

Like @cube2222 said earlier this year:

The reason I'm asking is that I'm worried most people care about [wanting to use it with S3-compatible state storage elsewhere], not [using this with the real Amazon S3]. The problem with [S3-compatible state storage] is that the correctness of this state-locking approach is heavily dependent on the consistency guarantees of your object storage backend. If those guarantees aren't good enough, then the locking will be broken for you, in a way that's non-deterministic and extremely hard to debug. Overall, I'm worried about the size of footgun that we'd be introducing here.

IMO, this is the salient point. If you're using AWS for realsies, then the cost benefits seem negligible to me. If you're using an S3-compatible backend, it is far less likely to have the guarantees that make this useful.

IF there is an interest in moving forward, I'd be curious about replacing DynamoDB with another OSS backend that has strong consistency guarantees that may NOT be an S3-compatible storage backend. Like Redis or SQLite or something (not that these are ideal solutions; just examples).

stevehipwell · 2024-08-29T21:36:05Z

@skyzyx I generally agree with your points above, but you've missed what I think is the real driving factor here; simplicity and alignment. Knowing that I'm using AWS S3 I don't want the cognitive load of having to consider DynamoDB, even if it's not a cost factor. I also want AWS S3 to be aligned to the other object storage providers where no additional systems are required.

There are a number of options to make it less likely that this is misconfigured, but it should be trivial to actually test a backend either as a new command or as part of the current use pattern.

Anecdotally we nearly had to use Azure storage for our backends based on the perceived benefit of not having to run DynamoDB, this is despite the SMEs preferring AWS S3.

b-milescu · 2024-10-15T06:19:38Z

Can we take a look at this? hashicorp/terraform#35661

abstractionfactory · 2024-10-15T06:45:04Z

@b-milescu I raised this with the core team for discussion, thank you for pointing this out. We won't take a look at the HashiCorp PR directly, but we'll consider if adding feature-parity is something we should do.

stevehipwell · 2024-10-15T08:49:26Z

Just to reinforce my comment above; we've been undertaking some TF maintenance where end-users have been required to manipulate their TF state. These are competent engineers who use TF day in day out and I was shocked at the number of cases where DynamoDB was missed; resulting in significant organisation overhead. IMHO a S3 only state system would have been much simpler for them to work with and reason about.

pdecat · 2024-11-26T11:10:59Z

Probably useful here:

Amazon S3 adds new functionality for conditional writes
Posted on: Nov 25, 2024

Amazon S3 can now perform conditional writes that evaluate if an object is unmodified before updating it. This helps you coordinate simultaneous writes to the same object and prevents multiple concurrent writers from unintentionally overwriting the object without knowing the state of its content. You can use this capability by providing the ETag of an object using S3 PutObject or CompleteMultipartUpload API requests in both S3 general purpose and directory buckets.

https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/

Also interesting, and maybe to recommend during the setup of the S3 bucket storing the states:

Amazon S3 now supports enforcement of conditional write operations for S3 general purpose buckets
Posted on: Nov 25, 2024

Amazon S3 now supports enforcement of conditional write operations for S3 general purpose buckets using bucket policies. With enforcement of conditional writes, you can now mandate that S3 check the existence of an object before creating it in your bucket. Similarly, you can also mandate that S3 check the state of the object’s content before updating it in your bucket. This helps you to simplify distributed applications by preventing unintentional data overwrites, especially in high-concurrency, multi-writer scenarios.

https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-enforcement-conditional-write-operations-general-purpose-buckets/

ravinitp added pending-decision This issue has not been accepted for implementation nor rejected. It's still open to discussion. rfc labels Sep 27, 2023

ravinitp mentioned this issue Sep 27, 2023

Added support to lock/unlock using s3 bucket only #595

Closed

eranelbaz assigned yaronya Sep 27, 2023

yaronya added enhancement New feature or request frozen labels Sep 27, 2023

yaronya assigned cube2222 and unassigned yaronya Sep 30, 2023

eranelbaz assigned marcinwyszynski and unassigned cube2222 Oct 1, 2023

github-actions bot mentioned this issue Apr 16, 2024

Top-Ranking Issues 📊 #1496

Open

mogul mentioned this issue May 9, 2024

SPIKE: investigate supporting easy terraform locking via cloud.gov-managed services GSA-TTS/terraform-cloudgov#38

Open

cam72cam mentioned this issue Aug 22, 2024

S3 remote backend without DynamoDB #1938

Closed

cam72cam assigned ollevche and unassigned marcinwyszynski Aug 22, 2024

cam72cam added needs-rfc This issue needs an RFC prior to being accepted or, if it's accepted, prior to being implemented. and removed rfc frozen labels Aug 22, 2024

ghost mentioned this issue Aug 27, 2024

Add support for locking via S3 conditional writes #1944

Closed

cam72cam added accepted This issue has been accepted for implementation. and removed pending-decision This issue has not been accepted for implementation nor rejected. It's still open to discussion. labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unlocking Terraform's S3 Backend: Going DynamoDB-Free #599

Unlocking Terraform's S3 Backend: Going DynamoDB-Free #599

ravinitp commented Sep 27, 2023

cube2222 commented Sep 27, 2023

ravinitp commented Sep 30, 2023

ziggythehamster commented Oct 23, 2023

cam72cam commented Mar 29, 2024 •

edited

Loading

cube2222 commented May 20, 2024

ravinitp commented May 20, 2024

cube2222 commented May 20, 2024 •

edited

Loading

ravinitp commented May 22, 2024 •

edited

Loading

cam72cam commented May 22, 2024

stevehipwell commented Aug 22, 2024

cam72cam commented Aug 22, 2024

cam72cam commented Aug 22, 2024

skyzyx commented Aug 29, 2024

stevehipwell commented Aug 29, 2024

b-milescu commented Oct 15, 2024

abstractionfactory commented Oct 15, 2024

stevehipwell commented Oct 15, 2024

pdecat commented Nov 26, 2024 •

edited

Loading

Unlocking Terraform's S3 Backend: Going DynamoDB-Free #599

Unlocking Terraform's S3 Backend: Going DynamoDB-Free #599

Comments

ravinitp commented Sep 27, 2023

Summary

Problem Statement

User-facing description

Technical Description

Rationale and alternatives

Downsides

Unresolved Questions

Related Issues

Proof of Concept

cube2222 commented Sep 27, 2023

ravinitp commented Sep 30, 2023

ziggythehamster commented Oct 23, 2023

cam72cam commented Mar 29, 2024 • edited Loading

cube2222 commented May 20, 2024

ravinitp commented May 20, 2024

cube2222 commented May 20, 2024 • edited Loading

ravinitp commented May 22, 2024 • edited Loading

cam72cam commented May 22, 2024

stevehipwell commented Aug 22, 2024

cam72cam commented Aug 22, 2024

cam72cam commented Aug 22, 2024

skyzyx commented Aug 29, 2024

stevehipwell commented Aug 29, 2024

b-milescu commented Oct 15, 2024

abstractionfactory commented Oct 15, 2024

stevehipwell commented Oct 15, 2024

pdecat commented Nov 26, 2024 • edited Loading

cam72cam commented Mar 29, 2024 •

edited

Loading

cube2222 commented May 20, 2024 •

edited

Loading

ravinitp commented May 22, 2024 •

edited

Loading

pdecat commented Nov 26, 2024 •

edited

Loading