Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unlocking Terraform's S3 Backend: Going DynamoDB-Free #599

Open
ravinitp opened this issue Sep 27, 2023 · 18 comments
Open

Unlocking Terraform's S3 Backend: Going DynamoDB-Free #599

ravinitp opened this issue Sep 27, 2023 · 18 comments
Assignees
Labels
accepted This issue has been accepted for implementation. enhancement New feature or request needs-rfc This issue needs an RFC prior to being accepted or, if it's accepted, prior to being implemented.

Comments

@ravinitp
Copy link

Summary

This RFC Propose a significant enhancement to terraform's S3 backend configuration. The objective is to provide a DynamoDB-free alternative for state file locking, making our infrastructure management more flexible and cost-efficient.

Configuration Update: Modify the terraform backend s3 to include a new lock_storage_type option in the s3 backend configuration block. Users can now choose between two options:

  1. "DynamoDB" (the default): The traditional DynamoDB-based state locking mechanism.
  2. "S3Bucket": The new DynamoDB-free alternative for state locking.

Problem Statement

DynamoDB has been Terraform's go-to solution for state locking, and it has served its purpose well. However, there are a few reasons why some users seek alternatives:

  1. Cost Implications
    DynamoDB can incur additional costs, especially for users with substantial workloads. These costs might not always align with the organization’s budget constraints.

  2. Complex Setup
    Setting up DynamoDB tables, managing permissions, and ensuring high availability can be complex and time-consuming.

  3. External Dependency
    DynamoDB introduces an external dependency into your Terraform workflow, making it harder to manage everything within your infrastructure code.

For Terraform users who prefer simplicity and cost-efficiency, the dependence on DynamoDB for state locking has been a recurring challenge. DynamoDB adds operational overhead and incurs additional costs, making it less attractive for smaller projects or cost-conscious organisations.

User-facing description

To implement this change, users simply need to set lock_storage_type = "S3Bucket" in their Terraform configuration. This will enable state locking using the S3 bucket itself, without the need for a DynamoDB table.
Compatibility: We need to ensure that this customisation is fully compatible with AWS S3 buckets and is backward compatible.

How to Use
I have added a dedicated section explaining the new lock_storage_type option and providing usage examples.
example

terraform {
    backend "s3" {
      bucket = "terraform-backend-ravi"
      region = "ap-south-1"
      skip_region_validation = true
      skip_credentials_validation = true
      force_path_style = true
      key = "terraform.tfstate"
      lock_storage_type = "S3Bucket"
      access_key = "<access_key>"
      secret_key = "<Secrete_Key>"
    }
}

Technical Description

I have done a POC here
opentofu/pull/595

Rationale and alternatives

The DynamoDB-free state locking customisation brings several benefits:

Cost Savings
By eliminating DynamoDB, User can significantly reduce their infrastructure costs, making Terraform more budget-friendly.

Simplified Setup
Say goodbye to the complexities of setting up and managing DynamoDB tables. The new configuration is straightforward and easy to implement.

Reduced External Dependencies
With this customisation, Users reduce external dependencies in their Terraform setup, allowing for a more self-contained infrastructure as code.

Improved Flexibility
Users gain more flexibility in choosing the best state locking solution for their specific use case.

Why is this solution better than alternative solutions to this problem, if there are any?
Terraform backend GCS
GCS uses same mecanism

Downsides

No

Unresolved Questions

NA

Related Issues

NA

Proof of Concept

#595

@ravinitp ravinitp added pending-decision This issue has not been accepted for implementation nor rejected. It's still open to discussion. rfc labels Sep 27, 2023
@cube2222
Copy link
Contributor

Hey, thanks for creating the issue @ravinitp!

Right now the PoC here does a separate read and write, this is prone to a data race (write skew).

E.g. GCS has conditional writes, so the implementation in OpenTofu uses that for creating lockfiles.

Based on some googling, it seems like S3 doesn't support this. Second comment here, also underlying article.

@yaronya yaronya added enhancement New feature or request frozen labels Sep 27, 2023
@ravinitp
Copy link
Author

Hi @cube2222
I have written https://github.com/ravinitp/s3-object-lock-demo.
this Proof of Concept (POC) demonstrates object locking in AWS S3 using a minimalistic Go tool. Allows users to acquire and release locks on S3 objects with versioning-enabled buckets for secure and coordinated access.
Please review this once.

The locking mechanism in this project follows these steps:

  1. Copy the object from <OBJECT_KEY> to <OBJECT_KEY>.lock to create a locked version of the object. Take note of the version of the locked object.

  2. Delete the original object <OBJECT_KEY>.

  3. check the version of <OBJECT_KEY>.lock. If the version matches the one acquired earlier, the lock is granted; otherwise, it indicates that someone else has acquired the lock. This design guarantees the elimination of race conditions when acquiring the lock.

@yaronya yaronya assigned cube2222 and unassigned yaronya Sep 30, 2023
@eranelbaz eranelbaz assigned marcinwyszynski and unassigned cube2222 Oct 1, 2023
@ziggythehamster
Copy link

For anyone else that has been around AWS for a long time, I thought I should point out that S3 now has read-after-write consistency instead of eventual consistency. The S3+DynamoDB combination in Terraform was created when S3 had eventual consistency, where the above solution would not work.

@cam72cam
Copy link
Member

cam72cam commented Mar 29, 2024

I've closed the PoC PR as it does not adequately address the consistency concerns and it is very out of date.

Although this may be possible with the current AWS S3 implementation , we also need to consider what happens when people try to use this feature with S3 compatible services.

@cube2222
Copy link
Contributor

It'd be curious to hear, from people who are upvoting this, what is your main use-case here? If you don't want to write a comment, feel free to react to mine with the relevant emoji.

🎉 Want to use this with AWS S3, for any of the benefits listed above (simplicity, cost-savings, etc.)

🚀 Want to use it with s3-compatible state storage elsewhere, due to benefits like getting state locking which is now not possible in such a scenario (as you don't have dynamodb elsewhere).

The reason I'm asking is that I'm worried most people care about the latter, not the former. The problem with the latter is that the correctness of this state-locking approach is heavily dependent on the consistency guarantees of your object storage backend. If those guarantees aren't good enough, then the locking will be broken for you, in a way that's non-deterministic and extremely hard to debug. Overall, I'm worried about the size of footgun that we'd be introducing here.

@ravinitp
Copy link
Author

Hi @cube2222 ,

Thank you for your comment and the thoughtful considerations regarding the use cases and potential challenges.

I am the one who created this RFC, and I wanted to address your concerns directly. Through my testing, I have implemented a mechanism that ensures consistency in the locking mechanism, which mitigates the risks associated with the consistency guarantees of s3 Backend.

If the OpenTofu core team prefers not to modify the existing S3 backend, I propose creating a separate backend that supports using S3 for state storage along with an integrated lock mechanism. I am fully committed to developing this solution end-to-end.

If others in the community support this idea, please react to this comment.

@cube2222
Copy link
Contributor

cube2222 commented May 20, 2024

I haven't read your PoC in-depth, but your lock function relies on DeleteObject being strongly consistent. If two processes can delete a single object before the delete goes through, then they can both get the lock. In general, if your s3-compatible object storage is eventually-consistent, all bets are off.

In the case of S3 itself it seems like it supports strong consistency for deletes, at least based on S3 delivers strong read-after-write consistency for any storage request, even though it's not explicitly listed in

After a successful write of a new object or an overwrite of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with all changes reflected.

so it would definitely have to be double-checked with AWS support.

Either way, I've left my comment to better understand other voters' use-cases for this. We're not planning to accept big changes to state backends before we decide on #382, so this is purely exploratory.

@ravinitp
Copy link
Author

ravinitp commented May 22, 2024

Hi @cube2222
Thank you for your insights and concerns.

I request you to review the entire lock function implementation. It performs the following actions:

Creates a copy of the state file and retrieves the latest version ID of the backup (main.go#62).
Deletes the state file (main.go#73).
Reads the backup resource file (main.go#83) and matches the version ID with the version ID obtained in step 1 (main.go#89).
If the version IDs match, the lock is acquired by the current user; otherwise, someone else has acquired the lock.

I also recommend reviewing the main_test.go file. This lock mechanism has been tested with 100 concurrent threads and has demonstrated consistent performance.

I understand the concerns regarding eventual consistency in S3-compatible storage solutions. However, based on AWS documentation, S3 provides strong read-after-write consistency for all storage requests, including delete operations. Nonetheless, double-checking with AWS support for absolute clarity is a prudent step.

Also, I want to get it implemented till my last breath. Please prove me wrong if you see any inconsistency in the lock mechanism.

Your feedback is invaluable as we explore potential solutions.

Note :- Above solution only work with version enabled AWS s3 bucket.

@cam72cam
Copy link
Member

S3 provides strong read-after-write consistency

That is the AWS implementation, not the protocol itself that others providers implement. Many other providers use the S3 API to wrap other less consistent storage options.

Note :- Above solution only work with version enabled AWS s3 bucket.

Not everyone uses a versioned bucket (although it is highly recommended).

I like the idea of giving users additional flexibility, but want to make sure that anything related to consistency and safety is thoroughly thought out and documented.

Also, I want to get it implemented till my last breath.

I truly appreciate the enthusiasm :)

We are having discussions on if/when #382 will be implemented as it would allow cloud providers and organizations to tailor backends to their specific storage solution and consistency needs.

@stevehipwell
Copy link

This should now be possible given the announcement that S3 now supports conditional writes.

@cam72cam
Copy link
Member

The interesting component will be if/how we support this with other S3-compatible storage solutions, though it will probably be worth adding support and letting folks opt-in if they choose.

@cam72cam cam72cam assigned ollevche and unassigned marcinwyszynski Aug 22, 2024
@cam72cam cam72cam added needs-rfc This issue needs an RFC prior to being accepted or, if it's accepted, prior to being implemented. and removed rfc frozen labels Aug 22, 2024
@cam72cam
Copy link
Member

Assigning @ollevche to document the options currently available to us in the S3 API. This will eventually be turned into an RFC that the community can weigh in on. If we implement this, it will be an opt-in feature.

@skyzyx
Copy link

skyzyx commented Aug 29, 2024

@cam72cam said:

The interesting component will be if/how we support this with other S3-compatible storage solutions, though it will probably be worth adding support and letting folks opt-in if they choose.

IF OpenTofu were to move forward with this, there are a few things I'd be thinking about:

  1. "Support" does not mean that it works. Support refers to who is on-the-hook when something goes wrong. OpenTofu could choose to support Amazon S3, but not other S3-compatible APIs.

    A user, however, could choose to support themselves with S3-compatible backends. It would be helpful, however, for OpenTofu (the subject matter experts) to explain in plain english what the end-users should look for when validating whether or not their preferred storage backend provides support for the necessary functionality (e.g., conditional writes, strong read-after-write consistency).

  2. A possible option, if cost/resource-effective, could be to write some automated smoke tests against a smattering of popular S3-compatible services. If the tests pass, the storage backend provides the guarantees. If the tests fail, the storage backend does not. Perhaps a Markdown document could be updated with the results and used as a reference.

    This could be provided as a guide to less risk-averse users, irrespective of how OpenTofu chooses to define "support" for these backends.

  3. As a one-man startup company who is leveraging multiple cloud service providers, and has budget alerts configured, the DynamoDB locking barely registers a blip in cost.

Like @cube2222 said earlier this year:

The reason I'm asking is that I'm worried most people care about [wanting to use it with S3-compatible state storage elsewhere], not [using this with the real Amazon S3]. The problem with [S3-compatible state storage] is that the correctness of this state-locking approach is heavily dependent on the consistency guarantees of your object storage backend. If those guarantees aren't good enough, then the locking will be broken for you, in a way that's non-deterministic and extremely hard to debug. Overall, I'm worried about the size of footgun that we'd be introducing here.

IMO, this is the salient point. If you're using AWS for realsies, then the cost benefits seem negligible to me. If you're using an S3-compatible backend, it is far less likely to have the guarantees that make this useful.

IF there is an interest in moving forward, I'd be curious about replacing DynamoDB with another OSS backend that has strong consistency guarantees that may NOT be an S3-compatible storage backend. Like Redis or SQLite or something (not that these are ideal solutions; just examples).

@stevehipwell
Copy link

@skyzyx I generally agree with your points above, but you've missed what I think is the real driving factor here; simplicity and alignment. Knowing that I'm using AWS S3 I don't want the cognitive load of having to consider DynamoDB, even if it's not a cost factor. I also want AWS S3 to be aligned to the other object storage providers where no additional systems are required.

There are a number of options to make it less likely that this is misconfigured, but it should be trivial to actually test a backend either as a new command or as part of the current use pattern.

Anecdotally we nearly had to use Azure storage for our backends based on the perceived benefit of not having to run DynamoDB, this is despite the SMEs preferring AWS S3.

@b-milescu
Copy link

Can we take a look at this? hashicorp/terraform#35661

@abstractionfactory
Copy link
Contributor

@b-milescu I raised this with the core team for discussion, thank you for pointing this out. We won't take a look at the HashiCorp PR directly, but we'll consider if adding feature-parity is something we should do.

@stevehipwell
Copy link

Just to reinforce my comment above; we've been undertaking some TF maintenance where end-users have been required to manipulate their TF state. These are competent engineers who use TF day in day out and I was shocked at the number of cases where DynamoDB was missed; resulting in significant organisation overhead. IMHO a S3 only state system would have been much simpler for them to work with and reason about.

@cam72cam cam72cam added accepted This issue has been accepted for implementation. and removed pending-decision This issue has not been accepted for implementation nor rejected. It's still open to discussion. labels Nov 5, 2024
@pdecat
Copy link
Contributor

pdecat commented Nov 26, 2024

Probably useful here:

Amazon S3 adds new functionality for conditional writes
Posted on: Nov 25, 2024

Amazon S3 can now perform conditional writes that evaluate if an object is unmodified before updating it. This helps you coordinate simultaneous writes to the same object and prevents multiple concurrent writers from unintentionally overwriting the object without knowing the state of its content. You can use this capability by providing the ETag of an object using S3 PutObject or CompleteMultipartUpload API requests in both S3 general purpose and directory buckets.

https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/

Also interesting, and maybe to recommend during the setup of the S3 bucket storing the states:

Amazon S3 now supports enforcement of conditional write operations for S3 general purpose buckets
Posted on: Nov 25, 2024

Amazon S3 now supports enforcement of conditional write operations for S3 general purpose buckets using bucket policies. With enforcement of conditional writes, you can now mandate that S3 check the existence of an object before creating it in your bucket. Similarly, you can also mandate that S3 check the state of the object’s content before updating it in your bucket. This helps you to simplify distributed applications by preventing unintentional data overwrites, especially in high-concurrency, multi-writer scenarios.

https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-enforcement-conditional-write-operations-general-purpose-buckets/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted This issue has been accepted for implementation. enhancement New feature or request needs-rfc This issue needs an RFC prior to being accepted or, if it's accepted, prior to being implemented.
Projects
None yet
Development

No branches or pull requests