-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unlocking Terraform's S3 Backend: Going DynamoDB-Free #599
Comments
Hey, thanks for creating the issue @ravinitp! Right now the PoC here does a separate read and write, this is prone to a data race (write skew). E.g. GCS has conditional writes, so the implementation in OpenTofu uses that for creating lockfiles. Based on some googling, it seems like S3 doesn't support this. Second comment here, also underlying article. |
Hi @cube2222 The locking mechanism in this project follows these steps:
|
For anyone else that has been around AWS for a long time, I thought I should point out that S3 now has read-after-write consistency instead of eventual consistency. The S3+DynamoDB combination in Terraform was created when S3 had eventual consistency, where the above solution would not work. |
I've closed the PoC PR as it does not adequately address the consistency concerns and it is very out of date. Although this may be possible with the current AWS S3 implementation , we also need to consider what happens when people try to use this feature with S3 compatible services. |
It'd be curious to hear, from people who are upvoting this, what is your main use-case here? If you don't want to write a comment, feel free to react to mine with the relevant emoji. 🎉 Want to use this with AWS S3, for any of the benefits listed above (simplicity, cost-savings, etc.) 🚀 Want to use it with s3-compatible state storage elsewhere, due to benefits like getting state locking which is now not possible in such a scenario (as you don't have dynamodb elsewhere). The reason I'm asking is that I'm worried most people care about the latter, not the former. The problem with the latter is that the correctness of this state-locking approach is heavily dependent on the consistency guarantees of your object storage backend. If those guarantees aren't good enough, then the locking will be broken for you, in a way that's non-deterministic and extremely hard to debug. Overall, I'm worried about the size of footgun that we'd be introducing here. |
Hi @cube2222 , Thank you for your comment and the thoughtful considerations regarding the use cases and potential challenges. I am the one who created this RFC, and I wanted to address your concerns directly. Through my testing, I have implemented a mechanism that ensures consistency in the locking mechanism, which mitigates the risks associated with the consistency guarantees of s3 Backend. If the OpenTofu core team prefers not to modify the existing S3 backend, I propose creating a separate backend that supports using S3 for state storage along with an integrated lock mechanism. I am fully committed to developing this solution end-to-end. If others in the community support this idea, please react to this comment. |
I haven't read your PoC in-depth, but your In the case of S3 itself it seems like it supports strong consistency for deletes, at least based on
so it would definitely have to be double-checked with AWS support. Either way, I've left my comment to better understand other voters' use-cases for this. We're not planning to accept big changes to state backends before we decide on #382, so this is purely exploratory. |
Hi @cube2222 I request you to review the entire lock function implementation. It performs the following actions: Creates a copy of the state file and retrieves the latest version ID of the backup (main.go#62). I also recommend reviewing the main_test.go file. This lock mechanism has been tested with 100 concurrent threads and has demonstrated consistent performance. I understand the concerns regarding eventual consistency in S3-compatible storage solutions. However, based on AWS documentation, S3 provides strong read-after-write consistency for all storage requests, including delete operations. Nonetheless, double-checking with AWS support for absolute clarity is a prudent step. Also, I want to get it implemented till my last breath. Please prove me wrong if you see any inconsistency in the lock mechanism. Your feedback is invaluable as we explore potential solutions. Note :- Above solution only work with version enabled AWS s3 bucket. |
That is the AWS implementation, not the protocol itself that others providers implement. Many other providers use the S3 API to wrap other less consistent storage options.
Not everyone uses a versioned bucket (although it is highly recommended). I like the idea of giving users additional flexibility, but want to make sure that anything related to consistency and safety is thoroughly thought out and documented.
I truly appreciate the enthusiasm :) We are having discussions on if/when #382 will be implemented as it would allow cloud providers and organizations to tailor backends to their specific storage solution and consistency needs. |
This should now be possible given the announcement that S3 now supports conditional writes. |
The interesting component will be if/how we support this with other S3-compatible storage solutions, though it will probably be worth adding support and letting folks opt-in if they choose. |
Assigning @ollevche to document the options currently available to us in the S3 API. This will eventually be turned into an RFC that the community can weigh in on. If we implement this, it will be an opt-in feature. |
@cam72cam said:
IF OpenTofu were to move forward with this, there are a few things I'd be thinking about:
Like @cube2222 said earlier this year:
IMO, this is the salient point. If you're using AWS for realsies, then the cost benefits seem negligible to me. If you're using an S3-compatible backend, it is far less likely to have the guarantees that make this useful. IF there is an interest in moving forward, I'd be curious about replacing DynamoDB with another OSS backend that has strong consistency guarantees that may NOT be an S3-compatible storage backend. Like Redis or SQLite or something (not that these are ideal solutions; just examples). |
@skyzyx I generally agree with your points above, but you've missed what I think is the real driving factor here; simplicity and alignment. Knowing that I'm using AWS S3 I don't want the cognitive load of having to consider DynamoDB, even if it's not a cost factor. I also want AWS S3 to be aligned to the other object storage providers where no additional systems are required. There are a number of options to make it less likely that this is misconfigured, but it should be trivial to actually test a backend either as a new command or as part of the current use pattern. Anecdotally we nearly had to use Azure storage for our backends based on the perceived benefit of not having to run DynamoDB, this is despite the SMEs preferring AWS S3. |
Can we take a look at this? hashicorp/terraform#35661 |
@b-milescu I raised this with the core team for discussion, thank you for pointing this out. We won't take a look at the HashiCorp PR directly, but we'll consider if adding feature-parity is something we should do. |
Just to reinforce my comment above; we've been undertaking some TF maintenance where end-users have been required to manipulate their TF state. These are competent engineers who use TF day in day out and I was shocked at the number of cases where DynamoDB was missed; resulting in significant organisation overhead. IMHO a S3 only state system would have been much simpler for them to work with and reason about. |
Probably useful here:
https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/ Also interesting, and maybe to recommend during the setup of the S3 bucket storing the states:
|
Summary
This RFC Propose a significant enhancement to terraform's S3 backend configuration. The objective is to provide a DynamoDB-free alternative for state file locking, making our infrastructure management more flexible and cost-efficient.
Configuration Update: Modify the terraform backend s3 to include a new lock_storage_type option in the s3 backend configuration block. Users can now choose between two options:
Problem Statement
DynamoDB has been Terraform's go-to solution for state locking, and it has served its purpose well. However, there are a few reasons why some users seek alternatives:
Cost Implications
DynamoDB can incur additional costs, especially for users with substantial workloads. These costs might not always align with the organization’s budget constraints.
Complex Setup
Setting up DynamoDB tables, managing permissions, and ensuring high availability can be complex and time-consuming.
External Dependency
DynamoDB introduces an external dependency into your Terraform workflow, making it harder to manage everything within your infrastructure code.
For Terraform users who prefer simplicity and cost-efficiency, the dependence on DynamoDB for state locking has been a recurring challenge. DynamoDB adds operational overhead and incurs additional costs, making it less attractive for smaller projects or cost-conscious organisations.
User-facing description
To implement this change, users simply need to set lock_storage_type = "S3Bucket" in their Terraform configuration. This will enable state locking using the S3 bucket itself, without the need for a DynamoDB table.
Compatibility: We need to ensure that this customisation is fully compatible with AWS S3 buckets and is backward compatible.
How to Use
I have added a dedicated section explaining the new lock_storage_type option and providing usage examples.
example
Technical Description
I have done a POC here
opentofu/pull/595
Rationale and alternatives
The DynamoDB-free state locking customisation brings several benefits:
Cost Savings
By eliminating DynamoDB, User can significantly reduce their infrastructure costs, making Terraform more budget-friendly.
Simplified Setup
Say goodbye to the complexities of setting up and managing DynamoDB tables. The new configuration is straightforward and easy to implement.
Reduced External Dependencies
With this customisation, Users reduce external dependencies in their Terraform setup, allowing for a more self-contained infrastructure as code.
Improved Flexibility
Users gain more flexibility in choosing the best state locking solution for their specific use case.
Why is this solution better than alternative solutions to this problem, if there are any?
Terraform backend GCS
GCS uses same mecanism
Downsides
No
Unresolved Questions
NA
Related Issues
NA
Proof of Concept
#595
The text was updated successfully, but these errors were encountered: