Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding lock mechanism #2169

Closed
BenjaminDecreusefond opened this issue Nov 14, 2024 · 14 comments
Closed

Question regarding lock mechanism #2169

BenjaminDecreusefond opened this issue Nov 14, 2024 · 14 comments
Labels
enhancement New feature or request pending-decision This issue has not been accepted for implementation nor rejected. It's still open to discussion.

Comments

@BenjaminDecreusefond
Copy link

BenjaminDecreusefond commented Nov 14, 2024

OpenTofu Version

OpenTofu v1.8.3
on darwin_arm64
+ provider registry.opentofu.org/alekc/kubectl v2.1.3
+ provider registry.opentofu.org/hashicorp/archive v2.6.0
+ provider registry.opentofu.org/hashicorp/aws v5.75.1
+ provider registry.opentofu.org/hashicorp/helm v2.16.1
+ provider registry.opentofu.org/hashicorp/kubernetes v2.33.0
+ provider registry.opentofu.org/integrations/github v6.3.1
+ provider registry.opentofu.org/mrparkers/keycloak v4.4.0

The problem in your OpenTofu project

Hello!

I’d like to get your insights on the locking mechanism in tofu. At our company, we use a single Terraform configuration to manage multiple environments, selecting specific environments through different variable files. All init and plan operations are run within the same working directory. We’re using an open-source remote backend called Terrakube, similar to Terraform Enterprise (TFE) with concepts like workspaces and working directories.

Our goal is to create a TFE-like VCS integration solution by triggering a Lambda function on pull request events. This setup works well when we run init and plan sequentially for each environment in a folder. However, with a large number of environments, sequential execution becomes time-consuming. Ideally, we want to run several tofu plan operations in parallel within the same directory.

The main challenge here is that the state file’s lock mechanism prevents parallel runs, as only one operation can access the state at a time.

Could you suggest a way to refactor this setup to allow for parallel plan operations while respecting the state locking requirements?

Attempted Solutions

Tried to make a copy of the working directory for each env but the remote workspace working directory is enable to use TF_CLI_ARGS_plan.

Proposal

My idea is that, for remote backend we change the names .terraform and .terraform.lock to for instance .terraoform-<workspace-name> and .terraform.lock.<workspace-name> this would allow to manage and apply different state file a the same time. It's a proposal, I'm really not sure it is doable nor a good idea !

Lemme know ! :)

References

No response

@BenjaminDecreusefond BenjaminDecreusefond added enhancement New feature or request pending-decision This issue has not been accepted for implementation nor rejected. It's still open to discussion. labels Nov 14, 2024
@abstractionfactory
Copy link
Contributor

Hello @BenjaminDecreusefond thank you for the issue. I think this may be closely related to other locking-related issues, but I'll let people more knowledgeable than me in this area say more about that. In the mean time, I've queued up this issue for the core team to discuss, please bear with us until we get to this issue.

@apparentlymart
Copy link
Contributor

Hi @BenjaminDecreusefond! Thanks for starting this discussion.

From the perspective of OpenTofu CLI, the state locking mechanism is already expecting the locks to be separate for each workspace, because the locking API belongs to what we call a "state manager" and each one of those manages the state for only one workspace. The backend API includes a method which takes a workspace name and returns a state manager, and then OpenTofu calls the lock method on that state manager.

Of course, that doesn't necessarily guarantee that the underlying state manager implementation will also treat them as separate: it's possible to implement this API in a way that acquires a broader lock so that a lock on any workspace effectively locks them all.

You mentioned that you are using the remote backend. That backend effectively just delegates the locking request to the remote API, and so it's the server of the API that would decide what granularity to use for locks.

Is it possible that the cross-workspace locking is something that Terrakube is doing in its server, rather than something OpenTofu is doing client-side? I'm not familiar with Terrakube so I don't know if this is true, but just want to try to understand exactly where this constraint is coming from so we can figure out what it would take to weaken it in the way you described.

@apparentlymart
Copy link
Contributor

apparentlymart commented Nov 14, 2024

Putting the question of lock granularity aside for the moment, I think there is another potential improvement we could consider:

Today's state locking model is a single advisory lock that only one client can hold at a time, regardless of what they are intending to do with the state.

In principle we could rework the locking API to differentiate between read-only (shared) locks and writable (exclusive) locks. A suitably-advanced state storage implementation could then choose to allow multiple read-only locks to be held at the same time, but to treat writable locks as exclusive both to other write locks and to other read-only locks.

We could then arrange for OpenTofu to acquire a writable lock only for the commands that will both read and write state during their work, which includes (but isn't necessarily limited to):

  • tofu apply (both in the interactive mode where it creates a plan and prompts for approval, and in the mode where it's applying a plan that was previously saved)
  • all of the tofu state subcommands that make direct modifications to the state
  • tofu import
  • tofu init when it is dealing with "backend migration"

In particular, I think it should be safe for commands like tofu plan, tofu output, tofu show and tofu console to do their work with only a shared read-only lock, and so these commands would be able to run concurrently with themselves and with each other.

In the saved plan workflow, OpenTofu saves enough information in the plan file to detect if the state has changed since the plan was created, so there is no need to hold a writable lock across both the plan and apply phases if you are willing to tolerate an error when someone tries to apply a stale plan.

I think we would still need to give concrete implementations the option of treating all locks as exclusive, because not all state storage implementations have a good way to implement a read-write lock, but if we include that information in our lock requests then each implementation can presumably decide whether or not to differentiate between the two lock types.

If we decide that this is worth pursuing then we should probably include it in #2157. In particular, hopefully we'd design it into the new plugin-based state storage API from the outset so that we don't need to make a breaking protocol change.

@BenjaminDecreusefond
Copy link
Author

BenjaminDecreusefond commented Nov 15, 2024

Yep ! I think I totally agree with you for the second part. Being able to run read-only commands concurrently would be really helpful !

However, I'm not sure to understand your explanation about the fact that the mechanism is set by Terrakube on the lock file and not by Tofu. From my perspective when you run tofu plan with a remote backend, Tofu will upload the content of the current directory to the remote workspace and then it will be processed by the remote API. Since the API has everything it needs to perform the plan I do not understand why would it need to put a lock on the local file ?

To me if a new plan is being run it will be queued until the previous plan finishes ? Maybe I misunderstood something ?

Thanks !
Benjamin

@cam72cam
Copy link
Member

@Cipher-08 I've removed your post as it appears to be a llm summary of past comments with no additional substance.

@opentofu opentofu deleted a comment from Cipher-08 Nov 15, 2024
@BenjaminDecreusefond
Copy link
Author

BenjaminDecreusefond commented Nov 15, 2024

Hello !

Thanks everyone for your answers ! I think I managed to solved the issue with terraform env variables !
First thanks Cipher for your answer ! Unfortunately, we would like to avoid separating our infrastructure definition in multiple folders for the same project for the following reasons:

  • Creates many duplicates of our infrastructure definition which results in more difficulty managing it
  • Making development with this philosophy means we have to copy paste our dev code to the production code which inevitably result in human errors and drift between staging and prod

For us it is an unrecommended approach and would like to avoid it as much as possible ! :)

Nonetheless I pursued my research and found at that we can use environment variables with Opentofu ! In particular, we can use TF_DATA_DIR which allow us to define the directory where the .terraform folder will be written during terraform plan. Assuming we have this file structure:

├── backend.tf
├── envs
│   ├── command
│   │   └── variables.tfvars
│   └── common.tfvars
├── terrakube.tf
└── variables.tf

Assuming that we have the environment command and would like to load the variables file for command env we will run the following command:
TF_DATA_DIR=./envs/command/.terraform tofu plan -var-file="envs/command/variables.tfvars

This way we keep the same file structure and we can plan several environments concurrently !

Since this issue seem to have initiated a lock mechanism refactor I will let it open but feel free to close if you want ! :)

Thanks for your help !
Best regards,
Benjamin

@apparentlymart
Copy link
Contributor

Hi @BenjaminDecreusefond! I'm glad you found a working solution.

If you'd be willing, I'd still like to understand more about what you are doing here, since the solution you've described doesn't match my current understanding of how the system works. 🤔

You mentioned that you are using Terrakube and so I'd been assuming you were doing something like what they describe in CLI-driven Workflow, with a backend "remote" block referring to your Terrakube server.

If that were true then the state locking would be delegated to the remote server, rather than enforced on your local system. The TF_DATA_DIR directory would not change the behavior because the locking does not use anything in that directory aside from using the $TF_DATA_DIR/terraform.tfstate file to decide which API endpoints to connect to.

So I assume that something special is happening in your case, and I'd like to understand what that is to make sure you're depending on an intended behavior of the system, rather than on a coincidence that might change in future if either OpenTofu or Terrakube's behavior changes, or on something that might make the behavior unsafe for you in practice. 😬


I found some code in Terrakube that seems to implement the workspace locking API, and it does seem like it treats locking as a per-workspace problem. Therefore I would not expect any operation on a specific workspace to block any operation on any other workspace.

From your original description I assumed you were talking about different workspace operations blocking one another, but on re-read it occurs to me that you might've been asking about multiple operations in the same workspace. Is that true?

If that is true then that would explain why the operations were blocking each other before, but it still doesn't explain why selecting a different TF_DATA_DIR avoided the problem, because I would've expected all of those data dirs to contain essentially the same $TF_DATA_DIR/terraform.tfstate, pointing to the same workspace on the same Terrakube server, and so these would all still end up trying to lock the same workspace. 🤔

Do you have any ideas about what incorrect assumption I might be making here? This is the first time I've learned about Terrakube so I may be totally misunderstanding how it works or how you are using it. 😖

@BenjaminDecreusefond
Copy link
Author

Hi @apparentlymart ! I'll try to be as clear as possible ! :)

Terrakube supports two approach for workspace. One is CLI-driven workflow and the other is VCS driven workflow. In our case we are using VCS driven workflow and the idea was that we wanted to recreate a TFE VCS like system for the Terrakube environment.

Then we had the idea to create a lambda triggered by a webhook. The lambda would go into modified directories and run a speculative plan on the concerned project. In the tree structure above it is important to note that for each folder inside the envs/* directory it has its own dedicated workspace. Therefore, assuming we have envs/staging and envs/production folder we have two workspace, one staging and one for production.

The issue I had in the first place was that when I run two tofu plan concurrently in the aws lambda, one for staging and one for production there was a lock mechanism preventing me from running both at the same time.
Trust me or not, I'm not sure to understand why but when I ran two tofu plan concurrently on my local machine I did not have the issue of lock mechanism. Maybe is it because I ran them in two different shell window ?

I think there might be a confusion for the piece of code you found on Terrakube. Terrakube offers two ways to apply terraform, locally (your computer) or remotely (Terrakube job handle it). The code you pointed out is used for local runs which seems coherent as our local machine is applying we do not want other that are applying to override our lock. However, at our company we use remote plans which means that state read/writes are managed by Terrakube.

My guess is that using the TF_DATA_DIR allow us to store the .terraform folder in isolated place which was not the case before where the .terraform of the staging and the production were overlapping each other. And I think that state lock is managed by Terrakube during plan !

I tried to be as clear as possible !
Lemme know if you have questions ! :)

Benjamin

@apparentlymart
Copy link
Contributor

Thanks for that extra context, @BenjaminDecreusefond.

I think that all explains why the API code I was looking at is not important, but it leaves one question unanswered: why did the two tofu plan runs in AWS Lambda conflict with one another, even though they were using separate workspaces?

I have no answer to that question. I can't think of any reason why that should be true.

I also still don't understand why a separate TF_DATA_DIR helps, because OpenTofu does not use anything in that directory directly for state locking. The TF_DATA_DIR is used for a few different purposes by OpenTofu, but the only one that's even slightly related to state locking is $TF_DATA_DIR/terraform.tfstate, but that isn't directly used for locking: it just tells OpenTofu what backend config to use to obtain the state manager that will ultimately deal with the locking.

Perhaps I should just leave this unexplained. 😀 I just worry a little that your solution isn't working the way you think it is and that you might get surprised later if something changes. 😬

@BenjaminDecreusefond
Copy link
Author

BenjaminDecreusefond commented Nov 16, 2024

Hi @apparentlymart !

Looking back at the error ! I think I know why it did fix the issue !

�[31m│�[0m �[0m�[1m�[31mError: �[0m�[0m�[1mError locking state: [{%!s(tfdiags.Severity=69) Error acquiring the state lock Error message: resource temporarily unavailable
�[31m│�[0m �[0mLock Info:
�[31m│�[0m �[0m ID: 681e4111-d9a2-2e55-47df-fe783bbe4a91
�[31m│�[0m �[0m Path: .terraform/terraform.tfstate
�[31m│�[0m �[0m Operation: backend from plan
�[31m│�[0m �[0m Who: [email protected]
�[31m│�[0m �[0m Version: 1.8.4
�[31m│�[0m �[0m Created: 2024-11-14 09:10:11.343848201 +0000 UTC
�[31m│�[0m �[0m Info:
�[31m│�[0m �[0m
�[31m│�[0m �[0m
�[31m│�[0m �[0mOpenTofu acquires a state lock to protect the state from being written
�[31m│�[0m �[0mby multiple users at the same time. Please resolve the issue above and try
�[31m│�[0m �[0magain. For most commands, you can disable locking with the "-lock=false"
�[31m│�[0m �[0mflag, but this is not recommended. }]�[0m

As we can see in the error ! The lock is indeed inside the .terraform/terraform.tfstate which explains why moving them to separated folders fixed the issue !

@apparentlymart
Copy link
Contributor

Ahh, okay! That does seem to explain it.

What you've encountered here is not actually the normal state locking, but instead the code that interacts with the backend configuration cached in .terraform/terraform.tfstate, handled by the command/clistate package.

The use of "state" to refer to this concept is some technical debt resulting from it being derived from a very old version of the state snapshot format that OpenTofu no longer uses for any purpose other than this special .terraform/terraform.tfstate file.

It's always confusing that this code generates messages referring to this file as "state"; these messages are all just inherited from the older code that this was derived from.

This issue seems like a good prompt to review these legacy codepaths and understand what locking patterns they are using, and whether it's actually necessary to take and hold locks here. And along with that, also a good reminder to clean up all of this legacy messaging about "states" that often causes confusion like this.

@BenjaminDecreusefond
Copy link
Author

Yes I do agree with you !

Also I think it could be nice to remove the locking mechanism for read-only commands as you mentioned earlier !

Regards !
Benjamin

@apparentlymart
Copy link
Contributor

apparentlymart commented Nov 19, 2024

Thanks for confirming, @BenjaminDecreusefond.

I think then we probably need to decide what to do with this issue. 🤔

What you originally discussed here was, it turns out, related to some awkward legacy behavior in the code that manages the "backend configuration state" file ($TF_DATA_DIR/terraform.tfstate), which currently seems to be preventing running any two tofu commands that interact with the backend in the same working directory.

However, you found that you can use the TF_DATA_DIR environment variable to work around that, and the code in question is quite old and crufty so I expect we're unlikely to prioritize reworking it and potentially breaking it unless this were a common problem. Therefore my instinct is to retitle this issue so that it's more clearly a problem statement rather than a question and encourage others to vote on it if they would benefit from it being improved.

We also have the separate question of whether we want to switch from a pure mutex to a rwlock-style locking strategy for the real OpenTofu state, as stored through a remote backend. Since we already have #2157 working towards a vision for how the backend concept might evolve in future, I think I'm just going to leave myself a comment there for now to remind me to add something to that RFC encouraging a future author of a "plugin-based state storage" RFC to consider whether we ought to design a shared vs. exclusive lock representation into the plugin API.

Does that all seem reasonable to you?

@apparentlymart
Copy link
Contributor

Hi again!

Since you found a suitable workaround for your situation, and since I already captured the question of shared vs. exclusive locks in a comment over in #2157, I'm going to close this issue now.

I think there is still a valid question about whether the filesystem-level locking of $TF_DATA_DIR/terraform.tfstate is actually useful, whether it's implemented correctly, and whether it might also benefit from a shared vs. exclusive lock distinction, but we weren't able to form a coherent enough question to frame that as a standalone issue covering it, and so we'll likely also revisit that part as we continue working on #2157 since this local tracking file could potentially become completely irrelevant under some possible new designs for state storage.

Thanks for your patience while we worked out what the problem was here!

@apparentlymart apparentlymart closed this as not planned Won't fix, can't repro, duplicate, stale Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pending-decision This issue has not been accepted for implementation nor rejected. It's still open to discussion.
Projects
None yet
Development

No branches or pull requests

4 participants