Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft state locking #30277

Open
deberon opened this issue Dec 30, 2021 · 7 comments
Open

Soft state locking #30277

deberon opened this issue Dec 30, 2021 · 7 comments
Labels
enhancement new new issue not yet triaged

Comments

@deberon
Copy link

deberon commented Dec 30, 2021

Current Terraform Version

N/A

Use-cases

When designing deployment pipelines, it would be nice to have the ability to completely lock the state between a plan and apply stage. That way no other modifications to the state can be made while a change is pending.

Attempted Solutions

I looked for ways to manually lock the state and didn't find anything apart from manually editing the state file.

Proposal

Having the ability to initiate a "soft" lock on the state would be very helpful logistically when designing CI/CD pipelines. The existing whole-workspace locking mechanism is sufficient, I am just proposing the ability to manually turn it on and off. Here is an example workflow, imagine a manual intervention step between the plan and apply stages:

plan stage

terraform plan -out plan.out -soft-lock-id <lock_id>

apply stage

terraform apply plan.out

failure catching stage

terraform state unlock <lock_id>

I would also propose the addition of the following terraform state subcommands:

  • terraform state lock <lock_id>
  • terraform state unlock <lock_id>

Based on my understanding, this might work with the following changes.

State file

{
  "version": 4,
+ "soft_lock_id": "<lock_id>",
  "terraform_version": "1.1.2",
  "serial": 1,
  "lineage": "676fba34-18c2-25bc-b542-eafc3190dd35",
  "outputs": {
    "hi": {
      "value": "hi",
      "type": "string"
    }
  },
  "resources": []
}

Plan output

The output file generated by terraform plan -out could include an additional file called lock_id. The contents of this file would be the lock_id that is currently locking the state. The apply command could either process the lock_id from this file, or from a parameter passed at runtime: terraform apply -soft-lock-id=<lock_id>

I believe this would allow clients to completely ignore soft locks since having additional keys in the state file shouldn't cause any parsing problems (maybe? this is an assumption on my part) and the backend locking mechanism will still be in place for critical state locking. I believe this would make the feature opt-in and backwards compatible.

References

  • 28710
@deberon deberon added enhancement new new issue not yet triaged labels Dec 30, 2021
@deberon deberon changed the title Token based state locking Soft state locking Dec 30, 2021
@deberon
Copy link
Author

deberon commented Dec 30, 2021

Per the contribution guide I am intending to implement this change via PR.

@apparentlymart
Copy link
Contributor

apparentlymart commented Jan 4, 2022

Hi @deberon! Thanks for this proposal.

I've not thought through this all completely yet, so this is just an initial thought and I'd love to hear what others on our team think here too.

An important thing to consider here is that not all backends are able to sustain a lock without a running process actually "holding" the lock. For example, the local backend uses flock (or similar on other platforms) so the lock is released implicitly when the CLI process exits, and I believe the "consul" backend needs to hold open a TCP socket to a Consul server in order to sustain the lock.

One way we could address that is to allow backends to each individually opt in to supporting this sort of explicit locking, and thus it can be left unsupported (with an explicit error message) on backends that can't support it.

Another possibility would be to have a command you can run which stays running as the means to hold the lock, and then you release the lock by interrupting that process. That would then work for all backends in principle, but would still require a means like you proposed here for other commands running in the same directory to be able to use the same lock.

The backend locking semantics tend to be deceptively complex in spite of the relatively simple API, so I expect the are some other similar subtleties to consider here, but this was the one that came to my mind while initially thinking about this.

@sameershah21
Copy link

Hi @deberon! Thanks for this proposal.

I've not thought through this all completely yet, so this is just an initial thought and I'd love to hear what others on our team think here too.

An important thing to consider here is that not all backends are able to sustain a lock without a running process actually "holding" the lock. For example, the local backend uses flock (or similar on other platforms) so the lock is released implicitly when the CLI process exits, and I believe the "consul" backend needs to hold open a TCP socket to a Consul server in order to sustain the lock.

One way we could address that is to allow backends to each individually opt in to supporting this sort of explicit locking, and thus it can be left unsupported (with an explicit error message) on backends that can't support it.

Another possibility would be to have a command you can run which stays running as the means to hold the lock, and then you release the lock by interrupting that process. That would then work for all backends in principle, but would still require a means like you proposed here for other commands running in the same directory to be able to use the same lock.

The backend locking semantics tend to be deceptively conplex in spite of the relatively simple API, so I expect the are some other similar subtleties to consider here, but this was the one that came to my mind while initially thinking about this.

If the backend locking mechanisms are fairly complex to be achieved....
One other way that I have been implementing to counter the need for lock (in AWS) would using IAM. I have found foll setup to be helpful. Note that these steps are for AWS, but similar can be accomplished in other clouds Azure/GCP..etc
Create:

  • IAM role for remote state access within the s3 bucket object
  • Trust Policy for who can assume the role (Only CI/CD Principal)
  • Bucket object access permissions policy for write access
  • IAM role policy attachments for the above

Now when CI/CD Pipelines want to run the pipeline, it will use the CI/CD prinicpal with the assumed role to lock and access the bucket object. No other principal would be able to write to this S3 bucket object.

I am not sure how feasible this solution is for every use case, but for me, it has worked well over the years. Hence, just wanted to share

@jbardin
Copy link
Member

jbardin commented Jan 5, 2022

For some background, the original design intent of the state locking mechanism was only to guard the state data against concurrent modification (early on, users concurrently working with S3 without some sort of global orchestration mechanism would find themselves with the wrong state at times). The -lock-timeout option was only added after the fact once we could be sure that it didn't impose any undue restrictions on backend maintenance. We purposely did not create a model which allowed persistent locks, not only because not all implementations could maintain such locks, but it was outside the design goals of the Terraform CLI. Locking the state is only part of implementing a complete workflow in Terraform, hence the workflow tooling should also manage the various levels of synchronization.

Once we have a new interface designed for remote state storage, we can document a more precise contract for the locking mechanisms. Hopefully in the process we can simplify things a bit for implementors, though the semantics may change slightly in the process.

@deberon
Copy link
Author

deberon commented Jan 5, 2022

@apparentlymart I'm not proposing any changes to the existing locking mechanism. Instead I'm suggesting a process that writes an arbitrary lock id directly into the state (my proposal adds a new top level key to the state object). So the state can be unlocked (from a backend perspective) while still allowing an external process to make a claim to the state. This claim can even be ignored by terraform apply by default, which would maintain existing functionality. My worry is that this might compromise some understanding of when and how changes are written to the state.

Thanks for taking a look at this!

@n2N8Z
Copy link

n2N8Z commented Feb 19, 2023

#17203

@darpham
Copy link

darpham commented Oct 25, 2024

If it's of any use to folks stumbling across this, here's a bash/zsh compatible interactive script to more easily unlock state. Could be updated for running in CI/CD in an automated fashion with some pre-validation - YMMY and use with due caution.

tf_unlock() {
    local LOCK_ID
    local ERROR_OUTPUT
    local LOCK_DETAILS

    # Capture both stdout and stderr
    ERROR_OUTPUT=$(terraform plan -json 2>&1)

    # Check if there's a lock error
    if [[ $ERROR_OUTPUT == *"Error acquiring the state lock"* ]]; then
        LOCK_DETAILS=$(echo "$ERROR_OUTPUT" | jq -r '.diagnostic.detail // empty' 2>/dev/null)
        if [[ -z "$LOCK_DETAILS" ]]; then
            LOCK_DETAILS=$(echo "$ERROR_OUTPUT" | grep -A10 "Lock Info:")
        fi
        LOCK_ID=$(echo "$LOCK_DETAILS" | awk '/ID:/ {print $2; exit}')

        echo "State is locked. Lock details:"
        echo "$LOCK_DETAILS"
        echo

        echo -n "Do you want to unlock this state? Type 'yes' to confirm: "
        read response
        if [[ "$response" == "yes" ]]; then
            echo "Attempting to unlock..."
            if terraform force-unlock --force "${LOCK_ID}"; then
                echo "Terraform state has been successfully unlocked!"
            else
                echo "Failed to unlock the state. Please check the error message above." >&2
                return 1
            fi
        else
            echo "Unlock cancelled."
            return 0
        fi
    elif [[ $ERROR_OUTPUT == *"Error:"* ]]; then
        echo "Error occurred while checking Terraform state:" >&2
        echo "$ERROR_OUTPUT" >&2
        return 1
    else
        echo "State is not locked. No action needed."
    fi
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new new issue not yet triaged
Projects
None yet
Development

No branches or pull requests

6 participants