Skip to content

Conversation

@andyzhangx
Copy link
Member

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR changes timeout value in csi plugin from 15s to 2min which fixes the timeout issue

attach/WaitForAttach timeout value is too small in CSI driver, it's all 15s which is really too small:

csiTimeout = 15 * time.Second

Compared to built-in driver timeout, it's really too small:
podAttachAndMountTimeout is 2 minutes, total waitForAttachTimeout is 10 minutes

Which issue(s) this PR fixes:

Fixes #79528

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

changes timeout value in csi plugin from 15s to 2min which fixes the timeout issue

/kind bug
/assign @msau42
/priority important-soon
/sig storage

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jun 28, 2019
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jun 28, 2019
@k8s-ci-robot k8s-ci-robot requested review from humblec and rootfs June 28, 2019 14:50
@andyzhangx
Copy link
Member Author

cc @davidz627

// for consistency.
csiAddrTemplate = "/var/lib/kubelet/plugins/%v/csi.sock"
csiTimeout = 15 * time.Second
csiTimeout = 2 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is the best default value. If the csiTimeout can be configurable like other csi sidecar do that will be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hoyho yes, but now I would like cherry-pick this fix to old release from 1.13 to 1.15, 15s is really too small, it makes the timeout a lot. And when user switch to use CSI driver, they won't be happy to see that timeout issue every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with you. For our CSI driver, 15s is not enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also meet some issues caused by the default time.

LGTM

@davidz627
Copy link
Contributor

The same timeout is used for Attach/Mount/Stage/Block etc. Do we want to split it up into different timeouts for different operations that tend to take different amount of times? I also agree with the premise 15 seconds is far too short.

@msau42
Copy link
Member

msau42 commented Jun 29, 2019

Also keep in mind that the CSI driver should be idempotent and handle another call coming in on the same volume even if one is already in progress.

@andyzhangx
Copy link
Member Author

The same timeout is used for Attach/Mount/Stage/Block etc. Do we want to split it up into different timeouts for different operations that tend to take different amount of times? I also agree with the premise 15 seconds is far too short.

I think currently let's change the default timeout value to 2min, for other operations, like Mount, 15s is still too short if there is a large disk to format, 2min is much safer. If we found we need different timeouts in the future, then let's split them up at that time.

@andyzhangx andyzhangx changed the title fix: change timeout value in csi plugin fix: change default timeout value in csi plugin Jun 29, 2019
@davidz627
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2019
Copy link
Contributor

@cwdsuzhou cwdsuzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also cherrypick it to 1.12 1.11

// for consistency.
csiAddrTemplate = "/var/lib/kubelet/plugins/%v/csi.sock"
csiTimeout = 15 * time.Second
csiTimeout = 2 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also meet some issues caused by the default time.

LGTM

@andyzhangx
Copy link
Member Author

cc @jsafrane @saad-ali for approval, thanks.

@andyzhangx
Copy link
Member Author

/assign @jsafrane @saad-ali

@andyzhangx
Copy link
Member Author

could you also cherrypick it to 1.12 1.11

1.12, 1.11 is already code freezed, so the latest cherry pick version should be 1.13. @cwdsuzhou

@jsafrane
Copy link
Member

jsafrane commented Jul 3, 2019

I must miss something. The linked issue mentions Attach/WaitForAttach timeouts. Taking Attach as example, each Attach() call waits for 15 seconds for a volume to get attached and then it returns an error to A/D controller. A/D controller waits for an exponential backoff and re-tries calling Attach(). Initial backoff delay is 0.5s and doubles with each error. So for the first two minutes we have:

15s waiting for attachment
0.5s sleep
15s waiting for attachment
1s sleep
...
15s waiting for attachment
16s sleep

(=121.5s total)
In these two minutes, 90 seconds the controller is actively waiting for the attachment and only 31.5s sleeping.

If the volume is attached in first 48.5 seconds, it takes only 2 second max for the controller to notice a volume is attached (that's the max sleep in 3 iterations).

The real difference is the last sleep. If a volume gets attached 90.5s after start, the controller waits for extra 16 seconds to notice it. Is it really the case we are trying to fix here? What storage backend takes 90s to attach a volume?

WaitForAttach: I may be mistaken here, I think kubelet calls WaitForAttach after A/D controller reports a volume as attached. A VolumeAttachment should already be in the right state and it should be only a matter of syncing API server watch(), which should take fraction of a second in average case.

@andyzhangx
Copy link
Member Author

15s waiting for attachment

Hi @jsafrane we are talking about the 15s waiting for attachment value, for a few CSI drivers, the first 15s is too short, there would be timeout error, while it could succeed finally by retry. So make the default timeout value as 120s(identical to built-in driver) would be better.

@cwdsuzhou
Copy link
Contributor

@jsafrane @andyzhangx

I have sent a PR #75963 in the past to expose the maxBackoff time which may help decrease the time mount waitis if attach lasts for long time. However, it is hard to be merged.

@andyzhangx
Copy link
Member Author

@jsafrane @andyzhangx

I have sent a PR #75963 in the past to expose the maxBackoff time which may help decrease the time mount waitis if attach lasts for long time. However, it is hard to be merged.

Actually timeout value in the built-in driver is 2min (per

// podAttachAndMountTimeout is the maximum amount of time the
// WaitForAttachAndMount call will wait for all volumes in the specified pod
// to be attached and mounted. Even though cloud operations can take several
// minutes to complete, we set the timeout to 2 minutes because kubelet
// will retry in the next sync iteration. This frees the associated
// goroutine of the pod to process newer updates if needed (e.g., a delete
// request to the pod).
// Value is slightly offset from 2 minutes to make timeouts due to this
// constant recognizable.
podAttachAndMountTimeout = 2*time.Minute + 3*time.Second
), while csi driver plugin makes it as only 15s, I think this is a regression. 2min is a suitable value for most drivers, although retry will succeed in the end.

@jsafrane
Copy link
Member

jsafrane commented Jul 3, 2019

So the only problem here is a timeout error visible in kubectl describe pod?

@andyzhangx
Copy link
Member Author

So the only problem here is a timeout error visible in kubectl describe pod?

Yes

@jsafrane
Copy link
Member

jsafrane commented Jul 3, 2019

Oh, I expected there is some real issue :-)

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2019
@andyzhangx
Copy link
Member Author

Oh, I expected there is some real issue :-)

/approve

Thanks, this will make some users nervous if first attach volume failed every time.

@k8s-ci-robot k8s-ci-robot merged commit a3be4b6 into kubernetes:master Jul 3, 2019
k8s-ci-robot added a commit that referenced this pull request Jul 4, 2019
…9529-upstream-release-1.15

Automated cherry pick of #79529: fix: change timeout value in csi plugin
k8s-ci-robot added a commit that referenced this pull request Aug 7, 2019
…9529-upstream-release-1.13

Automated cherry pick of #79529: fix: change timeout value in csi plugin
k8s-ci-robot added a commit that referenced this pull request Aug 7, 2019
…9529-upstream-release-1.14

Automated cherry pick of #79529: fix: change timeout value in csi plugin
mars1024 pushed a commit to mars1024/kubernetes that referenced this pull request Nov 26, 2019
mars1024 pushed a commit to mars1024/kubernetes that referenced this pull request Nov 26, 2019
…35774_20190702

Merge branch 'cherry-pick-kubernetesgh-79529 of [email protected]:antstack/cafe-kubernetes.git into EI61435774_20190702

Signed-off-by: 子波 <[email protected]>
mars1024 pushed a commit to mars1024/kubernetes that referenced this pull request Nov 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

attach/WaitForAttach timeout value is too small in CSI driver

8 participants