-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rest: retry on connection refused and apiserver shutdown #75368
Conversation
/assign @logicalhan |
@smarterclayton this is in keeping with a pull you wrote a while back I think. @kubernetes/sig-api-machinery-misc |
// Thus in case of "GET" operations, we simply retry it. | ||
// We are not automatically retrying "write" operations, as | ||
// they are not idempotent. | ||
if !net.IsConnectionReset(err) || r.verb != "GET" { | ||
if !net.IsConnectionReset(err) || !IsConnectionRefused(err) || !isAPIServerShutdown(err) || r.verb != "GET" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic isn't right... this will not retry unless a single error is simultaneously a connection reset AND a connection refused AND an "apiserver is shutting down" error.
this demonstrates the existing retry logic isn't exercised in tests :-/ unit tests did fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is, the unit test failed, mea culpa.
28a598d
to
c7c19b8
Compare
c7c19b8
to
d910cea
Compare
d910cea
to
ecc4c87
Compare
@@ -38,6 +38,7 @@ func WithWaitGroup(handler http.Handler, longRunning apirequest.LongRunningReque | |||
|
|||
if !longRunning(req, requestInfo) { | |||
if err := wg.Add(1); err != nil { | |||
w.Header().Add("Retry-After", "1") | |||
http.Error(w, "apiserver is shutting down.", http.StatusInternalServerError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also switch this to a Status object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would do this as a follow up (it might require more plumbing and there are more places in code that return plain text errors via http.Error()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 1 appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton my assumption is that HA clusters have 2 other endpoints the client-go can reconnect when it gets apiserver shutdown error. So faster retry will lead to faster success?
ecc4c87
to
4e49c60
Compare
// TODO: Should we clean the original response if it exists? | ||
resp = &http.Response{ | ||
StatusCode: http.StatusInternalServerError, | ||
Header: http.Header{"Retry-After": []string{"1"}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 seems too short
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is pre-existing, and fires on connectionreset errors... 1 second doesn't seem awful for that, but does maybe seem too short for apiserver is shutting down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would guess on apiserver shutdown the client-go retry request and possibly hit different endpoint that is not shutting down (in HA env).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this retry infinitely? Connection refused could be non-transient as well. We should at least fail eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if seconds, wait := checkWait(resp); wait && retries < maxRetries {
limits us to maxRetries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review with ignore whitespace, this is all preexisting.
4e49c60
to
9fda5a4
Compare
9fda5a4
to
9f4cf47
Compare
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
huh, can I force this to re-open? |
@@ -38,7 +43,13 @@ func WithWaitGroup(handler http.Handler, longRunning apirequest.LongRunningReque | |||
|
|||
if !longRunning(req, requestInfo) { | |||
if err := wg.Add(1); err != nil { | |||
http.Error(w, "apiserver is shutting down.", http.StatusInternalServerError) | |||
// When apiserver is shutting down, signal clients to retry | |||
w.Header().Add("Retry-After", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining that the expectation is that with a load-balancer, you'll hit a different server, so a tight retry is good for client responsiveness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment added
35db800
to
377ad81
Compare
377ad81
to
edc52e3
Compare
// Returns if the given err is "connection reset by peer" error. | ||
func IsConnectionReset(err error) bool { | ||
// Returns true if the given err is "connection refused" error. | ||
func IsConnectionRefused(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfojtik in your rebase, these got reordered. Check to see if this change is even needed anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is weird, but I see both functions (reset and refused) are present in file and needed, not sure why this was re-rdered.
57f8bb7
to
14aea7c
Compare
14aea7c
to
a3c82e8
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, mfojtik The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Before this PR:
After this PR:
This makes kubectl get against an invalid host/port retry many times, and ultimately error with a confusing "InternalError" |
opened #87510 to track this |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The client-go seems to retry on connection refused in case of GET requests. There are more transient errors that we can retry instead of making users implement retry in their code. Two examples I can
this of are: "connection refused" errors and "apiserver is shutting down"... Both should be transient and I think we can safely retry those.