-
Notifications
You must be signed in to change notification settings - Fork 42.1k
fix get azure accounts timeout issue when there is no out-bound IP #74191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix get azure accounts timeout issue when there is no out-bound IP #74191
Conversation
|
Question, does the call to |
yes, and it's for unmanaged disk which is corner scenario now, while for compatibility, I still keep it in the code. So is setting timeout as 1 min ok in your scenario? @djsly |
105d50d to
5d4dacd
Compare
|
/hold |
|
@andyzhangx if you are saying that there is no choice but to query the list of StorageAccount at initiation time, then I guess it's good. |
|
I was looking at the code, and I see that the Couldnt the call to Food for though... |
5d4dacd to
65350c7
Compare
|
@djsly thanks for your code analysis, you are right: I have changed the code according to your request. PTAL, thanks. |
andyzhangx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for @djsly 's suggestion, this is an old issue which should be fixed in earlier version. I will cherry pick to earlier version when this PR get merged.
/hold cancel
|
@andyzhangx , perfect! thanks for the cherry picks. |
65350c7 to
1566024
Compare
|
hello, I tested 1.13.3 + patch on a VM that doesn't have an outbound IP. We have a Standard ILB in k8s with a single frontend from the vnet's subnet. This time, there isn't any timeouts I still have a problem where all my nodes are showing up as |
|
ok the new call now that timesout after 10mins is: |
Problem here is that the source of the call is in k8s, not Can't we use instanceMetadata for this ? I'm not sure yet, why the full list of VmSizes in a SubID is required. |
|
so I now understand better how this works, Since the instance metadata doesn't provide the I would try your timeout logic you initially wrote but I can't find the commit anymore, must have been squashed. Do we know if the instance metadata team could augment the metadata with this information ? |
|
Sure this is good to go. I guess we will need to open another one for the issue with the
I will leave that to your discretion to see if you prefer to manage two PRs with multiple cherry picks. Or if it would be best tackle all those issues within the same PR, since in the end all this work is related to Kubelet not starting in a timely fashion when there isn't any outbound IP |
|
OK I tested the context.WithTimeout() on the VirtualMachineSizesClient API call and this seem to be the last long blocking call . You can see bellow that the node isn't registered for 15sec, until the first API call timesout. |
|
also, before my patch, we can see that the calls were failing every 10mins (I'm guessing kubelet will keep retrying) and after I patched, on existing Nodes, the error messages is every ~15sec |
|
@djsly Thanks for validating this. @andyzhangx Could you fix the issue within same PR? They are actually same issue, but just different APIs. And we need to cherry pick the changes to stable releases. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyzhangx The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep c.accouts as nil here? or else, it won't try again on errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@feiskyer yes, won't try again on error, it would make new map and create new storage account(don't use existing accounts)
pkg/volume/azure_dd/azure_dd.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add same timeout to getAllStorageAccounts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@feiskyer fixed.
pkg/volume/azure_dd/azure_dd.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: define a consts for timeout, e.g. listAPITimeout = 60*time.Second
|
/milestone v1.14 |
add timeout for getAllStorageAccounts
28a9aa8 to
8cd09bb
Compare
feiskyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
Thanks guys! |
…4191-upstream-release-1.13 Automated cherry pick of #74191: remove get azure accounts in the init process set timeout
…4191-upstream-release-1.11 Automated cherry pick of #74191: remove get azure accounts in the init process set timeout
…4191-upstream-release-1.12 Automated cherry pick of #74191: remove get azure accounts in the init process set timeout
What type of PR is this?
/kind bug
What this PR does / why we need it:
make
getAllStorageAccountsfunc happen only when there is create unmanaged disk request, for managed disk cluster, it's not necessary to makegetAllStorageAccountsfunc run.Which issue(s) this PR fixes:
Fixes #74190
Original issue: kubernetes-sigs/cloud-provider-azure#86
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/kind bug
/assign @feiskyer
/priority important-soon
/sig azure
cc @djsly