Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix sync loop health check #47124

Merged

Conversation

andyxning
Copy link
Member

@andyxning andyxning commented Jun 7, 2017

This PR will do error logging about the fall behind sync for kubelet instead of sync loop healthz checking.

The reason is kubelet can not do sync loop and therefore can not update sync loop time when there is any runtime error, such as docker hung.

When there is any runtime error, according to current implementation, kubelet will not do sync operation and thus kubelet's sync loop time will not be updated. This will make when there is any runtime error, kubelet will also return non 200 response status code when accessing healthz endpoint. This is contrary with #37865 which prevents kubelet from being killed when docker hangs.

Release note:

fix sync loop health check with seperating runtime errors

/cc @yujuhong @Random-Liu @dchen1107

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 7, 2017
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 7, 2017
@andyxning andyxning force-pushed the remove_sync_loop_health_check branch from 10346b9 to 0009b68 Compare June 7, 2017 14:52
@andyxning
Copy link
Member Author

@k8s-bot pull-kubernetes-e2e-gce-etcd3 test this

@yujuhong yujuhong self-assigned this Jun 9, 2017
@@ -1805,6 +1805,21 @@ func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHand
if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
break
}

kl.syncLoopMonitorCheck()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why you moved the check here.

How about moving the kl.syncLoopMonitor operations to this loop instead? This way, it gets updated even if runtime is not functioning.

Copy link
Member Author

@andyxning andyxning Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong

How about moving the kl.syncLoopMonitor operations to this loop instead? This way, it gets updated even if runtime is not functioning.

There are two update to thee syncLoopMonitor. One is at he start of syncLoopIteration, the other is at the end of syncLoopIteration. I am not quite sure why we need the two updation. Maybe the end update of syncLoopMonitor is to indicate the long running of sync operation.
If that is true, we may need to do this:

kl.syncLoopMonitor.Store(kl.clock.Now())
if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
	break
}
kl.syncLoopMonitor.Store(kl.clock.Now())

I am not sure why you moved the check here.

The intention is to remove sync loop healthz endpoint and instead just logging this situation. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code snippet looks good.

I think we still want the healthz checkpoint in case the goroutine hangs, but we don't have to fail the health check when runtime is not ready like you said :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Will update the code later.

Copy link
Member Author

@andyxning andyxning Jun 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong Done. I have also update the PR's title and release note. PTAL. :)

@andyxning andyxning force-pushed the remove_sync_loop_health_check branch from 0009b68 to 4b772a9 Compare June 10, 2017 02:05
@andyxning andyxning changed the title remove sync loop health check fix sync loop health check Jun 10, 2017
@andyxning
Copy link
Member Author

@k8s-bot pull-kubernetes-unit test this

@andyxning
Copy link
Member Author

@k8s-bot pull-kubernetes-bazel test this

@k8s-github-robot k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 10, 2017
@andyxning andyxning force-pushed the remove_sync_loop_health_check branch from 4b772a9 to 96cb439 Compare June 10, 2017 03:26
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 10, 2017
@yujuhong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyxning, yujuhong

Associated issue: 37865

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2017
@andyxning
Copy link
Member Author

andyxning commented Jun 13, 2017

@yujuhong Thanks for review. There is still a problem that the PR has not been automatically merged. :)

@yujuhong yujuhong added this to the v1.7 milestone Jun 13, 2017
@yujuhong
Copy link
Contributor

/cc @dchen1107

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 47000, 47188, 47094, 47323, 47124)

k8s-github-robot pushed a commit that referenced this pull request Jun 13, 2017
Automatic merge from submit-queue (batch tested with PRs 47000, 47188, 47094, 47323, 47124)

fix sync loop health check

This PR will do error logging about the fall behind sync for kubelet instead of sync loop healthz checking.

The reason is kubelet can not do sync loop and therefore can not update sync loop time when there is any runtime error, such as docker hung. 

When there is any runtime error, according to current implementation, kubelet will not do sync operation and thus kubelet's sync loop time will not be updated. This will make when there is any runtime error, kubelet will also return non 200 response status code when accessing healthz endpoint. This is contrary with #37865 which prevents kubelet from being killed when docker hangs.

**Release note**:
```release-note
fix sync loop health check with seperating runtime errors
```

/cc @yujuhong @Random-Liu @dchen1107
@k8s-github-robot k8s-github-robot merged commit 17244ea into kubernetes:master Jun 13, 2017
@andyxning andyxning deleted the remove_sync_loop_health_check branch June 13, 2017 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants