Fix 'Schedulercache is corrupted' error #55262

liggitt · 2017-11-07T19:30:01Z

If an Assume()ed pod is Add()ed with a different nodeName, the podStates view of the pod is not corrected to reflect the actual nodeName. On the next Update(), the scheduler observes the mismatch and process exits.

Fixed 'Schedulercache is corrupted' error in kube-scheduler

liggitt · 2017-11-07T20:12:01Z

cc @kubernetes/sig-scheduling-bugs @kubernetes/sig-scheduling-pr-reviews

liggitt · 2017-11-07T20:12:31Z

/retest

timothysc · 2017-11-07T20:17:38Z

/assign @wojtek-t

timothysc

/lgtm - minor comment.

timothysc · 2017-11-07T20:35:49Z

plugin/pkg/scheduler/schedulercache/cache_test.go

+				t.Fatalf("AddPod failed: %v", err)
+			}
+		}
+		cache.cleanupAssumedPods(now.Add(2 * ttl))


Why do you up the ttl?

this is simulating passing of time to ensure the added pod doesn't expire. was copy/paste from previous testcase, though, and not really relevant to this one, so I can remove it

k8s-cherrypick-bot · 2017-11-07T20:36:21Z

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

liggitt · 2017-11-07T20:41:39Z

updated to remove unrelated ttl bit from unit test

cblecker · 2017-11-07T21:09:17Z

CRD Flake
/test pull-kubernetes-unit

bsalamat

Your change looks good to me, I just wonder if Scheduler resource accounting remains valid in such scenario where a pod is assumed to a different node that it is bound to.

bsalamat · 2017-11-07T20:32:26Z

plugin/pkg/scheduler/schedulercache/cache_test.go

+		}
+		for _, podToUpdate := range tt.podsToUpdate {
+			if err := cache.UpdatePod(podToUpdate[0], podToUpdate[1]); err != nil {
+				t.Fatalf("AddPod failed: %v", err)


s/AddPod/UpdatePod/

liggitt · 2017-11-07T21:18:50Z

fixed gofmt error and test message, re-tagging

liggitt · 2017-11-07T21:29:11Z

I just wonder if Scheduler resource accounting remains valid in such scenario where a pod is assumed to a different node that it is bound to.

It was just the podStates' version of the pod that wasn't getting updated. The removePod(currState.pod)/addPod(pod) dance corrects the per-node accounting already (and the unit test I added demonstrates the per-node accounting is corrected):

kubernetes/plugin/pkg/scheduler/schedulercache/cache.go

Lines 234 to 241 in 454074d

    
           case ok && cache.assumedPods[key]: 
        
           	if currState.pod.Spec.NodeName != pod.Spec.NodeName { 
        
           		// The pod was added to a different node than it was assumed to. 
        
           		glog.Warningf("Pod %v assumed to a different node than added to.", key) 
        
           		// Clean this up. 
        
           		cache.removePod(currState.pod) 
        
           		cache.addPod(pod) 
        
           	}

cblecker · 2017-11-07T22:08:18Z

@liggitt looks like there's still a gofmt issue on ./plugin/pkg/scheduler/schedulercache/cache_test.go

bsalamat · 2017-11-07T22:09:59Z

Thanks, @liggitt!

/lgtm

liggitt · 2017-11-07T23:53:06Z

actually fixed gofmt error

cblecker · 2017-11-07T23:54:57Z

reapplying LGTM
/lgtm

k8s-github-robot · 2017-11-07T23:55:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, cblecker, liggitt, timothysc

Associated issue: 50916

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~plugin/pkg/scheduler/OWNERS~~ [bsalamat,timothysc]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

cblecker · 2017-11-07T23:55:34Z

uhhh..

W1107 23:53:54.719] Run: ('bash', '-c', 'cd kubernetes && ./hack/jenkins/test-dockerized.sh')
W1107 23:53:54.722] bash: line 0: cd: kubernetes: No such file or directory

That's an odd unit test failure. Let's retry.
/test pull-kubernetes-unit

cblecker · 2017-11-08T00:09:17Z

unit test failing: #55276

cblecker · 2017-11-08T00:11:03Z

fingers crossed
/test pull-kubernetes-unit

liggitt · 2017-11-08T01:54:46Z

TestCRD flake
/test pull-kubernetes-unit

…2-upstream-release-1.8 Automated cherry pick of #55262

k8s-github-robot · 2017-11-08T06:36:02Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2017-11-08T07:23:32Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

…2-upstream-release-1.7 Automatic merge from submit-queue. Automated cherry pick of #55262 Cherry pick of #55262 on release-1.7. #55262: Fix 'Schedulercache is corrupted' error

k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 7, 2017

k8s-github-robot assigned bsalamat and k82cn Nov 7, 2017

liggitt mentioned this pull request Nov 7, 2017

Scheduler dies with "Schedulercache is corrupted" #50916

Closed

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. kind/bug Categorizes issue or PR as related to a bug. labels Nov 7, 2017

k8s-ci-robot assigned wojtek-t Nov 7, 2017

timothysc modified the milestone: v1.9 Nov 7, 2017

timothysc added the cherrypick-candidate label Nov 7, 2017

timothysc added this to the v1.8 milestone Nov 7, 2017

timothysc approved these changes Nov 7, 2017

View reviewed changes

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2017

k8s-cherrypick-bot removed the cherrypick-candidate label Nov 7, 2017

timothysc added the cherrypick-candidate label Nov 7, 2017

liggitt force-pushed the schedulercache branch from e747e16 to a9b08fa Compare November 7, 2017 20:41

This was referenced Nov 7, 2017

Automated cherry pick of #55262 #55266

Merged

Automated cherry pick of #55262 #55267

Merged

jpbetz added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Nov 7, 2017

bsalamat reviewed Nov 7, 2017

View reviewed changes

liggitt force-pushed the schedulercache branch from a9b08fa to 5048838 Compare November 7, 2017 21:18

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2017

liggitt added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2017

Fix 'Schedulercache is corrupted' error

a366e6c

liggitt force-pushed the schedulercache branch from 5048838 to a366e6c Compare November 7, 2017 23:49

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2017

k8s-ci-robot assigned cblecker Nov 7, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2017

jpbetz added a commit that referenced this pull request Nov 8, 2017

Merge pull request #55266 from liggitt/automated-cherry-pick-of-#5526…

861dab5

…2-upstream-release-1.8 Automated cherry pick of #55262

k8s-github-robot merged commit 33f873d into kubernetes:master Nov 8, 2017

liggitt deleted the schedulercache branch November 9, 2017 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 'Schedulercache is corrupted' error #55262

Fix 'Schedulercache is corrupted' error #55262

liggitt commented Nov 7, 2017 •

edited

Loading

liggitt commented Nov 7, 2017

liggitt commented Nov 7, 2017

timothysc commented Nov 7, 2017

timothysc left a comment

timothysc Nov 7, 2017

liggitt Nov 7, 2017

k8s-cherrypick-bot commented Nov 7, 2017

liggitt commented Nov 7, 2017 •

edited

Loading

cblecker commented Nov 7, 2017

bsalamat left a comment

bsalamat Nov 7, 2017

liggitt commented Nov 7, 2017 •

edited

Loading

liggitt commented Nov 7, 2017 •

edited

Loading

cblecker commented Nov 7, 2017

bsalamat commented Nov 7, 2017

liggitt commented Nov 7, 2017

cblecker commented Nov 7, 2017

k8s-github-robot commented Nov 7, 2017

cblecker commented Nov 7, 2017

cblecker commented Nov 8, 2017

cblecker commented Nov 8, 2017

liggitt commented Nov 8, 2017

k8s-github-robot commented Nov 8, 2017

k8s-github-robot commented Nov 8, 2017

Fix 'Schedulercache is corrupted' error #55262

Fix 'Schedulercache is corrupted' error #55262

Conversation

liggitt commented Nov 7, 2017 • edited Loading

liggitt commented Nov 7, 2017

liggitt commented Nov 7, 2017

timothysc commented Nov 7, 2017

timothysc left a comment

Choose a reason for hiding this comment

timothysc Nov 7, 2017

Choose a reason for hiding this comment

liggitt Nov 7, 2017

Choose a reason for hiding this comment

k8s-cherrypick-bot commented Nov 7, 2017

liggitt commented Nov 7, 2017 • edited Loading

cblecker commented Nov 7, 2017

bsalamat left a comment

Choose a reason for hiding this comment

bsalamat Nov 7, 2017

Choose a reason for hiding this comment

liggitt commented Nov 7, 2017 • edited Loading

liggitt commented Nov 7, 2017 • edited Loading

cblecker commented Nov 7, 2017

bsalamat commented Nov 7, 2017

liggitt commented Nov 7, 2017

cblecker commented Nov 7, 2017

k8s-github-robot commented Nov 7, 2017

cblecker commented Nov 7, 2017

cblecker commented Nov 8, 2017

cblecker commented Nov 8, 2017

liggitt commented Nov 8, 2017

k8s-github-robot commented Nov 8, 2017

k8s-github-robot commented Nov 8, 2017

liggitt commented Nov 7, 2017 •

edited

Loading

liggitt commented Nov 7, 2017 •

edited

Loading

liggitt commented Nov 7, 2017 •

edited

Loading

liggitt commented Nov 7, 2017 •

edited

Loading