Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] gce-cos-master-alpha-features (ci-kubernetes-e2e-gci-gce-alpha-features) #85414

Closed
alenkacz opened this issue Nov 18, 2019 · 4 comments · Fixed by #85527
Closed
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/unresolved Indicates an issue that can not or will not be resolved.
Milestone

Comments

@alenkacz
Copy link
Contributor

Which jobs are failing:
gce-cos-master-alpha-features

Which test(s) are failing:
[sig-network] Networking should recreate its iptables rules if they are deleted [Disruptive]

Since when has it been failing:
as far as testgrid can see, since forever, it's like ... 10% flake

Testgrid link:
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features

Reason for failure:

I1118 07:39:08.800] �[1mSTEP�[0m: Deleting pod execpod-6v4xj in namespace nettest-1831
I1118 07:39:08.844] �[1mSTEP�[0m: verifying that kubelet rules are eventually recreated
I1118 07:44:10.242] Nov 18 07:44:10.241: FAIL: kubelet did not recreate its iptables rules
I1118 07:44:10.243] Unexpected error:
I1118 07:44:10.244]     <*errors.errorString | 0xc0000d9950>: {
I1118 07:44:10.244]         s: "timed out waiting for the condition",
I1118 07:44:10.244]     }
I1118 07:44:10.244]     timed out waiting for the condition
I1118 07:44:10.244] occurred
I1118 07:44:10.244] �[1mSTEP�[0m: deleting ReplicationController iptables-flush-test in namespace nettest-1831, will wait for the garbage collector to delete the pods
I1118 07:44:10.455] Nov 18 07:44:10.455: INFO: Deleting ReplicationController iptables-flush-test took: 41.88912ms

Anything else we need to know:
/cc @kubernetes/ci-signal
/sig network
/priority important-soon
/milestone v1.17

@alenkacz alenkacz added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Nov 18, 2019
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 18, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 18, 2019
@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 18, 2019
@athenabot
Copy link

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Nov 18, 2019
@danwinship
Copy link
Contributor

/assign

@danwinship
Copy link
Contributor

FTR, I'm pretty sure that the feature is working fine and it's only the test that is broken. The kubelet logs (eg, here) show that it is noticing the deleted chains and reloading:

I1119 10:56:23.337984    1325 iptables.go:523] iptables canary mangle/KUBE-KUBELET-CANARY deleted
I1119 10:56:23.341665    1325 iptables.go:549] Reloading after iptables flush

And if it failed to recreate any of the rules, it would log errors about it:

	if _, err := kl.iptClient.EnsureRule(utiliptables.Append, utiliptables.TableNAT, KubeMarkDropChain, "-j", "MARK", "--set-xmark", dropMark); err != nil {
		klog.Errorf("Failed to ensure marking rule for %v: %v", KubeMarkDropChain, err)
		return
	}

and we don't see those errors in the logs. So I think there's just a bug in the test case. Trying to reproduce now in a debug PR...

@aojea
Copy link
Member

aojea commented Nov 22, 2019

@danwinship my apologies, this is the cause of the flakiness
#84422

The test tries to find the KUBE-MARK-DROP rule

if strings.Contains(result.Stdout, "\n-A KUBE-MARK-DROP ") {

but kube-proxy deletes the rule because we are not recreating it as we are doing with the KUBE-MARK-MASK

// NB: THIS MUST MATCH the corresponding code in the kubelet

Since #84422 is only needed for kube-proxy with dual stack in iptables mode and that PR is not merged we can revert it.

The other option is making the the chains in kubelet and kube-proxy different
#82125 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
5 participants