Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make iptables and ipvs modes of kube-proxy MASQUERADE --random-fully if possible #78547

Merged
merged 1 commit into from
Sep 3, 2019

Conversation

MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented May 30, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
This PR makes kube-proxy in iptables or ipvs mode include --random-fully in the iptables rule it emits to -j MASQUERADE, if the local version of iptables supports that additional flag. This greatly reduces the problems caused by a Linux kernel bug that bites when multiple flows are started at the same time. Due to this kernel bug the flows can get mapped to the same port, and all but one of them will suffer packet drops. This flag makes the port choice fully random, greatly reducing the probability of this port collision and consequent packet drop.

Which issue(s) this PR fixes:

Fixes #76699

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/network Categorizes an issue or PR as relevant to SIG Network. labels May 30, 2019
@k8s-ci-robot k8s-ci-robot requested review from cmluciano and dnardo May 30, 2019 20:43
@MikeSpreitzer
Copy link
Member Author

@kubernetes/sig-network-bugs

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/ipvs and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 31, 2019
@MikeSpreitzer
Copy link
Member Author

/remove-label needs-sig

@MikeSpreitzer
Copy link
Member Author

/woof

@k8s-ci-robot
Copy link
Contributor

@MikeSpreitzer: dog image

In response to this:

/woof

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cmluciano
Copy link

/assign @cmluciano

@@ -308,6 +313,16 @@ func NewProxier(ipt utiliptables.Interface,
klog.Warning("missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended")
}

masqueradeRandomFully := false
iptVersionString, err := ipt.GetVersion()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log this error on startup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Done.

masqueradeRandomFully := false
iptVersionString, err := ipt.GetVersion()
if err == nil {
iptVersion, err := utilversion.ParseGeneric(iptVersionString)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is duplicated with the first check on L317.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble understanding the words here; this line invokes a different function than line 317. Can you please elaborate on the problem?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got my lines confused. I meant to comment on L318-328 where we run ParseGeneric a few times in different nested ifs. Can we consolidate these lines and utilize something like version.Compare. It might just be me but I got a little lost walking through all of these nested conditions. I also had to reread the final comparison on L326 a few times before I understood that we are setting that flag to true based on the other conditions.

Perhaps @dcbw or @danwinship will have other thoughts on if it looks ok to them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmluciano : in lines 318--328 there are two different strings parsed with utilversion.ParseGeneric. We coud remove one of those by using version.Compare --- which does a parse internally --- instead of version.AtLeast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship : it looks like #78547 (comment) was intended to be attached here, right?

I am a little surprised that pkg/util/iptables does not cache the version string. Is that deliberate preparation for coping with live kernel patching? Should we continue to eschew caching the version string and things derived from it for that reason?

There is no direct external consumer for the "supports check" bit, so "SupportsRandomFully" (data bit or query function) would be the first of its kind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship : it looks like #78547 (comment) was intended to be attached here, right?

oops. yes

I am a little surprised that pkg/util/iptables does not cache the version string. Is that deliberate preparation for coping with live kernel patching?

Nothing so planned. Originally it was only needed in one place. Later whoever added the second usage must have felt it would be easier to just call the "get version" function again rather than cache it. it's not like it takes that long to run "iptables --version", and you only end up doing it a few times anyway...

There is no direct external consumer for the "supports check" bit, so "SupportsRandomFully" (data bit or query function) would be the first of its kind.

There are external consumers (like pkg/proxy/iptables/proxier.go), they just have to implement it by getting the iptables version and manually comparing it against utiliptables.MinCheckVersion...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored, consolidating the version string parsing and comparison logic.

"-A", string(kubePostroutingChain),
"-m", "comment", "--comment", `"kubernetes service traffic requiring SNAT"`,
"-m", "mark", "--mark", proxier.masqueradeMark,
"-j", "MASQUERADE",
}...)
}
if proxier.masqueradeRandomFully {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an e2e to be sure this is parsed correctly when this feature is enabled

Copy link
Member Author

@MikeSpreitzer MikeSpreitzer Jun 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the existing code, if there is any error in any generated rule then (a) none of the generate rules will take effect (and thus the kube-proxy will be completely ineffective) and (b) an error message will be logged.

Presumably there are already e2e tests that exercise kube-proxy functionality. So those will test the rules that the revised code generates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be the case, could you reference the kube-proxy test that leverages this generated code for the test cases

I'm mainly trying to find unit test that just validates the expected lines added by this flag or an e2e that enables the flag and ensures the baseline iptables functionality is correct. I can imagine that having the e2e for the scale tests would be useful to see if we can replicate the issue without this flag enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code isn't that confusing, though maybe it would be nice to just have pkg/util/iptables/ just check the version once and export a bunch of variables like SupportsCheck and SupportsRandomFully for everyone else to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the original point, of testing:

  1. I added a unit test that the generated -j MASQUERADE includes --random-fully if the iptables version supports it.

  2. The existing e2e tests surely test that all the generated rules parse, for the version of iptables where the tests run. If any rule does not parse then none are installed, which causes massive damage --- for example, no client will be able to reach the kubernetes Service.

@MikeSpreitzer
Copy link
Member Author

/retest

@MikeSpreitzer MikeSpreitzer force-pushed the fix-76699 branch 3 times, most recently from ad402d1 to 7740a36 Compare June 29, 2019 00:27
@MikeSpreitzer
Copy link
Member Author

/test pull-kubernetes-verify

2 similar comments
@MikeSpreitzer
Copy link
Member Author

/test pull-kubernetes-verify

@MikeSpreitzer
Copy link
Member Author

/test pull-kubernetes-verify

@MikeSpreitzer
Copy link
Member Author

/retest

@MikeSpreitzer MikeSpreitzer changed the title Make iptables mode of kube-proxy MASQUERADE --random-fully if possible Make iptables and ipvs modes of kube-proxy MASQUERADE --random-fully if possible Jul 2, 2019
@MikeSpreitzer
Copy link
Member Author

@danwinship , @cmluciano : I think I have addressed all the comments. Look good?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cblecker, MikeSpreitzer, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MikeSpreitzer
Copy link
Member Author

@cblecker : thanks!

@MikeSpreitzer
Copy link
Member Author

/retest

…ng if possible

Work around Linux kernel bug that sometimes causes multiple flows to
get mapped to the same IP:PORT and consequently some suffer packet
drops.

Also made the same update in kubelet.

Also added cross-pointers between the two bodies of code, in comments.

Some day we should eliminate the duplicate code.  But today is not
that day.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 2, 2019
@MikeSpreitzer
Copy link
Member Author

Dang! I had to push another revision to fix a mis-spelling.

@MikeSpreitzer
Copy link
Member Author

@cblecker

@cblecker
Copy link
Member

cblecker commented Sep 2, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 2, 2019
@MikeSpreitzer
Copy link
Member Author

thanks again!

@lachie83
Copy link
Member

lachie83 commented Sep 2, 2019

👋 1.16 release lead here. I've been asked to consider this PR to be added to the 1.16 milestone. Can you please answer the following? Does this fix address a bug that was introduced by an enhancement in 1.16? If not, are you planning on getting this cherry-picked in the current supported releases? What are the associated risks of merging this?

@MikeSpreitzer
Copy link
Member Author

@lachie83 : the bug that this fixes was reported in April, but has been around longer than that. This is a work-around for a Linux kernel deficiency or bug that has been there forever (as far as I know). Yes, cherry-picking this to supported releases would be good. Regarding risks: this is the usual work-around for the kernel issue, and I am not aware of any reported problems with this work-around.

@MikeSpreitzer
Copy link
Member Author

@lachie83, another thought regarding risks: I am not sure how -j MASQUERADE --random-fully behaves when the kernel is running low on available ports; there could be a risk there.

@dannyk81
Copy link

dannyk81 commented Sep 2, 2019

@lachie83 @MikeSpreitzer I'd also comment that this flag has become standard (for a while now) in recent releases of CNI plugins such as Calico and Flannel that create iptables rules.

Calico: projectcalico/felix#1901
Flannel: flannel-io/flannel#1040

Edit: considering the "illusive" impact of the underlying issue this flag fixes, I would be very pleased to see this in 1.16 as well as 1.15 🙏

@MikeSpreitzer
Copy link
Member Author

Also FYI: Here is an explanation of the problem and how --random-fully works around it: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 . The code cited in reference (3) includes the loop at https://github.com/torvalds/linux/blob/24de3d377539e384621c5b8f8f8d8d01852dddc8/net/netfilter/nf_nat_proto_common.c#L87-L94 , which answers the question of how --random-fully reacts to high utilization of the port range: no worse than the alternatives. The only difference is where the search starts.

@MikeSpreitzer
Copy link
Member Author

That same code structure is in the latest kernel in the 4.X line.
The 5.X line is different.

@lachie83
Copy link
Member

lachie83 commented Sep 3, 2019

Thanks for the updates. Can we get issue #76699 appropriately labelled with the correct priority from sig-network so that I can evaluate the urgency of getting this milestoned in 1.16?

@MikeSpreitzer
Copy link
Member Author

@lachie83 : can you please elaborate on what that means / how to get it done?

@MikeSpreitzer
Copy link
Member Author

BTW, on further investigation: it looks like in the 5.X kernel they have made this randomization the normal behavior; the --random-fully option is still accepted but no longer makes a difference.

@lachie83
Copy link
Member

lachie83 commented Sep 3, 2019

/milestone v1.16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow connect (TCP retransmits) via NodePort