Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy: filter INPUT as well as OUTPUT #43972

Merged

Conversation

thockin
Copy link
Member

@thockin thockin commented Apr 3, 2017

We need to apply filter rules on the way in (nodeports) and out (cluster
IPs). Testing here is insufficient to have caught this - will come back
for that.

Fixes #43969

@justinsb since you have the best repro, can you test? It passes what I think is repro.

@ethernetdan we will want this in 1.6.x

Fix bug with service nodeports that have no backends not being rejected, when they should be. This is not a regression vs v1.5 - it's a fix that didn't quite fix hard enough.

kube-proxy handling of services with no endpoints now applies to both INPUT and OUTPUT chains, preventing socket leak.

@thockin thockin added this to the v1.6 milestone Apr 3, 2017
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 3, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 3, 2017
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 3, 2017
@k8s-github-robot k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 3, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@ncdc
Copy link
Member

ncdc commented Apr 3, 2017

cc @eparis @dcbw @danwinship @sttts

@thockin
Copy link
Member Author

thockin commented Apr 4, 2017

@k8s-bot gce etcd3 e2e test this

@thockin
Copy link
Member Author

thockin commented Apr 4, 2017

@k8s-bot cvm gce e2e test this

@justinsb
Copy link
Member

justinsb commented Apr 4, 2017

The bad news:

> iptables -t filter --list-rules
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N DOCKER
-N DOCKER-ISOLATION
-N KUBE-FIREWALL
-N KUBE-SERVICES
-A INPUT -j KUBE-FIREWALL
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -j KUBE-FIREWALL
-A DOCKER-ISOLATION -j RETURN
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-SERVICES -p tcp -m comment --comment "default/guestbook: has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 30849 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 100.70.10.247/32 -p tcp -m comment --comment "default/guestbook: has no endpoints" -m tcp --dport 3000 -j REJECT --reject-with icmp-port-unreachable

The good news: if I then do iptables -A INPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES the count stops going up.

I think you also want it here: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L793

We need to apply filter rules on the way in (nodeports) and out (cluster
IPs).  Testing here is insufficient to have caught this - will come back
for that.
@thockin thockin force-pushed the fix-43969-proxy-filter-input branch from b7894cd to 9a423b6 Compare April 4, 2017 03:50
@thockin
Copy link
Member Author

thockin commented Apr 4, 2017

urg, I thought I hit both places.

@thockin
Copy link
Member Author

thockin commented Apr 4, 2017

re-pushed

@thockin
Copy link
Member Author

thockin commented Apr 6, 2017

ping @justinsb

@justinsb
Copy link
Member

justinsb commented Apr 6, 2017

It works for me - thanks for fixing :-)

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 6, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb, thockin

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit bc8a755 into kubernetes:master Apr 6, 2017
k8s-github-robot pushed a commit that referenced this pull request May 26, 2017
…72-upstream-release-1.6

Automatic merge from submit-queue

Automated cherry pick of #43972

Cherry pick of #43972 on release-1.6.

#43972: kube-proxy: filter INPUT as well as OUTPUT

```release-note
kube-proxy handling of services with no endpoints now applies to both INPUT and OUTPUT chains, preventing socket leak
```
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.6" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

k8s-github-robot pushed a commit that referenced this pull request Feb 23, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Split KUBE-SERVICES chain to re-shrink the INPUT chain

**What this PR does / why we need it**:
#43972 added an iptables rule "`-A INPUT -j KUBE-SERVICES`" to make NodePort ICMP rejection work. (Previously the KUBE-SERVICES chain was only run from OUTPUT, not INPUT.) #44547 extended that patch for ExternalIP rejection as well.

However, the KUBE-SERVICES chain may potentially have a very large number of ICMP reject rules for plain ClusterIP services (the ones that get run from OUTPUT), and it seems that for some reason the kernel is much more sensitive to the length of the INPUT chain than it is to the length of the OUTPUT chain. So a node that worked fine with kube 1.6 (when KUBE-SERVICES was only run from OUTPUT) might fall over with kube 1.7 (with KUBE-SERVICES being run from both INPUT and OUTPUT).

(Specifically, a node with about 5000 ClusterIP reject rules that ran fine with OpenShift 3.6 [kube 1.6] slowed almost to a complete halt with OpenShift 3.7 [kube 1.7].)

This PR fixes things by splitting out the "new" part of KUBE-SERVICES (NodePort and ExternalIP reject rules) into a separate KUBE-EXTERNAL-SERVICES chain run from INPUT, and moves KUBE-SERVICES back to being only run from OUTPUT. (So, yes, this assumes that you don't have 5000 NodePort/ExternalIP services, but, if you do, there's not much we can do, since those rules *have* to be run on the INPUT side.)

Oh, and I left in the code to clean up the "`-A INPUT -j KUBE-SERVICES`" rule even though we don't generate it any more, so it gets fixed on upgrade.

**Release note**:
```release-note
Reorganized iptables rules to fix a performance regression on clusters with thousands of services.
```

@kubernetes/sig-network-bugs @kubernetes/rh-networking
spiliopoulos added a commit to Asana/kubernetes that referenced this pull request Feb 19, 2019
Task: https://app.asana.com/0/181015504277729/1108553820840840/f
Original Issue: kubernetes#31435
Backported fix: kubernetes#43972

kube-proxy: filter INPUT as well as OUTPUT

We need to apply filter rules on the way in (nodeports) and out (cluster
IPs).  Testing here is insufficient to have caught this - will come back
for that.

(cherry picked from commit 9a423b6)

Original Author: Tim Hockin <[email protected]>
@thockin thockin deleted the fix-43969-proxy-filter-input branch August 14, 2019 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodePort rejection should be on input chain
7 participants