Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc iptables proxy fixes #106030

Merged

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Oct 31, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes four iptables proxier bugs:

  • The sync_proxy_rules_iptables_total metric was off-by-one
  • It was generating affinity rules for all endpoints rather than only Ready endpoints, meaning clients with affinity might have new connections get routed to a no-longer-serving endpoint.
  • It was generating endpoint chains for all endpoints (whether or not they actually got used), resulting in rule bloat.
  • It only output -j REJECT rules for services with no endpoints at all, rather than outputting it for services with no usable endpoints.

There are some other known bugs which this PR does not fix.

Which issue(s) this PR fixes:

Fixes #106182

Does this PR introduce a user-facing change?

The kube-proxy sync_proxy_rules_iptables_total metric now gives
the correct number of rules, rather than being off by one.

Fixed multiple iptables proxy regressions introduced in 1.22:

  - When using Services with SessionAffinity, client affinity for an
    endpoint now gets broken when that endpoint becomes non-ready
    (rather than continuing until the endpoint is fully deleted).

  - Traffic to a service IP now starts getting rejected (as opposed to
    merely dropped) as soon as there are no longer any *usable*
    endpoints, rather than waiting until all of the terminating
    endpoints have terminated even when those terminating endpoints
    were not being used.

  - Chains for endpoints that won't be used are no longer output to
    iptables, saving a bit of memory/time/cpu.

/sig network
/priority important-soon
/cc @andrewsykim @robscott

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 31, 2021
@k8s-ci-robot
Copy link
Contributor

@danwinship: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2021
@aojea
Copy link
Member

aojea commented Oct 31, 2021

is not only session affinity, is everything, right now kube-proxy doesn't consider if the endpoint is ready, as a consequence there are more issues like this

#105657

easy to reproduce with an init containers that holds the pod to be ready, the iptables rules are installed

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  labels:
    app: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
        active: "yes"
    spec:
      initContainers:
        - name: pause-service
          image: busybox:stable
          command: ['sh', '-c', 'echo Pausing start. && sleep 100']
      containers:
      - name: test
        image: httpd
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: test
spec:
  type: ClusterIP
  selector:
    app: test
    active: "yes"
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

@aojea
Copy link
Member

aojea commented Oct 31, 2021

we also have the tests wrong

-A KUBE-SEP-VLJB2F747S6W7EX4 -m comment --comment ns1/svc1 -s 10.0.1.4/32 -j KUBE-MARK-MASQ
-A KUBE-SEP-VLJB2F747S6W7EX4 -m comment --comment ns1/svc1 -m tcp -p tcp -j DNAT --to-destination 10.0.1.4:80

Addresses: []string{"10.0.1.4"},
Conditions: discovery.EndpointConditions{Ready: utilpointer.BoolPtr(false)},
NodeName: utilpointer.StringPtr("node4"),
}},

@aojea
Copy link
Member

aojea commented Oct 31, 2021

hmm, it seem I was wrong about that.
The rules are there but the service is not jumping to them

@aojea
Copy link
Member

aojea commented Oct 31, 2021

I think we need something like

diff --git a/pkg/proxy/iptables/proxier.go b/pkg/proxy/iptables/proxier.go
index 79bb5b6c793..0b13ce46dd0 100644
--- a/pkg/proxy/iptables/proxier.go
+++ b/pkg/proxy/iptables/proxier.go
@@ -1002,7 +1002,19 @@ func (proxier *Proxier) syncProxyRules() {
                // Service does not have conflicting configuration such as
                // externalTrafficPolicy=Local.
                allEndpoints = proxy.FilterEndpoints(allEndpoints, svcInfo, proxier.nodeLabels)
-               hasEndpoints := len(allEndpoints) > 0
+               hasEndpoints := false
+               for _, e := range allEndpoints {
+                       if e.IsReady() {
+                               hasEndpoints = true
+                               break
+                       }
+                       if utilfeature.DefaultFeatureGate.Enabled(features.ProxyTerminatingEndpoints) {
+                               if svc.NodeLocalExternal() && e.IsServing() && e.IsTerminating() {
+                                       hasEndpoints = true
+                                       break
+                               }
+                       }
+               }
 
                svcChain := svcInfo.servicePortChainName
                if hasEndpoints {
diff --git a/pkg/proxy/iptables/proxier_test.go b/pkg/proxy/iptables/proxier_test.go
index 85f40337ddf..81b142f1f25 100644
--- a/pkg/proxy/iptables/proxier_test.go
+++ b/pkg/proxy/iptables/proxier_test.go
@@ -2938,6 +2938,77 @@ COMMIT
        assert.NotEqual(t, expectedIPTablesWithSlice, fp.iptablesData.String())
 }
 
+func TestEndpointSliceE2EReject(t *testing.T) {
+       expectedIPTablesWithSlice := `*filter
+:KUBE-SERVICES - [0:0]
+:KUBE-EXTERNAL-SERVICES - [0:0]
+:KUBE-FORWARD - [0:0]
+:KUBE-NODEPORTS - [0:0]
+-A KUBE-SERVICES -m comment --comment "ns1/svc1 has no endpoints" -m tcp -p tcp -d 172.20.1.1/32 --dport 0 -j REJECT
+-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
+-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
+-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
+-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
+COMMIT
+*nat
+:KUBE-SERVICES - [0:0]
+:KUBE-NODEPORTS - [0:0]
+:KUBE-POSTROUTING - [0:0]
+:KUBE-MARK-MASQ - [0:0]
+-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
+-A KUBE-POSTROUTING -j MARK --xor-mark 0x4000
+-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE
+-A KUBE-MARK-MASQ -j MARK --or-mark 0x4000
+-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
+COMMIT
+`
+
+       ipt := iptablestest.NewFake()
+       fp := NewFakeProxier(ipt)
+               AddressType: discovery.AddressTypeIPv4,
+               Endpoints: []discovery.Endpoint{{
+                       Addresses:  []string{"10.0.1.1"},
+                       Conditions: discovery.EndpointConditions{Ready: utilpointer.BoolPtr(false)},
+                       NodeName:   utilpointer.StringPtr(testHostname),
+               }, {
+                       Addresses:  []string{"10.0.1.4"},
+                       Conditions: discovery.EndpointConditions{Ready: utilpointer.BoolPtr(false)},
+                       NodeName:   utilpointer.StringPtr("node4"),
+               }},
+       }
+
+       fp.OnEndpointSliceAdd(endpointSlice)
+       fp.syncProxyRules()
+       assert.Equal(t, expectedIPTablesWithSlice, fp.iptablesData.String())
+}
+

@aojea
Copy link
Member

aojea commented Oct 31, 2021

/assign @aojea @thockin @robscott

@aojea
Copy link
Member

aojea commented Oct 31, 2021

and this to fix #105657

diff --git a/pkg/proxy/endpoints.go b/pkg/proxy/endpoints.go
index 945237dd1ce..2a01b95d746 100644
--- a/pkg/proxy/endpoints.go
+++ b/pkg/proxy/endpoints.go
@@ -121,7 +121,13 @@ func (info *BaseEndpointInfo) Port() (int, error) {
 
 // Equal is part of proxy.Endpoint interface.
 func (info *BaseEndpointInfo) Equal(other Endpoint) bool {
-       return info.String() == other.String() && info.GetIsLocal() == other.GetIsLocal()
+       return info.String() == other.String() &&
+               info.GetIsLocal() == other.GetIsLocal() &&
+               info.GetNodeName() == other.GetNodeName() &&
+               info.GetZone() == other.GetZone() &&
+               info.IsReady() == other.IsReady() &&
+               info.IsTerminating() == other.IsTerminating() &&
+               info.IsServing() == other.IsServing()
 }
 
 // GetNodeName returns the NodeName for this endpoint.
@@ -561,8 +567,20 @@ func detectStaleConnections(oldEndpointsMap, newEndpointsMap EndpointsMap, stale
                        continue
                }
 
+               epReady := 0
+               for _, ep := range epList {
+                       if ep.IsReady() {
+                               epReady++
+                       }
+               }
+               oldEpReady := 0
+               for _, ep := range oldEndpointsMap[svcPortName] {
+                       if ep.IsReady() {
+                               oldEpReady++
+                       }
+               }
                // For udp service, if its backend changes from 0 to non-0. There may exist a conntrack entry that could blackhole traffic to the service.
-               if len(epList) > 0 && len(oldEndpointsMap[svcPortName]) == 0 {
+               if epReady > 0 && oldEpReady == 0 {
                        *staleServiceNames = append(*staleServiceNames, svcPortName)
                }
        }

ideally we should rewrite k8s.io/kubernetes/pkg/proxy/endpoints_test.go to test EndpointSlices instead of Endpoints and cover this new cases about Ready

Dan do you want to split the load?

@jayunit100
Copy link
Member

cc @knabben

@danwinship
Copy link
Contributor Author

is not only session affinity, is everything

it's not everything...

hmm, it seem I was wrong about that.
The rules are there but the service is not jumping to them

Yes, because it writes out the chains before it figures out whether it's going to need to use the terminating endpoints. I have a fix for that coming later.

+                       if ep.IsReady() {
+                               epReady++

That's still not right, because it's not taking ProxyTerminatingEndpoints into account.

In fact, endpoints.go can't figure out on its own which endpoints are actually being used, because it doesn't know whether the service is externalTrafficPolicy: Local or not...

Dan do you want to split the load?

I'm in the middle of refactoring the iptables syncProxyRules to fix internalTrafficPolicy (#100313). I need to think about how to rearrange stuff to get some of the fixes in without refactoring first so it will be backportable.

@danwinship
Copy link
Contributor Author

/hold
may not make sense to merge this if other closely-related fixes are coming imminently...

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 1, 2021
@danwinship
Copy link
Contributor Author

danwinship commented Nov 1, 2021

In fact, endpoints.go can't figure out on its own which endpoints are actually being used, because it doesn't know whether the service is externalTrafficPolicy: Local or not...

And TopologyAwareHints screws this up as well. To compute the list of stale endpoints, you need information from both the EndpointSlices and the Services.

Except it's even worse than that, because we can now have different stale endpoints for internal and external connections:

  • UDP Service X has externalTrafficPolicy: Local and (implicitly) internalTrafficPolicy: Cluster. It has both internal and external clients. ProxyTerminatingEndpoints is enabled.
  • The sole endpoint for the service on Node Y gets deleted and its EndpointSlice becomes Ready: false, Serving: true, Terminating: true. It thus stops being a valid endpoint for internal (Cluster policy) traffic from Node Y, but does not stop being a valid endpoint for external (Local policy) traffic that reaches Node Y.
  • At this point Node Y needs to delete conntrack entries from internal sources to the cluster IP or LB IP, but needs to not delete conntrack entries from external sources to the LB IP.

But we can't implement "invalidate conntrack entries from internal sources, but not conntrack entries from external sources", because the definition of "internal" and "external" is abstracted by the LocalDetector, which might distinguish internal from external using arbitrary iptables rules that have no equivalent in conntrack -D. (eg, matching by source interface name).

I think this is only a problem when ProxyTerminatingEndpoints is enabled; if it's disabled then the set of endpoints for a Local-traffic-policy service will always be a subset of the endpoints for a Cluster-traffic-policy service, so we can do the cleanup correctly. (We don't do the cleanup correctly currently, but we could, as opposed to in the ProxyTerminatingEndpoints case, where there is no easy fix at all.)

Ideas:

  1. Drop ProxyTerminatingEndpoints, revert the associated code, go back to the drawing board...
  2. Change the rules of ProxyTerminatingEndpoints so that it applies to Cluster services as well as Local ones so internal and external can never have disjoint endpoints. This is probably an API break, so we'd have to make it an opt-in feature on the service? (ie, you'd have to set publishTerminatingAddresses: true on the service)
  3. We can fix the cleanup of "black hole" conntrack entries by always preceding every UDP -j REJECT rule with a -j NOTRACK rule so we don't end up creating the conntrack entries in the first place... That doesn't seem that bad in terms of either added iptables rules (most people won't have lots and lots of services with no endpoints) or added kernel filtering load (most people won't have lots and lots of clients trying to connect to services with no endpoints so the conntrack rules won't help much anyway). We can't NOTRACK the rules for accepted connections though, so I don't think this will completely solve the problem for us. (But maybe it's a good idea for the rejected connections anyway?)
  4. We could make the proxy set conntrack labels on UDP traffic going through it, to record whether it came from an internal or external source, so that later we can use conntrack -D --label ... to delete conntrack entries from either only internal or only external sources. (This seems like it should work based on the iptables and conntrack man pages, but I'm not 100% sure.)

@danwinship
Copy link
Contributor Author

/hold cancel
There don't seem to be any other places in syncProxyRules itself that are confusing allEndpoints and readyEndpoints, so I think this fix (as a minimal, backport-able fix) is independent of any endpoint-tracking fixes, internalTrafficPolicy fixes, or ProxyTerminatingEndpoints fixes.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 1, 2021
@aojea
Copy link
Member

aojea commented Nov 1, 2021

There is an additional bug, if a service has ONLY non-ready endpoints it doesn't insert the reject rule
#106030 (comment)

I think we can assume that the stale/conntrack problems were solved on the previous logic, if we are able to figure out how to solve the traffic locality problem that you mention , of course

At this point, ProxyTerminatingEndpoints is alpha so we have some room there ...

@jayunit100
Copy link
Member

fixes #106182

{
name: "basic test using each match type",
input: `
*filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pedantic but:

Could we use artificial input here (and less of it?) that is easier to verify visually?


// Scan the endpoints list to see what we have. "hasEndpoints" will be true
// if there are any usable endpoints for this service anywhere in the cluster.
var hasEndpoints, hasLocalReadyEndpoints, hasLocalServingTerminatingEndpoints bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do some careful refactoring of all this code (I started some in #106158) but we can do those as small followups. Balancing readability of this code with the desire to DRY is hard.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does IPVS get ready/terminating/localterminating right? Seems like an opportunity to write this once.

@thockin
Copy link
Member

thockin commented Nov 5, 2021

I'm going to LGTM - we can do followups

@thockin
Copy link
Member

thockin commented Nov 5, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2021
@aojea
Copy link
Member

aojea commented Nov 5, 2021

I'm going to LGTM - we can do followups

we also have to review the in-flight keps, I don't know if it was Dan or you or both who commented it already, but reviewing them I'm really concerned about the future conflicts and interactions and the lots of KEPs that touch kube-proxy https://github.com/orgs/kubernetes/projects/10

@robscott
Copy link
Member

robscott commented Nov 5, 2021

@aojea agree that we need to be careful here, there are a lot of changes coming to kube-proxy and the result could be inconsistencies like this fixes + non-iptables proxiers falling behind. #100313 covers the need for hints and traffic policy overlap, and @danwinship did a great job at describing the changes we'd need in that issue. With that said, this is all getting sufficiently complex that it's going to be difficult to keep all proxiers in sync.

@aojea
Copy link
Member

aojea commented Nov 5, 2021

how hard can be merge all the keps and break it down in different subtasks so we can work in different streams without breaking each others?

I can see clearly 2 themes:

  • loadbalancer service type changes
  • cluster traffic behavior (internal, external, ...)

@k8s-ci-robot k8s-ci-robot merged commit cb040e5 into kubernetes:master Nov 5, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 5, 2021
@robscott
Copy link
Member

robscott commented Nov 5, 2021

At this point @danwinship has volunteered to take on #100313 which I think is really your second bullet point. I'm not sure who's working on the LoadBalancer/NodePort type changes.

@aojea
Copy link
Member

aojea commented Nov 5, 2021

if Dan is on it we are in good hands ;)

@danwinship danwinship deleted the session-affinity-readiness branch November 6, 2021 02:16
@jayunit100
Copy link
Member

jayunit100 commented Nov 7, 2021

Hiya rob , good pt WRT other proxiers.. the userspace and windows proxiers will be alot better off in KPNG long term bc of what rob mentioned. We are always playing catch-up with those .

We're deprecating userspace now and we have MRs in flight for both of these to run in kpng:

I.e. here's our first windows one, kubernetes-sigs/kpng#83

And the userspace kubernetes-sigs/kpng#55

In that case the in tree proxiers could be whittled down to just iptables and IPVS which work quite well and aren't too hard to keep in sync.

@aojea
Copy link
Member

aojea commented Nov 10, 2021

@danwinship until which version do we have to backport this?

@danwinship
Copy link
Contributor Author

just 1.22; that's when terminating endpoint support was first added

@danwinship
Copy link
Contributor Author

(and only the last commit actually has bugfixes; the rest is all unit test changes, and I'm not sure if they'll backport easily, but it's not really necessary to backport them if they don't)

(well, ok, the "fix the metric" commit is a bugfix too, but it's not worth backporting)

@aojea
Copy link
Member

aojea commented Nov 10, 2021

let me know if you can do it, next friday is the last day for the cherry-pick

The cherry-pick deadline for the 1.22, 1.21 and 1.20 branches is next Friday, November 12, 2021, EOD PT.

@aojea
Copy link
Member

aojea commented Nov 12, 2021

let me know if you can do it, next friday is the last day for the cherry-pick

The cherry-pick deadline for the 1.22, 1.21 and 1.20 branches is next Friday, November 12, 2021, EOD PT.

cherry-pick to 1.22 #106373

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

retroactive issue for IPTables proxy fixes for affinity rules / chains
6 participants