New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc iptables proxy fixes #106030
misc iptables proxy fixes #106030
Conversation
@danwinship: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
is not only session affinity, is everything, right now kube-proxy doesn't consider if the endpoint is ready, as a consequence there are more issues like this easy to reproduce with an init containers that holds the pod to be ready, the iptables rules are installed apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deployment
labels:
app: test
spec:
replicas: 1
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
active: "yes"
spec:
initContainers:
- name: pause-service
image: busybox:stable
command: ['sh', '-c', 'echo Pausing start. && sleep 100']
containers:
- name: test
image: httpd
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: test
spec:
type: ClusterIP
selector:
app: test
active: "yes"
ports:
- protocol: TCP
port: 80
targetPort: 80 |
we also have the tests wrong kubernetes/pkg/proxy/iptables/proxier_test.go Lines 2981 to 2982 in 08bf546
kubernetes/pkg/proxy/iptables/proxier_test.go Lines 3038 to 3041 in 08bf546
|
hmm, it seem I was wrong about that. |
I think we need something like diff --git a/pkg/proxy/iptables/proxier.go b/pkg/proxy/iptables/proxier.go
index 79bb5b6c793..0b13ce46dd0 100644
--- a/pkg/proxy/iptables/proxier.go
+++ b/pkg/proxy/iptables/proxier.go
@@ -1002,7 +1002,19 @@ func (proxier *Proxier) syncProxyRules() {
// Service does not have conflicting configuration such as
// externalTrafficPolicy=Local.
allEndpoints = proxy.FilterEndpoints(allEndpoints, svcInfo, proxier.nodeLabels)
- hasEndpoints := len(allEndpoints) > 0
+ hasEndpoints := false
+ for _, e := range allEndpoints {
+ if e.IsReady() {
+ hasEndpoints = true
+ break
+ }
+ if utilfeature.DefaultFeatureGate.Enabled(features.ProxyTerminatingEndpoints) {
+ if svc.NodeLocalExternal() && e.IsServing() && e.IsTerminating() {
+ hasEndpoints = true
+ break
+ }
+ }
+ }
svcChain := svcInfo.servicePortChainName
if hasEndpoints {
diff --git a/pkg/proxy/iptables/proxier_test.go b/pkg/proxy/iptables/proxier_test.go
index 85f40337ddf..81b142f1f25 100644
--- a/pkg/proxy/iptables/proxier_test.go
+++ b/pkg/proxy/iptables/proxier_test.go
@@ -2938,6 +2938,77 @@ COMMIT
assert.NotEqual(t, expectedIPTablesWithSlice, fp.iptablesData.String())
}
+func TestEndpointSliceE2EReject(t *testing.T) {
+ expectedIPTablesWithSlice := `*filter
+:KUBE-SERVICES - [0:0]
+:KUBE-EXTERNAL-SERVICES - [0:0]
+:KUBE-FORWARD - [0:0]
+:KUBE-NODEPORTS - [0:0]
+-A KUBE-SERVICES -m comment --comment "ns1/svc1 has no endpoints" -m tcp -p tcp -d 172.20.1.1/32 --dport 0 -j REJECT
+-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
+-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
+-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
+-A KUBE-FORWARD -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
+COMMIT
+*nat
+:KUBE-SERVICES - [0:0]
+:KUBE-NODEPORTS - [0:0]
+:KUBE-POSTROUTING - [0:0]
+:KUBE-MARK-MASQ - [0:0]
+-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
+-A KUBE-POSTROUTING -j MARK --xor-mark 0x4000
+-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE
+-A KUBE-MARK-MASQ -j MARK --or-mark 0x4000
+-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
+COMMIT
+`
+
+ ipt := iptablestest.NewFake()
+ fp := NewFakeProxier(ipt)
+ AddressType: discovery.AddressTypeIPv4,
+ Endpoints: []discovery.Endpoint{{
+ Addresses: []string{"10.0.1.1"},
+ Conditions: discovery.EndpointConditions{Ready: utilpointer.BoolPtr(false)},
+ NodeName: utilpointer.StringPtr(testHostname),
+ }, {
+ Addresses: []string{"10.0.1.4"},
+ Conditions: discovery.EndpointConditions{Ready: utilpointer.BoolPtr(false)},
+ NodeName: utilpointer.StringPtr("node4"),
+ }},
+ }
+
+ fp.OnEndpointSliceAdd(endpointSlice)
+ fp.syncProxyRules()
+ assert.Equal(t, expectedIPTablesWithSlice, fp.iptablesData.String())
+}
+ |
and this to fix #105657 diff --git a/pkg/proxy/endpoints.go b/pkg/proxy/endpoints.go
index 945237dd1ce..2a01b95d746 100644
--- a/pkg/proxy/endpoints.go
+++ b/pkg/proxy/endpoints.go
@@ -121,7 +121,13 @@ func (info *BaseEndpointInfo) Port() (int, error) {
// Equal is part of proxy.Endpoint interface.
func (info *BaseEndpointInfo) Equal(other Endpoint) bool {
- return info.String() == other.String() && info.GetIsLocal() == other.GetIsLocal()
+ return info.String() == other.String() &&
+ info.GetIsLocal() == other.GetIsLocal() &&
+ info.GetNodeName() == other.GetNodeName() &&
+ info.GetZone() == other.GetZone() &&
+ info.IsReady() == other.IsReady() &&
+ info.IsTerminating() == other.IsTerminating() &&
+ info.IsServing() == other.IsServing()
}
// GetNodeName returns the NodeName for this endpoint.
@@ -561,8 +567,20 @@ func detectStaleConnections(oldEndpointsMap, newEndpointsMap EndpointsMap, stale
continue
}
+ epReady := 0
+ for _, ep := range epList {
+ if ep.IsReady() {
+ epReady++
+ }
+ }
+ oldEpReady := 0
+ for _, ep := range oldEndpointsMap[svcPortName] {
+ if ep.IsReady() {
+ oldEpReady++
+ }
+ }
// For udp service, if its backend changes from 0 to non-0. There may exist a conntrack entry that could blackhole traffic to the service.
- if len(epList) > 0 && len(oldEndpointsMap[svcPortName]) == 0 {
+ if epReady > 0 && oldEpReady == 0 {
*staleServiceNames = append(*staleServiceNames, svcPortName)
}
} ideally we should rewrite Dan do you want to split the load? |
cc @knabben |
it's not everything...
Yes, because it writes out the chains before it figures out whether it's going to need to use the terminating endpoints. I have a fix for that coming later.
That's still not right, because it's not taking In fact,
I'm in the middle of refactoring the iptables |
/hold |
And Except it's even worse than that, because we can now have different stale endpoints for internal and external connections:
But we can't implement "invalidate conntrack entries from internal sources, but not conntrack entries from external sources", because the definition of "internal" and "external" is abstracted by the LocalDetector, which might distinguish internal from external using arbitrary iptables rules that have no equivalent in I think this is only a problem when Ideas:
|
/hold cancel |
There is an additional bug, if a service has ONLY non-ready endpoints it doesn't insert the reject rule I think we can assume that the stale/conntrack problems were solved on the previous logic, if we are able to figure out how to solve the traffic locality problem that you mention , of course At this point, |
fixes #106182 |
{ | ||
name: "basic test using each match type", | ||
input: ` | ||
*filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pedantic but:
Could we use artificial input here (and less of it?) that is easier to verify visually?
|
||
// Scan the endpoints list to see what we have. "hasEndpoints" will be true | ||
// if there are any usable endpoints for this service anywhere in the cluster. | ||
var hasEndpoints, hasLocalReadyEndpoints, hasLocalServingTerminatingEndpoints bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do some careful refactoring of all this code (I started some in #106158) but we can do those as small followups. Balancing readability of this code with the desire to DRY is hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does IPVS get ready/terminating/localterminating right? Seems like an opportunity to write this once.
I'm going to LGTM - we can do followups |
/lgtm |
we also have to review the in-flight keps, I don't know if it was Dan or you or both who commented it already, but reviewing them I'm really concerned about the future conflicts and interactions and the lots of KEPs that touch kube-proxy https://github.com/orgs/kubernetes/projects/10
|
@aojea agree that we need to be careful here, there are a lot of changes coming to kube-proxy and the result could be inconsistencies like this fixes + non-iptables proxiers falling behind. #100313 covers the need for hints and traffic policy overlap, and @danwinship did a great job at describing the changes we'd need in that issue. With that said, this is all getting sufficiently complex that it's going to be difficult to keep all proxiers in sync. |
how hard can be merge all the keps and break it down in different subtasks so we can work in different streams without breaking each others? I can see clearly 2 themes:
|
At this point @danwinship has volunteered to take on #100313 which I think is really your second bullet point. I'm not sure who's working on the LoadBalancer/NodePort type changes. |
if Dan is on it we are in good hands ;) |
Hiya rob , good pt WRT other proxiers.. the userspace and windows proxiers will be alot better off in KPNG long term bc of what rob mentioned. We are always playing catch-up with those . We're deprecating userspace now and we have MRs in flight for both of these to run in kpng: I.e. here's our first windows one, kubernetes-sigs/kpng#83 And the userspace kubernetes-sigs/kpng#55 In that case the in tree proxiers could be whittled down to just iptables and IPVS which work quite well and aren't too hard to keep in sync. |
@danwinship until which version do we have to backport this? |
just 1.22; that's when terminating endpoint support was first added |
(and only the last commit actually has bugfixes; the rest is all unit test changes, and I'm not sure if they'll backport easily, but it's not really necessary to backport them if they don't) (well, ok, the "fix the metric" commit is a bugfix too, but it's not worth backporting) |
let me know if you can do it, next friday is the last day for the cherry-pick
|
cherry-pick to 1.22 #106373 |
…0-upstream-release-1.22 Automated cherry pick of #106030: proxy/iptables: Fix sync_proxy_rules_iptables_total
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes four iptables proxier bugs:
sync_proxy_rules_iptables_total
metric was off-by-one-j REJECT
rules for services with no endpoints at all, rather than outputting it for services with no usable endpoints.There are some other known bugs which this PR does not fix.
Which issue(s) this PR fixes:
Fixes #106182
Does this PR introduce a user-facing change?
/sig network
/priority important-soon
/cc @andrewsykim @robscott