New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure deletion of pods in queues and cache #106102
Ensure deletion of pods in queues and cache #106102
Conversation
@alculquicondor: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @Huang-Wei |
// An assumed pod won't have Delete/Remove event. It needs to have Add event | ||
// before Remove event, in which case the state would change from Assumed to Added. | ||
case ok && !cache.assumedPods.Has(key): | ||
case ok: | ||
if currState.pod.Spec.NodeName != pod.Spec.NodeName { | ||
klog.Errorf("Pod %v was assumed to be on %v but got added to %v", key, pod.Spec.NodeName, currState.pod.Spec.NodeName) | ||
klog.Fatalf("Schedulercache is corrupted and can badly affect scheduling decisions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this klog.Fatalf here is worrying... is currState.pod
or pod
from an informer cache?
is it possible for there to be a mismatch where one has an empty NodeName and one has a non-empty NodeName?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currState.pod
is the internal cache and pod
is from the event (so from the informer in the case of DeletedFinalStateUnknown
)
If the pod was assumed, currState.pod
would have a name and pod
might not, if events were missed. I think we can skip the Fatal in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g @Huang-Wei wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can skip the fatal here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with skip fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/scheduler/eventhandlers.go
Outdated
if pod, ok := t.Obj.(*v1.Pod); ok { | ||
return assignedPod(pod) | ||
if _, ok := t.Obj.(*v1.Pod); ok { | ||
// We don't know if the Pod was assigned or not. Attempt cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We don't know if the Pod was assigned or not. Attempt cleanup. | |
// The carried obj may be stale, so we don't use it to check if | |
// it's assigned or not. Will attempt to cleanup later anyways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -549,6 +547,7 @@ func (cache *schedulerCache) RemovePod(pod *v1.Pod) error { | |||
return err | |||
} | |||
delete(cache.podStates, key) | |||
delete(cache.assumedPods, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, L545~L550 can be shorten like:
if err := cache.expirePod(key, currState); err != nil {
return err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// An assumed pod won't have Delete/Remove event. It needs to have Add event | ||
// before Remove event, in which case the state would change from Assumed to Added. | ||
case ok && !cache.assumedPods.Has(key): | ||
case ok: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this UTed? (i.e., pod deletions w/ and w/o the pod key in assumedPods.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a UT already.
@alculquicondor: you cannot LGTM your own PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
When the client misses a delete event from the watcher, it will use the last state of the pod in the informer cache to produce a delete event. At that point, it's not clear if the pod was in the queues or the cache, so we should issue a deletion in both. The pod could be assumed, so deletion of assumed pods from the cache should work. Change-Id: I11ce9785de603924fc121fe2fa6ed5cb1e16922f
b8654c3
to
ff741f6
Compare
if _, ok := t.Obj.(*v1.Pod); ok { | ||
// The carried object may be stale, so we don't use it to check if | ||
// it's assigned or not. Attempting to cleanup anyways. | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make sure, so when both filters (this and the one below) return true, both handlers will be invoked, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct. The handlers are independent of each other.
A couple of thoughts:
|
When the DeletionTimestamp is set, the kubelet might still be trying to stop the pod. We actually want to know when the pod is terminated, which is either a Delete or it reaches the Succeeded or Failed phase. The latter is filtered at the informer level, using a FieldSelector, so it happens in the server. Although I admit that I don't know all the intricacies of how this filtering relates to events. kubernetes/pkg/scheduler/scheduler.go Lines 677 to 683 in f5dd4d2
|
Basically, events filtered at the server side -> events go to scheduler -> scheduler local filters -> add/update/delete logic with cache or scheduling queue. |
Anything else to merge this on master? We can decide later whether to merge to 1.22 and older versions. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
@alculquicondor Hi, can you help to describe in detail how this scene occurred or how to reproduce it? I have encountered this or something like this before in our Production Cluster. But we were unable to reproduce it to prove whether the event was actually missed. |
I don't actually have a reliable repro either. It only happened in a cluster with a lot of pod creations, where it's more likely for kube-scheduler to miss actual events from the watcher. We have increased the logging since then to v=3 to have information about events. Can you do the same? |
Should we wait for the 1.23 release before backporting to 1.22 and lower? |
if pod.Spec.NodeName != "" { | ||
// An empty NodeName is possible when the scheduler misses a Delete | ||
// event and it gets the last known state from the informer cache. | ||
klog.Fatalf("Schedulercache is corrupted and can badly affect scheduling decisions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we backport this PR, is it possible to hit this fatal without the fix in #105913
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was the case already in previous versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For pod.Spec.NodeName
to have a node name, it means that we already saw the pod before and scheduled it for a 2nd time. So I don't think we would hit this.
I took a second look at the code, I think it is safe to backport this. |
@ahg-g @alculquicondor this is not needed in the 1.20 branch? |
Created #106695 for 1.20 |
…of-#106102-upstream-release-1.22 Automated cherry pick of #106102: Ensure deletion of pods in queues and cache
…of-#106102-upstream-release-1.21 Automated cherry pick of #106102: Ensure deletion of pods in queues and cache
@alculquicondor Did this solve your problem? I've been seeing the same issue. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When the client misses a delete event from the watcher, it will use the last state of the pod in the informer cache to produce a delete event. At that point, it's not clear if the pod was in the queues or the cache, so we should issue a deletion in both.
The pod could be assumed, so deletion of assumed pods from the cache should work.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
In one of our clusters, the scheduler seemed to have failed a to remove a recently bound Pod from the cache, causing other Pods to be unschedulable. The case above seems like a possible culprit, but we couldn't confirm 100%
I want to backport this to all supported versions.
Does this PR introduce a user-facing change?