Ensure deletion of pods in queues and cache #106102

alculquicondor · 2021-11-02T19:47:18Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When the client misses a delete event from the watcher, it will use the last state of the pod in the informer cache to produce a delete event. At that point, it's not clear if the pod was in the queues or the cache, so we should issue a deletion in both.

The pod could be assumed, so deletion of assumed pods from the cache should work.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

In one of our clusters, the scheduler seemed to have failed a to remove a recently bound Pod from the cache, causing other Pods to be unschedulable. The case above seems like a possible culprit, but we couldn't confirm 100%

I want to backport this to all supported versions.

Does this PR introduce a user-facing change?

Ensure Pods are removed from the scheduler cache when the scheduler misses deletion events due to transient errors

k8s-ci-robot · 2021-11-02T19:47:25Z

@alculquicondor: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2021-11-02T19:47:30Z

/assign @Huang-Wei

alculquicondor · 2021-11-02T19:48:47Z

cc @ahg-g @liggitt

liggitt · 2021-11-02T19:55:35Z

pkg/scheduler/internal/cache/cache.go

-	// An assumed pod won't have Delete/Remove event. It needs to have Add event
-	// before Remove event, in which case the state would change from Assumed to Added.
-	case ok && !cache.assumedPods.Has(key):
+	case ok:
 		if currState.pod.Spec.NodeName != pod.Spec.NodeName {
 			klog.Errorf("Pod %v was assumed to be on %v but got added to %v", key, pod.Spec.NodeName, currState.pod.Spec.NodeName)
 			klog.Fatalf("Schedulercache is corrupted and can badly affect scheduling decisions")


this klog.Fatalf here is worrying... is currState.pod or pod from an informer cache?

is it possible for there to be a mismatch where one has an empty NodeName and one has a non-empty NodeName?

currState.pod is the internal cache and pod is from the event (so from the informer in the case of DeletedFinalStateUnknown)

If the pod was assumed, currState.pod would have a name and pod might not, if events were missed. I think we can skip the Fatal in that case.

@ahg-g @Huang-Wei wdyt?

we can skip the fatal here.

Fine with skip fatal.

Huang-Wei · 2021-11-02T21:11:09Z

pkg/scheduler/eventhandlers.go

-					if pod, ok := t.Obj.(*v1.Pod); ok {
-						return assignedPod(pod)
+					if _, ok := t.Obj.(*v1.Pod); ok {
+						// We don't know if the Pod was assigned or not. Attempt cleanup.


Suggested change

// We don't know if the Pod was assigned or not. Attempt cleanup.

// The carried obj may be stale, so we don't use it to check if

// it's assigned or not. Will attempt to cleanup later anyways.

Huang-Wei · 2021-11-02T21:28:38Z

pkg/scheduler/internal/cache/cache.go

@@ -549,6 +547,7 @@ func (cache *schedulerCache) RemovePod(pod *v1.Pod) error {
 			return err
 		}
 		delete(cache.podStates, key)
+		delete(cache.assumedPods, key)


Now, L545~L550 can be shorten like:

if err := cache.expirePod(key, currState); err != nil { return err }

Huang-Wei · 2021-11-02T21:31:56Z

pkg/scheduler/internal/cache/cache.go

-	// An assumed pod won't have Delete/Remove event. It needs to have Add event
-	// before Remove event, in which case the state would change from Assumed to Added.
-	case ok && !cache.assumedPods.Has(key):
+	case ok:


Can we have this UTed? (i.e., pod deletions w/ and w/o the pod key in assumedPods.)

I added a UT already.

k8s-ci-robot · 2021-11-03T15:46:37Z

@alculquicondor: you cannot LGTM your own PR.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2021-11-03T15:46:49Z

/retest

When the client misses a delete event from the watcher, it will use the last state of the pod in the informer cache to produce a delete event. At that point, it's not clear if the pod was in the queues or the cache, so we should issue a deletion in both. The pod could be assumed, so deletion of assumed pods from the cache should work. Change-Id: I11ce9785de603924fc121fe2fa6ed5cb1e16922f

ahg-g · 2021-11-03T18:48:30Z

pkg/scheduler/eventhandlers.go

+					if _, ok := t.Obj.(*v1.Pod); ok {
+						// The carried object may be stale, so we don't use it to check if
+						// it's assigned or not. Attempting to cleanup anyways.
+						return true


just to make sure, so when both filters (this and the one below) return true, both handlers will be invoked, right?

correct. The handlers are independent of each other.

ahg-g · 2021-11-03T18:55:39Z

A couple of thoughts:

Why not simply delete the pod from the cache whenever the pod DeletionTimestamp is set? why limit it to DeletedFinalStateUnknown?
I think we should not backport this, let it soak for a while first, the existing logic is not new, so no harm in taking it slowly.

alculquicondor · 2021-11-03T19:11:20Z

Why not simply delete the pod from the cache whenever the pod DeletionTimestamp is set? why limit it to DeletedFinalStateUnknown?

When the DeletionTimestamp is set, the kubelet might still be trying to stop the pod. We actually want to know when the pod is terminated, which is either a Delete or it reaches the Succeeded or Failed phase. The latter is filtered at the informer level, using a FieldSelector, so it happens in the server. Although I admit that I don't know all the intricacies of how this filtering relates to events.

kubernetes/pkg/scheduler/scheduler.go

Lines 677 to 683 in f5dd4d2

    
           func newPodInformer(cs clientset.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer { 
        
           	selector := fmt.Sprintf("status.phase!=%v,status.phase!=%v", v1.PodSucceeded, v1.PodFailed) 
        
           	tweakListOptions := func(options *metav1.ListOptions) { 
        
           		options.FieldSelector = selector 
        
           	} 
        
           	return coreinformers.NewFilteredPodInformer(cs, metav1.NamespaceAll, resyncPeriod, nil, tweakListOptions) 
        
           }

Huang-Wei · 2021-11-03T19:18:04Z

Although I admit that I don't know all the intricacies of how this filtering relates to events.

Basically, events filtered at the server side -> events go to scheduler -> scheduler local filters -> add/update/delete logic with cache or scheduling queue.

alculquicondor · 2021-11-05T18:14:24Z

Anything else to merge this on master?

We can decide later whether to merge to 1.22 and older versions.

ahg-g · 2021-11-06T14:11:00Z

/lgtm
/approve

k8s-ci-robot · 2021-11-06T14:11:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [ahg-g,alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-triage-robot · 2021-11-06T17:22:35Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

denkensk · 2021-11-11T04:04:49Z

When the client misses a delete event from the watcher

@alculquicondor Hi, can you help to describe in detail how this scene occurred or how to reproduce it? I have encountered this or something like this before in our Production Cluster. But we were unable to reproduce it to prove whether the event was actually missed.

alculquicondor · 2021-11-11T14:20:18Z

I don't actually have a reliable repro either. It only happened in a cluster with a lot of pod creations, where it's more likely for kube-scheduler to miss actual events from the watcher. We have increased the logging since then to v=3 to have information about events. Can you do the same?

alculquicondor · 2021-11-25T15:52:36Z

Should we wait for the 1.23 release before backporting to 1.22 and lower?

ahg-g · 2021-11-25T17:27:11Z

pkg/scheduler/internal/cache/cache.go

+			if pod.Spec.NodeName != "" {
+				// An empty NodeName is possible when the scheduler misses a Delete
+				// event and it gets the last known state from the informer cache.
+				klog.Fatalf("Schedulercache is corrupted and can badly affect scheduling decisions")


if we backport this PR, is it possible to hit this fatal without the fix in #105913

I guess this was the case already in previous versions

For pod.Spec.NodeName to have a node name, it means that we already saw the pod before and scheduled it for a 2nd time. So I don't think we would hit this.

ahg-g · 2021-11-25T17:39:33Z

I took a second look at the code, I think it is safe to backport this.

cpanato · 2021-11-26T10:00:59Z

@ahg-g @alculquicondor this is not needed in the 1.20 branch?

alculquicondor · 2021-11-26T15:32:34Z

Created #106695 for 1.20

…of-#106102-upstream-release-1.22 Automated cherry pick of #106102: Ensure deletion of pods in queues and cache

…of-#106102-upstream-release-1.21 Automated cherry pick of #106102: Ensure deletion of pods in queues and cache

lpitstickpstg · 2022-06-02T23:50:20Z

@alculquicondor Did this solve your problem? I've been seeing the same issue.
I thought we'd want to return false on line 276 to trigger the delete handler to cleanup the cache.

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 2, 2021

k8s-ci-robot assigned Huang-Wei Nov 2, 2021

k8s-ci-robot requested review from adtac and ahg-g November 2, 2021 19:48

liggitt reviewed Nov 2, 2021

View reviewed changes

Huang-Wei reviewed Nov 2, 2021

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 3, 2021

alculquicondor force-pushed the delete-pod-tombstone branch from b8654c3 to ff741f6 Compare November 3, 2021 18:00

ahg-g reviewed Nov 3, 2021

View reviewed changes

k8s-ci-robot assigned ahg-g Nov 6, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2021

k8s-ci-robot merged commit d92a443 into kubernetes:master Nov 6, 2021

k8s-ci-robot added this to the v1.23 milestone Nov 6, 2021

ahg-g reviewed Nov 25, 2021

View reviewed changes

This was referenced Nov 25, 2021

Automated cherry pick of #106102: Ensure deletion of pods in queues and cache #106684

Merged

Automated cherry pick of #106102: Ensure deletion of pods in queues and cache #106685

Merged

k8s-ci-robot added a commit that referenced this pull request Nov 26, 2021

Merge pull request #106684 from alculquicondor/automated-cherry-pick-…

1c85b16

…of-#106102-upstream-release-1.22 Automated cherry pick of #106102: Ensure deletion of pods in queues and cache

k8s-ci-robot added a commit that referenced this pull request Nov 26, 2021

Merge pull request #106685 from alculquicondor/automated-cherry-pick-…

b6ee981

…of-#106102-upstream-release-1.21 Automated cherry pick of #106102: Ensure deletion of pods in queues and cache

	// We don't know if the Pod was assigned or not. Attempt cleanup.
	// The carried obj may be stale, so we don't use it to check if
	// it's assigned or not. Will attempt to cleanup later anyways.

Ensure deletion of pods in queues and cache #106102

Ensure deletion of pods in queues and cache #106102

Conversation

alculquicondor commented Nov 2, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Nov 2, 2021

alculquicondor commented Nov 2, 2021

alculquicondor commented Nov 2, 2021

liggitt Nov 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 3, 2021

alculquicondor commented Nov 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Nov 3, 2021

alculquicondor commented Nov 3, 2021

Huang-Wei commented Nov 3, 2021

alculquicondor commented Nov 5, 2021 • edited

ahg-g commented Nov 6, 2021

k8s-ci-robot commented Nov 6, 2021

k8s-triage-robot commented Nov 6, 2021

denkensk commented Nov 11, 2021

alculquicondor commented Nov 11, 2021

alculquicondor commented Nov 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Nov 25, 2021

cpanato commented Nov 26, 2021

alculquicondor commented Nov 26, 2021

lpitstickpstg commented Jun 2, 2022 • edited

liggitt Nov 2, 2021 •

edited

alculquicondor commented Nov 5, 2021 •

edited

lpitstickpstg commented Jun 2, 2022 •

edited