Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure deletion of pods in queues and cache #106102

Merged

Conversation

alculquicondor
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

When the client misses a delete event from the watcher, it will use the last state of the pod in the informer cache to produce a delete event. At that point, it's not clear if the pod was in the queues or the cache, so we should issue a deletion in both.

The pod could be assumed, so deletion of assumed pods from the cache should work.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

In one of our clusters, the scheduler seemed to have failed a to remove a recently bound Pod from the cache, causing other Pods to be unschedulable. The case above seems like a possible culprit, but we couldn't confirm 100%

I want to backport this to all supported versions.

Does this PR introduce a user-facing change?

Ensure Pods are removed from the scheduler cache when the scheduler misses deletion events due to transient errors

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 2, 2021
@k8s-ci-robot
Copy link
Contributor

@alculquicondor: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 2, 2021
@alculquicondor
Copy link
Member Author

/assign @Huang-Wei

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 2, 2021
@alculquicondor
Copy link
Member Author

cc @ahg-g @liggitt

// An assumed pod won't have Delete/Remove event. It needs to have Add event
// before Remove event, in which case the state would change from Assumed to Added.
case ok && !cache.assumedPods.Has(key):
case ok:
if currState.pod.Spec.NodeName != pod.Spec.NodeName {
klog.Errorf("Pod %v was assumed to be on %v but got added to %v", key, pod.Spec.NodeName, currState.pod.Spec.NodeName)
klog.Fatalf("Schedulercache is corrupted and can badly affect scheduling decisions")
Copy link
Member

@liggitt liggitt Nov 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this klog.Fatalf here is worrying... is currState.pod or pod from an informer cache?

is it possible for there to be a mismatch where one has an empty NodeName and one has a non-empty NodeName?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currState.pod is the internal cache and pod is from the event (so from the informer in the case of DeletedFinalStateUnknown)

If the pod was assumed, currState.pod would have a name and pod might not, if events were missed. I think we can skip the Fatal in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can skip the fatal here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with skip fatal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if pod, ok := t.Obj.(*v1.Pod); ok {
return assignedPod(pod)
if _, ok := t.Obj.(*v1.Pod); ok {
// We don't know if the Pod was assigned or not. Attempt cleanup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We don't know if the Pod was assigned or not. Attempt cleanup.
// The carried obj may be stale, so we don't use it to check if
// it's assigned or not. Will attempt to cleanup later anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -549,6 +547,7 @@ func (cache *schedulerCache) RemovePod(pod *v1.Pod) error {
return err
}
delete(cache.podStates, key)
delete(cache.assumedPods, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, L545~L550 can be shorten like:

if err := cache.expirePod(key, currState); err != nil {
	return err
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// An assumed pod won't have Delete/Remove event. It needs to have Add event
// before Remove event, in which case the state would change from Assumed to Added.
case ok && !cache.assumedPods.Has(key):
case ok:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this UTed? (i.e., pod deletions w/ and w/o the pod key in assumedPods.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a UT already.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 3, 2021
@k8s-ci-robot
Copy link
Contributor

@alculquicondor: you cannot LGTM your own PR.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Member Author

/retest

When the client misses a delete event from the watcher, it will use the last state of the pod in the informer cache to produce a delete event. At that point, it's not clear if the pod was in the queues or the cache, so we should issue a deletion in both.

The pod could be assumed, so deletion of assumed pods from the cache should work.

Change-Id: I11ce9785de603924fc121fe2fa6ed5cb1e16922f
if _, ok := t.Obj.(*v1.Pod); ok {
// The carried object may be stale, so we don't use it to check if
// it's assigned or not. Attempting to cleanup anyways.
return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure, so when both filters (this and the one below) return true, both handlers will be invoked, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. The handlers are independent of each other.

@ahg-g
Copy link
Member

ahg-g commented Nov 3, 2021

A couple of thoughts:

  1. Why not simply delete the pod from the cache whenever the pod DeletionTimestamp is set? why limit it to DeletedFinalStateUnknown?
  2. I think we should not backport this, let it soak for a while first, the existing logic is not new, so no harm in taking it slowly.

@alculquicondor
Copy link
Member Author

Why not simply delete the pod from the cache whenever the pod DeletionTimestamp is set? why limit it to DeletedFinalStateUnknown?

When the DeletionTimestamp is set, the kubelet might still be trying to stop the pod. We actually want to know when the pod is terminated, which is either a Delete or it reaches the Succeeded or Failed phase. The latter is filtered at the informer level, using a FieldSelector, so it happens in the server. Although I admit that I don't know all the intricacies of how this filtering relates to events.

func newPodInformer(cs clientset.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {
selector := fmt.Sprintf("status.phase!=%v,status.phase!=%v", v1.PodSucceeded, v1.PodFailed)
tweakListOptions := func(options *metav1.ListOptions) {
options.FieldSelector = selector
}
return coreinformers.NewFilteredPodInformer(cs, metav1.NamespaceAll, resyncPeriod, nil, tweakListOptions)
}

@Huang-Wei
Copy link
Member

Although I admit that I don't know all the intricacies of how this filtering relates to events.

Basically, events filtered at the server side -> events go to scheduler -> scheduler local filters -> add/update/delete logic with cache or scheduling queue.

@alculquicondor
Copy link
Member Author

alculquicondor commented Nov 5, 2021

Anything else to merge this on master?

We can decide later whether to merge to 1.22 and older versions.

@ahg-g
Copy link
Member

ahg-g commented Nov 6, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-triage-robot
Copy link

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

  • The PR does have any do-not-merge/* labels
  • The PR does not have the needs-ok-to-test label
  • The PR is mergeable (does not have a needs-rebase label)
  • The PR is approved (has cncf-cla: yes, lgtm, approved labels)
  • The PR is failing tests required for merge

You can:

/retest

@k8s-ci-robot k8s-ci-robot merged commit d92a443 into kubernetes:master Nov 6, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 6, 2021
@denkensk
Copy link
Member

When the client misses a delete event from the watcher

@alculquicondor Hi, can you help to describe in detail how this scene occurred or how to reproduce it? I have encountered this or something like this before in our Production Cluster. But we were unable to reproduce it to prove whether the event was actually missed.

@alculquicondor
Copy link
Member Author

I don't actually have a reliable repro either. It only happened in a cluster with a lot of pod creations, where it's more likely for kube-scheduler to miss actual events from the watcher. We have increased the logging since then to v=3 to have information about events. Can you do the same?

@alculquicondor
Copy link
Member Author

Should we wait for the 1.23 release before backporting to 1.22 and lower?

if pod.Spec.NodeName != "" {
// An empty NodeName is possible when the scheduler misses a Delete
// event and it gets the last known state from the informer cache.
klog.Fatalf("Schedulercache is corrupted and can badly affect scheduling decisions")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we backport this PR, is it possible to hit this fatal without the fix in #105913

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was the case already in previous versions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pod.Spec.NodeName to have a node name, it means that we already saw the pod before and scheduled it for a 2nd time. So I don't think we would hit this.

@ahg-g
Copy link
Member

ahg-g commented Nov 25, 2021

I took a second look at the code, I think it is safe to backport this.

@cpanato
Copy link
Member

cpanato commented Nov 26, 2021

@ahg-g @alculquicondor this is not needed in the 1.20 branch?

@alculquicondor
Copy link
Member Author

Created #106695 for 1.20

k8s-ci-robot added a commit that referenced this pull request Nov 26, 2021
…of-#106102-upstream-release-1.22

Automated cherry pick of #106102: Ensure deletion of pods in queues and cache
k8s-ci-robot added a commit that referenced this pull request Nov 26, 2021
…of-#106102-upstream-release-1.21

Automated cherry pick of #106102: Ensure deletion of pods in queues and cache
@lpitstickpstg
Copy link

lpitstickpstg commented Jun 2, 2022

@alculquicondor Did this solve your problem? I've been seeing the same issue.
I thought we'd want to return false on line 276 to trigger the delete handler to cleanup the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants