Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: Admission must exclude completed pods and avoid races #104577

Merged

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Aug 25, 2021

Fixes two issues with how the pod worker refactor calculated the
pods that admission could see (GetActivePods() and
filterOutTerminatedPods())

First, completed pods must be filtered from the "desired" state
for admission, which arguably should be happening earlier in
config. Exclude the two terminal pods states from GetActivePods()

Second, the previous check introduced with the pod worker lifecycle
ownership changes was subtly wrong for the admission use case.
Admission has to include pods that haven't yet hit the pod worker,
which CouldHaveRunningContainers was filtering out (because the
pod worker hasn't seen them). Introduce a weaker check -
IsPodKnownTerminated() - that returns true only if the pod is in
a known terminated state (no running containers AND known to pod
worker). This weaker check may only be called from components that
need admitted pods, not other kubelet subsystems.

This commit does not fix the long standing bug that force deleted
pods are omitted from admission checks, which must be fixed by
having GetActivePods() also include pods "still terminating".

/kind bug
/kind regression

Fixes #104560

Fix a 1.22 regression where the Kubelet failed to exclude already completed pods from calculations about how many resources it was currently using when deciding whether to allow more pods.

Needs additional e2es to prevent future regressions.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 25, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2021
@rphillips
Copy link
Member

/triage accepted
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 25, 2021
@smarterclayton
Copy link
Contributor Author

Still needs a unit test to exercise the admission filter here. Can add the node e2e here or separately for completed pod (although for regression purposes I'd be ok with that separate)

@rphillips
Copy link
Member

rphillips commented Aug 25, 2021

/hold for the test or top level approver

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 25, 2021
@smarterclayton smarterclayton changed the title kubelet: Admission must exclude completed tests and avoid races kubelet: Admission must exclude completed pods and avoid races Aug 25, 2021
Fixes two issues with how the pod worker refactor calculated the
pods that admission could see (GetActivePods() and
filterOutTerminatedPods())

First, completed pods must be filtered from the "desired" state
for admission, which arguably should be happening earlier in
config. Exclude the two terminal pods states from GetActivePods()

Second, the previous check introduced with the pod worker lifecycle
ownership changes was subtly wrong for the admission use case.
Admission has to include pods that haven't yet hit the pod worker,
which CouldHaveRunningContainers was filtering out (because the
pod worker hasn't seen them). Introduce a weaker check -
IsPodKnownTerminated() - that returns true only if the pod is in
a known terminated state (no running containers AND known to pod
worker). This weaker check may only be called from components that
need admitted pods, not other kubelet subsystems.

This commit does not fix the long standing bug that force deleted
pods are omitted from admission checks, which must be fixed by
having GetActivePods() also include pods "still terminating".
@bobbypage
Copy link
Member

bobbypage commented Aug 25, 2021

Thanks @smarterclayton for the quick investigation and proposed fix.

I validated your PR against the minimal repro test created in #104560 (comment) and can confirm the PR here results in pods no longer being rejected by kubelet.

$ gh pr checkout 104577
# cherry-pick e2e test (https://github.com/bobbypage/kubernetes/blob/b8e2c5bec8f12d866493ecfccede3669377a7f0c/test/e2e/scheduling/job-completions.go)
$ git cherry-pick b8e2c5bec8f12d866493ecfccede3669377a7f0c
$ kind build node-image
$ kind create cluster --config ${HOME}/dev_scripts/kind-config-e2e.yaml --image kindest/node:latest
$ ./_output/bin/e2e.test -kubeconfig ${HOME}/.kube/kind-test-config -ginkgo.focus="\[Feature:JobWithCompletions\]" --num-nodes=1 2>&1 | tee -i "/tmp/build-log.txt"

Test passes - build log - https://gist.github.com/c59e092d6befdb74fab35caeed2571c1

if kl.podWorkers.IsPodKnownTerminated(p.UID) {
continue
}
if p.Status.Phase == v1.PodSucceeded || p.Status.Phase == v1.PodFailed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious for my own understanding, why is it necessary to also check the pod phase in addition to IsPodKnownTerminated?

Is it not guaranteed that if status.IsTerminated() for the pod worker then the pod is in succeeded or failed phase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod worker is not guaranteed to track terminal pods - once a pod is terminal, the worker shuts down and then the metadata is eventually removed.

@rphillips
Copy link
Member

@bobbypage I ported your test case over into #104611. Any other changes you would like to see?

@harche
Copy link
Contributor

harche commented Aug 30, 2021

@smarterclayton @bobbypage @rphillips I was able to write a node e2e to test this PR - #104658

@mrunalp
Copy link
Contributor

mrunalp commented Aug 30, 2021

/lgtm

Tagging this and then we can pick one of the e2e PRs on top.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 30, 2021
@bobbypage
Copy link
Member

bobbypage commented Aug 30, 2021

Thanks @harche for creating node e2e test.

I suggest we proceed with merging this PR as is and then either pick up @harche 's node e2e test as followup or the original one I made as @rphillips suggested in #104611.

@bobbypage
Copy link
Member

/lgtm

@rphillips
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2021
@k8s-ci-robot k8s-ci-robot merged commit bbbeceb into kubernetes:master Aug 30, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Aug 30, 2021
k8s-ci-robot added a commit that referenced this pull request Aug 31, 2021
…04577-upstream-release-1.22

Automated cherry pick of #104577: kubelet: Admission must exclude completed pods and avoid
@harche
Copy link
Contributor

harche commented Aug 31, 2021

Thanks @harche for creating node e2e test.

I suggest we proceed with merging this PR as is and then either pick up @harche 's node e2e test as followup or the original one I made as @rphillips suggested in #104611.

@bobbypage I closed the with node e2e because it wasn't really reproducing the scenario correctly, I would vote for your original test case which went through scheduler which is now part of this PR, #104611

@smarterclayton
Copy link
Contributor Author

Found another gap, #104817 is also needed when a pod is rejected. The test we add also needs to test that rejected pods (not just completed pods) are part of admission failures.

Once we verified #104817 it would also need to be backported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1.22 regression: Kubelet rejects pods that use resources that should be freed by completed pods
6 participants