Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure terminal pods maintain terminal status #105462

Merged
merged 3 commits into from Oct 5, 2021

Conversation

ehashman
Copy link
Member

@ehashman ehashman commented Oct 4, 2021

Alternative to #105411. This is a slightly tighter patch and appears to also fix the issue.

What type of PR is this?

/kind bug
/sig node
/priority critical-urgent

What this PR does / why we need it:

Avoids overwriting statuses or phase on terminated pods.

Which issue(s) this PR fixes:

Fixes #105358

Special notes for your reviewer

This avoids an isTerminated check, which we may not need. We can fix the container restart count issue separately.

Does this PR introduce a user-facing change?

Evicted and other terminated pods will no longer revert to Running phase

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/node Categorizes an issue or PR as relevant to SIG Node. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 4, 2021
@ehashman
Copy link
Member Author

ehashman commented Oct 4, 2021

/test pull-kubernetes-node-kubelet-eviction
/test pull-kubernetes-node-kubelet-serial

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 4, 2021
@@ -622,6 +622,64 @@ func TestTerminatePod(t *testing.T) {
assert.Equal(t, newStatus.Message, firstStatus.Message)
}

func TestTerminatePodWaiting(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically copied from test above, but changed (init)container 3 to waiting status

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 4, 2021
@ehashman ehashman changed the title [WIP] Ensure terminal pods maintain terminal status Ensure terminal pods maintain terminal status Oct 4, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 4, 2021
@rphillips
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 4, 2021
@ehashman ehashman added this to Needs Reviewer in SIG Node PR Triage Oct 4, 2021
@rphillips
Copy link
Member

/retest-required

@ehashman
Copy link
Member Author

ehashman commented Oct 5, 2021

Oct 05 17:41:03 kind-worker kubelet[248]: I1005 17:41:03.668493     248 config.go:383] "Receiving a new pod" pod="pods-7829/pod-submit-status-0-13"
Oct 05 17:41:03 kind-worker kubelet[248]: I1005 17:41:03.669167     248 kubelet.go:2055] "SyncLoop ADD" source="api" pods=[pods-7829/pod-submit-status-0-13]
Oct 05 17:41:03 kind-worker kubelet[248]: I1005 17:41:03.669398     248 pod_workers.go:535] "Pod is being synced for the first time" pod="pods-7829/pod-submit-status-0-13" podUID=a409f0f6-ac57-4d5e-95e1-97e074ea74b7
Oct 05 17:41:03 kind-worker kubelet[248]: I1005 17:41:03.669503     248 pod_workers.go:742] "Processing pod event" pod="pods-7829/pod-submit-status-0-13" podUID=a409f0f6-ac57-4d5e-95e1-97e074ea74b7 updateType=0
Oct 05 17:41:03 kind-worker kubelet[248]: I1005 17:41:03.669541     248 kubelet.go:1533] "syncPod enter" pod="pods-7829/pod-submit-status-0-13" podUID=a409f0f6-ac57-4d5e-95e1-97e074ea74b7
Oct 05 17:41:03 kind-worker kubelet[248]: I1005 17:41:03.669559     248 kubelet_pods.go:1437] "Generating pod status" pod="pods-7829/pod-submit-status-0-13"
Oct 05 17:41:03 kind-worker kubelet[248]: I1005 17:41:03.669600     248 kubelet_pods.go:1449] "Got phase for pod" pod="pods-7829/pod-submit-status-0-13" oldPhase=Pending phase=Pending
Oct 05 17:41:03 kind-worker kubelet[248]: I1005 17:41:03.694537     248 kubelet.go:2071] "SyncLoop DELETE" source="api" pods=[pods-7829/pod-submit-status-0-13]

The node got a DELETE before it even started to spin up the pod... looks like it wasn't able to sync out the final status due to client-side throttling:

Oct 05 17:41:05 kind-worker kubelet[248]: I1005 17:41:05.527801     248 pod_workers.go:931] "Pod is complete and the worker can now stop" pod="pods-7829/pod-submit-status-0-13" podUID=a409f0f6-ac57-4d5e-95e1-97e074ea74b7
Oct 05 17:41:05 kind-worker kubelet[248]: I1005 17:41:05.527832     248 pod_workers.go:811] "Processing pod event done" pod="pods-7829/pod-submit-status-0-13" podUID=a409f0f6-ac57-4d5e-95e1-97e074ea74b7 updateType=2
Oct 05 17:41:09 kind-worker kubelet[248]: I1005 17:41:09.645158     248 request.go:597] Waited for 386.835047ms due to client-side throttling, not priority and fairness, request: GET:https://kind-control-plane:6443/api/v1/namespaces/pods-7829/pods/pod-submit-status-0-13
Oct 05 17:41:09 kind-worker kubelet[248]: I1005 17:41:09.843151     248 request.go:597] Waited for 154.773669ms due to client-side throttling, not priority and fairness, request: PATCH:https://kind-control-plane:6443/api/v1/namespaces/pods-7829/pods/pod-submit-status-0-13/status
Oct 05 17:41:09 kind-worker kubelet[248]: I1005 17:41:09.860368     248 status_manager.go:622] "Patch status for pod" pod="pods-7829/pod-submit-status-0-13" patch="{\"metadata\":{\"uid\":\"a409f0f6-ac57-4d5e-95e1-97e074ea74b7\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-10-05T17:41:03Z\",\"status\":\"True\",\"type\":\"Initialized\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-10-05T17:41:03Z\",\"message\":\"containers with unready status: [busybox]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-10-05T17:41:03Z\",\"message\":\"containers with unready status: [busybox]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"image\":\"k8s.gcr.io/e2e-test-images/busybox:1.29-2\",\"imageID\":\"\",\"lastState\":{},\"name\":\"busybox\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"terminated\":{\"exitCode\":137,\"finishedAt\":null,\"message\":\"The container could not be located when the pod was terminated\",\"reason\":\"ContainerStatusUnknown\",\"startedAt\":null}}}],\"hostIP\":\"172.18.0.2\",\"podIP\":\"10.244.1.111\",\"podIPs\":[{\"ip\":\"10.244.1.111\"}],\"startTime\":\"2021-10-05T17:41:03Z\"}}"
Oct 05 17:41:09 kind-worker kubelet[248]: I1005 17:41:09.860488     248 status_manager.go:631] "Status for pod updated successfully" pod="pods-7829/pod-submit-status-0-13" statusVersion=4 status={Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-10-05 17:41:03 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-10-05 17:41:03 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [busybox]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-10-05 17:41:03 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [busybox]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-10-05 17:41:03 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.18.0.2 PodIP:10.244.1.111 PodIPs:[{IP:10.244.1.111}] StartTime:2021-10-05 17:41:03 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:busybox State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:137,Signal:0,Reason:ContainerStatusUnknown,Message:The container could not be located when the pod was terminated,StartedAt:0001-01-01 00:00:00 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,ContainerID:,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:k8s.gcr.io/e2e-test-images/busybox:1.29-2 ImageID: ContainerID: Started:0xc0014d25a8}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Oct 05 17:41:09 kind-worker kubelet[248]: I1005 17:41:09.860884     248 kubelet_pods.go:951] "Pod is terminated and all resources are reclaimed" pod="pods-7829/pod-submit-status-0-13"
Oct 05 17:41:09 kind-worker kubelet[248]: I1005 17:41:09.861840     248 kubelet.go:2068] "SyncLoop RECONCILE" source="api" pods=[pods-7829/pod-submit-status-0-13]
Oct 05 17:41:10 kind-worker kubelet[248]: I1005 17:41:10.042381     248 request.go:597] Waited for 181.389859ms due to client-side throttling, not priority and fairness, request: DELETE:https://kind-control-plane:6443/api/v1/namespaces/pods-7829/pods/pod-submit-status-0-13
Oct 05 17:41:10 kind-worker kubelet[248]: I1005 17:41:10.102100     248 kubelet.go:2071] "SyncLoop DELETE" source="api" pods=[pods-7829/pod-submit-status-0-13]
Oct 05 17:41:10 kind-worker kubelet[248]: I1005 17:41:10.102555     248 pod_workers.go:557] "Pod is finished processing, no further updates" pod="pods-7829/pod-submit-status-0-13" podUID=a409f0f6-ac57-4d5e-95e1-97e074ea74b7
Oct 05 17:41:10 kind-worker kubelet[248]: I1005 17:41:10.116535     248 status_manager.go:650] "Pod fully terminated and removed from etcd" pod="pods-7829/pod-submit-status-0-13"
Oct 05 17:41:10 kind-worker kubelet[248]: I1005 17:41:10.119678     248 kubelet.go:2065] "SyncLoop REMOVE" source="api" pods=[pods-7829/pod-submit-status-0-13]
Oct 05 17:41:10 kind-worker kubelet[248]: I1005 17:41:10.119730     248 kubelet.go:1910] "Pod has been deleted and must be killed" pod="pods-7829/pod-submit-status-0-13" podUID=a409f0f6-ac57-4d5e-95e1-97e074ea74b7
Oct 05 17:41:10 kind-worker kubelet[248]: I1005 17:41:10.119758     248 pod_workers.go:557] "Pod is finished processing, no further updates" pod="pods-7829/pod-submit-status-0-13" podUID=a409f0f6-ac57-4d5e-95e1-97e074ea74b7

This just seems like accidental bad timing to me and unrelated to the PR... thoughts?

Attached:
pod-submit-status-0-13.log

from https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/105462/pull-kubernetes-e2e-kind/1445439136864407552/artifacts/logs/kind-worker/kubelet.log

- Update test to account for late synced statuses
- Terminated containers are not running
- Add missing format values to test logline
@@ -425,7 +427,7 @@ var _ = SIGDescribe("Pods Extended", func() {
if !hasTerminalPhase {
var names []string
for _, status := range pod.Status.ContainerStatuses {
if status.State.Terminated != nil || status.State.Running != nil {
if status.State.Running != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was just wrong before, but we never hit it because we were failing to sync out terminated container statuses. Now that we're doing that correctly, a Pending pod with a terminated container (perhaps because it was deleted before it reached Running) would cause this test to fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this line came from #88440 which explicitly says in the description "The kubelet will consider any container without the waiting or terminated state to be still running, like it does for other operations." which doesn't match the code here.

@ehashman
Copy link
Member Author

ehashman commented Oct 5, 2021

/test pull-kubernetes-node-kubelet-eviction
/test pull-kubernetes-node-kubelet-serial

@ehashman
Copy link
Member Author

ehashman commented Oct 5, 2021

Passed tests 🥳

The serial tests will fail, just wanna take a quick look for signal.

@smarterclayton
Copy link
Contributor

Are there any other optional jobs we should run?

@ehashman
Copy link
Member Author

ehashman commented Oct 5, 2021

Don't think so.

@smarterclayton
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ehashman, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit 907d62e into kubernetes:master Oct 5, 2021
SIG Node PR Triage automation moved this from Needs Reviewer to Done Oct 5, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Oct 5, 2021
k8s-ci-robot added a commit that referenced this pull request Oct 8, 2021
…5462-upstream-release-1.22

Automated cherry pick of #105462: Ensure terminal pods maintain terminal status
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

Pods go into error state after kubelet-initiated eviction
5 participants