client-go: deltaFIFO trace slow handlers #103917

aojea · 2021-07-26T14:14:20Z

If the informers handlers are slow processing the objects, the deltaFIFO
blocks the queue and the streamWatchers can not add new elements to the
queue, creating contention and causing different problems, like high
memory usage.
The problem is not easy to identify from a user perspective, typically
you can use pprof to identify a high memory usage on the StreamWatchers
or some handler consuming most of the cpu time, but users should not
have to profile the golang binary in order to know that.

Metrics were disabled on the reflector because of memory leaks, also
monitoring the queue depth can't give a good signal, since it never goes high

However, we can trace slow handlers and inform users about the problem.

/kind cleanup

Fixes #103789

troubleshooting: informers log handlers that take more than 100 milliseconds to process an object if the DeltaFIFO queue starts to grow beyond 10 elements.

aojea · 2021-07-26T14:14:36Z

/assign @deads2k @wojtek-t

aojea · 2021-07-26T14:19:49Z

staging/src/k8s.io/client-go/tools/cache/delta_fifo.go

+		// Only log traces if the queue depth is greater than 10 and it takes more than
+		// 1 second to process one item from the queue.


I've made up the number based on local experimentation, sending 300 events to the queue and adding a 1 second delay on one of the handlers, it never went higher than ~30 (but this is totally random experiment, as there can be many factors)

dcbw · 2021-07-27T02:21:12Z

Nice, this would have helped me debug some things recently :)
/lgtm

wojtek-t · 2021-07-27T06:32:21Z

staging/src/k8s.io/client-go/tools/cache/delta_fifo.go

+				utiltrace.Field{Key: "ID", Value: id},
+				utiltrace.Field{Key: "Depth", Value: depth},
+				utiltrace.Field{Key: "Reason", Value: "slow event handlers, please consider standard handler to workqueue pattern,"})
+			defer trace.LogIfLong(1 * time.Second)


I think that 1s is too conservative. I agree that with depth we may want to keep that relatively low (for the reasons described above), but for latency of handler we want to be much more aggressive.

In general, wherever possible we're saying that handler should never block and should be as fast as possible (in majority of cases just put an item to the queue for offline processing). In heavy-loaded clusters, we may easily get tens or hundreds of incoming events per second for different kind of resources.

So I would suggest going much lower than that [I woulld say that 100ms is the maximum, but I would even go to 25-50ms].

If the informers handlers are slow processing the objects, the deltaFIFO blocks the queue and the streamWatchers can not add new elements to the queue, creating contention and causing different problems, like high memory usage. The problem is not easy to identify from a user perspective, typically you can use pprof to identify a high memory usage on the StreamWatchers or some handler consuming most of the cpu time, but users should not have to profile the golang binary in order to know that. Metrics were disabled on the reflector because of memory leaks, also monitoring the queue depth can't give a good signal, since it never goes high However, we can trace slow handlers and inform users about the problem.

wojtek-t · 2021-07-27T09:56:08Z

/lgtm
/approve

/retest

k8s-ci-robot · 2021-07-27T09:56:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/client-go/tools/cache/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fedebongio · 2021-07-27T20:13:31Z

/triage accepted

evanlixin · 2021-08-01T03:34:39Z

I find that It doesn't solve the problem at the root，even if informer handler will replace with fifo queue。

dims · 2021-08-05T10:38:29Z

/ok-to-test

aojea · 2021-08-05T13:20:21Z

/retest

aojea · 2021-08-05T13:21:08Z

I find that It doesn't solve the problem at the root，even if informer handler will replace with fifo queue。

this is not meant to solve the problem, is just to inform the user where is the problem

k8s-ci-robot assigned deads2k and wojtek-t Jul 26, 2021

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 26, 2021

k8s-ci-robot requested review from jessfraz and wojtek-t July 26, 2021 14:15

aojea commented Jul 26, 2021

View reviewed changes

k8s-ci-robot assigned dcbw Jul 27, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2021

wojtek-t reviewed Jul 27, 2021

View reviewed changes

aojea force-pushed the slow_handlers branch from 30bb84f to d38c2df Compare July 27, 2021 06:52

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 27, 2021

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Aug 5, 2021

k8s-ci-robot merged commit ec663ad into kubernetes:master Aug 5, 2021

k8s-ci-robot added this to the v1.23 milestone Aug 5, 2021

aojea mentioned this pull request Aug 5, 2022

Enhance NodeIPAM to support multiple ClusterCIDRs #109090

Merged

aojea mentioned this pull request Apr 28, 2023

Add Network Create and Delete events handling. kubernetes/cloud-provider-gcp#505

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client-go: deltaFIFO trace slow handlers #103917

client-go: deltaFIFO trace slow handlers #103917

aojea commented Jul 26, 2021 •

edited

aojea commented Jul 26, 2021

aojea Jul 26, 2021 •

edited

dcbw commented Jul 27, 2021

wojtek-t Jul 27, 2021

wojtek-t commented Jul 27, 2021

k8s-ci-robot commented Jul 27, 2021

fedebongio commented Jul 27, 2021

evanlixin commented Aug 1, 2021

dims commented Aug 5, 2021

aojea commented Aug 5, 2021

aojea commented Aug 5, 2021

		// Only log traces if the queue depth is greater than 10 and it takes more than
		// 1 second to process one item from the queue.

client-go: deltaFIFO trace slow handlers #103917

client-go: deltaFIFO trace slow handlers #103917

Conversation

aojea commented Jul 26, 2021 • edited

aojea commented Jul 26, 2021

aojea Jul 26, 2021 • edited

Choose a reason for hiding this comment

dcbw commented Jul 27, 2021

wojtek-t Jul 27, 2021

Choose a reason for hiding this comment

wojtek-t commented Jul 27, 2021

k8s-ci-robot commented Jul 27, 2021

fedebongio commented Jul 27, 2021

evanlixin commented Aug 1, 2021

dims commented Aug 5, 2021

aojea commented Aug 5, 2021

aojea commented Aug 5, 2021

aojea commented Jul 26, 2021 •

edited

aojea Jul 26, 2021 •

edited