Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client-go: deltaFIFO trace slow handlers #103917

Merged
merged 1 commit into from Aug 5, 2021

Conversation

aojea
Copy link
Member

@aojea aojea commented Jul 26, 2021

If the informers handlers are slow processing the objects, the deltaFIFO
blocks the queue and the streamWatchers can not add new elements to the
queue, creating contention and causing different problems, like high
memory usage.
The problem is not easy to identify from a user perspective, typically
you can use pprof to identify a high memory usage on the StreamWatchers
or some handler consuming most of the cpu time, but users should not
have to profile the golang binary in order to know that.

Metrics were disabled on the reflector because of memory leaks, also
monitoring the queue depth can't give a good signal, since it never goes high

However, we can trace slow handlers and inform users about the problem.

/kind cleanup

Fixes #103789

troubleshooting: informers log handlers that take more than 100 milliseconds to process an object if the DeltaFIFO queue starts to grow beyond 10 elements.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 26, 2021
@aojea
Copy link
Member Author

aojea commented Jul 26, 2021

/assign @deads2k @wojtek-t

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 26, 2021
Comment on lines 542 to 543
// Only log traces if the queue depth is greater than 10 and it takes more than
// 1 second to process one item from the queue.
Copy link
Member Author

@aojea aojea Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made up the number based on local experimentation, sending 300 events to the queue and adding a 1 second delay on one of the handlers, it never went higher than ~30 (but this is totally random experiment, as there can be many factors)

@dcbw
Copy link
Member

dcbw commented Jul 27, 2021

Nice, this would have helped me debug some things recently :)
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2021
utiltrace.Field{Key: "ID", Value: id},
utiltrace.Field{Key: "Depth", Value: depth},
utiltrace.Field{Key: "Reason", Value: "slow event handlers, please consider standard handler to workqueue pattern,"})
defer trace.LogIfLong(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that 1s is too conservative. I agree that with depth we may want to keep that relatively low (for the reasons described above), but for latency of handler we want to be much more aggressive.

In general, wherever possible we're saying that handler should never block and should be as fast as possible (in majority of cases just put an item to the queue for offline processing). In heavy-loaded clusters, we may easily get tens or hundreds of incoming events per second for different kind of resources.

So I would suggest going much lower than that [I woulld say that 100ms is the maximum, but I would even go to 25-50ms].

If the informers handlers are slow processing the objects, the deltaFIFO
blocks the queue and the streamWatchers can not add new elements to the
queue, creating contention and causing different problems, like high
memory usage.
The problem is not easy to identify from a user perspective, typically
you can use pprof to identify a high memory usage on the StreamWatchers
or some handler consuming most of the cpu time, but users should not
have to profile the golang binary in order to know that.

Metrics were disabled on the reflector because of memory leaks, also
monitoring the queue depth can't give a good signal, since it never goes high

However, we can trace slow handlers and inform users about the problem.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2021
@wojtek-t
Copy link
Member

/lgtm
/approve

/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2021
@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 27, 2021
@evanlixin
Copy link

I find that It doesn't solve the problem at the root,even if informer handler will replace with fifo queue。

@dims
Copy link
Member

dims commented Aug 5, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Aug 5, 2021
@aojea
Copy link
Member Author

aojea commented Aug 5, 2021

/retest

@aojea
Copy link
Member Author

aojea commented Aug 5, 2021

I find that It doesn't solve the problem at the root,even if informer handler will replace with fifo queue。

this is not meant to solve the problem, is just to inform the user where is the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StreamWatcher memory leak ?
8 participants