New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client-go: deltaFIFO trace slow handlers #103917
Conversation
// Only log traces if the queue depth is greater than 10 and it takes more than | ||
// 1 second to process one item from the queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made up the number based on local experimentation, sending 300 events to the queue and adding a 1 second delay on one of the handlers, it never went higher than ~30 (but this is totally random experiment, as there can be many factors)
Nice, this would have helped me debug some things recently :) |
utiltrace.Field{Key: "ID", Value: id}, | ||
utiltrace.Field{Key: "Depth", Value: depth}, | ||
utiltrace.Field{Key: "Reason", Value: "slow event handlers, please consider standard handler to workqueue pattern,"}) | ||
defer trace.LogIfLong(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that 1s is too conservative. I agree that with depth we may want to keep that relatively low (for the reasons described above), but for latency of handler we want to be much more aggressive.
In general, wherever possible we're saying that handler should never block and should be as fast as possible (in majority of cases just put an item to the queue for offline processing). In heavy-loaded clusters, we may easily get tens or hundreds of incoming events per second for different kind of resources.
So I would suggest going much lower than that [I woulld say that 100ms is the maximum, but I would even go to 25-50ms].
If the informers handlers are slow processing the objects, the deltaFIFO blocks the queue and the streamWatchers can not add new elements to the queue, creating contention and causing different problems, like high memory usage. The problem is not easy to identify from a user perspective, typically you can use pprof to identify a high memory usage on the StreamWatchers or some handler consuming most of the cpu time, but users should not have to profile the golang binary in order to know that. Metrics were disabled on the reflector because of memory leaks, also monitoring the queue depth can't give a good signal, since it never goes high However, we can trace slow handlers and inform users about the problem.
/lgtm /retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/triage accepted |
I find that It doesn't solve the problem at the root,even if informer handler will replace with fifo queue。 |
/ok-to-test |
/retest |
this is not meant to solve the problem, is just to inform the user where is the problem |
If the informers handlers are slow processing the objects, the deltaFIFO
blocks the queue and the streamWatchers can not add new elements to the
queue, creating contention and causing different problems, like high
memory usage.
The problem is not easy to identify from a user perspective, typically
you can use pprof to identify a high memory usage on the StreamWatchers
or some handler consuming most of the cpu time, but users should not
have to profile the golang binary in order to know that.
Metrics were disabled on the reflector because of memory leaks, also
monitoring the queue depth can't give a good signal, since it never goes high
However, we can trace slow handlers and inform users about the problem.
/kind cleanup
Fixes #103789