Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move a number of scheduler metrics to STABLE #106266

Merged
merged 1 commit into from Nov 11, 2021

Conversation

ahg-g
Copy link
Member

@ahg-g ahg-g commented Nov 9, 2021

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Marks a number of scheduler metrics as stable. Those has been around for quite sometime and some are using them to create scheduler SLOs

Which issue(s) this PR fixes:

Follow up to #105861

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Graduating `pod_scheduling_duration_seconds`, `pod_scheduling_attempts`, `framework_extension_point_duration_seconds`, `plugin_execution_duration_seconds` and `queue_incoming_pods_total` metrics to stable.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/documentation Categorizes issue or PR as related to documentation. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 9, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2021
@k8s-ci-robot
Copy link
Contributor

@ahg-g: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 9, 2021
@@ -143,7 +143,7 @@ var (
// Start with 0.01ms with the last bucket being [~22ms, Inf). We use a small factor (1.5)
// so that we have better granularity since plugin latency is very sensitive.
Buckets: metrics.ExponentialBuckets(0.00001, 1.5, 20),
StabilityLevel: metrics.ALPHA,
StabilityLevel: metrics.STABLE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a way to validate if this metric is accurate and useful?

Currently, we post at most 1000 samples every second. Since there's no particular logic to decide which samples should proceed, I fear that the information might be skewed towards certain extension point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not be skewed because we either sample the whole cycle or not. We can increase the sampling rate if we wish to make it more accurate, but in the end it depends on the number of instances we are sampling from (so number of scheduling cycles), and that depends on the cluster.

I don't think it provides useful information fleet-wide, but when debugging a specific cluster that we know uses heavily a specific feature, this metric can provide useful information on how much time that feature is adding overhead on the scheduling cycle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not be skewed because we either sample the whole cycle or not.

Why do you say that?

Copy link
Member Author

@ahg-g ahg-g Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't sample based on the extension point, we either time for all extension points or for none of them in a scheduling cycle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the sampling is random

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to recap from offline discussion: There are 2 things controlling what gets recorded:

My only concern is that the buffer might cause certain extension points to not be recorded because they are not called often. I think the most important one is the preemption PostFilter plugin. I would be more comfortable graduating this metric if we can ensure that the buffer is enough to fit an entire scheduling cycle for a 5k cluster with a default percentageOfNodesToScore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think we should just drop the graduation of this one for now, for the filter plugin, it would be ideal if we can measure for all nodes, not per node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 that would reduce a lot of measuring points.

Thanks!

@dgrisonnet
Copy link
Member

Looks good to me from an instrumentation perspective.

Leaving it up to scheduler owners to judge whether the metrics that are proposed for graduation are actually used.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 9, 2021
@alculquicondor
Copy link
Member

You probably need to update test/instrumentation/testdata/stable-metrics-list.yaml

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 9, 2021
@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Nov 10, 2021

@alculquicondor unit test fixed

@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Nov 10, 2021

/retest

@kerthcet
Copy link
Member

Also looks good to me, not label lgtm for I'm not a reviewer now.

@ahg-g
Copy link
Member Author

ahg-g commented Nov 11, 2021

/assign @logicalhan

to approve "test/instrumentation/testdata/OWNERS"

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 11, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Nov 11, 2021

/retest

@k8s-ci-robot k8s-ci-robot merged commit c9a245f into kubernetes:master Nov 11, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants