Move a number of scheduler metrics to STABLE #106266

ahg-g · 2021-11-09T14:17:59Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Marks a number of scheduler metrics as stable. Those has been around for quite sometime and some are using them to create scheduler SLOs

Which issue(s) this PR fixes:

Follow up to #105861

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Graduating `pod_scheduling_duration_seconds`, `pod_scheduling_attempts`, `framework_extension_point_duration_seconds`, `plugin_execution_duration_seconds` and `queue_incoming_pods_total` metrics to stable.

k8s-ci-robot · 2021-11-09T14:19:07Z

@ahg-g: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2021-11-09T14:29:22Z

pkg/scheduler/metrics/metrics.go

@@ -143,7 +143,7 @@ var (
 			// Start with 0.01ms with the last bucket being [~22ms, Inf). We use a small factor (1.5)
 			// so that we have better granularity since plugin latency is very sensitive.
 			Buckets:        metrics.ExponentialBuckets(0.00001, 1.5, 20),
-			StabilityLevel: metrics.ALPHA,
+			StabilityLevel: metrics.STABLE,


Can you think of a way to validate if this metric is accurate and useful?

Currently, we post at most 1000 samples every second. Since there's no particular logic to decide which samples should proceed, I fear that the information might be skewed towards certain extension point.

It will not be skewed because we either sample the whole cycle or not. We can increase the sampling rate if we wish to make it more accurate, but in the end it depends on the number of instances we are sampling from (so number of scheduling cycles), and that depends on the cluster.

I don't think it provides useful information fleet-wide, but when debugging a specific cluster that we know uses heavily a specific feature, this metric can provide useful information on how much time that feature is adding overhead on the scheduling cycle.

It will not be skewed because we either sample the whole cycle or not.

Why do you say that?

Because we don't sample based on the extension point, we either time for all extension points or for none of them in a scheduling cycle.

Also, the sampling is random

to recap from offline discussion: There are 2 things controlling what gets recorded:

a random sampling per scheduling cycle

kubernetes/pkg/scheduler/scheduler.go

Line 441 in 6ac2d8e

state.SetRecordPluginMetrics(rand.Intn(100) < pluginMetricsSamplePercent)

a buffer

kubernetes/pkg/scheduler/framework/runtime/metrics_recorder.go

Line 54 in 6ac2d8e

bufferCh: make(chan *frameworkMetric, bufferSize),

My only concern is that the buffer might cause certain extension points to not be recorded because they are not called often. I think the most important one is the preemption PostFilter plugin. I would be more comfortable graduating this metric if we can ensure that the buffer is enough to fit an entire scheduling cycle for a 5k cluster with a default percentageOfNodesToScore.

ok, I think we should just drop the graduation of this one for now, for the filter plugin, it would be ideal if we can measure for all nodes, not per node.

+1 that would reduce a lot of measuring points.

Thanks!

dgrisonnet · 2021-11-09T17:31:28Z

Looks good to me from an instrumentation perspective.

Leaving it up to scheduler owners to judge whether the metrics that are proposed for graduation are actually used.

alculquicondor · 2021-11-09T19:42:14Z

You probably need to update test/instrumentation/testdata/stable-metrics-list.yaml

alculquicondor · 2021-11-09T20:22:07Z

/lgtm

ahg-g · 2021-11-10T17:51:41Z

@alculquicondor unit test fixed

alculquicondor · 2021-11-10T17:53:41Z

/lgtm

ahg-g · 2021-11-10T18:36:41Z

/retest

kerthcet · 2021-11-11T02:50:17Z

Also looks good to me, not label lgtm for I'm not a reviewer now.

ahg-g · 2021-11-11T03:40:26Z

/assign @logicalhan

to approve "test/instrumentation/testdata/OWNERS"

logicalhan

/lgtm
/approve

k8s-ci-robot · 2021-11-11T17:48:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [ahg-g]
~~test/instrumentation/testdata/OWNERS~~ [logicalhan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2021-11-11T20:07:18Z

/retest

k8s-ci-robot requested review from alculquicondor and chendave November 9, 2021 14:18

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2021

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 9, 2021

alculquicondor reviewed Nov 9, 2021

View reviewed changes

ahg-g force-pushed the ahg-metrics branch from 73f5ea1 to 8f47c82 Compare November 9, 2021 19:31

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 9, 2021

ahg-g force-pushed the ahg-metrics branch from 8f47c82 to 5057839 Compare November 9, 2021 20:11

k8s-ci-robot assigned alculquicondor Nov 9, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2021

move a number of scheduler metrics to STABLE

a241c45

ahg-g force-pushed the ahg-metrics branch from 5057839 to a241c45 Compare November 10, 2021 17:51

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2021

k8s-ci-robot assigned logicalhan Nov 11, 2021

logicalhan approved these changes Nov 11, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 11, 2021

k8s-ci-robot merged commit c9a245f into kubernetes:master Nov 11, 2021

k8s-ci-robot added this to the v1.23 milestone Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move a number of scheduler metrics to STABLE #106266

Move a number of scheduler metrics to STABLE #106266

ahg-g commented Nov 9, 2021

k8s-ci-robot commented Nov 9, 2021

alculquicondor Nov 9, 2021

ahg-g Nov 9, 2021

alculquicondor Nov 9, 2021

ahg-g Nov 9, 2021 •

edited

ahg-g Nov 9, 2021

alculquicondor Nov 9, 2021 •

edited

ahg-g Nov 9, 2021

alculquicondor Nov 9, 2021

dgrisonnet commented Nov 9, 2021

alculquicondor commented Nov 9, 2021

alculquicondor commented Nov 9, 2021

ahg-g commented Nov 10, 2021

alculquicondor commented Nov 10, 2021

ahg-g commented Nov 10, 2021

kerthcet commented Nov 11, 2021

ahg-g commented Nov 11, 2021

logicalhan left a comment

k8s-ci-robot commented Nov 11, 2021

ahg-g commented Nov 11, 2021

Move a number of scheduler metrics to STABLE #106266

Move a number of scheduler metrics to STABLE #106266

Conversation

ahg-g commented Nov 9, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Nov 9, 2021

alculquicondor Nov 9, 2021

Choose a reason for hiding this comment

ahg-g Nov 9, 2021

Choose a reason for hiding this comment

alculquicondor Nov 9, 2021

Choose a reason for hiding this comment

ahg-g Nov 9, 2021 • edited

Choose a reason for hiding this comment

ahg-g Nov 9, 2021

Choose a reason for hiding this comment

alculquicondor Nov 9, 2021 • edited

Choose a reason for hiding this comment

ahg-g Nov 9, 2021

Choose a reason for hiding this comment

alculquicondor Nov 9, 2021

Choose a reason for hiding this comment

dgrisonnet commented Nov 9, 2021

alculquicondor commented Nov 9, 2021

alculquicondor commented Nov 9, 2021

ahg-g commented Nov 10, 2021

alculquicondor commented Nov 10, 2021

ahg-g commented Nov 10, 2021

kerthcet commented Nov 11, 2021

ahg-g commented Nov 11, 2021

logicalhan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 11, 2021

ahg-g commented Nov 11, 2021

ahg-g Nov 9, 2021 •

edited

alculquicondor Nov 9, 2021 •

edited