New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PodSecurity] Metrics improvements #105898
Conversation
@tallclair: GitHub didn't allow me to request PR reviews from the following users: jyz0309. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@tallclair: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
return "namespace" | ||
default: | ||
// Assume any other resource is a valid input to pod-security, and therefore a controller. | ||
return "controller" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches the spec described in the KEP, but should we call this workload
instead, to match the term used in the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with there being some mismatches - especially if that's what's in the KEP.
The workload resources page is at the URL https://kubernetes.io/docs/concepts/workloads/controllers/
Am I right that each of these “controller” checks is really looking at a pod template? If I were going to rename, I'd rename to something that mentions pod_template
. Just a personal opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about the controller vs workload label value. Calling it pod_template might be confusing, since there's a specific PodTemplate kind
Overall lgtm. Since the metrics calls are pretty dispersed, how's the unit test coverage of the various places we record them? |
Looking good from a metrics perspective. |
@@ -117,6 +119,11 @@ func (s *Server) Start(ctx context.Context) error { | |||
// debugging or proxy purposes. The API server will not connect to an http webhook. | |||
mux.HandleFunc("/", s.HandleValidate) | |||
|
|||
// Serve the global metrics registry. | |||
metrics.LegacyMustRegister() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the global registry here? Since you already added the logic to handle the /metrics
endpoint, it would be a good opportunity to move away from the global registry and add a local KubeRegistry. This would prevent being polluted by metrics registered to the legacyregistry that wouldn't be meaningful to the webhook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I dumped the metrics from the webhook, and there was a lot of irrelevant stuff there. For now I'm using a new registry with just the pod_security metrics. We may want to add back some of the relevant pieces later though (go metrics, server metrics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add the go metrics, right now by registering the client_golang GoCollector.
As for server metrics, it is a bit more complex since they aren't dissociated from the legacyregistry, so if we really want them the best would be to still serve metrics from the global registry. That's what we are doing in metrics-server
which is an aggregated API for which server metrics are really useful: https://github.com/kubernetes-sigs/metrics-server/blob/bfa76316118b4e4993fd16c1e6df501109b0cc18/pkg/server/config.go#L95-L110. So we end up serving metrics from both the legacylegistry and the KubeRegistry, but having them separated into two registries should make it easier to integrate server metrics once they are decoupled from the global registry.
} | ||
|
||
func (r PrometheusRecorder) RecordEvaluation(decision Decision, policy api.LevelVersion, evalMode Mode, attrs api.Attributes) { | ||
func (r *PrometheusRecorder) MustRegister(registerFunc func(...metrics.Registerable)) { | ||
r.registerOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have too much knowledge about the PodSecurity plugin, but can we really have multiple plugins registering the metrics on the same server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking why registerOnce
is necessary here? TBH, I'm not sure, but I kept getting panics in the integration testing. Maybe something to do with the way integration tests are run in process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I was asking cause from an external point of view, it didn't seem possible to have the metrics being registered more than once. If you were seeing panics in the integration tests, I guess it means that it is relevant here 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not really the recorder's responsibility to limit itself to registering once... I suggested moving this to the kube-apiserver single registration to the legacy registry
Ready for another round of reviews. |
Updated tests to explicitly expect the cached metrics, and eliminate the dependency on the prometheus model. |
} | ||
|
||
func (r PrometheusRecorder) RecordEvaluation(decision Decision, policy api.LevelVersion, evalMode Mode, attrs api.Attributes) { | ||
func (r *PrometheusRecorder) MustRegister(registerFunc func(...metrics.Registerable)) { | ||
r.registerOnce.Do(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not really the recorder's responsibility to limit itself to registering once... I suggested moving this to the kube-apiserver single registration to the legacy registry
/hold |
c543dc4
to
ade88ed
Compare
This is ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs update-vendor.sh run
lgtm otherwise |
Done. Also squashed commits & rebased. /hold cancel |
/retest |
Update pod security metrics to match the spec in the KEP.
Rebased. Also had to update the test to check specifically for |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liggitt, tallclair The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
/metrics
endpoint on the webhookoperation
->request_operation
for the label namepod
orcontroller
(see comment on below)Which issue(s) this PR fixes:
Fixes #103207
Special notes for your reviewer:
Question 1: Do we want to consider the short-circuit case an evaluation, for monitoring purposes? doing so adds a second alloc in the common path, and doubles the evaluation time. See benchmarks below.
Question 2: I was corrected in the docs PR (kubernetes/website#29124 (review), @sftim ) that we should refer to the "pod template resources" as workload resources. Do we want to update the resource label on the metric to match (
controller
->workload
)?Benchmarks:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig auth
/milestone v1.23
/assign @liggitt
/cc @jyz0309