Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health checks for controller managers. #104667

Merged
merged 4 commits into from Sep 3, 2021
Merged

Health checks for controller managers. #104667

merged 4 commits into from Sep 3, 2021

Conversation

jiahuif
Copy link
Member

@jiahuif jiahuif commented Aug 30, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR is recreated from #95454 to support health checks for each controller, in controller managers.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

health check of kube-controller-manager now includes each controller.

That MAY be true for cloud-controller-manager. CCM SHOULD include health checks. However, impl. of CCM is up to each cloud provider.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 30, 2021
@fedebongio
Copy link
Contributor

/assign @cheftako @jpbetz @leilajal
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. area/cloudprovider and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 31, 2021
// HealthCheckable defines a controller that allows the controller manager
// to include it in the health checks.
//
// If a controller implements HealthCheckable, and the returned HealthChecker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor: These details seem to belong with the godoc for the below HealthChecker() func declaration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is both. At the type level, the point is that the controller still has a chance to disable health check even if it implements this interface.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add that to HealthChecker() too

if ctrl != nil {
if healthCheckable, ok := ctrl.(controller.HealthCheckable); ok {
if realCheck := healthCheckable.HealthChecker(); realCheck != nil {
check = realCheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could two controllers provide health checks with the same name? What would happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard healthz mechanism disallows that. It will panic when two checks try to mount on the same path. e.g. two checks named "foo", first mounts on "/healthz/foo", second mounts on the same path, which will panic (from implementation of mux).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, if two controllers provide the same health checks, the controller manager would panic on start, catching the error during any e2e test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good. Would it possible for us to validate the names in the provider checks match the controller name? If not, what we have might be okay, but if we enforce the naming that would be great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is another way. We won't let the controller decide the name (with a new interface instead of healthz.HealthChecker). The controller manager decide the name of a check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent.

@jpbetz
Copy link
Contributor

jpbetz commented Sep 2, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 2, 2021
@jpbetz
Copy link
Contributor

jpbetz commented Sep 2, 2021

/assign @cheftako I think this is ready for approval.

@cheftako
Copy link
Member

cheftako commented Sep 3, 2021

/lgtm
/approve

@jiahuif
Copy link
Member Author

jiahuif commented Sep 3, 2021

/assign @sttts

For vendor/OWNERS, could you take a look at the change in vendor/modules.txt? It is a reference to the new package added in this PR.

@liggitt
Copy link
Member

liggitt commented Sep 3, 2021

/approve
for vendor package

this doesn't actually implement any health checks yet, right?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, jiahuif, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 3, 2021
@jiahuif
Copy link
Member Author

jiahuif commented Sep 3, 2021

@liggitt
Right, this is for the health-check support. I would not have the expertise to add checks to any controller (especially these in KCM) that I did not maintain. As the code is shared between kube-controller-manager and cloud-controller-manager, I can go ahead and work on controllers within cloud-controller-manager.

@k8s-ci-robot k8s-ci-robot merged commit 30f3511 into kubernetes:master Sep 3, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Sep 3, 2021
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants