Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dynamic cert file change detection #104102

Merged
merged 1 commit into from Aug 5, 2021

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Aug 3, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

DynamicFileCAContent and DynamicCertKeyPairContent used periodical job to check whether the file content has changed, leading to 1 minute of delay in worst case. This patch improves it by leveraging fsnotify watcher. The content change will be reflected immediately.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is also a TODO @deads2k left in the code.

Does this PR introduce a user-facing change?

CA, certificate and key bundles for the generic-apiserver based servers will be reloaded immediately after the files are changed.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 3, 2021
@k8s-ci-robot k8s-ci-robot requested a review from enj August 3, 2021 14:04
@k8s-ci-robot k8s-ci-robot requested review from SataQiu and a team August 3, 2021 14:04
@k8s-ci-robot k8s-ci-robot added area/dependency Issues or PRs related to dependency changes area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 3, 2021
@caesarxuchao
Copy link
Member

/assign @deads2k

could you help triage? Thanks.

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 3, 2021

// TODO this can be wired to an fsnotifier as well.
// start the loop that watches the CA file until stopCh is closed.
go wait.NonSlidingUntil(func() { c.watchCAFile(stopCh) }, time.Second, stopCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since watchCAFile only returns on error, let's make the retry every minute instead of every second to avoid warm loops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why non-sliding? I think we want the timer to interval to start when the function ends, not when it starts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. watchCAFile also returns on file removal. The 1 second interval and non-sliding is for the mounted volume case: if a secret includes the CA bundle is mounted in a Pod, updating the secret leads to a file replacement, not content update. fsnotify watcher will receive a fsnotify.Remove event of the old file. If we use one minute as the interval and slidingUntil, it will restart the watch after one minute even the new file is ready immediately.
Maybe I could handle file removal in watchCAFile and start a new watch immediately without relying on the external loop. Then the external loop handles error case only and can wait longer interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the change, PTAL.

@tnqn tnqn force-pushed the dynamic-file branch 2 times, most recently from bb24290 to 5eeae97 Compare August 4, 2021 06:07
if err := w.Add(c.filename); err != nil {
return fmt.Errorf("error adding watch for file %s: %v", c.filename, err)
}
c.queue.Add(workItemKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this before you remove the file and add it back. Doing it here means you could return an error and not trigger the workqueue for another minute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, a trigger after starting watch is required to prevent the same race condition above. There is also a trigger in L188, which will trigger reload as long as the current file has any update. The two triggers can be merged by using defer but the routine will need to be wrapped in a function then. Please let me know which way you would prefer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this a little to guarantee a resync will be triggered after any event is received and it will be executed after starting new watch if there is.

if err := w.Add(e.Name); err != nil {
return fmt.Errorf("error adding watch for file %s: %v", e.Name, err)
}
c.queue.Add(workItemKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me how cert/key mismatches are handled. A code reference comment right here about how it will handled when one is updated before the other would be helpful. Previously, at a one minute polling cycle it was pretty unlikely, but I'm hoping past-me did something clever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You definitely handled this well: the certificate and key are validated before they are used. If either is missing or they don't match, an error will be returned and workqueue will reschedule it later with rate limiter. I will add comment to explain it.

func (c *DynamicCertKeyPairContent) loadCertKeyPair() error {
cert, err := ioutil.ReadFile(c.certFile)
if err != nil {
return err
}
key, err := ioutil.ReadFile(c.keyFile)
if err != nil {
return err
}
if len(cert) == 0 || len(key) == 0 {
return fmt.Errorf("missing content for serving cert %q", c.Name())
}
// Ensure that the key matches the cert and both are valid
_, err = tls.X509KeyPair(cert, key)
if err != nil {
return err
}

@dims dims removed the area/dependency Issues or PRs related to dependency changes label Aug 5, 2021
func (c *DynamicFileCAContent) handleWatchEvent(e fsnotify.Event, w *fsnotify.Watcher) error {
// This should be executed after restarting the watch (if applicable) to ensure no file event will be missing.
defer c.queue.Add(workItemKey)
if e.Op&(fsnotify.Remove|e.Op&fsnotify.Rename) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you'll be pushing an update anyway, the project standard is to invert this check and return nil, so i think that's

if condition <=0{
  return nil
}

w.remove
w.add

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for the suggestion.

@k8s-ci-robot k8s-ci-robot added the area/dependency Issues or PRs related to dependency changes label Aug 5, 2021
DynamicFileCAContent and DynamicCertKeyPairContent used periodical job
to check whether the file content has changed, leading to 1 minute of
delay in worst case. This patch improves it by leveraging fsnotify
watcher. The content change will be reflected immediately.
@deads2k
Copy link
Contributor

deads2k commented Aug 5, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 5, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, tnqn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit 98bd200 into kubernetes:master Aug 5, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Aug 5, 2021
@tnqn tnqn deleted the dynamic-file branch August 6, 2021 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/dependency Issues or PRs related to dependency changes area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants