New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve dynamic cert file change detection #104102
Conversation
/assign @deads2k could you help triage? Thanks. /triage accepted |
|
||
// TODO this can be wired to an fsnotifier as well. | ||
// start the loop that watches the CA file until stopCh is closed. | ||
go wait.NonSlidingUntil(func() { c.watchCAFile(stopCh) }, time.Second, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since watchCAFile
only returns on error, let's make the retry every minute instead of every second to avoid warm loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why non-sliding? I think we want the timer to interval to start when the function ends, not when it starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. watchCAFile
also returns on file removal. The 1 second interval and non-sliding is for the mounted volume case: if a secret includes the CA bundle is mounted in a Pod, updating the secret leads to a file replacement, not content update. fsnotify watcher will receive a fsnotify.Remove
event of the old file. If we use one minute as the interval and slidingUntil, it will restart the watch after one minute even the new file is ready immediately.
Maybe I could handle file removal in watchCAFile
and start a new watch immediately without relying on the external loop. Then the external loop handles error case only and can wait longer interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the change, PTAL.
bb24290
to
5eeae97
Compare
if err := w.Add(c.filename); err != nil { | ||
return fmt.Errorf("error adding watch for file %s: %v", c.filename, err) | ||
} | ||
c.queue.Add(workItemKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this before you remove the file and add it back. Doing it here means you could return an error and not trigger the workqueue for another minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, a trigger after starting watch is required to prevent the same race condition above. There is also a trigger in L188, which will trigger reload as long as the current file has any update. The two triggers can be merged by using defer
but the routine will need to be wrapped in a function then. Please let me know which way you would prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this a little to guarantee a resync will be triggered after any event is received and it will be executed after starting new watch if there is.
staging/src/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_cafile_content.go
Show resolved
Hide resolved
if err := w.Add(e.Name); err != nil { | ||
return fmt.Errorf("error adding watch for file %s: %v", e.Name, err) | ||
} | ||
c.queue.Add(workItemKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remind me how cert/key mismatches are handled. A code reference comment right here about how it will handled when one is updated before the other would be helpful. Previously, at a one minute polling cycle it was pretty unlikely, but I'm hoping past-me did something clever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You definitely handled this well: the certificate and key are validated before they are used. If either is missing or they don't match, an error will be returned and workqueue will reschedule it later with rate limiter. I will add comment to explain it.
kubernetes/staging/src/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_serving_content.go
Lines 79 to 96 in f1c8176
func (c *DynamicCertKeyPairContent) loadCertKeyPair() error { | |
cert, err := ioutil.ReadFile(c.certFile) | |
if err != nil { | |
return err | |
} | |
key, err := ioutil.ReadFile(c.keyFile) | |
if err != nil { | |
return err | |
} | |
if len(cert) == 0 || len(key) == 0 { | |
return fmt.Errorf("missing content for serving cert %q", c.Name()) | |
} | |
// Ensure that the key matches the cert and both are valid | |
_, err = tls.X509KeyPair(cert, key) | |
if err != nil { | |
return err | |
} |
func (c *DynamicFileCAContent) handleWatchEvent(e fsnotify.Event, w *fsnotify.Watcher) error { | ||
// This should be executed after restarting the watch (if applicable) to ensure no file event will be missing. | ||
defer c.queue.Add(workItemKey) | ||
if e.Op&(fsnotify.Remove|e.Op&fsnotify.Rename) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you'll be pushing an update anyway, the project standard is to invert this check and return nil, so i think that's
if condition <=0{
return nil
}
w.remove
w.add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks for the suggestion.
DynamicFileCAContent and DynamicCertKeyPairContent used periodical job to check whether the file content has changed, leading to 1 minute of delay in worst case. This patch improves it by leveraging fsnotify watcher. The content change will be reflected immediately.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, tnqn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
DynamicFileCAContent and DynamicCertKeyPairContent used periodical job to check whether the file content has changed, leading to 1 minute of delay in worst case. This patch improves it by leveraging fsnotify watcher. The content change will be reflected immediately.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This is also a TODO @deads2k left in the code.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: