Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver: graceful termination with new option shutdown-send-retry-after #101257

Merged
merged 2 commits into from Aug 10, 2021

Conversation

tkashem
Copy link
Contributor

@tkashem tkashem commented Apr 19, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

Introduce a new server run option shutdown-send-retry-after

If true the HTTP Server will continue listening until all non longrunning request(s) in flight 
have been drained, during this window all incoming requests will be rejected with
a status code 429 and a 'Retry-After' response header.

The default value is set to false to maintain current behavior.

With this option, we now have two modes of graceful termination:

  • shutdown-send-retry-after=true: we initiate shutdown of the HTTP Server when all non longrunning requests in-flight have been drained.
  • shutdown-send-retry-after=false: we initiate shutdown of the HTTP Server as soon as shutdown-delay-duration has elapsed. (This is the current behavior)

The following happens with shutdown-send-retry-after=true:

  • Do not call http Server.Shutdown immediately after ShutdownDelayDuration elapses. Rather, wait for all non longrunning requests to drain. (the requests that are being counted by WithWaitGroup filter)
  • Immediately after ShutdownDelayDuration elapses, any new incoming request should be answered with a 429 and the following headers:
  • Connection: close and
  • Retry-After: N

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

a new server run option 'shutdown-send-retry-after'  has been introduced. If true the HTTP Server
will continue listening until all non longrunning request(s) in flight have been drained, during this window all 
incoming requests will be rejected with a status code 429 and a 'Retry-After' response header.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Apr 19, 2021
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 19, 2021
@tkashem
Copy link
Contributor Author

tkashem commented Apr 19, 2021

/test all

@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 19, 2021
@tkashem
Copy link
Contributor Author

tkashem commented Apr 19, 2021

/assign @smarterclayton

@tkashem tkashem changed the title [WIP]: graceful termination [WIP]: graceful termination with delayed server shutdown Apr 19, 2021
@smarterclayton
Copy link
Contributor

Hrm, I'd prefer 429 - try again (every request after shutdown we reject is " come back later) and connection close ensures that.

@tkashem
Copy link
Contributor Author

tkashem commented Apr 19, 2021

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2021
@tkashem tkashem changed the title [WIP]: graceful termination with delayed server shutdown [WIP]: [DO NOT REVIEW] graceful termination with delayed server shutdown May 11, 2021
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 11, 2021
@tkashem tkashem force-pushed the graceful-termintaion branch 2 times, most recently from ffc30cb to f8c0a38 Compare May 12, 2021 18:32

func (f retryAfterConditionFunc) ShouldRespondWithRetryAfter() (*retryAfterParams, bool) {
return f()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd construct. Just use retryAfterConditionFunc() directly. No need for a method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


func withRetryAfter(handler http.Handler, excludeFn excludeFunc, retryAfterFn retryAfterConditionFunc) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
retryAfterParams, sendRetryAfter := retryAfterFn.ShouldRespondWithRetryAfter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryAfterFn()

@tkashem
Copy link
Contributor Author

tkashem commented Jul 16, 2021

/retest

@tkashem tkashem force-pushed the graceful-termintaion branch 4 times, most recently from 962e00d to e305345 Compare July 16, 2021 19:54
@tkashem tkashem changed the title apiserver: graceful termination with new option keep-listening-during-shutdown apiserver: graceful termination with new option shutdown-send-retry-after Jul 16, 2021
// net/http waits for 1s for the peer to respond to a GO_AWAY frame, so
// we should wait for a minimum of 2s
stopHttpServerCh = drainedCh.Signaled()
shutdownTimeout = 2 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the long-running connections should have a chance to terminate gracefully.
with shutdownTimeout = 2 * time.Second this is impossible.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 6, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 9, 2021
// NOTE: both WithRetryAfter and WithWaitGroup must use the same exact isRequestExemptFunc 'isRequestExemptFromRetryAfter,
// otherwise SafeWaitGroup might wait indefinitely and will prevent the server from shutting down gracefully.
func WithRetryAfter(handler http.Handler, shutdownDelayDurationElapsedCh <-chan struct{}) http.Handler {
return withRetryAfter(handler, isRequestExemptFromRetryAfter, newRetryAfterFunc(shutdownDelayDurationElapsedCh))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move the isRequestExemptFromRetryAfter comment above this line, out of the func doc


func withRetryAfter(handler http.Handler, isRequestExemptFn isRequestExemptFunc, shouldRespondWithRetryAfterFn shouldRespondWithRetryAfterFunc) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
retryAfterParams, sendRetryAfter := shouldRespondWithRetryAfterFn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params, send := .... The func is about retryAfter. No need for fully qualified var names.

// newRetryAfterFunc builds a shouldRespondWithRetryAfterFunc that starts
// sending Retry-After response to all incoming requests once the shut down
// delay duration (ShutdownDelayDuration) has elapsed.
func newRetryAfterFunc(shutdownDelayDurationElapsedCh <-chan struct{}) shouldRespondWithRetryAfterFunc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't checked where this is used outside of WithRetryAfter. Just inline this there. This abstraction does not help, but rather distributes complexity.

// NOTE: both 'WithRetryAfter' and 'WithWaitGroup' filters should use this function
// to exempt the set of requests from being rejected or tracked.
func isRequestExemptFromRetryAfter(r *http.Request) bool {
if isKubeApiserverUserAgent(r) || hasExemptPathPrefix(r) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return isKubeApiserverUserAgent(r) || hasExemptPathPrefix(r)

add a new mode for graceful termination with the new server run option
'shutdown-send-retry-after'
- shutdown-send-retry-after=true: we initiate shutdown of the
  HTTP Server when all in-flight request(s) have been drained. during
  this window all incoming requests are rejected with status code
  429 and the following response headers:
    - 'Retry-After: N' - client should retry after N seconds
    - 'Connection: close' - tear down the TCP connection
- shutdown-send-retry-after=false: we initiate shutdown of the
  HTTP Server as soon as shutdown-delay-duration has elapsed. This
  is in keeping with the current behavior.
@sttts
Copy link
Contributor

sttts commented Aug 10, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sttts, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants