New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add periodic etcd scraping to integration tests #106190
Add periodic etcd scraping to integration tests #106190
Conversation
A possible alternative approach would be to run the regular Prometheus server to do the scraping. In order to draft that, I would have to understand the best way to run that server. I can think of two approaches, each of which involves things I do not know.
|
02e8a45
to
30d649b
Compare
there are some shellcheck errrors
metrics are obtained in https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/106190/pull-kubernetes-integration/1456747488651251712/artifacts/etcd-scrapes/scrapes.tgz , maybe we can compress without the full path?
naive question, how can I visualize those dumps , is there any tool to import those files into prometheus locally? @serathius do you know? /assign @thockin @BenTheElder |
.. to help understand where and why things go bad.
30d649b
to
dc07025
Compare
Revised to make the scrapes easier to handle. |
cfd9ef0
to
292a53c
Compare
d000802
to
888df4d
Compare
hack/serve-prom-scrapes.sh
Outdated
|
||
find_and_transform | ||
|
||
while true; do nc -l -N -p "$port_num" < "$response_file" > /dev/null; sleep 10; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while true; do nc -l -N -p "$port_num" < "$response_file" > /dev/null; sleep 10; done | |
while true; do nc -l -p "$port_num" < "$response_file" > /dev/null; sleep 10; done |
that -N is not present in nmap netcat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think nc -l -p
is the most universal combination of flags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's annoying. It is missing in MacOS too. If I just remove the -N
, this script does not work on MacOS. Does it work for you? What OS are you using? Which netcat package? I got this working on Ubuntu 20.04 using netcat-openbsd 1.206-1ubuntu1. Between that and MacOS, I can tell which is which using man nc | grep -w -E -N
; what does that get for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, so this is the problem
-N shutdown(2) the network socket after EOF on the input. Some servers require this to finish their work.
then keep the -N , is the one implemented in the original netcat
this works fine, awesome job, Mike, just the netcat flag comment For docs purposes, you can also use grafana dashboards to check the metrics:
|
Speaking of doc, I was wondering where is the right place to document the new scripts. They are non-trivial, and deserve a bit more documentation than the brief release notes mention that I drafted. |
888df4d
to
c00d4c3
Compare
The force-push to c00d4c3800f adds some more flexibility to the scripts. |
2baa374
to
8e1e930
Compare
The last few force-pushes have been tweaking up the log message(s) about networking. |
@@ -64,6 +64,9 @@ runTests() { | |||
kube::log::status "Starting etcd instance" | |||
CLEANUP_REQUIRED=1 | |||
kube::etcd::start | |||
# shellcheck disable=SC2034 | |||
local ETCD_SCRAPE_PID # Set in kube::etcd::start_scraping, used in cleanup | |||
kube::etcd::start_scraping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of doing this scrape using a simple curl
for integration tests seems lightweight enough to try. I'm not entirely sure how you will stitch this back together with which tests were running at a particular time, but given that it doesn't really add to our dependency stack, I'm ok giving it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the job stores the etcd logs and the test output with timestamps, it is not great but better than before 🤷 we can correlate those logs with the graphs obtained here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case I was looking at, there was a log with timestamps.
8e1e930
to
4ca4ccd
Compare
The force-push to 4ca4ccd picks some shell lint and tweaks logging again. |
/retest |
/test pull-kubernetes-integration |
/lgtm |
/assign @deads2k |
a lightweight way to keep the etc metrics for integration tests seems worth trying. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, MikeSpreitzer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
/kind flake
What this PR does / why we need it:
This PR adds periodic Prometheus scraping of the etcd server used in integration testing.
This is helpful because we suspect integration tests are flaky due to etcd overload.
See #106012 (comment)
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig testing
/sig api-machinery
/cc @aojea
@wojtek-t
@tkashem