Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add periodic etcd scraping to integration tests #106190

Merged

Conversation

MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Nov 5, 2021

What type of PR is this?

/kind feature
/kind flake

What this PR does / why we need it:

This PR adds periodic Prometheus scraping of the etcd server used in integration testing.
This is helpful because we suspect integration tests are flaky due to etcd overload.
See #106012 (comment)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Integration testing now takes periodic Prometheus scrapes from the etcd server.
There is a new script ,`hack/run-prometheus-on-etcd-scrapes.sh`, that runs a containerized Prometheus server against an archive of such scrapes.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig testing
/sig api-machinery
/cc @aojea
@wojtek-t
@tkashem

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. sig/testing Categorizes an issue or PR as relevant to SIG Testing. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 5, 2021
@MikeSpreitzer
Copy link
Member Author

A possible alternative approach would be to run the regular Prometheus server to do the scraping. In order to draft that, I would have to understand the best way to run that server. I can think of two approaches, each of which involves things I do not know.

  1. Run Prometheus in a container. What --- if any --- is the right way to add running containers to the integration tests? BTW, Added section on how CI runs integration tests community#6203 begs a related question.
  2. Build and install Prometheus as part of running the integration tests. https://github.com/prometheus/prometheus#building-from-source says this requires npm >= v7 and NodeJS >= v16; are these dependencies already satisfied or reasonable to introduce?

@aojea
Copy link
Member

aojea commented Nov 5, 2021

there are some shellcheck errrors

In ./hack/lib/etcd.sh line 120:
tar czf "${ETCD_SCRAPE_DIR}/scrapes.tgz" "${ETCD_SCRAPE_DIR}"/.scrape && rm "${ETCD_SCRAPE_DIR}"/.scrape || :
^-- SC2015: Note that A && B || C is not if-then-else. C may run when A is true.

In ./hack/make-rules/test-integration.sh line 67:
local ETCD_SCRAPE_PID
^-------------^ SC2034: ETCD_SCRAPE_PID appears unused. Verify use (or export if used externally).

metrics are obtained in https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/106190/pull-kubernetes-integration/1456747488651251712/artifacts/etcd-scrapes/scrapes.tgz , maybe we can compress without the full path?
right now it expands to:

./logs/artifacts/etcd-scrapes/224457.scrape

naive question, how can I visualize those dumps , is there any tool to import those files into prometheus locally? @serathius do you know?

/assign @thockin @BenTheElder
for bash review

.. to help understand where and why things go bad.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 6, 2021
@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented Nov 6, 2021

Revised to make the scrapes easier to handle.
Added a simple server for scrape files that exports their contents via the usual Prometheus web interface.
Added a simple script that puts the above together with simple scripting for running a Prometheus server against it.

@MikeSpreitzer MikeSpreitzer force-pushed the integration-scrape-etcd branch 4 times, most recently from cfd9ef0 to 292a53c Compare November 7, 2021 09:24
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 7, 2021
@MikeSpreitzer MikeSpreitzer force-pushed the integration-scrape-etcd branch 2 times, most recently from d000802 to 888df4d Compare November 8, 2021 00:00

find_and_transform

while true; do nc -l -N -p "$port_num" < "$response_file" > /dev/null; sleep 10; done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while true; do nc -l -N -p "$port_num" < "$response_file" > /dev/null; sleep 10; done
while true; do nc -l -p "$port_num" < "$response_file" > /dev/null; sleep 10; done

that -N is not present in nmap netcat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nc -l -p is the most universal combination of flags

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's annoying. It is missing in MacOS too. If I just remove the -N, this script does not work on MacOS. Does it work for you? What OS are you using? Which netcat package? I got this working on Ubuntu 20.04 using netcat-openbsd 1.206-1ubuntu1. Between that and MacOS, I can tell which is which using man nc | grep -w -E -N; what does that get for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so this is the problem

     -N      shutdown(2) the network socket after EOF on the input.  Some servers require this to finish their work.

then keep the -N , is the one implemented in the original netcat

@aojea
Copy link
Member

aojea commented Nov 8, 2021

this works fine, awesome job, Mike, just the netcat flag comment

For docs purposes, you can also use grafana dashboards to check the metrics:

  1. run another container with grafana
docker run -p 3000:3000 grafana/grafana
  1. login with admin:admin on localhost:3000
  2. setup local prometheus as data source
  3. install an etcd dashboard, (id 3070 Etcd by Prometheus) per example
  4. Check all the metrics of etcd during the test
    image

@MikeSpreitzer
Copy link
Member Author

Speaking of doc, I was wondering where is the right place to document the new scripts. They are non-trivial, and deserve a bit more documentation than the brief release notes mention that I drafted.

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented Nov 8, 2021

The force-push to c00d4c3800f adds some more flexibility to the scripts.
Now they also work on my Mac, albeit with the occasional mysterious print of the scrape request message from Prometheus.

@MikeSpreitzer MikeSpreitzer force-pushed the integration-scrape-etcd branch 3 times, most recently from 2baa374 to 8e1e930 Compare November 8, 2021 17:20
@MikeSpreitzer
Copy link
Member Author

The last few force-pushes have been tweaking up the log message(s) about networking.

@MikeSpreitzer
Copy link
Member Author

@deads2k
@tkashem

@@ -64,6 +64,9 @@ runTests() {
kube::log::status "Starting etcd instance"
CLEANUP_REQUIRED=1
kube::etcd::start
# shellcheck disable=SC2034
local ETCD_SCRAPE_PID # Set in kube::etcd::start_scraping, used in cleanup
kube::etcd::start_scraping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of doing this scrape using a simple curl for integration tests seems lightweight enough to try. I'm not entirely sure how you will stitch this back together with which tests were running at a particular time, but given that it doesn't really add to our dependency stack, I'm ok giving it a try.

Copy link
Member

@aojea aojea Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the job stores the etcd logs and the test output with timestamps, it is not great but better than before 🤷 we can correlate those logs with the graphs obtained here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case I was looking at, there was a log with timestamps.

@MikeSpreitzer
Copy link
Member Author

The force-push to 4ca4ccd picks some shell lint and tweaks logging again.

@MikeSpreitzer
Copy link
Member Author

/retest

@MikeSpreitzer
Copy link
Member Author

/test pull-kubernetes-integration

@aojea
Copy link
Member

aojea commented Nov 9, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2021
@fedebongio
Copy link
Contributor

/assign @deads2k
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2021
@deads2k
Copy link
Contributor

deads2k commented Nov 9, 2021

a lightweight way to keep the etc metrics for integration tests seems worth trying.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, MikeSpreitzer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2021
@k8s-ci-robot k8s-ci-robot merged commit 0abc054 into kubernetes:master Nov 9, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 9, 2021
@MikeSpreitzer MikeSpreitzer deleted the integration-scrape-etcd branch November 10, 2021 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants