Add periodic etcd scraping to integration tests #106190

MikeSpreitzer · 2021-11-05T20:03:42Z

What type of PR is this?

/kind feature
/kind flake

What this PR does / why we need it:

This PR adds periodic Prometheus scraping of the etcd server used in integration testing.
This is helpful because we suspect integration tests are flaky due to etcd overload.
See #106012 (comment)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Integration testing now takes periodic Prometheus scrapes from the etcd server.
There is a new script ,`hack/run-prometheus-on-etcd-scrapes.sh`, that runs a containerized Prometheus server against an archive of such scrapes.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/sig testing
/sig api-machinery
/cc @aojea
@wojtek-t
@tkashem

MikeSpreitzer · 2021-11-05T20:18:01Z

A possible alternative approach would be to run the regular Prometheus server to do the scraping. In order to draft that, I would have to understand the best way to run that server. I can think of two approaches, each of which involves things I do not know.

Run Prometheus in a container. What --- if any --- is the right way to add running containers to the integration tests? BTW, Added section on how CI runs integration tests community#6203 begs a related question.
Build and install Prometheus as part of running the integration tests. https://github.com/prometheus/prometheus#building-from-source says this requires npm >= v7 and NodeJS >= v16; are these dependencies already satisfied or reasonable to introduce?

aojea · 2021-11-05T23:29:52Z

there are some shellcheck errrors

In ./hack/lib/etcd.sh line 120:
tar czf "${ETCD_SCRAPE_DIR}/scrapes.tgz" "${ETCD_SCRAPE_DIR}"/.scrape && rm "${ETCD_SCRAPE_DIR}"/.scrape || :
^-- SC2015: Note that A && B || C is not if-then-else. C may run when A is true.

In ./hack/make-rules/test-integration.sh line 67:
local ETCD_SCRAPE_PID
^-------------^ SC2034: ETCD_SCRAPE_PID appears unused. Verify use (or export if used externally).

metrics are obtained in https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/106190/pull-kubernetes-integration/1456747488651251712/artifacts/etcd-scrapes/scrapes.tgz , maybe we can compress without the full path?
right now it expands to:

./logs/artifacts/etcd-scrapes/224457.scrape

naive question, how can I visualize those dumps , is there any tool to import those files into prometheus locally? @serathius do you know?

/assign @thockin @BenTheElder
for bash review

.. to help understand where and why things go bad.

MikeSpreitzer · 2021-11-06T06:14:17Z

Revised to make the scrapes easier to handle.
Added a simple server for scrape files that exports their contents via the usual Prometheus web interface.
Added a simple script that puts the above together with simple scripting for running a Prometheus server against it.

aojea · 2021-11-08T10:00:59Z

hack/serve-prom-scrapes.sh

+
+find_and_transform
+
+while true; do nc -l -N -p "$port_num" < "$response_file" > /dev/null; sleep 10; done


Suggested change

while true; do nc -l -N -p "$port_num" < "$response_file" > /dev/null; sleep 10; done

while true; do nc -l -p "$port_num" < "$response_file" > /dev/null; sleep 10; done

that -N is not present in nmap netcat

I think nc -l -p is the most universal combination of flags

Oh, that's annoying. It is missing in MacOS too. If I just remove the -N, this script does not work on MacOS. Does it work for you? What OS are you using? Which netcat package? I got this working on Ubuntu 20.04 using netcat-openbsd 1.206-1ubuntu1. Between that and MacOS, I can tell which is which using man nc | grep -w -E -N; what does that get for you?

oh, so this is the problem

-N shutdown(2) the network socket after EOF on the input. Some servers require this to finish their work.

then keep the -N , is the one implemented in the original netcat

aojea · 2021-11-08T10:11:01Z

this works fine, awesome job, Mike, just the netcat flag comment

For docs purposes, you can also use grafana dashboards to check the metrics:

run another container with grafana

docker run -p 3000:3000 grafana/grafana

login with admin:admin on localhost:3000
setup local prometheus as data source
install an etcd dashboard, (id 3070 Etcd by Prometheus) per example
Check all the metrics of etcd during the test

MikeSpreitzer · 2021-11-08T14:56:10Z

Speaking of doc, I was wondering where is the right place to document the new scripts. They are non-trivial, and deserve a bit more documentation than the brief release notes mention that I drafted.

MikeSpreitzer · 2021-11-08T16:22:35Z

The force-push to c00d4c3800f adds some more flexibility to the scripts.
Now they also work on my Mac, albeit with the occasional mysterious print of the scrape request message from Prometheus.

MikeSpreitzer · 2021-11-08T17:20:54Z

The last few force-pushes have been tweaking up the log message(s) about networking.

MikeSpreitzer · 2021-11-08T18:41:22Z

@deads2k
@tkashem

deads2k · 2021-11-08T19:13:00Z

hack/make-rules/test-integration.sh

@@ -64,6 +64,9 @@ runTests() {
  kube::log::status "Starting etcd instance"
  CLEANUP_REQUIRED=1
  kube::etcd::start
+  # shellcheck disable=SC2034
+  local ETCD_SCRAPE_PID # Set in kube::etcd::start_scraping, used in cleanup
+  kube::etcd::start_scraping


The idea of doing this scrape using a simple curl for integration tests seems lightweight enough to try. I'm not entirely sure how you will stitch this back together with which tests were running at a particular time, but given that it doesn't really add to our dependency stack, I'm ok giving it a try.

the job stores the etcd logs and the test output with timestamps, it is not great but better than before 🤷 we can correlate those logs with the graphs obtained here

In the case I was looking at, there was a log with timestamps.

MikeSpreitzer · 2021-11-08T20:28:45Z

The force-push to 4ca4ccd picks some shell lint and tweaks logging again.

MikeSpreitzer · 2021-11-08T21:38:35Z

/retest

MikeSpreitzer · 2021-11-08T21:38:55Z

/test pull-kubernetes-integration

aojea · 2021-11-09T10:24:55Z

/lgtm

fedebongio · 2021-11-09T21:09:55Z

/assign @deads2k
/triage accepted

deads2k · 2021-11-09T21:16:08Z

a lightweight way to keep the etc metrics for integration tests seems worth trying.

/approve

k8s-ci-robot · 2021-11-09T21:16:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, MikeSpreitzer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~hack/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested a review from aojea November 5, 2021 20:03

MikeSpreitzer force-pushed the integration-scrape-etcd branch from 02e8a45 to 30d649b Compare November 5, 2021 22:17

Add periodic etcd scraping to integration tests

dc07025

.. to help understand where and why things go bad.

MikeSpreitzer force-pushed the integration-scrape-etcd branch from 30d649b to dc07025 Compare November 6, 2021 01:33

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 6, 2021

MikeSpreitzer force-pushed the integration-scrape-etcd branch 4 times, most recently from cfd9ef0 to 292a53c Compare November 7, 2021 09:24

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 7, 2021

MikeSpreitzer force-pushed the integration-scrape-etcd branch 2 times, most recently from d000802 to 888df4d Compare November 8, 2021 00:00

aojea reviewed Nov 8, 2021

View reviewed changes

MikeSpreitzer force-pushed the integration-scrape-etcd branch from 888df4d to c00d4c3 Compare November 8, 2021 16:21

MikeSpreitzer force-pushed the integration-scrape-etcd branch 3 times, most recently from 2baa374 to 8e1e930 Compare November 8, 2021 17:20

deads2k reviewed Nov 8, 2021

View reviewed changes

Add serving of scrapes as Prometheus metrics

4ca4ccd

MikeSpreitzer force-pushed the integration-scrape-etcd branch from 8e1e930 to 4ca4ccd Compare November 8, 2021 20:28

k8s-ci-robot assigned aojea Nov 9, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2021

k8s-ci-robot assigned deads2k Nov 9, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2021

k8s-ci-robot merged commit 0abc054 into kubernetes:master Nov 9, 2021

k8s-ci-robot added this to the v1.23 milestone Nov 9, 2021

MikeSpreitzer deleted the integration-scrape-etcd branch November 10, 2021 04:25

aojea mentioned this pull request Jan 29, 2022

pull-kubernetes-integration timing out #107857

Closed

aojea mentioned this pull request Mar 31, 2022

pull-kubernetes-integration failing ~60-90% of runs - updated 2022-03-31 #109038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add periodic etcd scraping to integration tests #106190

Add periodic etcd scraping to integration tests #106190

MikeSpreitzer commented Nov 5, 2021 •

edited

MikeSpreitzer commented Nov 5, 2021

aojea commented Nov 5, 2021

MikeSpreitzer commented Nov 6, 2021 •

edited

aojea Nov 8, 2021

aojea Nov 8, 2021

MikeSpreitzer Nov 8, 2021

aojea Nov 8, 2021

aojea commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021 •

edited

MikeSpreitzer commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

deads2k Nov 8, 2021

aojea Nov 8, 2021 •

edited

MikeSpreitzer Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

aojea commented Nov 9, 2021

fedebongio commented Nov 9, 2021

deads2k commented Nov 9, 2021

k8s-ci-robot commented Nov 9, 2021


		find_and_transform

		while true; do nc -l -N -p "$port_num" < "$response_file" > /dev/null; sleep 10; done

	while true; do nc -l -N -p "$port_num" < "$response_file" > /dev/null; sleep 10; done
	while true; do nc -l -p "$port_num" < "$response_file" > /dev/null; sleep 10; done

Add periodic etcd scraping to integration tests #106190

Add periodic etcd scraping to integration tests #106190

Conversation

MikeSpreitzer commented Nov 5, 2021 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

MikeSpreitzer commented Nov 5, 2021

aojea commented Nov 5, 2021

MikeSpreitzer commented Nov 6, 2021 • edited

aojea Nov 8, 2021

Choose a reason for hiding this comment

aojea Nov 8, 2021

Choose a reason for hiding this comment

MikeSpreitzer Nov 8, 2021

Choose a reason for hiding this comment

aojea Nov 8, 2021

Choose a reason for hiding this comment

aojea commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021 • edited

MikeSpreitzer commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

deads2k Nov 8, 2021

Choose a reason for hiding this comment

aojea Nov 8, 2021 • edited

Choose a reason for hiding this comment

MikeSpreitzer Nov 8, 2021

Choose a reason for hiding this comment

MikeSpreitzer commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

MikeSpreitzer commented Nov 8, 2021

aojea commented Nov 9, 2021

fedebongio commented Nov 9, 2021

deads2k commented Nov 9, 2021

k8s-ci-robot commented Nov 9, 2021

MikeSpreitzer commented Nov 5, 2021 •

edited

MikeSpreitzer commented Nov 6, 2021 •

edited

MikeSpreitzer commented Nov 8, 2021 •

edited

aojea Nov 8, 2021 •

edited