feat(ops/pipelines): support buildkite retries

cl/12228 did enable automatic retries for some flaky tests, which
generally did work, as can be seen in
https://buildkite.com/tvl/depot/builds/35893

However, "🦆" still reports as failing, because we check the number
of steps to be nonzero, which is not the case if retries have happened.

We cannot check for the overall status of the build, as it's still
"RUNNING", but instead of counting all failed steps so far, we can query
all failed jobs and then filter out the ones that were already retried.

Change-Id: Ib9d27587c8a8ba7970850812c4302fecdc4482e7
Reviewed-on: https://cl.tvl.fyi/c/depot/+/12233
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
This commit is contained in:
Florian Klink 2024-08-18 19:17:23 +03:00 committed by flokli
parent 98863e7312
commit bb5d7c9678

View file

@ -88,10 +88,12 @@ steps:
continue_on_failure: true
# Exit with success or failure depending on whether any other steps
# failed.
# failed (but not retried).
#
# This information is checked by querying the Buildkite GraphQL API
# and fetching the count of failed steps.
# and fetching all failed steps, then filtering out the ones that were
# retried (retried jobs create new jobs, which would also show up in the
# query).
#
# This step must be :duck: (yes, really!) because the post-command
# hook will inspect this name.
@ -109,8 +111,8 @@ steps:
readonly FAILED_JOBS=$(curl 'https://graphql.buildkite.com/v1' \
--silent \
-H "Authorization: Bearer $(cat ${BUILDKITE_TOKEN_PATH})" \
-d "{\"query\": \"query BuildStatusQuery { build(uuid: \\\"$BUILDKITE_BUILD_ID\\\") { jobs(passed: false) { count } } }\"}" | \
jq -r '.data.build.jobs.count')
-d "{\"query\": \"query BuildStatusQuery { build(uuid: \\\"$BUILDKITE_BUILD_ID\\\") { jobs(passed: false, first: 500 ) { edges { node { ... on JobTypeCommand { retried } } } } } }\"}" | \
jq -r '.data.build.jobs.edges | map(select(.node.retried == false)) | length')
echo "$$FAILED_JOBS build jobs failed."