tvl-depot/ops/pipelines/static-pipeline.yaml
Vincent Ambo 38ec27e834 fix(ops/pipelines): Chunk build pipeline into multiple uploads
The number of jobs in the depot pipeline is reaching the limits of the
Buildkite backend's ability for a single pipeline upload. Based on a
conversation with their support my understanding is that this has to
do with internal locking mechanisms at Buildkite.

To work around this, we can instead chunk the pipeline into several
smaller chunks that are uploaded serially.

This commit introduces logic to chunk the pipeline accordingly. The
chunk size chosen is 256 for now (a multiple of our number of agents,
which is useful if we can get builds from the first chunk to start
before the next ones are uploaded).

Note that this chunk size is significantly below even the current
number of targets (~460 as of this commit), but choosing a lower chunk
size might alleviate problems we've been seeing with timeouts during
pipeline uploads.

Change-Id: I77030aaf8b874c330218b78c77d15216e13b9af7
Reviewed-on: https://cl.tvl.fyi/c/depot/+/4332
Tested-by: BuildkiteCI
Reviewed-by: wpcarro <wpcarro@gmail.com>
Autosubmit: tazjin <mail@tazj.in>
2021-12-15 15:49:40 +00:00

84 lines
2.9 KiB
YAML

# This file defines the static Buildkite pipeline which attempts to
# create the dynamic pipeline of all depot targets.
#
# If something fails during the creation of the pipeline, the fallback
# is executed instead which will simply report an error to Gerrit.
---
steps:
- label: ":llama:"
command: |
set -ue
nix-build -A ops.pipelines.depot -o pipeline --show-trace
# Steps need to be uploaded in reverse order because pipeline
# upload prepends instead of appending.
ls pipeline/chunk-*.json | tac | while read chunk; do
buildkite-agent pipeline upload $$chunk
done
# Wait for all previous steps to complete.
- wait: null
continue_on_failure: true
# Exit with success or failure depending on whether any other steps
# failed.
#
# This information is checked by querying the Buildkite GraphQL API
# and fetching the count of failed steps.
#
# This step must be :duck: (yes, really!) because the post-command
# hook will inspect this name.
#
# Note that this step has requirements for the agent environment, which
# are enforced in our NixOS configuration:
#
# * curl and jq must be on the $PATH of build agents
# * besadii configuration must be readable to the build agents
- label: ":duck:"
key: ":duck:"
command: |
set -ueo pipefail
readonly FAILED_JOBS=$(curl 'https://graphql.buildkite.com/v1' \
--silent \
-H "Authorization: Bearer $(cat /run/agenix/buildkite-graphql-token)" \
-d "{\"query\": \"query BuildStatusQuery { build(uuid: \\\"$BUILDKITE_BUILD_ID\\\") { jobs(passed: false) { count } } }\"}" | \
jq -r '.data.build.jobs.count')
echo "$$FAILED_JOBS build jobs failed."
if (( $$FAILED_JOBS > 0 )); then
exit 1
fi
# After duck, on success, create a gcroot if the build branch is
# canon.
#
# We care that this anchors *most* of the depot, in practice it's
# unimportant if there is a build race and we get +-1 of the
# targets.
#
# Unfortunately this requires a third evaluation of the graph, but
# since it happens after :duck: it should not affect the timing of
# status reporting back to Gerrit.
- label: ":anchor:"
if: "build.branch == 'refs/heads/canon'"
command: |
nix-instantiate -A ci.gcroot --add-root /nix/var/nix/gcroots/depot/canon
depends_on:
- step: ":duck:"
allow_failure: false
# Create a revision number for the current commit for builds on
# canon.
#
# This writes data back to Gerrit using the Buildkite agent
# credentials injected through a git credentials helper.
#
# Revision numbers are defined as the number of commits in the
# lineage of HEAD, following only the first parent of merges.
- label: ":git:"
if: "build.branch == 'refs/heads/canon'"
command: |
git -c 'credential.helper=gerrit-creds' \
push origin "HEAD:refs/r/$(git rev-list --count --first-parent HEAD)"