docs(tvix/docs/TODO): extend O11Y section

Expand on tvix-tracing crate strategy, add some more context regarding
OTLP and span propagation.

Change-Id: Ice55c116c20aaf60531100465192ce11969551ac
Reviewed-on: https://cl.tvl.fyi/c/depot/+/11750
Autosubmit: flokli <flokli@flokli.de>
Tested-by: BuildkiteCI
Reviewed-by: Simon Hauser <simon.hauser@helsinki-systems.de>
Reviewed-by: flokli <flokli@flokli.de>
This commit is contained in:
Florian Klink 2024-06-05 10:37:35 +02:00 committed by clbot
parent 41e2fd7fa5
commit 0ea55c767a

View file

@ -140,9 +140,36 @@ logs etc, but this is something requiring a lot of designing.
- Some work ongoing on the worker operation parsing (griff, picnoir)
### O11Y
- gRPC trace propagation (cl/10532)
- `tracing-tracy` (cl/10952)
- `[tracing-]indicatif` for progress/log reporting (floklis stash)
- unification into `tvix-tracing` crate, currently a lot of boilerplate
in `tvix-store` CLI entrypoint, and half of the boilerplate copied over to
`tvix-cli`.
- `[tracing-]indicatif` for progress/log reporting (cl/11747)
- Currently there's a lot of boilerplate in the `tvix-store` CLI entrypoint,
and half of the boilerplate copied over to `tvix-cli`.
Setup of the tracing things should be unified into the `tvix-tracing` crate,
maybe including some of the CLI parameters (@simon).
Or maybe drop `--log-level` entirely, and only use `RUST_LOG` env
exclusively? `debug`,`trace` level across all crates is a bit useless, and
`RUST_LOG` can be much more granular…
- The OTLP stack is quite spammy if there's no OTLP collector running on
localhost.
https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/
mentions a `OTEL_SDK_DISABLED` env var, but it defaults to false, so they
suggest enabling OTLP by default.
We currently have a `--otlp` cmdline arg which explicitly needs to be set to
false to stop it, in line with that "enabled by default" philosophy
Do some research if we can be less spammy. While OTLP support is
feature-flagged, it should not get in the way too much, so we can actually
have it compiled in most of the time.
- gRPC trace propagation (cl/10532 + @simon)
We need to wire trace propagation into our gRPC clients, so if we collect
traces both for the client and server they will be connected.
- Fix OTLP sending batches on shutdown.
It seems for short-lived CLI invocations we don't end up receiving all spans.
Ensure we flush these on ctrl-c, and regular process termination.
See https://github.com/open-telemetry/opentelemetry-rust/issues/1395#issuecomment-2045567608
for some context.
Later:
- Trace propagation for HTTP clients too, using
https://www.w3.org/TR/trace-context/ or https://www.w3.org/TR/baggage/,
whichever makes more sense.
Candidates: nix+http(s) protocol, object_store crates.
- (`tracing-tracy` (cl/10952))