docs(tvix/docs/TODO): extend O11Y section
Expand on tvix-tracing crate strategy, add some more context regarding OTLP and span propagation. Change-Id: Ice55c116c20aaf60531100465192ce11969551ac Reviewed-on: https://cl.tvl.fyi/c/depot/+/11750 Autosubmit: flokli <flokli@flokli.de> Tested-by: BuildkiteCI Reviewed-by: Simon Hauser <simon.hauser@helsinki-systems.de> Reviewed-by: flokli <flokli@flokli.de>
This commit is contained in:
parent
41e2fd7fa5
commit
0ea55c767a
1 changed files with 33 additions and 6 deletions
|
@ -140,9 +140,36 @@ logs etc, but this is something requiring a lot of designing.
|
||||||
- Some work ongoing on the worker operation parsing (griff, picnoir)
|
- Some work ongoing on the worker operation parsing (griff, picnoir)
|
||||||
|
|
||||||
### O11Y
|
### O11Y
|
||||||
- gRPC trace propagation (cl/10532)
|
- `[tracing-]indicatif` for progress/log reporting (cl/11747)
|
||||||
- `tracing-tracy` (cl/10952)
|
- Currently there's a lot of boilerplate in the `tvix-store` CLI entrypoint,
|
||||||
- `[tracing-]indicatif` for progress/log reporting (floklis stash)
|
and half of the boilerplate copied over to `tvix-cli`.
|
||||||
- unification into `tvix-tracing` crate, currently a lot of boilerplate
|
Setup of the tracing things should be unified into the `tvix-tracing` crate,
|
||||||
in `tvix-store` CLI entrypoint, and half of the boilerplate copied over to
|
maybe including some of the CLI parameters (@simon).
|
||||||
`tvix-cli`.
|
Or maybe drop `--log-level` entirely, and only use `RUST_LOG` env
|
||||||
|
exclusively? `debug`,`trace` level across all crates is a bit useless, and
|
||||||
|
`RUST_LOG` can be much more granular…
|
||||||
|
- The OTLP stack is quite spammy if there's no OTLP collector running on
|
||||||
|
localhost.
|
||||||
|
https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/
|
||||||
|
mentions a `OTEL_SDK_DISABLED` env var, but it defaults to false, so they
|
||||||
|
suggest enabling OTLP by default.
|
||||||
|
We currently have a `--otlp` cmdline arg which explicitly needs to be set to
|
||||||
|
false to stop it, in line with that "enabled by default" philosophy
|
||||||
|
Do some research if we can be less spammy. While OTLP support is
|
||||||
|
feature-flagged, it should not get in the way too much, so we can actually
|
||||||
|
have it compiled in most of the time.
|
||||||
|
- gRPC trace propagation (cl/10532 + @simon)
|
||||||
|
We need to wire trace propagation into our gRPC clients, so if we collect
|
||||||
|
traces both for the client and server they will be connected.
|
||||||
|
- Fix OTLP sending batches on shutdown.
|
||||||
|
It seems for short-lived CLI invocations we don't end up receiving all spans.
|
||||||
|
Ensure we flush these on ctrl-c, and regular process termination.
|
||||||
|
See https://github.com/open-telemetry/opentelemetry-rust/issues/1395#issuecomment-2045567608
|
||||||
|
for some context.
|
||||||
|
|
||||||
|
Later:
|
||||||
|
- Trace propagation for HTTP clients too, using
|
||||||
|
https://www.w3.org/TR/trace-context/ or https://www.w3.org/TR/baggage/,
|
||||||
|
whichever makes more sense.
|
||||||
|
Candidates: nix+http(s) protocol, object_store crates.
|
||||||
|
- (`tracing-tracy` (cl/10952))
|
||||||
|
|
Loading…
Add table
Reference in a new issue