diff --git a/users/zseri/dbwospof.md b/users/zseri/dbwospof.md new file mode 100644 index 000000000..f1d68cde0 --- /dev/null +++ b/users/zseri/dbwospof.md @@ -0,0 +1,112 @@ +# distributed build without single points of failure + +## problem statement +> If we want to distribute a build across several build nodes, and want to avoid +> a "single point of failure", what needs to be considered? + +## motivation + +* distribute the build across several build nodes, because some packages take + extremely long to build + (e.g. `firefox`, `thunderbird`, `qtwebengine`, `webkitgtk`, ...) +* avoid a centralised setup like e.g. with Hydra, because we want to keep using + an on-demand workflow as usual with Nix + (e.g. `nixos-rebuild` on each host when necessary). + +## list of abbreviations + +
+
CA
content-addressed
+
drv
derivation
+
FOD
fixed-output derivation
+
IHA
input-hash-addressed
+
inhash
input hash
+
outhash
output hash
+
+ +## build graph + +The build graph can't be easily distributed. It is instead left on the coordinator, +and the build nodes just get individual build jobs, which just consist of +derivations (and some information about how to get the inputs from some central +or distributed store (e.g. Ceph), this may be transmitted "out of band"). + +## inhash-exclusive + +It is necessary that each derivation build is exclusive in the sense that +the same derivation is never build multiple times simultaneously, because +this otherwise either wastes compute resources (obviously) and, in the case +of non-deterministic builds, increases complexity +(the store needs to decide which result to prefer, and the build nodes with +"losing" build results need to pull the "winning" build results from the store, +replacing the local version). Although this might be unnecessary in case +of IHA drvs, enforcing it always reduces the amount of possible suprising +results when mixing CA drvs and IHA drvs. + +## what can be meaningfully distributed + +The following is strongly opinionated, but I consider the following +(based upon the original build graph implementation from yzix 12.2021): +* We can't push the entire build graph to each build node, because they would + overlap 100%, and thus create extreme contention on the inhash-exclusive lock +* We could try to split the build graph into multiple parts with independent + inputs (partitioning), but this can be really complex, and I'm not sure + if it is worth it... This also basically excludes the yzix node types + [ `Eval`, `AssertEqual` ] (should be done by the evaluator). + Implementing this option however would make an abort of a build graph + (the simple variant does not kill running tasks, + just stop new tasks from being scheduled) really hard, and complex to get right. +* It does not make sense to distribute "node tasks" across build nodes which + almost exclusively interact with the store, and are not CPU-bound, but I/O bound. + This applies to most, if not all, useful FODs. It applies to the yzix node types + [ `Dump`, `UnDump`, `Fetch`, `Require` ] (should be performed by evaluator+store). +* TODO: figure out how to do forced rebuilds (e.g. also prefer a node which is not + the build node of the previous realisation of that task) + +## coarse per-derivation workflow + +``` + derivation + | | + | | + key build + | | + | | + V V + inhash outhash + | (either CA or IHA) + \ / + \ / + \ / + realisation +``` + +## build results + +Just for completeness, two build results are currently considered: + +* success: the build succeeded, and the result is uploaded to the central store +* failure: the build failed (e.g. build process terminated via error exit code or was killed) +* another case might be "partial": the build succeeded, but uploading to the + central store failed (the result is only available on the build node that built it). + This case is interesting, because we don't need to rerun the build, just the upload step + needs to be fixed/done semi-manually (e.g. maybe the central store ran out of storage, + or the network was unavailable) + +## build task queue + +It is naïve to think that something like a queue via `rabbitmq` (`AMQP`) or `MQTT` +suffices, because some requirements are missing: + +1. some way to push build results to the clients, and these should be associated + to the build inputs (a hacky way might use multiple queues for that, e.g. + a `tasks` input queue and a `done` output queue). +2. some way to lock "inhashes" (see section inhash-exclusive). + +The second point is somewhat easy to realise using `etcd`, and using the `watch` +mechanism it can be used to simulate a queue, and the inhash-addressing of +queued derivations can be seamlessly integrated. + +TODO: maybe we want to adjust the priorities of tasks in the queue, but Nix currently +doesn't seem to do this, so consider this only when it starts to make sense as a +performance or lag optimization.