docs(tvix/eval): document inner workings of the new VM loop

Change-Id: I8224bf039f739c401900b5a2ddc839810c87cf6e
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8226
Tested-by: BuildkiteCI
Reviewed-by: Adam Joseph <adam@westernsemico.com>
This commit is contained in:
Vincent Ambo 2023-03-07 19:54:57 +03:00 committed by tazjin
parent d456705352
commit 466e6dc265

View file

@ -1,68 +1,295 @@
tvix-eval VM loop
=================
Date: 2023-02-14
Author: tazjin
This document describes the new tvix-eval VM execution loop implemented in the
chain focusing around cl/8104.
## Background
The VM loop implemented in `src/vm.rs` currently has a couple of functions:
The VM loop implemented in Tvix prior to cl/8104 had several functions:
1. Advance the instruction pointer and execute instructions, in a loop
(unsurprisingly).
1. Advancing the instruction pointer for a chunk of Tvix bytecode and
executing instructions in a loop until a result was yielded.
2. Tracking Nix call frames as functions/thunks are entered/exited.
2. Tracking Nix call frames as functions/thunks were entered/exited.
3. Catch trampoline requests returned from instruction executions, and
resuming of executions afterwards.
3. Catching trampoline requests returned from instructions to force suspended
thunks without increasing stack size *where possible*.
4. Invoking the inner trampoline loop, handling actions and
continuations from the trampoline.
4. Handling trampolines through an inner trampoline loop, switching between a
code execution mode and execution of subsequent trampolines.
The current implementation of the trampoline logic was added on to the
existing VM, which recursed for thunk forcing. As a result, it is
currently a little difficult to understand what exactly is going on in
the VM loop and how the trampoline logic works.
This implementation of the trampoline logic was added on to the existing VM,
which previously always recursed for thunk forcing. There are some cases (for
example values that need to be forced *inside* of the execution of a builtin)
where trampolines could not previously be used, and the VM recursed anyways.
This has also led to several bugs, for example: b/251, b/246, b/245,
and b/238.
As a result of this trampoline logic being added "on top" of the existing VM
loop the code became quite difficult to understand. This led to several bugs,
for example: b/251, b/246, b/245, and b/238.
These bugs are very tricky to deal with, as we have to try and make
the VM do things that are somewhat difficult to fit into the current
model. We could of course keep extending the trampoline logic to
accomodate all sorts of concepts (such as finalisers), but that seems
difficult.
These bugs were tricky to deal with, as we had to try and make the VM do
things that are somewhat difficult to fit into its model. We could of course
keep extending the trampoline logic to accommodate all sorts of concepts (such
as finalisers), but that seems like it does not solve the root problem.
There are also additional problems, such as forcing inside of builtin
implementations, which leads to a situation where we would like to
suspend the builtin and return control flow to the VM until the value
is forced.
## New VM loop
## Proposal
In cl/8104, a unified new solution is implemented with which the VM is capable
of evaluating everything without increasing the call stack size.
We rewrite parts of the VM loop to elevate trampolining and
potentially other modes of execution to a top-level concept of the VM.
This is done by introducing a new frame stack in the VM, on which execution
frames are enqueued that are either:
We achieve this by replacing the current concept of call frames with a
"VM context" (naming tbd), which can represent multiple different
states of the VM:
1. A bytecode frame, consisting of Tvix bytecode that evaluates compiled Nix
code.
2. A generator frame, consisting of some VM logic implemented in pure Rust
code that can be *suspended* when it hits a point where the VM would
previously need to recurse.
1. Tvix code execution, equivalent to what is currently a call frame,
executing bytecode until the instruction pointer reaches the end of
a chunk, then returning one level up.
We do this by making use of the `async` *keyword* in Rust, but notably
*without* introducing asynchronous I/O or concurrency in tvix-eval (the
complexity of which is currently undesirable for us).
2. Trampolining the forcing of a thunk, equivalent to the current
trampolining logic.
Specifically, when writing a Rust function that uses the `async` keyword, such
as:
3. Waiting for the result of a trampoline, to ensure that in case of
nested thunks all representations are correctly transformed.
```rust
async fn some_builtin(input: Value) -> Result<Value, ErrorKind> {
let mut out = NixList::new();
4. Trampolining the execution of a builtin. This is not in scope for
the initial implementation, but it should be conceptually possible.
for element in input.to_list()? {
let result = do_something_that_requires_the_vm(element).await;
out.push(result);
}
It is *not* in scope for this proposal to enable parallel suspension
of thunks. This is a separate topic which is discussed in [waiting for
the store][wfs].
Ok(out)
}
```
The compiler actually generates a state-machine under-the-hood which allows
the execution of that function to be *suspended* whenever it hits an `await`.
We use the [`genawaiter`][] crate that gives us a data structure and simple
interface for getting instances of these state machines that can be stored in
a struct (in our case, a *generator frame*).
The execution of the VM then becomes the execution of an *outer loop*, which
is responsible for selecting the next generator frame to execute, and two
*inner loops*, which drive the execution of a bytecode frame or generator
frame forward until it either yields a value or asks to be suspended in favour
of another frame.
All "communication" between frames happens solely through values left on the
stack: Whenever a frame of either type runs to completion, it is expected to
leave a *single* value on the stack. It follows that the whole VM, upon
completion of the last (or initial, depending on your perspective) frame
yields its result as the return value.
The core of the VM restructuring is cl/8104, unfortunately one of the largest
single commit changes we've had to make yet, as it touches pretty much all
areas of tvix-eval. The introduction of the generators and the
message/response system we built to request something from the VM, suspend a
generator, and wait for the return is in cl/8148.
The next sections describe in detail how the three different loops work.
### Outer VM loop
The outer VM loop is responsible for selecting the next frame to run, and
dispatching it correctly to inner loops, as well as determining when to shut
down the VM and return the final result.
```
╭──────────────────╮
╭────────┤ match frame kind ├──────╮
│ ╰──────────────────╯ │
│ │
┏━━━━━━━━━━━━┷━━━━━┓ ╭───────────┴───────────╮
───►┃ frame_stack.pop()┃ ▼ ▼
┗━━━━━━━━━━━━━━━━━━┛ ┏━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━┓
▲ ┃ bytecode frame ┃ ┃ generator frame ┃
│ ┗━━━━━━━━┯━━━━━━━┛ ┗━━━━━━━━┯━━━━━━━━┛
│[yes, cont.] │ │
│ ▼ ▼
┏━━━━━━━━┓ │ ╔════════════════╗ ╔═════════════════╗
◄───┨ return ┃ │ ║ inner bytecode ║ ║ inner generator ║
┗━━━━━━━━┛ │ ║ loop ║ ║ loop ║
▲ │ ╚════════╤═══════╝ ╚════════╤════════╝
│ ╭────┴─────╮ │ │
│ │ has next │ ╰───────────┬───────────╯
[no] ╰───┤ frame? │ │
╰────┬─────╯ ▼
│ ┏━━━━━━━━━━━━━━━━━┓
│ ┃ frame completed ┃
╰─────────────────────────┨ or suspended ┃
┗━━━━━━━━━━━━━━━━━┛
```
Initially, the VM always pops a frame from the frame stack and then inspects
the type of frame it found. As a consequence the next frame to execute is
always the frame at the top of the stack, and setting up a VM initially for
code execution is done by leaving a bytecode frame with the code to execute on
the stack and passing control to the outer loop.
Control is dispatched to either of the inner loops (depending on the type of
frame) and the cycle continues once they return.
When an inner loop returns, it has either finished its execution (and left its
result value on the *value stack*), or its frame has requested to be
suspended.
Frames request suspension by re-enqueueing *themselves* through VM helper
methods, and then leaving the frame they want to run *on top* of themselves in
the frame stack before yielding control back to the outer loop.
The inner control loops inform the outer loops about whether the frame has
been *completed* or *suspended* by returning a boolean.
### Inner bytecode loop
The inner bytecode loop drives the execution of some Tvix bytecode by
continously looking at the next instruction to execute, and dispatching to the
instruction handler.
```
┏━━━━━━━━━━━━━┓
◄──┨ return true ┃
┗━━━━━━━━━━━━━┛
╔════╧═════╗
║ OpReturn ║
╚══════════╝
╰──┬────────────────────────────╮
│ ▼
│ ╔═════════════════════╗
┏━━━━━━━━┷━━━━━┓ ║ execute instruction ║
───►┃ inspect next ┃ ╚══════════╤══════════╝
┃ instruction ┃ │
┗━━━━━━━━━━━━━━┛ │
▲ ╭─────┴─────╮
╰──────────────────────┤ suspends? │
[no] ╰─────┬─────╯
┏━━━━━━━━━━━━━━┓ │
◄──┨ return false ┃───────────────────────╯
┗━━━━━━━━━━━━━━┛ [yes]
```
With this refactoring, the compiler now emits a special `OpReturn` instruction
at the end of bytecode chunks. This is a signal to the runtime that the chunk
has completed and that its current value should be returned, without having to
perform instruction pointer arithmetic.
When `OpReturn` is encountered, the inner bytecode loop returns control to the
outer loop and informs it (by returning `true`) that the bytecode frame has
completed.
Any other instruction may also request a suspension of the bytecode frame (for
example, instructions that need to force a value). In this case the inner loop
is responsible for setting up the frame stack correctly, and returning `false`
to inform the outer loop of the suspension
### Inner generator loop
The inner generator loop is responsible for driving the execution of a
generator frame by continously calling [`Gen::resume`][] until it requests a
suspension (as a result of which control is returned to the outer loop), or
until the generator is done and yields a value.
```
┏━━━━━━━━━━━━━┓ [Done]
◄──┨ return true ┃ ◄───────────────────╮
┗━━━━━━━━━━━━━┛ │
╭──────────────────┴─────────╮
╭───────┤ inspect generator response │
│ ╰──────────────────┬─────────╯
┏━━━━━━━━┷━━━━━━━━┓ │
──►┃ gen.resume(msg) ┃ │[Yielded]
┗━━━━━━━━━━━━━━━━━┛ │
▲ ╭──────┴─────╮
│ [yes] │ same-frame │
╰───────────────────┤ request? │
╰──────┬─────╯
┏━━━━━━━━━━━━━━┓ │
◄──┨ return false ┃ ◄──────────────────╯
┗━━━━━━━━━━━━━━┛ [no]
```
On each execution of a generator frame, `resume_with` is called with a
[`VMResponse`][] (i.e. a message *from* the VM *to* the generator). For a newly
created generator, the initial message is just `Empty`.
A generator may then respond by signaling that it has finished execution
(`Done`), in which case the inner generator loop returns control to the outer
loop and informs it that this generator is done (by returning `true`).
A generator may also respond by signaling that it needs some data from the VM.
This is implemented through a request-response pattern, in which the generator
returns a `Yielded` message containing a [`VMRequest`][]. These requests can be
very simple ("Tell me the current store path") or more complex ("Call this Nix
function with these values").
Requests are divided into two classes: Same-frame requests (requests that can be
responded to *without* returning control to the outer loop, i.e. without
executing a *different* frame), and multi-frame generator requests. Based on the
type of request, the inner generator loop will either handle it right away and
send the response in a new `resume_with` call, or return `false` to the outer
generator loop after setting up the frame stack.
Most of this logic is implemented in cl/8148.
[`Gen::resume`]: https://docs.rs/genawaiter/0.99.1/genawaiter/rc/struct.Gen.html#method.resume_with
[`VMRequest`]: https://cs.tvl.fyi/depot@2696839770c1ccb62929ff2575a633c07f5c9593/-/blob/tvix/eval/src/vm/generators.rs?L44
[`VMResponse`]: https://cs.tvl.fyi/depot@2696839770c1ccb62929ff2575a633c07f5c9593/-/blob/tvix/eval/src/vm/generators.rs?L169
## Advantages & Disadvantages of the approach
This approach has several advantages:
* The execution model is much simpler than before, making it fairly
straightforward to build up a mental model of what the VM does.
* All "out of band requests" inside the VM are handled through the same
abstraction (generators).
* Implementation is not difficult, albeit a little verbose in some cases (we
can argue about whether or not to introduce macros for simplifying it).
* Several parts of the VM execution are now much easier to document,
potentially letting us onboard tvix-eval contributors faster.
* The linear VM execution itself is much easier to trace now, with for example
the `RuntimeObserver` (and by extension `tvixbolt`) giving much clearer
output now.
But it also comes with some disadvantages:
* Even though we "only" use the `async` keyword without a full async-I/O
runtime, we still encounter many of the drawbacks of the fragmented Rust
async ecosystem.
The biggest issue with this is that parts of the standard library become
unavailable to us, for example the built-in `Vec::sort_by` can no longer be
used for sorting in Nix because our comparators themselves are `async`.
This led us to having to implement some logic on our own, as the design of
`async` in Rust even makes it difficult to provide usecase-generic
implementations of concepts like sorting.
* We need to allocate quite a few new structures on the heap in order to drive
generators, as generators involve storing `Future` types (with unknown
sizes) inside of structs.
In initial testing this seems to make no significant difference in
performance (our performance in an actual nixpkgs-eval is still bottlenecked
by I/O concerns and reference scanning), but is something to keep in mind
later on when we start optimising more after the low-hanging fruits have
been reaped.
## Alternatives considered
@ -70,7 +297,19 @@ the store][wfs].
implementation to accomodate problems as they show up. This is not
preferred as the code is already getting messy.
2. ... ?
2. Making tvix-eval a fully `async` project, pulling in something like Tokio
or `async-std` as a runtime. This is not preferred due to the massively
increased complexity of those solutions, and all the known issues of fully
buying in to the async ecosystem.
tvix-eval fundamentally should work for use-cases besides building Nix
packages (e.g. for `//tvix/serde`), and its profile should be as slim as
possible.
[wfs]: https://docs.google.com/document/d/1Zuw9UdMy95hcqsd-KudTw5yeeUkEXrqOi0rooGB7GWA/edit
3. Convincing the Rust developers that Rust needs a way to guarantee
constant-stack-depth tail calls through something like a `tailcall`
keyword.
4. ... ?
[`genawaiter`]: https://docs.rs/genawaiter/