docs(tvix/eval): document inner workings of the new VM loop
Change-Id: I8224bf039f739c401900b5a2ddc839810c87cf6e Reviewed-on: https://cl.tvl.fyi/c/depot/+/8226 Tested-by: BuildkiteCI Reviewed-by: Adam Joseph <adam@westernsemico.com>
This commit is contained in:
parent
d456705352
commit
466e6dc265
1 changed files with 284 additions and 45 deletions
|
@ -1,68 +1,295 @@
|
|||
tvix-eval VM loop
|
||||
=================
|
||||
|
||||
Date: 2023-02-14
|
||||
Author: tazjin
|
||||
This document describes the new tvix-eval VM execution loop implemented in the
|
||||
chain focusing around cl/8104.
|
||||
|
||||
## Background
|
||||
|
||||
The VM loop implemented in `src/vm.rs` currently has a couple of functions:
|
||||
The VM loop implemented in Tvix prior to cl/8104 had several functions:
|
||||
|
||||
1. Advance the instruction pointer and execute instructions, in a loop
|
||||
(unsurprisingly).
|
||||
1. Advancing the instruction pointer for a chunk of Tvix bytecode and
|
||||
executing instructions in a loop until a result was yielded.
|
||||
|
||||
2. Tracking Nix call frames as functions/thunks are entered/exited.
|
||||
2. Tracking Nix call frames as functions/thunks were entered/exited.
|
||||
|
||||
3. Catch trampoline requests returned from instruction executions, and
|
||||
resuming of executions afterwards.
|
||||
3. Catching trampoline requests returned from instructions to force suspended
|
||||
thunks without increasing stack size *where possible*.
|
||||
|
||||
4. Invoking the inner trampoline loop, handling actions and
|
||||
continuations from the trampoline.
|
||||
4. Handling trampolines through an inner trampoline loop, switching between a
|
||||
code execution mode and execution of subsequent trampolines.
|
||||
|
||||
The current implementation of the trampoline logic was added on to the
|
||||
existing VM, which recursed for thunk forcing. As a result, it is
|
||||
currently a little difficult to understand what exactly is going on in
|
||||
the VM loop and how the trampoline logic works.
|
||||
This implementation of the trampoline logic was added on to the existing VM,
|
||||
which previously always recursed for thunk forcing. There are some cases (for
|
||||
example values that need to be forced *inside* of the execution of a builtin)
|
||||
where trampolines could not previously be used, and the VM recursed anyways.
|
||||
|
||||
This has also led to several bugs, for example: b/251, b/246, b/245,
|
||||
and b/238.
|
||||
As a result of this trampoline logic being added "on top" of the existing VM
|
||||
loop the code became quite difficult to understand. This led to several bugs,
|
||||
for example: b/251, b/246, b/245, and b/238.
|
||||
|
||||
These bugs are very tricky to deal with, as we have to try and make
|
||||
the VM do things that are somewhat difficult to fit into the current
|
||||
model. We could of course keep extending the trampoline logic to
|
||||
accomodate all sorts of concepts (such as finalisers), but that seems
|
||||
difficult.
|
||||
These bugs were tricky to deal with, as we had to try and make the VM do
|
||||
things that are somewhat difficult to fit into its model. We could of course
|
||||
keep extending the trampoline logic to accommodate all sorts of concepts (such
|
||||
as finalisers), but that seems like it does not solve the root problem.
|
||||
|
||||
There are also additional problems, such as forcing inside of builtin
|
||||
implementations, which leads to a situation where we would like to
|
||||
suspend the builtin and return control flow to the VM until the value
|
||||
is forced.
|
||||
## New VM loop
|
||||
|
||||
## Proposal
|
||||
In cl/8104, a unified new solution is implemented with which the VM is capable
|
||||
of evaluating everything without increasing the call stack size.
|
||||
|
||||
We rewrite parts of the VM loop to elevate trampolining and
|
||||
potentially other modes of execution to a top-level concept of the VM.
|
||||
This is done by introducing a new frame stack in the VM, on which execution
|
||||
frames are enqueued that are either:
|
||||
|
||||
We achieve this by replacing the current concept of call frames with a
|
||||
"VM context" (naming tbd), which can represent multiple different
|
||||
states of the VM:
|
||||
1. A bytecode frame, consisting of Tvix bytecode that evaluates compiled Nix
|
||||
code.
|
||||
2. A generator frame, consisting of some VM logic implemented in pure Rust
|
||||
code that can be *suspended* when it hits a point where the VM would
|
||||
previously need to recurse.
|
||||
|
||||
1. Tvix code execution, equivalent to what is currently a call frame,
|
||||
executing bytecode until the instruction pointer reaches the end of
|
||||
a chunk, then returning one level up.
|
||||
We do this by making use of the `async` *keyword* in Rust, but notably
|
||||
*without* introducing asynchronous I/O or concurrency in tvix-eval (the
|
||||
complexity of which is currently undesirable for us).
|
||||
|
||||
2. Trampolining the forcing of a thunk, equivalent to the current
|
||||
trampolining logic.
|
||||
Specifically, when writing a Rust function that uses the `async` keyword, such
|
||||
as:
|
||||
|
||||
3. Waiting for the result of a trampoline, to ensure that in case of
|
||||
nested thunks all representations are correctly transformed.
|
||||
```rust
|
||||
async fn some_builtin(input: Value) -> Result<Value, ErrorKind> {
|
||||
let mut out = NixList::new();
|
||||
|
||||
4. Trampolining the execution of a builtin. This is not in scope for
|
||||
the initial implementation, but it should be conceptually possible.
|
||||
for element in input.to_list()? {
|
||||
let result = do_something_that_requires_the_vm(element).await;
|
||||
out.push(result);
|
||||
}
|
||||
|
||||
It is *not* in scope for this proposal to enable parallel suspension
|
||||
of thunks. This is a separate topic which is discussed in [waiting for
|
||||
the store][wfs].
|
||||
Ok(out)
|
||||
}
|
||||
```
|
||||
|
||||
The compiler actually generates a state-machine under-the-hood which allows
|
||||
the execution of that function to be *suspended* whenever it hits an `await`.
|
||||
|
||||
We use the [`genawaiter`][] crate that gives us a data structure and simple
|
||||
interface for getting instances of these state machines that can be stored in
|
||||
a struct (in our case, a *generator frame*).
|
||||
|
||||
The execution of the VM then becomes the execution of an *outer loop*, which
|
||||
is responsible for selecting the next generator frame to execute, and two
|
||||
*inner loops*, which drive the execution of a bytecode frame or generator
|
||||
frame forward until it either yields a value or asks to be suspended in favour
|
||||
of another frame.
|
||||
|
||||
All "communication" between frames happens solely through values left on the
|
||||
stack: Whenever a frame of either type runs to completion, it is expected to
|
||||
leave a *single* value on the stack. It follows that the whole VM, upon
|
||||
completion of the last (or initial, depending on your perspective) frame
|
||||
yields its result as the return value.
|
||||
|
||||
The core of the VM restructuring is cl/8104, unfortunately one of the largest
|
||||
single commit changes we've had to make yet, as it touches pretty much all
|
||||
areas of tvix-eval. The introduction of the generators and the
|
||||
message/response system we built to request something from the VM, suspend a
|
||||
generator, and wait for the return is in cl/8148.
|
||||
|
||||
The next sections describe in detail how the three different loops work.
|
||||
|
||||
### Outer VM loop
|
||||
|
||||
The outer VM loop is responsible for selecting the next frame to run, and
|
||||
dispatching it correctly to inner loops, as well as determining when to shut
|
||||
down the VM and return the final result.
|
||||
|
||||
```
|
||||
╭──────────────────╮
|
||||
╭────────┤ match frame kind ├──────╮
|
||||
│ ╰──────────────────╯ │
|
||||
│ │
|
||||
┏━━━━━━━━━━━━┷━━━━━┓ ╭───────────┴───────────╮
|
||||
───►┃ frame_stack.pop()┃ ▼ ▼
|
||||
┗━━━━━━━━━━━━━━━━━━┛ ┏━━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━┓
|
||||
▲ ┃ bytecode frame ┃ ┃ generator frame ┃
|
||||
│ ┗━━━━━━━━┯━━━━━━━┛ ┗━━━━━━━━┯━━━━━━━━┛
|
||||
│[yes, cont.] │ │
|
||||
│ ▼ ▼
|
||||
┏━━━━━━━━┓ │ ╔════════════════╗ ╔═════════════════╗
|
||||
◄───┨ return ┃ │ ║ inner bytecode ║ ║ inner generator ║
|
||||
┗━━━━━━━━┛ │ ║ loop ║ ║ loop ║
|
||||
▲ │ ╚════════╤═══════╝ ╚════════╤════════╝
|
||||
│ ╭────┴─────╮ │ │
|
||||
│ │ has next │ ╰───────────┬───────────╯
|
||||
[no] ╰───┤ frame? │ │
|
||||
╰────┬─────╯ ▼
|
||||
│ ┏━━━━━━━━━━━━━━━━━┓
|
||||
│ ┃ frame completed ┃
|
||||
╰─────────────────────────┨ or suspended ┃
|
||||
┗━━━━━━━━━━━━━━━━━┛
|
||||
```
|
||||
|
||||
Initially, the VM always pops a frame from the frame stack and then inspects
|
||||
the type of frame it found. As a consequence the next frame to execute is
|
||||
always the frame at the top of the stack, and setting up a VM initially for
|
||||
code execution is done by leaving a bytecode frame with the code to execute on
|
||||
the stack and passing control to the outer loop.
|
||||
|
||||
Control is dispatched to either of the inner loops (depending on the type of
|
||||
frame) and the cycle continues once they return.
|
||||
|
||||
When an inner loop returns, it has either finished its execution (and left its
|
||||
result value on the *value stack*), or its frame has requested to be
|
||||
suspended.
|
||||
|
||||
Frames request suspension by re-enqueueing *themselves* through VM helper
|
||||
methods, and then leaving the frame they want to run *on top* of themselves in
|
||||
the frame stack before yielding control back to the outer loop.
|
||||
|
||||
The inner control loops inform the outer loops about whether the frame has
|
||||
been *completed* or *suspended* by returning a boolean.
|
||||
|
||||
### Inner bytecode loop
|
||||
|
||||
The inner bytecode loop drives the execution of some Tvix bytecode by
|
||||
continously looking at the next instruction to execute, and dispatching to the
|
||||
instruction handler.
|
||||
|
||||
```
|
||||
┏━━━━━━━━━━━━━┓
|
||||
◄──┨ return true ┃
|
||||
┗━━━━━━━━━━━━━┛
|
||||
▲
|
||||
╔════╧═════╗
|
||||
║ OpReturn ║
|
||||
╚══════════╝
|
||||
▲
|
||||
╰──┬────────────────────────────╮
|
||||
│ ▼
|
||||
│ ╔═════════════════════╗
|
||||
┏━━━━━━━━┷━━━━━┓ ║ execute instruction ║
|
||||
───►┃ inspect next ┃ ╚══════════╤══════════╝
|
||||
┃ instruction ┃ │
|
||||
┗━━━━━━━━━━━━━━┛ │
|
||||
▲ ╭─────┴─────╮
|
||||
╰──────────────────────┤ suspends? │
|
||||
[no] ╰─────┬─────╯
|
||||
│
|
||||
│
|
||||
┏━━━━━━━━━━━━━━┓ │
|
||||
◄──┨ return false ┃───────────────────────╯
|
||||
┗━━━━━━━━━━━━━━┛ [yes]
|
||||
```
|
||||
|
||||
With this refactoring, the compiler now emits a special `OpReturn` instruction
|
||||
at the end of bytecode chunks. This is a signal to the runtime that the chunk
|
||||
has completed and that its current value should be returned, without having to
|
||||
perform instruction pointer arithmetic.
|
||||
|
||||
When `OpReturn` is encountered, the inner bytecode loop returns control to the
|
||||
outer loop and informs it (by returning `true`) that the bytecode frame has
|
||||
completed.
|
||||
|
||||
Any other instruction may also request a suspension of the bytecode frame (for
|
||||
example, instructions that need to force a value). In this case the inner loop
|
||||
is responsible for setting up the frame stack correctly, and returning `false`
|
||||
to inform the outer loop of the suspension
|
||||
|
||||
### Inner generator loop
|
||||
|
||||
The inner generator loop is responsible for driving the execution of a
|
||||
generator frame by continously calling [`Gen::resume`][] until it requests a
|
||||
suspension (as a result of which control is returned to the outer loop), or
|
||||
until the generator is done and yields a value.
|
||||
|
||||
```
|
||||
┏━━━━━━━━━━━━━┓ [Done]
|
||||
◄──┨ return true ┃ ◄───────────────────╮
|
||||
┗━━━━━━━━━━━━━┛ │
|
||||
│
|
||||
│
|
||||
╭──────────────────┴─────────╮
|
||||
╭───────┤ inspect generator response │
|
||||
│ ╰──────────────────┬─────────╯
|
||||
┏━━━━━━━━┷━━━━━━━━┓ │
|
||||
──►┃ gen.resume(msg) ┃ │[Yielded]
|
||||
┗━━━━━━━━━━━━━━━━━┛ │
|
||||
▲ ╭──────┴─────╮
|
||||
│ [yes] │ same-frame │
|
||||
╰───────────────────┤ request? │
|
||||
╰──────┬─────╯
|
||||
┏━━━━━━━━━━━━━━┓ │
|
||||
◄──┨ return false ┃ ◄──────────────────╯
|
||||
┗━━━━━━━━━━━━━━┛ [no]
|
||||
```
|
||||
|
||||
On each execution of a generator frame, `resume_with` is called with a
|
||||
[`VMResponse`][] (i.e. a message *from* the VM *to* the generator). For a newly
|
||||
created generator, the initial message is just `Empty`.
|
||||
|
||||
A generator may then respond by signaling that it has finished execution
|
||||
(`Done`), in which case the inner generator loop returns control to the outer
|
||||
loop and informs it that this generator is done (by returning `true`).
|
||||
|
||||
A generator may also respond by signaling that it needs some data from the VM.
|
||||
This is implemented through a request-response pattern, in which the generator
|
||||
returns a `Yielded` message containing a [`VMRequest`][]. These requests can be
|
||||
very simple ("Tell me the current store path") or more complex ("Call this Nix
|
||||
function with these values").
|
||||
|
||||
Requests are divided into two classes: Same-frame requests (requests that can be
|
||||
responded to *without* returning control to the outer loop, i.e. without
|
||||
executing a *different* frame), and multi-frame generator requests. Based on the
|
||||
type of request, the inner generator loop will either handle it right away and
|
||||
send the response in a new `resume_with` call, or return `false` to the outer
|
||||
generator loop after setting up the frame stack.
|
||||
|
||||
Most of this logic is implemented in cl/8148.
|
||||
|
||||
[`Gen::resume`]: https://docs.rs/genawaiter/0.99.1/genawaiter/rc/struct.Gen.html#method.resume_with
|
||||
[`VMRequest`]: https://cs.tvl.fyi/depot@2696839770c1ccb62929ff2575a633c07f5c9593/-/blob/tvix/eval/src/vm/generators.rs?L44
|
||||
[`VMResponse`]: https://cs.tvl.fyi/depot@2696839770c1ccb62929ff2575a633c07f5c9593/-/blob/tvix/eval/src/vm/generators.rs?L169
|
||||
|
||||
## Advantages & Disadvantages of the approach
|
||||
|
||||
This approach has several advantages:
|
||||
|
||||
* The execution model is much simpler than before, making it fairly
|
||||
straightforward to build up a mental model of what the VM does.
|
||||
|
||||
* All "out of band requests" inside the VM are handled through the same
|
||||
abstraction (generators).
|
||||
|
||||
* Implementation is not difficult, albeit a little verbose in some cases (we
|
||||
can argue about whether or not to introduce macros for simplifying it).
|
||||
|
||||
* Several parts of the VM execution are now much easier to document,
|
||||
potentially letting us onboard tvix-eval contributors faster.
|
||||
|
||||
* The linear VM execution itself is much easier to trace now, with for example
|
||||
the `RuntimeObserver` (and by extension `tvixbolt`) giving much clearer
|
||||
output now.
|
||||
|
||||
But it also comes with some disadvantages:
|
||||
|
||||
* Even though we "only" use the `async` keyword without a full async-I/O
|
||||
runtime, we still encounter many of the drawbacks of the fragmented Rust
|
||||
async ecosystem.
|
||||
|
||||
The biggest issue with this is that parts of the standard library become
|
||||
unavailable to us, for example the built-in `Vec::sort_by` can no longer be
|
||||
used for sorting in Nix because our comparators themselves are `async`.
|
||||
|
||||
This led us to having to implement some logic on our own, as the design of
|
||||
`async` in Rust even makes it difficult to provide usecase-generic
|
||||
implementations of concepts like sorting.
|
||||
|
||||
* We need to allocate quite a few new structures on the heap in order to drive
|
||||
generators, as generators involve storing `Future` types (with unknown
|
||||
sizes) inside of structs.
|
||||
|
||||
In initial testing this seems to make no significant difference in
|
||||
performance (our performance in an actual nixpkgs-eval is still bottlenecked
|
||||
by I/O concerns and reference scanning), but is something to keep in mind
|
||||
later on when we start optimising more after the low-hanging fruits have
|
||||
been reaped.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
|
@ -70,7 +297,19 @@ the store][wfs].
|
|||
implementation to accomodate problems as they show up. This is not
|
||||
preferred as the code is already getting messy.
|
||||
|
||||
2. ... ?
|
||||
2. Making tvix-eval a fully `async` project, pulling in something like Tokio
|
||||
or `async-std` as a runtime. This is not preferred due to the massively
|
||||
increased complexity of those solutions, and all the known issues of fully
|
||||
buying in to the async ecosystem.
|
||||
|
||||
tvix-eval fundamentally should work for use-cases besides building Nix
|
||||
packages (e.g. for `//tvix/serde`), and its profile should be as slim as
|
||||
possible.
|
||||
|
||||
[wfs]: https://docs.google.com/document/d/1Zuw9UdMy95hcqsd-KudTw5yeeUkEXrqOi0rooGB7GWA/edit
|
||||
3. Convincing the Rust developers that Rust needs a way to guarantee
|
||||
constant-stack-depth tail calls through something like a `tailcall`
|
||||
keyword.
|
||||
|
||||
4. ... ?
|
||||
|
||||
[`genawaiter`]: https://docs.rs/genawaiter/
|
||||
|
|
Loading…
Reference in a new issue