tvl-depot/tvix/eval/docs/known-optimisation-potential.md

Known Optimisation Potential
============================

There are several areas of the Tvix evaluator code base where
potentially large performance gains can be achieved through
optimisations that we are already aware of.

The shape of most optimisations is that of moving more work into the
compiler to simplify the runtime execution of Nix code. This leads, in
some cases, to drastically higher complexity in both the compiler
itself and in invariants that need to be guaranteed between the
runtime and the compiler.

For this reason, and because we lack the infrastructure to adequately
track their impact (WIP), we have not yet implemented these
optimisations, but note the most important ones here.

* Use "open upvalues" [hard]

  Right now, Tvix will immediately close over all upvalues that are
  created and clone them into the `Closure::upvalues` array.

  Instead of doing this, we can statically determine most locals that
  are closed over *and escape their scope* (similar to how the
  `compiler::scope::Scope` struct currently tracks whether locals are
  used at all).

  If we implement the machinery to track this, we can implement some
  upvalues at runtime by simply sticking stack indices in the upvalue
  array and only copy the values where we know that they escape.

* Avoid `with` value duplication [easy]

  If a `with` makes use of a local identifier in a scope that can not
  close before the with (e.g. not across `LambdaCtx` boundaries), we
  can avoid the allocation of the phantom value and duplication of the
  `NixAttrs` value on the stack. In this case we simply push the stack
  index of the known local.

* Multiple attribute selection [medium]

  An instruction could be introduced that avoids repeatedly pushing an
  attribute set to/from the stack if multiple keys are being selected
  from it. This occurs, for example, when inheriting from an attribute
  set or when binding function formals.

* Split closure/function representation [easy]

  Functions have fewer fields that need to be populated at runtime and
  can directly use the `value::function::Lambda` representation where
  possible.

* Apply `compiler::optimise_select` to other set operations [medium]

  In addition to selects, statically known attribute resolution could
  also be used for things like `?` or `with`. The latter might be a
  little more complicated but is worth investigating.

* Inline fully applied builtins with equivalent operators [medium]

  Some `builtins` have equivalent operators, e.g. `builtins.sub`
  corresponds to the `-` operator, `builtins.hasAttr` to the `?`
  operator etc. These operators additionally compile to a primitive
  VM opcode, so they should be just as cheap (if not cheaper) as
  a builtin application.

  In case the compiler encounters a fully applied builtin (i.e.
  no currying is occurring) and the `builtins` global is unshadowed,
  it could compile the equivalent operator bytecode instead: For
  example, `builtins.sub 20 22` would be compiled as `20 - 22`.
  This would ensure that equivalent `builtins` can also benefit
  from special optimisations we may implement for certain operators
  (in the absence of currying). E.g. we could optimise access
  to the `builtins` attribute set which a call to
  `builtins.getAttr "foo" builtins` should also profit from.

* Avoid nested `VM::run` calls [hard]

  Currently when encountering Nix-native callables (thunks, closures)
  the VM's run loop will nest and return the value of the nested call
  frame one level up. This makes the Rust call stack almost mirror the
  Nix call stack, which is usually undesirable.

  It is possible to detect situations where this is avoidable and
  instead set up the VM in such a way that it continues and produces
  the desired result in the same run loop, but this is kind of tricky
  to get right - especially while other parts are still in flux.

  For details consult the commit with Gerrit change ID
  `I96828ab6a628136e0bac1bf03555faa4e6b74ece`, in which the initial
  attempt at doing this was reverted.

* Avoid thunks if only identifier closing is required [medium]

  Some constructs, like `with`, mostly do not change runtime behaviour
  if thunked. However, they are wrapped in thunks to ensure that
  deferred identifiers are resolved correctly.

  This can be avoided, as we statically analyse the scope and should
  be able to tell whether any such logic was required.

* Intern literals [easy]

  Currently, the compiler emits a separate entry in the constant
  table for each literal.  So the program `1 + 1 + 1` will have
  three entries in its `Chunk::constants` instead of only one.

* Do some list and attribute set operations in place [hard]

  Algorithms that can not do a lot of work inside `builtins` like `map`,
  `filter` or `foldl'` usually perform terribly if they use data structures like
  lists and attribute sets.

  `builtins` can do work in place on a copy of a `Value`, but naïvely expressed
  recursive algorithms will usually use `//` and `++` to do a single change to a
  `Value` at a time, requiring a full copy of the data structure each time.
  It would be a big improvement if we could do some of these operations in place
  without requiring a new copy.

  There are probably two approaches: We could determine statically if a value is
  reachable from elsewhere and emit a special in place instruction if not. An
  easier alternative is probably to rely on reference counting at runtime: If no
  other reference to a value exists, we can extend the list or update the
  attribute set in place.

  An **alternative** to this is using [persistent data
  structures](https://en.wikipedia.org/wiki/Persistent_data_structure) or at the
  very least [immutable data structures](https://docs.rs/im/latest/im/) that can
  be copied more efficiently than the stock structures we are using at the
  moment.

* Skip finalising unfinalised thunks or non-thunks instead of crashing [easy]

  Currently `OpFinalise` crashes the VM if it is called on values that don't
  need to be finalised. This helps catching miscompilations where `OpFinalise`
  operates on the wrong `StackIdx`. In the case of function argument patterns,
  however, this means extra VM stack and instruction overhead for dynamically
  determining if finalisation is necessary or not. This wouldn't be necessary
  if `OpFinalise` would just noop on any values that don't need to be finalised
  (anymore).

* Phantom binding for from expression of inherits [easy]

  The from expression of an inherit is reevaluated for each inherit. This can
  be demonstrated using the following Nix expression which, counter-intuitively,
  will print “plonk” twice.

  ```nix
  let
    inherit (builtins.trace "plonk" { a = null; b = null; }) a b;
  in
  builtins.seq a (builtins.seq b null)
  ```

  In most Nix code, the from expression is just an identifier, so it is not
  terribly inefficient, but in some cases a more expensive expression may
  be used. We should create a phantom binding for the from expression that
  is reused in the inherits, so only a single thunk is created for the from
  expression.
docs(tvix/eval): start a document on known optimisation potential Change-Id: I9bc41e57e1cfdf536d7b9048bac2e7aff1ee2ffa Reviewed-on: https://cl.tvl.fyi/c/depot/+/6313 Tested-by: BuildkiteCI Reviewed-by: sterni <sternenseemann@systemli.org> 2022-08-27 21:45:19 +02:00			`Known Optimisation Potential`
			`============================`

			`There are several areas of the Tvix evaluator code base where`
			`potentially large performance gains can be achieved through`
			`optimisations that we are already aware of.`

			`The shape of most optimisations is that of moving more work into the`
			`compiler to simplify the runtime execution of Nix code. This leads, in`
			`some cases, to drastically higher complexity in both the compiler`
			`itself and in invariants that need to be guaranteed between the`
			`runtime and the compiler.`

			`For this reason, and because we lack the infrastructure to adequately`
			`track their impact (WIP), we have not yet implemented these`
			`optimisations, but note the most important ones here.`

			`* Use "open upvalues" [hard]`

			`Right now, Tvix will immediately close over all upvalues that are`
			created and clone them into the `Closure::upvalues` array.

			`Instead of doing this, we can statically determine most locals that`
			`are closed over and escape their scope (similar to how the`
			`compiler::scope::Scope` struct currently tracks whether locals are
			`used at all).`

			`If we implement the machinery to track this, we can implement some`
			`upvalues at runtime by simply sticking stack indices in the upvalue`
			`array and only copy the values where we know that they escape.`

			* Avoid `with` value duplication [easy]

			If a `with` makes use of a local identifier in a scope that can not
			close before the with (e.g. not across `LambdaCtx` boundaries), we
			`can avoid the allocation of the phantom value and duplication of the`
			`NixAttrs` value on the stack. In this case we simply push the stack
			`index of the known local.`

			`* Multiple attribute selection [medium]`

			`An instruction could be introduced that avoids repeatedly pushing an`
			`attribute set to/from the stack if multiple keys are being selected`
			`from it. This occurs, for example, when inheriting from an attribute`
			`set or when binding function formals.`

			`* Split closure/function representation [easy]`

			`Functions have fewer fields that need to be populated at runtime and`
			can directly use the `value::function::Lambda` representation where
			`possible.`

refactor(tvix/eval): statically resolve select from constant attrs When resolving a select expression (`attrs.name` or `attrs.name or default`), if the set compiles to a constant attribute set (as is most notably the case with `builtins`) we can backtrack and replace that attribute set directly with the compiled value. For something like `builtins.length`, this will directly emit an `OpConstant` that leaves the `length` builtin on the stack. Change-Id: I639654e065a06e8cfcbcacb528c6da7ec9e513ee Reviewed-on: https://cl.tvl.fyi/c/depot/+/7957 Tested-by: BuildkiteCI Reviewed-by: flokli <flokli@flokli.de> 2023-01-29 21:40:57 +01:00			* Apply `compiler::optimise_select` to other set operations [medium]
docs(tvix/eval): add notes for builtins access optimisation Change-Id: Iadbfbe2864ae42fe5492ef3ede0925baee4872b2 Reviewed-on: https://cl.tvl.fyi/c/depot/+/6413 Reviewed-by: sterni <sternenseemann@systemli.org> Tested-by: BuildkiteCI 2022-09-02 14:46:14 +02:00
refactor(tvix/eval): statically resolve select from constant attrs When resolving a select expression (`attrs.name` or `attrs.name or default`), if the set compiles to a constant attribute set (as is most notably the case with `builtins`) we can backtrack and replace that attribute set directly with the compiled value. For something like `builtins.length`, this will directly emit an `OpConstant` that leaves the `length` builtin on the stack. Change-Id: I639654e065a06e8cfcbcacb528c6da7ec9e513ee Reviewed-on: https://cl.tvl.fyi/c/depot/+/7957 Tested-by: BuildkiteCI Reviewed-by: flokli <flokli@flokli.de> 2023-01-29 21:40:57 +01:00			`In addition to selects, statically known attribute resolution could`
			also be used for things like `?` or `with`. The latter might be a
			`little more complicated but is worth investigating.`
refactor(tvix/eval): return call frame result from VM::call Previously, "calling" (setting up the VM run loop for executing a call frame) and "running" (running this loop to completion) were separate operations. This was basically an attempt to avoid nesting `VM::run` invocations. However, doing things this way introduced some tricky bugs for exiting out of the call frames of thunks vs. builtins & closures. For now, we unify the two operations and always return the value to the caller directly. For now this makes calls a little less effective, but it gives us a chance to nail down some other strange behaviours and then re-optimise this afterwards. To make sure we tackle this again further down I've added it to the list of known possible optimisations. Change-Id: I96828ab6a628136e0bac1bf03555faa4e6b74ece Reviewed-on: https://cl.tvl.fyi/c/depot/+/6415 Reviewed-by: sterni <sternenseemann@systemli.org> Tested-by: BuildkiteCI 2022-09-02 20:49:11 +02:00
docs(tvix/eval): propose builtin "inlining" optimisation Change-Id: I96a187792a1fd48cffd6b56ec22347aee8cae3af Reviewed-on: https://cl.tvl.fyi/c/depot/+/6526 Autosubmit: sterni <sternenseemann@systemli.org> Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI 2022-09-06 21:39:03 +02:00			`* Inline fully applied builtins with equivalent operators [medium]`

docs(tvix/eval): builtins.add is not equivalent to + While it is in the given example, i.e. for integer addition, to claim that they are equivalent is a bit misleading: builtins.add is less overloaded than +, i.e. builtins.add "foo" "bar" will fail whereas "foo" + "bar" performs string concatenation. Change-Id: Ib52d530d1ab289b367565b286f06a76dd518d4fb Reviewed-on: https://cl.tvl.fyi/c/depot/+/7929 Autosubmit: sterni <sternenseemann@systemli.org> Reviewed-by: flokli <flokli@flokli.de> Tested-by: BuildkiteCI 2023-01-25 14:39:24 +01:00			Some `builtins` have equivalent operators, e.g. `builtins.sub`
			corresponds to the `-` operator, `builtins.hasAttr` to the `?`
docs(tvix/eval): propose builtin "inlining" optimisation Change-Id: I96a187792a1fd48cffd6b56ec22347aee8cae3af Reviewed-on: https://cl.tvl.fyi/c/depot/+/6526 Autosubmit: sterni <sternenseemann@systemli.org> Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI 2022-09-06 21:39:03 +02:00			`operator etc. These operators additionally compile to a primitive`
			`VM opcode, so they should be just as cheap (if not cheaper) as`
			`a builtin application.`

			`In case the compiler encounters a fully applied builtin (i.e.`
refactor(tvix/eval): administer antidote for poison The codebase contains a lot of complexity and odd roundabout handling for shadowing globals. I'm pretty sure none of this is necessary, and all of it disappears if you simply make the globals part of the ordinary identifier resolution chain, with their own scope up above the root scope. Then the ordinary shadowing routines do the right thing, and no special cases or new terminology are required. This commit does that. Note by tazjin: This commit was originally abandoned when Adam decided not to take away reviewer bandwidth for this at the time (eval was still in a much earlier stage). As we've recently done some significant refactoring of globals initialisation this came up again, and it seems we can easily cover the use-cases of the poison tracking in other ways now, so I've rebased, updated and resurrected the CL. Co-Authored-By: Vincent Ambo <tazjin@tvl.su> Signed-off-by: Adam Joseph <adam@westernsemico.com> Change-Id: Ib3309a47a7b31fa5bf10466bade0d876b76ae462 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7089 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI Reviewed-by: flokli <flokli@flokli.de> 2022-10-25 11:23:22 +02:00			no currying is occurring) and the `builtins` global is unshadowed,
docs(tvix/eval): propose builtin "inlining" optimisation Change-Id: I96a187792a1fd48cffd6b56ec22347aee8cae3af Reviewed-on: https://cl.tvl.fyi/c/depot/+/6526 Autosubmit: sterni <sternenseemann@systemli.org> Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI 2022-09-06 21:39:03 +02:00			`it could compile the equivalent operator bytecode instead: For`
docs(tvix/eval): builtins.add is not equivalent to + While it is in the given example, i.e. for integer addition, to claim that they are equivalent is a bit misleading: builtins.add is less overloaded than +, i.e. builtins.add "foo" "bar" will fail whereas "foo" + "bar" performs string concatenation. Change-Id: Ib52d530d1ab289b367565b286f06a76dd518d4fb Reviewed-on: https://cl.tvl.fyi/c/depot/+/7929 Autosubmit: sterni <sternenseemann@systemli.org> Reviewed-by: flokli <flokli@flokli.de> Tested-by: BuildkiteCI 2023-01-25 14:39:24 +01:00			example, `builtins.sub 20 22` would be compiled as `20 - 22`.
docs(tvix/eval): propose builtin "inlining" optimisation Change-Id: I96a187792a1fd48cffd6b56ec22347aee8cae3af Reviewed-on: https://cl.tvl.fyi/c/depot/+/6526 Autosubmit: sterni <sternenseemann@systemli.org> Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI 2022-09-06 21:39:03 +02:00			This would ensure that equivalent `builtins` can also benefit
			`from special optimisations we may implement for certain operators`
			`(in the absence of currying). E.g. we could optimise access`
			to the `builtins` attribute set which a call to
			`builtins.getAttr "foo" builtins` should also profit from.

refactor(tvix/eval): return call frame result from VM::call Previously, "calling" (setting up the VM run loop for executing a call frame) and "running" (running this loop to completion) were separate operations. This was basically an attempt to avoid nesting `VM::run` invocations. However, doing things this way introduced some tricky bugs for exiting out of the call frames of thunks vs. builtins & closures. For now, we unify the two operations and always return the value to the caller directly. For now this makes calls a little less effective, but it gives us a chance to nail down some other strange behaviours and then re-optimise this afterwards. To make sure we tackle this again further down I've added it to the list of known possible optimisations. Change-Id: I96828ab6a628136e0bac1bf03555faa4e6b74ece Reviewed-on: https://cl.tvl.fyi/c/depot/+/6415 Reviewed-by: sterni <sternenseemann@systemli.org> Tested-by: BuildkiteCI 2022-09-02 20:49:11 +02:00			* Avoid nested `VM::run` calls [hard]

			`Currently when encountering Nix-native callables (thunks, closures)`
			`the VM's run loop will nest and return the value of the nested call`
			`frame one level up. This makes the Rust call stack almost mirror the`
			`Nix call stack, which is usually undesirable.`

			`It is possible to detect situations where this is avoidable and`
			`instead set up the VM in such a way that it continues and produces`
			`the desired result in the same run loop, but this is kind of tricky`
			`to get right - especially while other parts are still in flux.`

			`For details consult the commit with Gerrit change ID`
			`I96828ab6a628136e0bac1bf03555faa4e6b74ece`, in which the initial
			`attempt at doing this was reverted.`
docs(tvix/eval): add optimisation note on eliminating `with` thunks Change-Id: I18d50ac8e157929a027f8bf284e65f1eb8950d5a Reviewed-on: https://cl.tvl.fyi/c/depot/+/6488 Tested-by: BuildkiteCI Reviewed-by: sterni <sternenseemann@systemli.org> 2022-09-06 22:25:57 +02:00
			`* Avoid thunks if only identifier closing is required [medium]`

			Some constructs, like `with`, mostly do not change runtime behaviour
			`if thunked. However, they are wrapped in thunks to ensure that`
			`deferred identifiers are resolved correctly.`

			`This can be avoided, as we statically analyse the scope and should`
			`be able to tell whether any such logic was required.`
docs(tvix/eval): add "intern literals" to future optimisations Signed-off-by: Adam Joseph <adam@westernsemico.com> Change-Id: I460230863de853ca5248733bc977d4780b216f36 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7096 Tested-by: BuildkiteCI Reviewed-by: sterni <sternenseemann@systemli.org> 2022-10-26 11:22:08 +02:00
			`* Intern literals [easy]`

			`Currently, the compiler emits a separate entry in the constant`
			table for each literal. So the program `1 + 1 + 1` will have
			three entries in its `Chunk::constants` instead of only one.
docs(tvix/eval): sketch in place list/attr set update idea Change-Id: Ic7debbd8cbd3acdf5f3947288f2aa2964bd163a0 Reviewed-on: https://cl.tvl.fyi/c/depot/+/7660 Autosubmit: sterni <sternenseemann@systemli.org> Tested-by: BuildkiteCI Reviewed-by: tazjin <tazjin@tvl.su> 2022-12-28 12:52:34 +01:00
			`* Do some list and attribute set operations in place [hard]`

			Algorithms that can not do a lot of work inside `builtins` like `map`,
			`filter` or `foldl'` usually perform terribly if they use data structures like
			`lists and attribute sets.`

			`builtins` can do work in place on a copy of a `Value`, but naïvely expressed
			recursive algorithms will usually use `//` and `++` to do a single change to a
			`Value` at a time, requiring a full copy of the data structure each time.
			`It would be a big improvement if we could do some of these operations in place`
			`without requiring a new copy.`

			`There are probably two approaches: We could determine statically if a value is`
			`reachable from elsewhere and emit a special in place instruction if not. An`
			`easier alternative is probably to rely on reference counting at runtime: If no`
			`other reference to a value exists, we can extend the list or update the`
			`attribute set in place.`

			`An alternative to this is using [persistent data`
			`structures](https://en.wikipedia.org/wiki/Persistent_data_structure) or at the`
			`very least [immutable data structures](https://docs.rs/im/latest/im/) that can`
			`be copied more efficiently than the stock structures we are using at the`
			`moment.`
fix(tvix/eval): only finalise formal arguments if defaulting When dealing with a formal argument in a function argument pattern that has a default expression, there are two different things that can happen at runtime: Either we select its value from the passed attribute successfully or we need to use the default expression. Both of these may be thunks and both of these may need finalisers. However, in the former case this is taken care of elsewhere, the value will always be finalised already if necessary. In the latter case we may need to finalise the thunk resulting from the default expression. However, the thunk corresponding to the expression may never end up in the local's stack slot. Since finalisation goes by stack slot (and not constants), we need to prevent a case where we don't fall back to the default expression, but finalise anyways. Previously, we worked around this by making `OpFinalise` ignore non-thunks. Since finalisation of already evaluated thunks still crashed, the faulty compilation of function pattern arguments could still cause a crash. As a new approach, we reinstate the old behavior of `OpFinalise` to crash whenever encountering something that is either not a thunk or doesn't need finalisation. This can also help catching (similar) miscompilations in the future. To then prevent the crash, we need to track whether we have fallen back or not at runtime. This is done using an additional phantom on the stack that holds a new `FinaliseRequest` value. When it comes to finalisation we check this value and conditionally execute `OpFinalise` based on its value. Resolves b/261 and b/265 (partially). Change-Id: Ic04fb80ec671a2ba11fa645090769c335fb7f58b Reviewed-on: https://cl.tvl.fyi/c/depot/+/8705 Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI Autosubmit: sterni <sternenseemann@systemli.org> 2023-06-03 02:10:31 +02:00
			`* Skip finalising unfinalised thunks or non-thunks instead of crashing [easy]`

			Currently `OpFinalise` crashes the VM if it is called on values that don't
			need to be finalised. This helps catching miscompilations where `OpFinalise`
			operates on the wrong `StackIdx`. In the case of function argument patterns,
			`however, this means extra VM stack and instruction overhead for dynamically`
			`determining if finalisation is necessary or not. This wouldn't be necessary`
			if `OpFinalise` would just noop on any values that don't need to be finalised
			`(anymore).`
docs(tvix/eval): optimization potential for inherit (from) exprs Change-Id: Ibddaa111a5b7a86c42dbe153ae8e53f9a5601a54 Reviewed-on: https://cl.tvl.fyi/c/depot/+/10112 Tested-by: BuildkiteCI Reviewed-by: Adam Joseph <adam@westernsemico.com> 2023-08-23 15:06:05 +02:00
			`* Phantom binding for from expression of inherits [easy]`

			`The from expression of an inherit is reevaluated for each inherit. This can`
			`be demonstrated using the following Nix expression which, counter-intuitively,`
			`will print “plonk” twice.`

			```nix
			`let`
			`inherit (builtins.trace "plonk" { a = null; b = null; }) a b;`
			`in`
			`builtins.seq a (builtins.seq b null)`
			```

			`In most Nix code, the from expression is just an identifier, so it is not`
			`terribly inefficient, but in some cases a more expensive expression may`
			`be used. We should create a phantom binding for the from expression that`
			`is reused in the inherits, so only a single thunk is created for the from`
			`expression.`