Commit graph

8 commits

Author SHA1 Message Date
Florian Klink
0dea3086f5 docs(tvix/cli/refscan): fix comment
The comment still mentions Aho-Corasick, which isn't correct.

Change-Id: I846a2ca9ea7075c2456ca6ef04a132ff6161227a
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8991
Autosubmit: flokli <flokli@flokli.de>
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
2023-07-31 09:01:34 +00:00
Linus Heckemann
26fed61750 fix(tvix/cli): fix refscan when no paths are referenced
Before, the construction of a TwoByteWM would panic when no patterns
were provided, as in `tvix --expr 'builtins.toFile "snens" "soos"'`.

Change-Id: I25ed498c475523aec5baf8683b23059fadabb21c
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8697
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-06-02 13:19:56 +00:00
Vincent Ambo
38e8c2e959 fix(tvix/cli): keep tracking full paths in known_paths
We need to distinguish explicitly between the paths used for the
scanner, and the paths that populate the derivation inputs. The full
paths must be accessible from the result of the refscanner to populate
drv fields correctly.

This was previously hidden by debug changes that masked actual IO
operations with no-ops.

Change-Id: I037af6e6bbe2b573034d695f8779bee1b56bc125
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8022
Reviewed-by: flokli <flokli@flokli.de>
Tested-by: BuildkiteCI
2023-02-02 23:37:34 +00:00
Vincent Ambo
9d6f29a72b refactor(tvix/cli): use Wu-Manber string scanning for drv references
Switch out the string-scanning algorithm used in the reference scanner.

The construction of aho-corasick automata made up the vast majority of
runtime when evaluating nixpkgs previously. While the actual scanning
with a constructed automaton is relatively fast, we almost never scan
for the same set of strings twice and the cost is not worth it.

An algorithm that better matches our needs is the Wu-Manber multiple
string match algorithm, which works efficiently on *long* and *random*
strings of the *same length*, which describes store paths (up to their
hash component).

This switches the refscanner crate to a Rust implementation[0][1] of
this algorithm.

This has several implications:

1. This crate does not provide a way to scan streams. I'm not sure if
   this is an inherent problem with the algorithm (probably not, but
   it would need buffering). Either way, related functions and
   tests (which were actually unused) have been removed.

2. All strings need to be of the same length. For this reason, we
   truncate the known paths after their hash part (they are still
   unique, of course).

3. Passing an empty set of matches, or a match that is shorter than
   the length of a store path, causes the crate to panic. We safeguard
   against this by completely skipping the refscanning if there are no
   known paths (i.e. when evaluating the first derivation of an eval),
   and by bailing out of scanning a string that is shorter than a
   store path.

On the upside, this reduces overall runtime to less 1/5 of what it was
before when evaluating `pkgs.stdenv.drvPath`.

[0]: Frankly, it's a random, research-grade MIT-licensed
     crate that I found on Github:

     https://github.com/jneem/wu-manber

[1]: We probably want to rewrite or at least fork the above crate, and
     add things like a three-byte wide scanner. Evaluating large
     portions of nixpkgs can easily lead to more than 65k derivations
     being scanned for.

Change-Id: I08926778e1e5d5a87fc9ac26e0437aed8bbd9eb0
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8017
Tested-by: BuildkiteCI
Reviewed-by: flokli <flokli@flokli.de>
2023-02-02 17:50:44 +00:00
Vincent Ambo
9cb3daee20 refactor(tvix/cli): reference scanner owns all the strings
This gets very complex very quickly otherwise, as all the construction
paths for a reference scanner and all the access patterns for the
KnownPaths structure are not yet fully understood.

Change-Id: Ibadf1f18b476695f3c286fc6896ae557760edf63
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7827
Reviewed-by: flokli <flokli@flokli.de>
Tested-by: BuildkiteCI
2023-01-17 10:31:42 +00:00
Vincent Ambo
f27f5ef0c9 feat(tvix/cli): add known_paths module
This module implements types used to track the set of known paths in
the context of an evaluation.

These are used to determine the build references of a derivation.

Change-Id: I81e15ae33632784e699128916485751613b231a3
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7816
Tested-by: BuildkiteCI
Reviewed-by: flokli <flokli@flokli.de>
2023-01-17 10:20:38 +00:00
Vincent Ambo
e63bff5545 feat(tvix/refscan): implement reference scanning over data streams
Using yet more machinery from the pretty comprehensive aho_corasick
crate, this makes it possible to pass anything implementing `io::Read`
to the `ReferenceScanner` to accumulate matches.

Change-Id: I5b0e28eb44ea4df24010f40831e29f2cbb8c1f80
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7810
Autosubmit: tazjin <tazjin@tvl.su>
Reviewed-by: flokli <flokli@flokli.de>
Tested-by: BuildkiteCI
2023-01-11 16:03:51 +00:00
Vincent Ambo
3045645df0 feat(tvix/cli): implement initial refscan module
This module implements a ReferenceScanner struct which uses the
aho_corasick crate to scan string inputs for known, non-overlapping
candidates (store paths, in our case).

I experimented with several different APIs, and landed on this version
with an initial accumulator in the scanner. The scanner is
instantiated from the candidates and "fed" all the strings, then
consumed by the caller to retrieve the result.

Right now only things that look vaguely like bytestrings can be fed to
the scanner, there is no streaming support in the API yet.

Change-Id: I7782f0f0df5fc64bccd813aa14712f5525b0168c
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7808
Autosubmit: tazjin <tazjin@tvl.su>
Reviewed-by: flokli <flokli@flokli.de>
Tested-by: BuildkiteCI
2023-01-11 16:03:51 +00:00