docs(tvix/castore/blobstore): reorganize docs

docs/verified-streaming.md explained how CDC and verified streaming can work together, but didn't really highlight enough how chunking in general also helps with seeking. In addition, a lot of the thoughts w.r.t. the BlobStore protocol, both gRPC and Rust traits, as well as why there's no support for seeking directly in gRPC, as well as how clients should behave w.r.t. chunked fetching was missing, or mixed together with the verified streaming bits. While there is no verified streaming version yet, a chunked one is coming soon, and documenting this a bit better is gonna make it easier to understand, as well as provide some lookout on where this is heading. Change-Id: Ib11b8ccf2ef82f9f3a43b36103df0ad64a9b68ce Reviewed-on: https://cl.tvl.fyi/c/depot/+/10733 Autosubmit: flokli <flokli@flokli.de> Tested-by: BuildkiteCI Reviewed-by: Connor Brewster <cbrewster@hey.com>
2024-02-02 15:51:44 +02:00 · 2024-02-02 15:51:44 +02:00 · 40d81d0c74
commit 40d81d0c74
parent 35e7b8a1a8
3 changed files with 251 additions and 121 deletions
--- a/tvix/castore/docs/blobstore-chunking.md
+++ b/tvix/castore/docs/blobstore-chunking.md
@ -0,0 +1,147 @@
+# BlobStore: Chunking & Verified Streaming
+
+`tvix-castore`'s BlobStore is a content-addressed storage system, using [blake3]
+as hash function.
+
+Returned data is fetched by using the digest as lookup key, and can be verified
+to be correct by feeding the received data through the hash function and
+ensuring it matches the digest initially used for the lookup.
+
+This means, data can be downloaded by any untrusted third-party as well, as the
+received data is validated to match the digest it was originally requested with.
+
+However, for larger blobs of data, having to download the entire blob at once is
+wasteful, if we only care about a part of the blob. Think about mounting a
+seekable data structure, like loop-mounting an .iso file, or doing partial reads
+in a large Parquet file, a column-oriented data format.
+
+> We want to have the possibility to *seek* into a larger file.
+
+This however shouldn't compromise on data integrity properties - we should not
+need to trust a peer we're downloading from to be "honest" about the partial
+data we're reading. We should be able to verify smaller reads.
+
+Especially when substituting from an untrusted third-party, we want to be able
+to detect quickly if that third-party is sending us wrong data, and terminate
+the connection early.
+
+## Chunking
+In content-addressed systems, this problem has historically been solved by
+breaking larger blobs into smaller chunks, which can be fetched individually,
+and making a hash of *this listing* the blob digest/identifier.
+
+ - BitTorrent for example breaks files up into smaller chunks, and maintains
+   a list of sha1 digests for each of these chunks. Magnet links contain a
+   digest over this listing as an identifier. (See [bittorrent-v2][here for
+   more details]).
+   With the identifier, a client can fetch the entire list, and then recursively
+   "unpack the graph" of nodes, until it ends up with a list of individual small
+   chunks, which can be fetched individually.
+ - Similarly, IPFS with its IPLD model builds up a Merkle DAG, and uses the
+   digest of the root node as an identitier.
+
+These approaches solve the problem of being able to fetch smaller chunks in a
+trusted fashion. They can also do some deduplication, in case there's the same
+leaf nodes same leaf nodes in multiple places.
+
+However, they also have a big disadvantage. The chunking parameters, and the
+"topology" of the graph structure itself "bleed" into the root hash of the
+entire data structure itself.
+
+Depending on the chunking parameters used, there's different representations for
+the same data, causing less data sharing/reuse in the overall system, in terms of how
+many chunks need to be downloaded vs. are already available locally, as well as
+how compact data is stored on-disk.
+
+This can be workarounded by agreeing on only a single way of chunking, but it's
+not pretty and misses a lot of deduplication potential.
+
+### Chunking in Tvix' Blobstore
+tvix-castore's BlobStore uses a hybrid approach to eliminate some of the
+disadvantages, while still being content-addressed internally, with the
+highlighted benefits.
+
+It uses [blake3] as hash function, and the blake3 digest of **the raw data
+itself** as an identifier (rather than some application-specific Merkle DAG that
+also embeds some chunking information).
+
+BLAKE3 is a tree hash where all left nodes fully populated, contrary to
+conventional serial hash functions. To be able to validate the hash of a node,
+one only needs the hash of the (2) children [^1], if any.
+
+This means one only needs to the root digest to validate a constructions, and these
+constructions can be sent [separately][bao-spec].
+
+This relieves us from the need of having to encode more granular chunking into
+our data model / identifier upfront, but can make this a mostly a transport/
+storage concern.
+
+For the some more description on the (remote) protocol, check
+`./blobstore-protocol.md`.
+
+#### Logical vs. physical chunking
+
+Due to the properties of the BLAKE3 hash function, we have logical blocks of
+1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
+chunk sizes w.r.t. what "physical chunks" are sent over the wire between peers,
+or are stored on-disk.
+
+The only thing we need to be able to read and verify an arbitrary byte range is
+having the covering range of aligned 1K blocks, and a construction from the root
+digest to the 1K block.
+
+Note the intermediate hash tree can be further trimmed, [omitting][bao-tree]
+lower parts of the tree while still providing verified streaming - at the cost
+of having to fetch larger covering ranges of aligned blocks.
+
+Let's pick an example. We identify each KiB by a number here for illustrational
+purposes.
+
+Assuming we omit the last two layers of the hash tree, we end up with logical
+4KiB leaf chunks (`bao_shift` of `2`).
+
+For a blob of 14 KiB total size, we could fetch logical blocks `[0..=3]`,
+`[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
+
+`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
+
+Assuming the server now informs us about the following physical chunking:
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
+```
+
+If our application now wants to arbitrarily read from 0 until 4 (inclusive):
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
+ |-------------|
+
+```
+
+…we need to fetch physical chunks `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
+
+
+`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
+interested in.
+
+We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
+can assemble `[ 4 5 6 7 ]` to verify both logical chunks:
+
+```
+[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]
+^       ^           ^     ^
+|----4KiB----|------4KiB-----|
+```
+
+Each physical chunk fetched can be validated to have the blake3 digest that was
+communicated upfront, and can be stored in a client-side cache/storage, so
+subsequent / other requests for the same data will be fast(er).
+
+---
+
+[^1]: and the surrounding context, aka position inside the whole blob, which is available while verifying the tree
+[bittorrent-v2]: https://blog.libtorrent.org/2020/09/bittorrent-v2/
+[blake3]: https://github.com/BLAKE3-team/BLAKE3
+[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
+[bao-tree]: https://github.com/n0-computer/bao-tree
--- a/tvix/castore/docs/blobstore-protocol.md
+++ b/tvix/castore/docs/blobstore-protocol.md
@ -0,0 +1,104 @@
+# BlobStore: Protocol / Composition
+
+This documents describes the protocol that BlobStore uses to substitute blobs
+other ("remote") BlobStores.
+
+How to come up with the blake3 digest of the blob to fetch is left to another
+layer in the stack.
+
+To put this into the context of Tvix as a Nix alternative, a blob represents an
+individual file inside a StorePath.
+In the Tvix Data Model, this is accomplished by having a `FileNode` (either the
+`root_node` in a `PathInfo` message, or a individual file inside a `Directory`
+message) encode a BLAKE3 digest.
+
+However, the whole infrastructure can be applied for other usecases requiring
+exchange/storage or access into data of which the blake3 digest is known.
+
+## Protocol and Interfaces
+As an RPC protocol, BlobStore currently uses gRPC.
+
+On the Rust side of things, every blob service implements the
+[`BlobService`](../src/blobservice/mod.rs) async trait, which isn't
+gRPC-specific.
+
+This `BlobService` trait provides functionality to check for existence of Blobs,
+read from blobs, and write new blobs.
+It also provides a method to ask for more granular chunks if they are available.
+
+In addition to some in-memory, on-disk and (soon) object-storage-based
+implementations, we also have a `BlobService` implementation that talks to a
+gRPC server, as well as a gRPC server wrapper component, which provides a gRPC
+service for anything implementing the `BlobService` trait.
+
+This makes it very easy to talk to a remote `BlobService`, which does not even
+need to be written in the same language, as long it speaks the same gRPC
+protocol.
+
+It also puts very little requirements on someone implementing a new
+`BlobService`, and how its internal storage or chunking algorithm looks like.
+
+The gRPC protocol is documented in `../protos/rpc_blobstore.proto`.
+Contrary to the `BlobService` trait, it does not have any options for seeking/
+ranging, as it's more desirable to provide this through chunking (see also
+`./blobstore-chunking.md`).
+
+## Composition
+Different `BlobStore` are supposed to be "composed"/"layered" to express
+caching, multiple local and remote sources.
+
+The fronting interface can be the same, it'd just be multiple "tiers" that can
+respond to requests, depending on where the data resides. [^1]
+
+This makes it very simple for consumers, as they don't need to be aware of the
+entire substitutor config.
+
+The flexibility of this doesn't need to be exposed to the user in the default
+case; in most cases we should be fine with some form of on-disk storage and a
+bunch of substituters with different priorities.
+
+### gRPC Clients
+Clients are encouraged to always read blobs in a chunked fashion (asking for a
+list of chunks for a blob via `BlobService.Stat()`, then fetching chunks via
+`BlobService.Read()` as needed), instead of directly reading the entire blob via
+`BlobService.Read()`.
+
+In a composition setting, this provides opportunity for caching, and avoids
+downloading some chunks if they're already present locally (for example, because
+they were already downloaded by reading from a similar blob earlier).
+
+It also removes the need for seeking to be a part of the gRPC protocol
+alltogether, as chunks are supposed to be "reasonably small" [^2].
+
+There's some further optimization potential, a `BlobService.Stat()` request
+could tell the server it's happy with very small blobs just being inlined in
+an additional additional field in the response, which would allow clients to
+populate their local chunk store in a single roundtrip.
+
+## Verified Streaming
+As already described in `./docs/blobstore-chunking.md`, the physical chunk
+information sent in a `BlobService.Stat()` response is still sufficient to fetch
+in an authenticated fashion.
+
+The exact protocol and formats are still a bit in flux, but here's some notes:
+
+ - `BlobService.Stat()` request gets a `send_bao` field (bool), signalling a
+   [BAO][bao-spec] should be sent. Could also be `bao_shift` integer, signalling
+   how detailed (down to the leaf chunks) it should go.
+   The exact format (and request fields) still need to be defined, edef has some
+   ideas around omitting some intermediate hash nodes over the wire and
+   recomputing them, reducing size by another ~50% over [bao-tree].
+ - `BlobService.Stat()` response gets some bao-related fields (`bao_shift`
+   field, signalling the actual format/shift level the server replies with, the
+   actual bao, and maybe some format specifier).
+   It would be nice to also be compatible with the baos used by [iroh], so we
+   can provide an implementation using it too.
+
+---
+
+[^1]: We might want to have some backchannel, so it becomes possible to provide
+      feedback to the user that something is downloaded.
+[^2]: Something between 512K-4M, TBD.
+[bao-spec]: https://github.com/oconnor663/bao/blob/master/docs/spec.md
+[bao-tree]: https://github.com/n0-computer/bao-tree
+[iroh]: https://github.com/n0-computer/iroh
--- a/tvix/castore/docs/verified-streaming.md
+++ b/tvix/castore/docs/verified-streaming.md
@ -1,121 +0,0 @@
-# Verified streaming
-
-`//tvix/castore` is a content-addressed storage system, using [blake3] as hash
-function.
-
-This means returned data is fetched by using the digest as lookup key, and can
-be verified to be correct by feeding the received data through the hash function
-and ensuring it matches the digest initially used for the lookup.
-
-This means, data can be downloaded by any untrusted third-party as well, as the
-received data is validated to match the digest it was originally requested with.
-
-However, for larger blobs of data, having to download the entire blob to be able
-to determine whether it's correct before being able to return it to an upper
-layer takes a lot of time, and is wasteful, if we're only interested in a small
-portion of it.
-
-Especially when substituting from an untrusted third-party, we want to be able
-to detect quickly if that third-party is sending us wrong data, and terminate
-the connection early.
-
-## Chunking
-
-This problem has historically been solved by exchanging a list of smaller
-chunks, which can be fetched individually.
-
-BitTorrent for example breaks files up into smaller chunks, and maintains a list
-of sha1 digests for each of these chunks. After the list has been fetched, this
-allows fetching smaller parts of data selectively from untrusted third-parties.
-
-Similarly, IPFS uses its IPLD model to content-address a Merkle DAG of chunk
-nodes.
-
-While these approaches solve the problem of being able to fetch smaller chunks,
-they have a big disadvantage: the chunking parameters, and the topology of
-the graph structure itself "bleed" into the hash of the entire data structure
-itself.
-
-This comes with some disadvantages:
-
-Depending on the chunking parameters used, there's different representations for
-the same data, causing less data sharing/reuse in the overall content- addressed
-system, both when downloading data from third-parties, as well as benefiting
-from data already available locally.
-
-This can be workarounded by agreeing on only single way of chunking, but it's
-not pretty.
-
-## Chunking in tvix-castore
-
-tvix-castore uses BLAKE3 as a digest function, which internally uses a fixed
-chunksize of 1024 bytes.
-
-BLAKE3 is a tree hash where all left nodes fully populated, contrary to
-conventional serial hash functions. To be able to validate the hash of a node,
-one only needs the hash of the (2) children, if any.
-
-This means one only needs to the root digest to validate a construction, and
-lower levels of the tree can be omitted.
-
-This relieves us from the need of having to communicate more granular chunking
-upfront, and making it part of our data model.
-
-## Logical vs. physical chunking
-
-Due to the properties of the BLAKE3 hash function, we have logical blocks of
-1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
-chunk sizes.
-
-The only thing we need to be able to read and verify an arbitrary byte range is
-having the covering range of aligned 1K blocks.
-
-## Actual implementation
-
- -> BlobService.Read() gets the capability to read chunks as well
- -> BlobService.Stat() can reply with a list of chunks.
-      rq params: send_bao bool
-         server should be able to offer bao all the way down to 1k
-         some open questions w.r.t sending the whole bao until there, or just
-         all the hashes on the "most granular" level
-         -> we can recreate everything above up to the root hash.
-         -> can we maybe add this to n0-computer/bao-tree as another outboard format?
-      resp:
-        - bao_shift: how many levels on the bottom were skipped.
-          0 means send all the leaf node hashes (1K block size)
-        - "our bao": blake3 digests for a given static chunk size + path down to the last leaf node and its data (length proof)
-        - list of (b3digest,size) of all physical chunks.
-          The server can do some CDC on ingestion, and communicate these chunks here.
-          Chunk sizes should be a "reasonable size", TBD, probably something between 512K-4M
-
-Depending on the bao depth received from the server, we end up with a logical
-size of chunks that can be fetched in an authenticated fashion.
-
-Assuming the bao chunk size received is 4(*1KiB bytes) (`bao_shift` of 2), and a
-total blob size of 14 (*1KiB bytes), we can fetch
-`[0..=3]`, `[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
-
-`[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
-
-Assuming the server now informs us about the following physical chunking:
-
-`[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
-
-To read from 0 until 4 (inclusive), we need to fetch physical chunks
-`[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
-
-`[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
-interested in.
-
-We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
-can assemble `[ 4 5 6 7 ]` to verify that logical chunk.
-
-Each physical chunk fetched can be validated to have the blake3 digest that was
-communicated upfront, and can be stored in a client-side cache/storage.
-
-If it's not there, the client can use the `BlobService.Read()` interface with
-the *physical chunk digest*.
-
---
-
-[blake3]: https://github.com/BLAKE3-team/BLAKE3