docs(tvix/castore): add notes on verified streaming

These are still a bit WIP, but should already explain the idea better than not having any documentation about how we intend to implement this at all. Let's put this into the repo, as a working document. Co-Authored-By: edef <edef@edef.eu> Change-Id: Ic0c5195cdf0a33c8382af1b5cff9064fe6ef9ffb Reviewed-on: https://cl.tvl.fyi/c/depot/+/9894 Autosubmit: flokli <flokli@flokli.de> Reviewed-by: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI
2023-11-02 10:33:44 +02:00 · 2023-11-02 10:33:44 +02:00 · 5a0ae8312c
commit 5a0ae8312c
parent beae3a4bf1
1 changed files with 121 additions and 0 deletions
--- a/tvix/castore/docs/verified-streaming.md
+++ b/tvix/castore/docs/verified-streaming.md
@ -0,0 +1,121 @@
 # Verified streaming
 `//tvix/castore` is a content-addressed storage system, using [blake3] as hash
 function.
 This means returned data is fetched by using the digest as lookup key, and can
 be verified to be correct by feeding the received data through the hash function
 and ensuring it matches the digest initially used for the lookup.
 This means, data can be downloaded by any untrusted third-party as well, as the
 received data is validated to match the digest it was originally requested with.
 However, for larger blobs of data, having to download the entire blob to be able
 to determine whether it's correct before being able to return it to an upper
 layer takes a lot of time, and is wasteful, if we're only interested in a small
 portion of it.
 Especially when substituting from an untrusted third-party, we want to be able
 to detect quickly if that third-party is sending us wrong data, and terminate
 the connection early.
 ## Chunking
 This problem has historically been solved by exchanging a list of smaller
 chunks, which can be fetched individually.
 BitTorrent for example breaks files up into smaller chunks, and maintains a list
 of sha1 digests for each of these chunks. After the list has been fetched, this
 allows fetching smaller parts of data selectively from untrusted third-parties.
 Similarly, IPFS uses its IPLD model to content-address a Merkle DAG of chunk
 nodes.
 While these approaches solve the problem of being able to fetch smaller chunks,
 they have a big disadvantage: the chunking parameters, and the topology of
 the graph structure itself "bleed" into the hash of the entire data structure
 itself.
 This comes with some disadvantages:
 Depending on the chunking parameters used, there's different representations for
 the same data, causing less data sharing/reuse in the overall content- addressed
 system, both when downloading data from third-parties, as well as benefiting
 from data already available locally.
 This can be workarounded by agreeing on only single way of chunking, but it's
 not pretty.
 ## Chunking in tvix-castore
 tvix-castore uses BLAKE3 as a digest function, which internally uses a fixed
 chunksize of 1024 bytes.
 BLAKE3 is a tree hash where all left nodes fully populated, contrary to
 conventional serial hash functions. To be able to validate the hash of a node,
 one only needs the hash of the (2) children, if any.
 This means one only needs to the root digest to validate a construction, and
 lower levels of the tree can be omitted.
 This relieves us from the need of having to communicate more granular chunking
 upfront, and making it part of our data model.
 ## Logical vs. physical chunking
 Due to the properties of the BLAKE3 hash function, we have logical blocks of
 1KiB, but this doesn't necessarily imply we need to restrict ourselves to these
 chunk sizes.
 The only thing we need to be able to read and verify an arbitrary byte range is
 having the covering range of aligned 1K blocks.
 ## Actual implementation
 -> BlobService.Read() gets the capability to read chunks as well
 -> BlobService.Stat() can hand out a list of chunks with a given chunksize if asked so.
      rq params: send_bao bool
         server should be able to offer bao all the way down to 1k
         some open questions w.r.t sending the whole bao until there, or just
         all the hashes on the "most granular" level
         -> we can recreate everything above up to the root hash.
         -> can we maybe add this to n0-computer/bao-tree as another outboard format?
      resp:
        - bao_shift: how many levels on the bottom were skipped.
          0 means send all the leaf node hashes (1K block size)
        - "our bao": blake3 digests for a given static chunk size + path down to the last leaf node and its data (length proof)
        - list of (b3digest,size) of all physical chunks.
          The server can do some CDC on ingestion, and communicate these chunks here.
          Chunk sizes should be a "reasonable size", TBD, probably something between 512K-4M
 Depending on the bao depth received from the server, we end up with a logical
 size of chunks that can be fetched in an authenticated fashion.
 Assuming the bao chunk size received is 4(*1KiB bytes) (`bao_shift` of 2), and a
 total blob size of 14 (*1KiB bytes), we can fetch
 `[0..=3]`, `[4..=7]`, `[8..=11]` and `[12..=13]` in an authenticated fashion:
 `[ 0 1 2 3 ] [ 4 5 6 7 ] [ 8 9 10 11 ] [ 12 13 ]`
 Assuming the server now informs us about the following physical chunking:
 `[ 0 1 ] [ 2 3 4 5 ] [ 6 ] [ 7 8 ] [ 9 10 11 12 13 14 15 ]`
 To read from 0 until 4 (inclusive), we need to fetch physical chunks
 `[ 0 1 ]`, `[ 2 3 4 5 ]` and `[ 6 ] [ 7 8 ]`.
 `[ 0 1 ]` and `[ 2 3 4 5 ]` are obvious, they contain the data we're
 interested in.
 We however also need to fetch the physical chunks `[ 6 ]` and `[ 7 8 ]`, so we
 can assemble `[ 4 5 6 7 ]` to verify that logical chunk.
 Each physical chunk fetched can be validated to have the blake3 digest that was
 communicated upfront, and can be stored in a client-side cache/storage.
 If it's not there, the client can use the `BlobService.Read()` interface with
 the *physical chunk digest*.
 ---
 [blake3]: https://github.com/BLAKE3-team/BLAKE3