tvl-depot/tvix/castore/docs/why-not-git-trees.md
Ben Webb 23e0973cdf docs(tvix): fix some typos across various documents
Fix some typos found while reading various documents, mostly those
relating to the castore.

Here is a summary of the edits.

- fix broken link between documents in the store and castore directories
- clarify expression in castore's data model document that indicates
  that the *name* of each child node of a directory must be unique
  across all three lists of children
- add missing closing parenthesis in castore's data model document
- replace "how" with "what" in the phrase "unclear how a ... would even
  look like" in castore's why-not-git-trees document
- remove unnecessary articles in castore's blobstore chunking document
- add missing "y" to "optionall" in eval's compilation of bindings
  document

Change-Id: I1997ea91bb4e9c40abcd81e0cde9405968580ba6
Reviewed-on: https://cl.tvl.fyi/c/depot/+/11763
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
2024-06-08 21:17:56 +00:00

57 lines
No EOL
2.7 KiB
Markdown

## Why not git tree objects?
We've been experimenting with (some variations of) the git tree and object
format, and ultimately decided against using it as an internal format, and
instead adapted the one documented in the other documents here.
While the tvix-store API protocol shares some similarities with the format used
in git for trees and objects, the git one has shown some significant
disadvantages:
### The binary encoding itself
#### trees
The git tree object format is a very binary, error-prone and
"made-to-be-read-and-written-from-C" format.
Tree objects are a combination of null-terminated strings, and fields of known
length. References to other tree objects use the literal sha1 hash of another
tree object in this encoding.
Extensions of the format/changes are very hard to do right, because parsers are
not aware they might be parsing something different.
The tvix-store protocol uses a canonical protobuf serialization, and uses
the [blake3][blake3] hash of that serialization to point to other `Directory`
messages.
It's both compact and with a wide range of libraries for encoders and decoders
in many programming languages.
The choice of protobuf makes it easy to add new fields, and make old clients
aware of some unknown fields being detected [^adding-fields].
#### blob
On disk, git blob objects start with a "blob" prefix, then the size of the
payload, and then the data itself. The hash of a blob is the literal sha1sum
over all of this - which makes it something very git specific to request for.
tvix-store simply uses the [blake3][blake3] hash of the literal contents
when referring to a file/blob, which makes it very easy to ask other data
sources for the same data, as no git-specific payload is included in the hash.
This also plays very well together with things like [iroh][iroh-discussion],
which plans to provide a way to substitute (large)blobs by their blake3 hash
over the IPFS network.
In addition to that, [blake3][blake3] makes it possible to do
[verified streaming][bao], as already described in other parts of the
documentation.
The git tree object format uses sha1 both for references to other trees and
hashes of blobs, which isn't really a hash function to fundamentally base
everything on in 2023.
The [migration to sha256][git-sha256] also has been dead for some years now,
and it's unclear what a "blake3" version of this would even look like.
[bao]: https://github.com/oconnor663/bao
[blake3]: https://github.com/BLAKE3-team/BLAKE3
[git-sha256]: https://git-scm.com/docs/hash-function-transition/
[iroh-discussion]: https://github.com/n0-computer/iroh/discussions/707#discussioncomment-5070197
[^adding-fields]: Obviously, adding new fields will change hashes, but it's something that's easy to detect.