Commit graph

6 commits

Author SHA1 Message Date
sterni
750ef6c693 feat(sterni/nix/utf8): check if codepoint valid/encodeable
* Enforce the U+0000 to U+10FFFF range in `count` and throw an error if
  the given codepoint exceeds the range (encoding U+0000 won't work of
  course, but this is Nix's fault…).

* Check if the produced bytes are well formed and output an error if
  not. This indicates that the codepoint can't be encoded as UTF-8, like
  U+D800 which is reserved for UTF-16.

Change-Id: I18336e527484580f28cbfe784d51718ee15c5477
2021-11-25 12:15:35 +01:00
sterni
87a0aaa77d feat(sterni/nix/utf8): implement UTF-8 encoding
This implementation is still a bit rough as it doesn't check if the
produced string is valid UTF-8 which may happen if an invalid Unicode
codepoint is passed.

Change-Id: Ibaa91dafa8937142ef704a175efe967b62e3ee7b
2021-11-25 12:15:35 +01:00
sterni
ab92c42f59 feat(sterni/nix/utf8): allow decoding the empty string
Change-Id: I8de9cd28c822ac5befbcd16e118440cd13cd86e9
2021-11-23 14:23:41 +01:00
Profpatsch
eb41eef612 chore(nix): move rustSimple from users.Profpatsch.writers
I think it’s solid enough to use in a wider context.

Change-Id: If53e8bbb6b90fa88d73fb42730db470e822ea182
Reviewed-on: https://cl.tvl.fyi/c/depot/+/3055
Tested-by: BuildkiteCI
Reviewed-by: sterni <sternenseemann@systemli.org>
Reviewed-by: lukegb <lukegb@tvl.fyi>
2021-04-24 10:23:55 +00:00
sterni
8d4b2f3d54 refactor(sterni): use pkgs over third_party to import from nixpkgs
This should ease migrating to a distinction between depot.third_party
and pkgs (as in nixpkgs) in the future.

Ref cl/2910, b/108.

Change-Id: I53a854071fddd7c0d0526cc4c5b16998202082c6
Reviewed-on: https://cl.tvl.fyi/c/depot/+/2913
Tested-by: BuildkiteCI
Reviewed-by: tazjin <mail@tazj.in>
2021-04-10 11:40:18 +00:00
sterni
b810c46a45 feat(users/sterni/nix/utf8): pure nix utf-8 decoder
users.sterni.nix.utf8 implements UTF-8 decoding in pure nix. We
implement the decoding as a simple state machine which is fed one byte
at a time. Decoding whole strings is possible by subsequently calling
step. This is done in decode which uses builtins.foldl' to get around
recursion restrictions and a neat trick using builtins.deepSeq puck
showed me limiting the size of the thunks in a foldl' (which can also
cause a stack overflow).

This makes decoding arbitrarily large UTF-8 files into codepoints using
nix theoretically possible, but it is not really practical: Decoding a
36KB LaTeX file I had lying around takes ~160s on my laptop.

Change-Id: Iab8c973dac89074ec280b4880a7408e0b3d19bc7
Reviewed-on: https://cl.tvl.fyi/c/depot/+/2590
Tested-by: BuildkiteCI
Reviewed-by: sterni <sternenseemann@systemli.org>
2021-03-05 11:07:41 +00:00