Commit graph

7 commits

Author SHA1 Message Date
sterni
750ef6c693 feat(sterni/nix/utf8): check if codepoint valid/encodeable
* Enforce the U+0000 to U+10FFFF range in `count` and throw an error if
  the given codepoint exceeds the range (encoding U+0000 won't work of
  course, but this is Nix's fault…).

* Check if the produced bytes are well formed and output an error if
  not. This indicates that the codepoint can't be encoded as UTF-8, like
  U+D800 which is reserved for UTF-16.

Change-Id: I18336e527484580f28cbfe784d51718ee15c5477
2021-11-25 12:15:35 +01:00
sterni
0e9c770972 refactor(sterni/nix/utf8): let wellFormedByte check first byte
Previously we would check the first byte only when trying to figure out
the predicate for the second byte. If the first byte was invalid, we'd
then throw with a helpful error message. However this made
wellFormedByte a very weird function.

At the expense of doing the same check twice, we now check the first
byte, when it is first passed, and always return a boolean.

Change-Id: I32ab6051c844711849e5b4a115e2511b53682baa
2021-11-25 12:15:35 +01:00
sterni
87a0aaa77d feat(sterni/nix/utf8): implement UTF-8 encoding
This implementation is still a bit rough as it doesn't check if the
produced string is valid UTF-8 which may happen if an invalid Unicode
codepoint is passed.

Change-Id: Ibaa91dafa8937142ef704a175efe967b62e3ee7b
2021-11-25 12:15:35 +01:00
sterni
9370ea5e33 chore(sterni/nix/utf8): remove decodeSafe
This is not really used anywhere and kind of useless. A better
decodeSafe would never return null and instead make use of replacement
characters to represent invalid bytes in the input.

Change-Id: Ib4111529bf0e472dbfa720a5d0b939c2d2511de5
2021-11-25 12:15:35 +01:00
sterni
ab92c42f59 feat(sterni/nix/utf8): allow decoding the empty string
Change-Id: I8de9cd28c822ac5befbcd16e118440cd13cd86e9
2021-11-23 14:23:41 +01:00
sterni
8615322bc8 refactor(sterni/nix/utf8): use genericClosure for decoding iteration
builtins.genericClosure is a quite powerful (and undocumented) Nix
primop: It repeatedly applies a function to values it produces and
collects them into a list. Additionally individual results can be
identified via a key attribute.

Since genericClosure only ever creates a single list value internally,
we can eliminate a huge performance bottleneck when building a list in a
recursive algorithm: list concatenation. Because Nix needs to copy the
entire chunk of memory used internally to represent the list, building
big lists one element at a time grinds Nix to a halt.

After rewriting decode using genericClosure decoding the LaTeX source
of my 20 page term paper now takes 2s instead of 14min.

Change-Id: I33847e4e7dd95d7f4d78ac83eb0d74a9867bfe80
2021-11-23 14:22:24 +01:00
sterni
b810c46a45 feat(users/sterni/nix/utf8): pure nix utf-8 decoder
users.sterni.nix.utf8 implements UTF-8 decoding in pure nix. We
implement the decoding as a simple state machine which is fed one byte
at a time. Decoding whole strings is possible by subsequently calling
step. This is done in decode which uses builtins.foldl' to get around
recursion restrictions and a neat trick using builtins.deepSeq puck
showed me limiting the size of the thunks in a foldl' (which can also
cause a stack overflow).

This makes decoding arbitrarily large UTF-8 files into codepoints using
nix theoretically possible, but it is not really practical: Decoding a
36KB LaTeX file I had lying around takes ~160s on my laptop.

Change-Id: Iab8c973dac89074ec280b4880a7408e0b3d19bc7
Reviewed-on: https://cl.tvl.fyi/c/depot/+/2590
Tested-by: BuildkiteCI
Reviewed-by: sterni <sternenseemann@systemli.org>
2021-03-05 11:07:41 +00:00