feat(users/Profpatsch/blog): Idealized Conflang blogpost
Change-Id: Iec01bc2a1a5ed5d21627ba1f9554ec4b3a8fc4ad Reviewed-on: https://cl.tvl.fyi/c/depot/+/5480 Tested-by: BuildkiteCI Reviewed-by: Profpatsch <mail@profpatsch.de>
This commit is contained in:
parent
ea871fdcd7
commit
535ad8732a
2 changed files with 311 additions and 4 deletions
|
@ -26,11 +26,11 @@ let
|
|||
# /notes/*
|
||||
notes = [
|
||||
{
|
||||
route = [ "notes" "preventing-oom" ];
|
||||
name = "Preventing out-of-memory (OOM) errors on Linux";
|
||||
route = [ "notes" "an-idealized-conflang" ];
|
||||
name = "An Idealized Configuration Language";
|
||||
page = { cssFile }: markdownToHtml {
|
||||
name = "preventing-oom";
|
||||
markdown = ./notes/preventing-oom.md;
|
||||
name = "an-idealized-conflang";
|
||||
markdown = ./notes/an-idealized-conflang.md;
|
||||
inherit cssFile;
|
||||
};
|
||||
}
|
||||
|
@ -43,6 +43,15 @@ let
|
|||
inherit cssFile;
|
||||
};
|
||||
}
|
||||
{
|
||||
route = [ "notes" "preventing-oom" ];
|
||||
name = "Preventing out-of-memory (OOM) errors on Linux";
|
||||
page = { cssFile }: markdownToHtml {
|
||||
name = "preventing-oom";
|
||||
markdown = ./notes/preventing-oom.md;
|
||||
inherit cssFile;
|
||||
};
|
||||
}
|
||||
];
|
||||
|
||||
projects = [
|
||||
|
|
298
users/Profpatsch/blog/notes/an-idealized-conflang.md
Normal file
298
users/Profpatsch/blog/notes/an-idealized-conflang.md
Normal file
|
@ -0,0 +1,298 @@
|
|||
tags: netencode, json
|
||||
date: 2022-03-31
|
||||
certainty: likely
|
||||
status: initial
|
||||
title: An idealized Configuration Language
|
||||
|
||||
# An Idealized Configuration Language
|
||||
|
||||
JSON brought us one step closer to what an idealized configuration language is,
|
||||
which I define as “data, stripped of all externalities of the system it is working in”.
|
||||
|
||||
Specifically, JSON is very close to what I consider the minimal properties to represent structured data.
|
||||
|
||||
## A short history, according to me
|
||||
|
||||
In the beginning, Lisp defined s-expressions as a stand-in for an actual syntax.
|
||||
Then, people figured out that it’s also a way to represent structured data.
|
||||
It has scalars, which can be nested into lists, recursively.
|
||||
|
||||
```
|
||||
(this is (a (list) (of lists)))
|
||||
```
|
||||
|
||||
This provides the first three rules of our idealized language:
|
||||
|
||||
1. A **scalar** is a primitive value that is domain-specific.
|
||||
We can assume a bunch of bytes here, or a text or an integer.
|
||||
|
||||
2. A **list** gives an ordering to `0..n` (or `1..n`) values
|
||||
|
||||
3. Both a scalar and a list are the *same kind* of “thing” (from here on called **value**),
|
||||
lists can be created from arbitrary values *recursively*
|
||||
(for example scalars, or lists of scalars and other lists)
|
||||
|
||||
|
||||
Later, ASN.1 came and had the important insight that the same idealized data structure
|
||||
can be represented in different fashions,
|
||||
for example as a binary-efficient version and a human-readable format.
|
||||
|
||||
Then, XML “graced” the world for a decade or two, and the main lesson from it was
|
||||
that you don’t want to mix markup languages and configuration languages,
|
||||
and that you don’t want a committee to design these things.
|
||||
|
||||
---
|
||||
|
||||
In the meantime, Brendan Eich designed Javascript. Its prototype-based object system
|
||||
arguably stripped down the rituals of existing OO-systems.
|
||||
Douglas Crockford later extracted the object format (minus functions) into a syntax, and we got JSON.
|
||||
|
||||
```
|
||||
{
|
||||
"foo": [
|
||||
{ "nested": "attrs" },
|
||||
"some text"
|
||||
],
|
||||
"bar": 42
|
||||
}
|
||||
```
|
||||
|
||||
JSON adds another fundamental idea into the mix:
|
||||
|
||||
4. **Records** are unordered collections of `name`/`value` pairs.
|
||||
A `name` is defined to be a unicode string, so a semantic descriptor of the nested `value`.
|
||||
|
||||
Unfortunately, the JSON syntax does not actually specify any semantics of records (`objects` in JSON lingo),
|
||||
in particular it does not mention what the meaning is if a `name` appears twice in one record.
|
||||
|
||||
If records can have multiple entries with the same `name`, suddenly ordering becomes important!
|
||||
But wait, remember earlier we defined *lists* to impose ordering on two values.
|
||||
So in order to rectify that problem, we say that
|
||||
|
||||
5. A `name` can only appear in a record *once*, names must be unique.
|
||||
|
||||
This is the current state of the programming community at large,
|
||||
where most “modern” configuration languages basically use a version of the JSON model
|
||||
as their underlying data structure. (However not all of them use the same version.)
|
||||
|
||||
## Improving JSON’s data model
|
||||
|
||||
We are not yet at the final “idealized” configuration language, though.
|
||||
|
||||
Modern languages like Standard ML define their data types as a mixture of
|
||||
|
||||
* *records* (“structs” in the C lingo)
|
||||
* and *sums* (which you can think about as enums that can hold more `value`s inside them)
|
||||
|
||||
This allows to express the common pattern where some fields in a record are only meaningful
|
||||
if another field—the so-called `tag`-field—is set to a specific value.
|
||||
|
||||
An easy example: if a request can fail with an error message or succeed with a result.
|
||||
|
||||
You could model that as
|
||||
|
||||
```
|
||||
{
|
||||
"was_error": true,
|
||||
"error_message": "there was an error"
|
||||
}
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```
|
||||
{
|
||||
"was_error": false,
|
||||
"result": 42
|
||||
}
|
||||
```
|
||||
|
||||
in your JSON representation.
|
||||
|
||||
But in a ML-like language (like, for example, Rust), you would instead model it as
|
||||
|
||||
```
|
||||
type RequestResult
|
||||
= Error { error_message: String }
|
||||
| Success { result: i64 }
|
||||
```
|
||||
|
||||
where the distinction in `Error` or `Success` makes it clear that `error_message` and `result`
|
||||
only exist in one of these cases, not the other.
|
||||
|
||||
We *can* encode exactly that idea into JSON in multiple ways, but not a “blessed” way.
|
||||
|
||||
For example, another way to encode the above would be
|
||||
|
||||
```
|
||||
{
|
||||
"Error": {
|
||||
"error_message": "there was an error"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```
|
||||
{
|
||||
"Success": {
|
||||
"result": 42
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Particularly notice the difference between the language representation, where the type is “closed”only `Success` or `Error` can happen—
|
||||
and the data representation where the type is “open”, more cases could potentially exist.
|
||||
|
||||
This is an important differentiation from a type system:
|
||||
Our idealized configuration language just gives more structure to a bag of data,
|
||||
it does not restrict which value can be where.
|
||||
Think of a value in an unityped language, like Python.
|
||||
|
||||
|
||||
So far we have the notion of
|
||||
|
||||
1. a scalar (a primitive)
|
||||
2. a list (ordering on values)
|
||||
3. a record (unordered collection of named values)
|
||||
|
||||
and in order to get the “open” `tag`ged enumeration values, we introduce
|
||||
|
||||
4. a `tag`, which gives a name to a value
|
||||
|
||||
We can then redefine `record` to mean “an unordered collection of `tag`ged values”,
|
||||
which further reduces the amount of concepts needed.
|
||||
|
||||
And that’s it, this is the full idealized configuration language.
|
||||
|
||||
|
||||
## Some examples of data modelling with tags
|
||||
|
||||
This is all well and good, but what does it look like in practice?
|
||||
|
||||
For these examples I will be using JSON with a new `< "tag": value >` syntax
|
||||
to represent `tag`s.
|
||||
|
||||
From a compatibility standpoint, `tag`s (or sum types) have dual properties to record types.
|
||||
|
||||
With a record, when you have a producer that *adds* a field to it, the consumer will still be able to handle the record (provided the semantics of the existing fields is not changed by the new field).
|
||||
|
||||
With a tag, *removing* a tag from the producer will mean that the consumer will still be able to handle the tag. It might do one “dead” check on the removed `tag`, but can still handle the remaining ones just fine.
|
||||
|
||||
<!-- TODO: some illustration here -->
|
||||
|
||||
An example of how that is applied in practice is that in `protobuf3`, fields of a record are *always* optional fields.
|
||||
|
||||
We can model optional fields by wrapping them in `< "Some": value >` or `< "None": {} >` (where the actual value of the `None` is ignored or always an empty record).
|
||||
|
||||
So a protobuf with the fields `foo: int` and `bar: string` has to be parsed by the receiver als containing *four* possibilities:
|
||||
|
||||
№|foo|bar|
|
||||
|--:|---|---|
|
||||
|1|`<"None":{}>`|`<"None":{}>`|
|
||||
|2|`<"Some":42>`|`<"None":{}>`|
|
||||
|3|`<"None":{}>`|`<"Some":"x">`|
|
||||
|4|`<"Some":42>`|`<"Some":"x">`|
|
||||
|
||||
Now, iff the receiver actually handles all four possibilities
|
||||
(and doesn’t just crash if a field is not set, as customary in million-dollar-mistake languages),
|
||||
it’s easy to see how removing a field from the producer is semantically equal to always setting it to `<"None":{}>`.
|
||||
Since all receivers should be ready to receive `None` for every field, this provides a simple forward-compatibility scheme.
|
||||
|
||||
We can abstract this to any kind of tag value:
|
||||
If you start with “more” tags, you give yourself space to remove them later without breaking compatibility, typically called “forward compatibility”.
|
||||
|
||||
|
||||
## To empty list/record or not to
|
||||
|
||||
Something to think about is whether records and fields should be defined
|
||||
to always contain at least one element.
|
||||
|
||||
As it stands, JSON has multiple ways of expressing the “empty value”:
|
||||
|
||||
* `null`
|
||||
* `[]`
|
||||
* `{}`
|
||||
* `""`
|
||||
* *leave out the field*
|
||||
|
||||
and two of those come from the possibility of having empty structured values.
|
||||
|
||||
## Representations of this language
|
||||
|
||||
This line of thought originally fell out of me designing [`netencode`](https://code.tvl.fyi/tree/users/Profpatsch/netencode/README.md)
|
||||
as a small human-debuggable format for pipeline serialization.
|
||||
|
||||
In addition to the concepts mentioned here (especially tags),
|
||||
it provides a better set of scalars than JSON (specifically arbitrary bytestrings),
|
||||
but it cannot practically be written or modified by hand,
|
||||
which might be a good thing depending on how you look at it.
|
||||
|
||||
---
|
||||
|
||||
The way that is compatible with the rest of the ecosystem is probably to use a subset of json
|
||||
to represent our idealized language.
|
||||
|
||||
There is multiple ways of encoding tags in json, which each have their pros and cons.
|
||||
|
||||
The most common is probably the “tag field” variant, where the tag is pulled into the nested record:
|
||||
|
||||
```
|
||||
{
|
||||
"_tag": "Success",
|
||||
"result": 42
|
||||
}
|
||||
```
|
||||
|
||||
Which has the advantage that people know how to deal with it and that it’s easy to “just add another field”,
|
||||
plus it is backward-compatible when you had a record in the first place.
|
||||
|
||||
It has multiple disadvantages however:
|
||||
|
||||
* If your value wasn’t a record (e.g. an int) before, you have to put it in a record and assign an arbitrary name to its field
|
||||
* People are not forced to “unwrap” the tag first, so they are going to forget to check it
|
||||
* The magic “_tag” name cannot be used by any of the record’s fields
|
||||
|
||||
|
||||
An in-between version of this with less downsides is to always push a json record onto the stack:
|
||||
|
||||
```
|
||||
{
|
||||
"tag": "Success",
|
||||
"value": {
|
||||
"result": 42
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This makes it harder for people to miss checking the `tag`, but still possible of course.
|
||||
It also makes it easily possible to inspect the contents of `value` without knowing the
|
||||
exhaustive list of `tag`s, which can be useful in practice (though often not sound!).
|
||||
It also gets rid of the “_tag” field name clash problem.
|
||||
|
||||
Disadvantages:
|
||||
|
||||
* Breaks the backwards-compatibility with an existing record-based approach if you want to introduce `tag`s
|
||||
* Verbosity of representation
|
||||
* hard to distinguish a record with the `tag` and `value` fields from a `tag`ed value (though you know the type layout of your data on a higher level, don’t you? ;) )
|
||||
|
||||
|
||||
The final, “most pure” representation is the one I gave in the original introduction:
|
||||
|
||||
```
|
||||
{
|
||||
"Success": {
|
||||
"result": 42
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Now you *have* to match on the `tag` name first, before you can actually access your data,
|
||||
and it’s less verbose than the above representation.
|
||||
|
||||
Disavantages:
|
||||
|
||||
* You also have to *know* what `tag`s to expect, it’s harder to query cause you need to extract the keys and values from the dict and then take the first one.
|
||||
* Doing a “tag backwards compat” check is harder,
|
||||
because you can’t just check whether `_tag` or `tag`/`value` are the keys in the dict.
|
Loading…
Reference in a new issue