Commit graph

20 commits

Author SHA1 Message Date
Vincent Ambo
0a27344953 chore(3p/sources): bump nixpkgs & overlays (2023-02-07)
Included fixes:

* //3p/overlays: tdlib override no longer needed (bump has landed upstream)
* //corp/{predlozhnik,tvixbolt}: bump wasm-bindgen to match nixpkgs

Home-manager has not been bumped as it has introduced an
incompatibility with Nix 2.3

Change-Id: I96ac3462b82c73db1ba23be03d7968f10abc9b53
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8033
Tested-by: BuildkiteCI
Reviewed-by: flokli <flokli@flokli.de>
Reviewed-by: sterni <sternenseemann@systemli.org>
2023-02-07 13:46:35 +00:00
Vincent Ambo
d05c380504 fix(corp/data-import): rank is an integer field
Change-Id: Ifc9cd46e5b5521096db19628bd8bcf026106dcc9
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7926
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-25 10:46:12 +00:00
Vincent Ambo
192dac5a74 feat(corp/data-import): map OR word types to sets of OC grammemes
Change-Id: I674f3a66fcd65314431a2ebd747e3830aa2dd7a1
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7924
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: tazjin <tazjin@tvl.su>
2023-01-24 22:41:21 +00:00
Vincent Ambo
80723b708d feat(corp/data-import): map OC lemma grammemes to OR form types
Change-Id: Ie804d185269336b0d9fe417754e5e795918e65b8
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7923
Autosubmit: tazjin <tazjin@tvl.su>
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-24 22:41:19 +00:00
Vincent Ambo
8d594658ab feat(corp/data-import): map OC word grammemes to OR form types
This table maps the grammemes for individual word forms (*not* for
lemmata in either corpus!) to the corresponding grammemes from the
other dataset.

These have drastically different shapes, so the mapping is not
perfect, but will help in determining which forms are intended to be
the same on both sides.

Change-Id: Ib0717e2f7a79d96bcb5e955a20f551e391fcd759
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7918
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
Autosubmit: tazjin <tazjin@tvl.su>
2023-01-24 22:41:19 +00:00
Vincent Ambo
ed8dd4acd7 feat(corp/data-import): add import of OR 'translations' table
The original dataset contains translations into different languages,
but only the English ones are imported here.

Note that translations are for lemmata only.

Change-Id: Ifb9c32c25fda44c38ad899efca9d205c520c0fa3
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7895
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-22 16:13:09 +00:00
Vincent Ambo
8eeb5d3bcc feat(corp/data-import): add import of OR 'words_forms' table
This is the full morphological set table for all the words from the
lemmata table, which they don't call it that.

Change-Id: I6f5be673c5f59f11e36bd8c8c935844a7d4fd170
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7894
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-21 17:49:33 +00:00
Vincent Ambo
429c0d00c4 feat(corp/data-import): add import of OpenRussian 'words' table
This is actually the lemmata table of this corpus, not the forms of
all words (they're in a separate table).

Change-Id: I89a2c2817ccce840f47406fa2a636f4ed3f49154
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7893
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-21 17:49:33 +00:00
Vincent Ambo
ee0c0ee951 chore(corp/data-import): make OR data archive available in env
Change-Id: Idacf42743051eae0cf7010f952a4f91af17ad708
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7892
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-21 17:49:33 +00:00
Vincent Ambo
0dfe460fbb docs(corp/data-import): document OpenRussian format
This is the second dataset I want to integrate as it contains some
more practically useful, but somewhat less structured, information.

Change-Id: Ib46b2597a33e76f59e030f889a0961ecc5a144eb
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7873
Tested-by: BuildkiteCI
Autosubmit: tazjin <tazjin@tvl.su>
Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 21:58:02 +00:00
Vincent Ambo
db26825eec chore(corp/data-import): namespace tables for OpenCorpora data
I'm changing strategies to importing both OC and another dataset
before continuing to normalise the data, as it might be easier to do
in a set of table-constructing queries inside of SQLite with all raw
data in place.

Change-Id: I26b41af80586fc1bfd8e26a6be20579068a82507
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7872
Autosubmit: tazjin <tazjin@tvl.su>
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-18 21:58:02 +00:00
Vincent Ambo
c891833414 feat(corp/data-import): build morphology database in derivation
This makes the actual imported database of the ~whole Russian
language (all lemmas, grammemes, forms etc.) a Nix build target which
is built in CI.

This still needs schema normalisation (it's fairly directly mapped to
the raw data), but it's already starting to be a useful data set.

This also happens to be a pretty cool demonstration of the power of
Nix. You can do `nix-build -A corp.russian.data-import.database` and
out comes a perfectly valid SQLite database with a valid external data
import!

Change-Id: I5d6d15e67d0e4a7ff590fad06252be34f5d561fd
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7866
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-18 15:44:06 +00:00
Vincent Ambo
0ed6583edc feat(corp/data-import): let users specify output path
Change-Id: I61ad021c7a5318b099f3adc8bc6aedef65500974
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7865
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 15:44:06 +00:00
Vincent Ambo
476e312c06 feat(corp/data-import): parse and import links
Change-Id: Iebdbc8f884f28064d7b00b8f8808b5030fa3d05c
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7864
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-18 15:44:06 +00:00
Vincent Ambo
dc55ea3201 feat(corp/data-import): parse and import link types
Change-Id: Iae01d1dc6894117dc693b4690d8bc79861212ae6
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7863
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 15:44:06 +00:00
Vincent Ambo
3f0b1d8e0b fix(corp/data-import): commit the final transaction, too
Otherwise up to 1000 elements might be missing.

Change-Id: I20d6238424eec27f0e758e7737c9c31bcb81b23d
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7862
Tested-by: BuildkiteCI
Reviewed-by: tazjin <tazjin@tvl.su>
2023-01-18 15:44:06 +00:00
Vincent Ambo
6986aa5824 feat(corp/data-import): insert OpenCorpora data into SQLite
This is an initial and kind of dumb table structure, but there's some
massaging that needs to be done before this makes more sense.

Change-Id: I441288b684ef86be507099bcc4ebf984598789c8
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7861
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-18 15:44:06 +00:00
Vincent Ambo
485c3cc912 feat(corp/data-import): parse lemmas from OpenCorpora dump
Change-Id: I1e4efcfc8e555f61578b563411d5e6ed9590d8e8
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7860
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-18 01:10:37 +00:00
Vincent Ambo
ee7616d956 feat(corp/russian/data-import): new OpenCorpora data import tool
Adds the beginning of a tool which can import OpenCorpora data into a
SQLite database. This is quite a lot of toil and there's probably a
better way to do this, but overall becoming this intimately familiar
with the data structures is quite helpful for understanding what I
can/can't do with only this dataset.

Change-Id: Ieab33a8ce07ea4ac87917b9c8132226bbc6523b1
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7859
Reviewed-by: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-18 01:10:37 +00:00
Vincent Ambo
aa96e25bbc chore(tazjin/predlozhnik): move to //corp
This is currently hosted by the company, and I'm assigning my
copyright to the company, which also runs an ad placement on the page.

Note that the NixOS module for hosting it has not been moved yet.

Change-Id: Iba9e1cab9370faa79e43c3344fbfbbbabead50b3
Reviewed-on: https://cl.tvl.fyi/c/depot/+/7857
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
2023-01-17 18:23:52 +00:00