docs(corp/data-import): document OpenRussian format
This is the second dataset I want to integrate as it contains some more practically useful, but somewhat less structured, information. Change-Id: Ib46b2597a33e76f59e030f889a0961ecc5a144eb Reviewed-on: https://cl.tvl.fyi/c/depot/+/7873 Tested-by: BuildkiteCI Autosubmit: tazjin <tazjin@tvl.su> Reviewed-by: tazjin <tazjin@tvl.su>
This commit is contained in:
parent
db26825eec
commit
0dfe460fbb
1 changed files with 53 additions and 4 deletions
|
@ -1,10 +1,10 @@
|
|||
//! This program imports Russian language data from OpenCorpora
|
||||
//! ("Открытый корпус") into a SQLite database that can be used for
|
||||
//! [//corp/russian][corp-russian] projects.
|
||||
//! This program imports Russian language data from OpenCorpora and
|
||||
//! OpenRussian ("Открытый корпус") into a SQLite database that can be
|
||||
//! used for [//corp/russian][corp-russian] projects.
|
||||
//!
|
||||
//! [corp-russian]: https://at.tvl.fyi/?q=%2F%2Fcorp%2Frussian
|
||||
//!
|
||||
//! Ideally, running this on an OpenCorpora dump should yield a fully
|
||||
//! Ideally, running this on intact dumps should yield a fully
|
||||
//! functional SQLite database compatible with all other tools
|
||||
//! consuming it.
|
||||
//!
|
||||
|
@ -53,6 +53,55 @@
|
|||
//!
|
||||
//! For example, a relationship `cardinal/ordinal` might be established
|
||||
//! between the lemmas "два" and "второй".
|
||||
//!
|
||||
//! ## OpenRussian format
|
||||
//!
|
||||
//! The [OpenRussian](https://en.openrussian.org/dictionary) project
|
||||
//! lets users export its database as a set of CSV-files. For our
|
||||
//! purposes, we download the files using `<tab>` separators.
|
||||
//!
|
||||
//! Whereas OpenCorpora opts for a flat structure with a "tag" system
|
||||
//! (through its flexible grammemes), OpenRussian has a fixed pre-hoc
|
||||
//! structure into which it sorts some words with their morphologies.
|
||||
//! The OpenRussian database is much smaller as of January 2023 (~1.7
|
||||
//! million words vs. >5 million for OpenCorpora), but some of the
|
||||
//! information is much more practically useful.
|
||||
//!
|
||||
//! Two very important bits of information OpenRussian has are accent
|
||||
//! marks (most tables containing actual words have a normal form
|
||||
//! containing and accent mark, and a "bare" form without) and
|
||||
//! translations into English and German.
|
||||
//!
|
||||
//! The full dump includes the following tables (and some more):
|
||||
//!
|
||||
//! * `words`: List of lemmas in the corpus, with various bits of
|
||||
//! metadata as well as hand-written notes.
|
||||
//!
|
||||
//! * `adjectives`: Contains IDs for words that are adjectives.
|
||||
//!
|
||||
//! * `nouns`: IDs for words that are nouns; and noun metadata (e.g.
|
||||
//! gender, declinability)
|
||||
//!
|
||||
//! * `verbs`: IDs of words that are verbs, including their aspect and
|
||||
//! "partnered" verb in the other aspect
|
||||
//!
|
||||
//! * `words_forms`: Contains all morphed variants of the lemmas from
|
||||
//! `words`, including information about their grammeme, and accent
|
||||
//! marks.
|
||||
//!
|
||||
//! * `words_rels`: Contains relations between words, containing
|
||||
//! information like "synonyms" or general relation between words.
|
||||
//!
|
||||
//! * `translations`: Contains translations tagged by target language,
|
||||
//! as well as examples and (occasionally) additional information.
|
||||
//!
|
||||
//! These tables also contain something, but have not been analysed
|
||||
//! yet:
|
||||
//!
|
||||
//! * `expressions_words`
|
||||
//! * `sentences`
|
||||
//! * `sentences_translations`
|
||||
//! * `sentences_words`
|
||||
|
||||
use log::{error, info};
|
||||
use rusqlite::{Connection, Result};
|
||||
|
|
Loading…
Reference in a new issue