Skip to content

Latest commit

 

History

History
142 lines (118 loc) · 6.31 KB

README.md

File metadata and controls

142 lines (118 loc) · 6.31 KB

FreEM LPM

DOI

- WARNING: This repository is the new repository of [LEM17](https://github.com/e-ditiones/LEM17), which is not maintained anymore

FreEM LPM (Lemmas, POS, Morphology). Linguistically annotated corpora of modern French (16-18th c.)

For more information about FreEM corpora, cf. our website.

100% center

«Sisyphe portant CornMol» (Titian, Prado Museum, Madrid, Spain, Source: Wikipedia).

Data

We provide:

  1. Several authority lists, two deriving from LGeRM.
  • One list contains only propre nouns (proper) with the latest added at the end
  • One list contains all the other lemmas (authority) with the latest added at the end
  • One list contains all the foreign words (foreign) with the latest added at the end
  • Each file has a _processed version with all the entries in the alphabetical order, after controlling that there is not twice the same entry
  • On top of these three files, numbers contains latin and arabic numbers and alphabet contains single latin letters.
  1. Training data:
  • CornMol is a gold corpus to be published
  • FranText is a corpus taken from the open data of FranText and aligned on our lemmatisation standards.
  • presto_gold is a gold corpus used by the Presto project tro train their TreeTagger model, converted to CATTEX and lightly corrected to match our authority lists.
  • presto_max have all the modern (16th-18th c.) texts of the Presto project, with lemmas heavily corrected. Each round of annotation/correction is numbered (v2, v3…)
  1. Out-of-domain testing data for 16th, 17th, 18th, 19th and 20th c. French
  • Data are separated: theatrical and non theatrical for historical reasons.
  • The same data exist in two versions: normalised and original (19th and 20th remains the same, only 16th, 17th and 18th change).
  1. The Models folder contains all the models produced with our data.
|-Authority_list
  |-authority_processed
  |-authority
  |-propres_processed
  |-propres
  |-foreign
|-Data
  |-CornMol_gold
  |-FranText
  |-presto_max
  |-presto_gold
|-Data_outOfDomain
  |-Data_outOfDomain_normalised
    |-theatre_normalised
    |-varia_normalised
  |-Data_outOfDomain_original
    |-theatre_original
    |-varia_original
|-Models
  |-train_1
  |-train_2
    |-Models
      |-lemma.tar
      |-pos.tar

Use the lemmatiser

To use the model,

  1. Create a (virtualenv env) and activate it (source env/bin/activate)
  2. Install Pie-extended: pip install pie-extended
  3. Download the freem model: pie-extended download
  4. Use the freem model: pie-extended tag freem your_file.txt

Do note that pie-extended includes a tokeniser dedicated to (early-)modern French.

Warnings

The morphology is provided but has not been carefully proofread.

Licences

Licence Creative Commons
Our work is licensed under a Creative Commons Attribution 4.0 International Licence.

Licence Creative Commons
Presto and LGeRM data are licensed under a Creative Commons Attribution 4.0 International Licence.

Contribute

If you want to contribute, you can do so by cloning the repository and sending us a pull request, or by sending an email at simon.gabay[at]unige.ch.

Cite this repository

@software{gabay_simon_2022_6481300,
  author       = {Gabay, Simon and
                  Clérice, Thibault and
                  Gille Levenson, Matthias and
                  Camps, Jean-Baptiste and
                  Tanguy, Jean-Baptiste},
  title        = {{FreEM-corpora/FreEMlpm: FreEM LPM (Lemma, POS-
                   tags, Morphology) corpus}},
  month        = apr,
  year         = 2022,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {4.0.1},
  doi          = {10.5281/zenodo.6481300},
  url          = {https://doi.org/10.5281/zenodo.6481300}
}
@article{jdmdh:7161,
  TITLE = {{Corpus and Models for Lemmatisation and POS-tagging of Classical French
  Theatre}},
  AUTHOR = {Jean-Baptiste Camps and Simon Gabay and Paul Fièvre and Thibault Clérice and Florian Cafiero},
  URL = {https://jdmdh.episciences.org/7161},
  DOI = {10.46298/jdmdh.6485},
  JOURNAL = {{Journal of Data Mining \& Digital Humanities}},
  VOLUME = {{2021}},
  YEAR = {2021},
  MONTH = Feb,
  KEYWORDS = {Computer Science - Computation and Language},
}
@inproceedings{gabay:hal-03018381,
  TITLE = {{Standardizing linguistic data: method and tools for annotating (pre-orthographic) French}},
  AUTHOR = {Gabay, Simon and Cl{\'e}rice, Thibault and Camps, Jean-Baptiste and Tanguy, Jean-Baptiste and Gille-Levenson, Matthias},
  URL = {https://hal.archives-ouvertes.fr/hal-03018381},
  BOOKTITLE = {{Proceedings of the 2nd International Digital Tools \& Uses Congress (DTUC '20)}},
  ADDRESS = {Hammamet, Tunisia},
  YEAR = {2020},
  MONTH = Oct,
  DOI = {10.1145/3423603.3423996},
  KEYWORDS = {linguistic annotation ; pre-orthographic language ; lemmatisation ; POS-tagging ; Lemmatisation ; Etiquetage morpho-syntaxique ; POStagging ; Lemmatisation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03018381/file/Lemmatisation.pdf},
  HAL_ID = {hal-03018381},
  HAL_VERSION = {v1},
}

Please keep me posted if you use this data!

Contact

simon.gabay[at]unige.ch