FreEM LPM

- WARNING: This repository is the new repository of [LEM17](https://github.com/e-ditiones/LEM17), which is not maintained anymore

FreEM LPM (Lemmas, POS, Morphology). Linguistically annotated corpora of modern French (16-18th c.)

For more information about FreEM corpora, cf. our website.

«Sisyphe portant CornMol» (Titian, Prado Museum, Madrid, Spain, Source: Wikipedia).

Data

We provide:

Several authority lists, two deriving from LGeRM.

One list contains only propre nouns (proper) with the latest added at the end
One list contains all the other lemmas (authority) with the latest added at the end
One list contains all the foreign words (foreign) with the latest added at the end
Each file has a _processed version with all the entries in the alphabetical order, after controlling that there is not twice the same entry
On top of these three files, numbers contains latin and arabic numbers and alphabet contains single latin letters.

Training data:

CornMol is a gold corpus to be published
FranText is a corpus taken from the open data of FranText and aligned on our lemmatisation standards.
presto_gold is a gold corpus used by the Presto project tro train their TreeTagger model, converted to CATTEX and lightly corrected to match our authority lists.
presto_max have all the modern (16th-18th c.) texts of the Presto project, with lemmas heavily corrected. Each round of annotation/correction is numbered (v2, v3…)

Out-of-domain testing data for 16th, 17th, 18th, 19th and 20th c. French

Data are separated: theatrical and non theatrical for historical reasons.
The same data exist in two versions: normalised and original (19th and 20th remains the same, only 16th, 17th and 18th change).

The Models folder contains all the models produced with our data.

|-Authority_list
  |-authority_processed
  |-authority
  |-propres_processed
  |-propres
  |-foreign
|-Data
  |-CornMol_gold
  |-FranText
  |-presto_max
  |-presto_gold
|-Data_outOfDomain
  |-Data_outOfDomain_normalised
    |-theatre_normalised
    |-varia_normalised
  |-Data_outOfDomain_original
    |-theatre_original
    |-varia_original
|-Models
  |-train_1
  |-train_2
    |-Models
      |-lemma.tar
      |-pos.tar

Use the lemmatiser

To use the model,

Create a (virtualenv env) and activate it (source env/bin/activate)
Install Pie-extended: pip install pie-extended
Download the freem model: pie-extended download
Use the freem model: pie-extended tag freem your_file.txt

Do note that pie-extended includes a tokeniser dedicated to (early-)modern French.

Warnings

The morphology is provided but has not been carefully proofread.

Licences

Our work is licensed under a Creative Commons Attribution 4.0 International Licence.

Presto and LGeRM data are licensed under a Creative Commons Attribution 4.0 International Licence.

Contribute

If you want to contribute, you can do so by cloning the repository and sending us a pull request, or by sending an email at simon.gabay[at]unige.ch.

Cite this repository

@software{gabay_simon_2022_6481300,
  author       = {Gabay, Simon and
                  Clérice, Thibault and
                  Gille Levenson, Matthias and
                  Camps, Jean-Baptiste and
                  Tanguy, Jean-Baptiste},
  title        = {{FreEM-corpora/FreEMlpm: FreEM LPM (Lemma, POS-
                   tags, Morphology) corpus}},
  month        = apr,
  year         = 2022,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {4.0.1},
  doi          = {10.5281/zenodo.6481300},
  url          = {https://doi.org/10.5281/zenodo.6481300}
}

@article{jdmdh:7161,
  TITLE = {{Corpus and Models for Lemmatisation and POS-tagging of Classical French
  Theatre}},
  AUTHOR = {Jean-Baptiste Camps and Simon Gabay and Paul Fièvre and Thibault Clérice and Florian Cafiero},
  URL = {https://jdmdh.episciences.org/7161},
  DOI = {10.46298/jdmdh.6485},
  JOURNAL = {{Journal of Data Mining \& Digital Humanities}},
  VOLUME = {{2021}},
  YEAR = {2021},
  MONTH = Feb,
  KEYWORDS = {Computer Science - Computation and Language},
}

@inproceedings{gabay:hal-03018381,
  TITLE = {{Standardizing linguistic data: method and tools for annotating (pre-orthographic) French}},
  AUTHOR = {Gabay, Simon and Cl{\'e}rice, Thibault and Camps, Jean-Baptiste and Tanguy, Jean-Baptiste and Gille-Levenson, Matthias},
  URL = {https://hal.archives-ouvertes.fr/hal-03018381},
  BOOKTITLE = {{Proceedings of the 2nd International Digital Tools \& Uses Congress (DTUC '20)}},
  ADDRESS = {Hammamet, Tunisia},
  YEAR = {2020},
  MONTH = Oct,
  DOI = {10.1145/3423603.3423996},
  KEYWORDS = {linguistic annotation ; pre-orthographic language ; lemmatisation ; POS-tagging ; Lemmatisation ; Etiquetage morpho-syntaxique ; POStagging ; Lemmatisation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03018381/file/Lemmatisation.pdf},
  HAL_ID = {hal-03018381},
  HAL_VERSION = {v1},
}

Please keep me posted if you use this data!

Contact

simon.gabay[at]unige.ch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FreEM LPM

Data

Use the lemmatiser

Warnings

Licences

Contribute

Cite this repository

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

FreEM LPM

Data

Use the lemmatiser

Warnings

Licences

Contribute

Cite this repository

Contact