Skip to content

Corpus for a language model of (early) modern French

Notifications You must be signed in to change notification settings

FreEM-corpora/FreEMmax_OA

Repository files navigation

FreEM max OA

DOI

A Large Corpus for Early modern French - open access

For more information about FreEM corpora, cf. our website.

Description

This repo contains documents of very different sources (wikipedia, scrapping, XML…).

  • Documents gathered found online or given by colleagues are stored in the 0_source folder. Those given in a .doc / .txt format or found online are loosely encoded in TEI.
  • New teiHeader are available with limited but highly structured information in the 1_header folder. These headers are used to generate the table of content.
  • It is possible to generate the final corpus with python3 build.py. For legal reason, we are not allowed to modify files, so we provide the script to mofidy them. This script creates
  1. a new version of the transcriptions with a minimal TEI encoding. They are adapted to a dedicated ODD/schema.
  2. Cleaned .txt files

After execution of the script, we obtain the following data:

 |-0_source
  |-(.*).xml
  |-(.*).xml
  |-(.*).xml
  |- ...
 |-1_header
  |-(.*)_dAlembert.xml
  |-(.*)_dAlembert.xml
  |-(.*)_dAlembert.xml
  |- ...
 |-2_TEI
  |-(.*)_dAlembert.xml
  |-(.*)_dAlembert.xml
  |-(.*)_dAlembert.xml
  |- ...
 |-3_TXT
  |-(.*)_dAlembert.txt
  |-(.*)_dAlembert.txt
  |-(.*)_dAlembert.txt
  |-(.*)_dAlembert.txt
  |- ...
 |-ODD
   |-ODD_clean.ODD
   |-out
     |-ODD_clean.rng
 |-scripts
   |-build.xsl.xsl
   |-make_TOC.xsl
   |-1to2.xsl

Table of content

A list of the files is available here.

Warning

This corpus is the open access version of the FreEM max corpus. Some (important) corpora are withdrawn from the available data.

Licences

Licences vary from one file and one project to another. Please pay attention to the <licence> element in the <teiHeader>.

Cite this repository

@software{gabay_simon_2022_6481135,
  author       = {Gabay, Simon and
                  Bartz, Alexandre and
                  Gambette, Philippe and
                  Chagué, Alix},
  title        = {{FreEM-corpora/FreEMmax\_OA: FreEM max OA: A Large
                   Corpus for Early modern French - Open access
                   version}},
  month        = apr,
  year         = 2022,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.6481135},
  url          = {https://doi.org/10.5281/zenodo.6481135}
}
@inproceedings{gabay:hal-03596653,
  TITLE = {{From FreEM to D'AlemBERT}},
  AUTHOR = {Gabay, Simon and Ortiz Suarez, Pedro and Bartz, Alexandre and Chagu{\'e}, Alix and Bawden, Rachel and Gambette, Philippe and Sagot, Beno{\^i}t},
  URL = {https://hal.inria.fr/hal-03596653},
  NOTE = {8 pages, 2 figures, 4 tables},
  BOOKTITLE = {{Proceedings of the 13th Language Resources and Evaluation Conference}},
  ADDRESS = {Marseille, France},
  ORGANIZATION = {{European Language Resources Association}},
  YEAR = {2022},
  MONTH = Jun,
  HAL_ID = {hal-03596653},
  HAL_VERSION = {v1},
}

Please keep me posted if you use this data!

Contact

simon.gabay[at]unige.ch