A Large Corpus for Early modern French - open access
For more information about FreEM corpora, cf. our website.
This repo contains documents of very different sources (wikipedia, scrapping, XML…).
- Documents gathered found online or given by colleagues are stored in the 0_source folder. Those given in a
.doc
/.txt
format or found online are loosely encoded in TEI. - New
teiHeader
are available with limited but highly structured information in the 1_header folder. These headers are used to generate the table of content. - It is possible to generate the final corpus with
python3 build.py
. For legal reason, we are not allowed to modify files, so we provide the script to mofidy them. This script creates
- a new version of the transcriptions with a minimal TEI encoding. They are adapted to a dedicated ODD/schema.
- Cleaned
.txt
files
After execution of the script, we obtain the following data:
|-0_source
|-(.*).xml
|-(.*).xml
|-(.*).xml
|- ...
|-1_header
|-(.*)_dAlembert.xml
|-(.*)_dAlembert.xml
|-(.*)_dAlembert.xml
|- ...
|-2_TEI
|-(.*)_dAlembert.xml
|-(.*)_dAlembert.xml
|-(.*)_dAlembert.xml
|- ...
|-3_TXT
|-(.*)_dAlembert.txt
|-(.*)_dAlembert.txt
|-(.*)_dAlembert.txt
|-(.*)_dAlembert.txt
|- ...
|-ODD
|-ODD_clean.ODD
|-out
|-ODD_clean.rng
|-scripts
|-build.xsl.xsl
|-make_TOC.xsl
|-1to2.xsl
A list of the files is available here.
This corpus is the open access version of the FreEM max corpus. Some (important) corpora are withdrawn from the available data.
Licences vary from one file and one project to another. Please pay attention to the <licence>
element in the <teiHeader>
.
@software{gabay_simon_2022_6481135,
author = {Gabay, Simon and
Bartz, Alexandre and
Gambette, Philippe and
Chagué, Alix},
title = {{FreEM-corpora/FreEMmax\_OA: FreEM max OA: A Large
Corpus for Early modern French - Open access
version}},
month = apr,
year = 2022,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.6481135},
url = {https://doi.org/10.5281/zenodo.6481135}
}
@inproceedings{gabay:hal-03596653,
TITLE = {{From FreEM to D'AlemBERT}},
AUTHOR = {Gabay, Simon and Ortiz Suarez, Pedro and Bartz, Alexandre and Chagu{\'e}, Alix and Bawden, Rachel and Gambette, Philippe and Sagot, Beno{\^i}t},
URL = {https://hal.inria.fr/hal-03596653},
NOTE = {8 pages, 2 figures, 4 tables},
BOOKTITLE = {{Proceedings of the 13th Language Resources and Evaluation Conference}},
ADDRESS = {Marseille, France},
ORGANIZATION = {{European Language Resources Association}},
YEAR = {2022},
MONTH = Jun,
HAL_ID = {hal-03596653},
HAL_VERSION = {v1},
}
Please keep me posted if you use this data!
simon.gabay[at]unige.ch