-
Notifications
You must be signed in to change notification settings - Fork 5
Transcription Workflow
Workflow: Transcription Desk to TEI-XML, then export (Diane Jakacki, 05/31/2018)
- As transcriptions are completed in Transcription Desk (TD), they are added to the "ML: Transcription Desk - Transcribed Status" spreadsheet: https://docs.google.com/spreadsheets/d/1vXQrEDowHRYyQ0U3YRECVxLIbidgsglk3qd-8-FhR4k/edit?usp=sharing *Transcribed text is 'scraped' from individual files in Mediawiki (admin only) and compiled into a P5 TEI-XML file in oXygen, compiling all memoir pages in one file with ML teiHeader and schema. [links here]
- Since the markup in the Transcription Desk / raw view is created for HTML/browser viewing, all HTML tags must be stripped out.
- 'Semantic' tags in TD are in XML but not TEI-compliant, so they must be converted as find/replace
- Some tags imposed in Mediawiki (e.g. some UTF-8 letters) must be removed or converted for XML files to validate.
*Memoir pages are structured so: Moravian Lives Workflow: Transcription Desk to TEI-XML, then export (Diane Jakacki, 05/31/2018)
-
As transcriptions are completed in Transcription Desk (TD), they are added to the "ML: Transcription Desk - Transcribed Status" spreadsheet: https://docs.google.com/spreadsheets/d/1vXQrEDowHRYyQ0U3YRECVxLIbidgsglk3qd-8-FhR4k/edit?usp=sharing
-
Transcribed text is 'scraped' from individual files in Mediawiki (admin only) and compiled into a P5 TEI-XML file in oXygen, compiling all memoir pages in one file with ML teiHeader and schema. [links here] ** Since the markup in the Transcription Desk / raw view is created for HTML/browser viewing, all HTML tags must be stripped out. ** 'Semantic' tags in TD are in XML but not TEI-compliant, so they must be converted as find/replace ** Some tags imposed in Mediawiki (e.g. some UTF-8 letters) must be removed or converted for XML files to validate.
-
Memoir pages are structured so:
<text type="memoir">
<body xml:lang="eng">
`<div type="page" n="$">`
`<head>[NAME] memoir, page 1`
`<seg ana="LOCATOR from GITHUB"/>`
`<graphic url="http://moravianlives.bucknell.edu/##.jpg"/>`
`</head>`
`<p>`
`</p>`
`</div>`
`<!-- repeated for each page -->`
`</body>`
</text>
There are XSLT files in the GitHub repo (in "XSLT for all TEI files") that convert TEI to plain HTML, tokenized TXT, and extract entities (currently people and places, but can be extended). The XSLT directory sits at the top level of the repo. The TEI, XML, XSLT, CSS, TXT files are stored in the TEI Memoir directory structure.