Skip to content
This repository was archived by the owner on Feb 29, 2020. It is now read-only.

Transcription Workflow

Diane Jakacki edited this page May 31, 2018 · 5 revisions

Moravian Lives

Workflow: Transcription Desk to TEI-XML, then export (Diane Jakacki, 05/31/2018)

TD/Mediawiki Export

  • As transcriptions are completed in Transcription Desk (TD), they are added to the "ML: Transcription Desk - Transcribed Status" spreadsheet: https://docs.google.com/spreadsheets/d/1vXQrEDowHRYyQ0U3YRECVxLIbidgsglk3qd-8-FhR4k/edit?usp=sharing *Transcribed text is 'scraped' from individual files in Mediawiki (admin only) and compiled into a P5 TEI-XML file in oXygen, compiling all memoir pages in one file with ML teiHeader and schema. [links here]
  • Since the markup in the Transcription Desk / raw view is created for HTML/browser viewing, all HTML tags must be stripped out.
    • 'Semantic' tags in TD are in XML but not TEI-compliant, so they must be converted as find/replace
    • Some tags imposed in Mediawiki (e.g.   some UTF-8 letters) must be removed or converted for XML files to validate.

XML to TEI

*Memoir pages are structured so: Moravian Lives Workflow: Transcription Desk to TEI-XML, then export (Diane Jakacki, 05/31/2018)

  • As transcriptions are completed in Transcription Desk (TD), they are added to the "ML: Transcription Desk - Transcribed Status" spreadsheet: https://docs.google.com/spreadsheets/d/1vXQrEDowHRYyQ0U3YRECVxLIbidgsglk3qd-8-FhR4k/edit?usp=sharing

  • Transcribed text is 'scraped' from individual files in Mediawiki (admin only) and compiled into a P5 TEI-XML file in oXygen, compiling all memoir pages in one file with ML teiHeader and schema. [links here] ** Since the markup in the Transcription Desk / raw view is created for HTML/browser viewing, all HTML tags must be stripped out. ** 'Semantic' tags in TD are in XML but not TEI-compliant, so they must be converted as find/replace ** Some tags imposed in Mediawiki (e.g.   some UTF-8 letters) must be removed or converted for XML files to validate.

  • Memoir pages are structured so:

<text type="memoir"> <body xml:lang="eng">

        `<div type="page" n="$">`
           `<head>[NAME] memoir, page 1`
           `<seg ana="LOCATOR from GITHUB"/>`
           `<graphic url="http://moravianlives.bucknell.edu/##.jpg"/>`
           `</head>`
        `<p>`
        
        
        `</p>`
        `</div>`

        `<!-- repeated for each page -->`
  `</body>`

</text>

TEI to Multiple Formats

There are XSLT files in the GitHub repo (in "XSLT for all TEI files") that convert TEI to plain HTML, tokenized TXT, and extract entities (currently people and places, but can be extended). The XSLT directory sits at the top level of the repo. The TEI, XML, XSLT, CSS, TXT files are stored in the TEI Memoir directory structure.