Skip to content

Commit

Permalink
add report option (#98)
Browse files Browse the repository at this point in the history
  • Loading branch information
dwhieb authored Jan 27, 2022
1 parent c3f3f37 commit c05372a
Show file tree
Hide file tree
Showing 3 changed files with 33 additions and 9 deletions.
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,10 @@ The database uses the [Data Format for Digital Linguistics][DaFoDiL] (DaFoDiL) a

<!-- TOC -->
- [Sources](#sources)
- [Project Requirements](#project-requirements)
- [Process](#process)
- [Style Guide](#style-guide)
- [The Database](#the-database)
- [Building the Database](#building-the-database)
- [Building & Updating the Database](#building--updating-the-database)
- [Steps to incrementally update the production database](#steps-to-incrementally-update-the-production-database)
- [Tests](#tests)
<!-- /TOC -->

Expand Down Expand Up @@ -76,6 +75,10 @@ To build and/or update the database, follow the steps below. Each of these steps

Entries from individual sources are **not** imported as main entries in the ALTLab database. Instead they are stored as subentries (using the `dataSources` field). The import script merely matches entries from individual sources to a main entry, or creates a main entry if none exists. An aggregation script then does the work of combining information from each of the subentries into a main entry (see the next step).

Each import step prints a table to the console, showing how many entries from the original data source were unmatched.

When importing the Maskwacîs database, you can add an `-r` or `--report` flag to output a list of unmatched entries to a file. The flag takes the file path as its argument.

6. Aggregate the data from the individual data sources: `node bin/aggregate.js <inputPath> <outputPath>` (the output path can be the same as the input path; this will overwrite the original).

7. For convenience, you can perform all the above steps with a single command in the terminal: `npm run build` | `yarn build`. In order for this command to work, you will need each of the following files to be present in the `/data` directory, with these exact filenames:
Expand Down
1 change: 1 addition & 0 deletions bin/import-MD.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import program from 'commander';
program
.arguments(`<mdPath> <databasePath> [fstPath]`)
.usage(`convert-md <mdPath> <databasePath> [fstPath]`)
.option(`-r, --report <reportPath>`, `generate report of unmatched entries`)
.action(importMD);

program.parse(process.argv);
32 changes: 26 additions & 6 deletions lib/import/MD.js
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
import createSpinner from 'ora';
import DatabaseIndex from '../utilities/DatabaseIndex.js';
import readNDJSON from '../utilities/readNDJSON.js';
import { Transducer } from 'hfstol';
import writeNDJSON from '../utilities/writeNDJSON.js';
import createSpinner from 'ora';
import { createWriteStream } from 'fs';
import DatabaseIndex from '../utilities/DatabaseIndex.js';
import readNDJSON from '../utilities/readNDJSON.js';
import { Transducer } from 'hfstol';
import writeNDJSON from '../utilities/writeNDJSON.js';

function getPos(str) {
if (!str) return ``;
Expand Down Expand Up @@ -32,8 +33,11 @@ function updateEntry(dbEntry, mdEntry) {
* Imports the MD entries into the ALTLab database.
* @param {String} mdPath
* @param {String} dbPath
* @param {String} [fstPath]
* @param {Object} [options={}]
* @param {String} [report] The path where you would like the report generated.
*/
export default async function importMD(mdPath, dbPath, fstPath) {
export default async function importMD(mdPath, dbPath, fstPath, { report } = {}) {

const readDatabaseSpinner = createSpinner(`Reading databases.`).start();

Expand Down Expand Up @@ -141,4 +145,20 @@ export default async function importMD(mdPath, dbPath, fstPath) {
'Entries without a match:': unmatched.length,
});

if (report) {

const reportSpinner = createSpinner(`Generating report of unmatched entries.`).start();
const writeStream = createWriteStream(report);

writeStream.write(`head\tPOS\toriginal\t\n`);

for (const { head, original, pos } of unmatched) {
writeStream.write(`${ head.md }\t${ pos }\t${ original }`);
}

writeStream.end();
reportSpinner.succeed();

}

}

0 comments on commit c05372a

Please sign in to comment.