diff --git a/README.md b/README.md index 5b2a80f..ded8d01 100644 --- a/README.md +++ b/README.md @@ -1,28 +1,28 @@ -# AcrosticScout +# AcrosticSleuth -AcrosticScout is a program for identifying and ranking acrostics. +AcrosticSleuth is a program for identifying and ranking acrostics. At a high level, the tool works by comparing the probability of random occurrence with the probability that a sequence of characters forms a meaningful word or phrase in the target language. -AcrosticScout is optimized to quickly process gigabytes of text. -With the help of AcrosticScout, we have been able to discover multiple previously unknown acrostics, including the English philosopher's Thomas Hobbes signature in *The Elements of Law* (THOMAS[OF]HOBBES). +AcrosticSleuth is optimized to quickly process gigabytes of text. +With the help of AcrosticSleuth, we have been able to discover multiple previously unknown acrostics, including the English philosopher's Thomas Hobbes signature in *The Elements of Law* (THOMAS[OF]HOBBES). You can read more about the methodology in our upcoming paper ([preprint]()). ### Table of contents -- [What languages does AcrosticScout support?](#what-languages-does-acrosticscout-support) -- [How to install and use AcrosticScout?](#how-to-install-and-use-acrosticscout) +- [What languages does AcrosticSleuth support?](#what-languages-does-acrosticsleuth-support) +- [How to install and use AcrosticSleuth?](#how-to-install-and-use-acrosticsleuth) - [Hello World example](#hello-world-example) -- [How was AcrosticScout evaluated?](#how-was-acrosticscout-evaluated) +- [How was AcrosticSleuth evaluated?](#how-was-acrosticsleuth-evaluated) - [How to reproduce our results?](#how-to-reproduce-our-results) - [How to cite this?](#how-to-cite-this) -## What languages does AcrosticScout support? -AcrosticScout currently support **English, French, Russian, and Latin**. -The only language-specific component of AcrosticScout is the unigram language model produced by [sentencepiece](https://github.com/google/sentencepiece). -Support for new languages can, therefore, be easily added -- please [make an issue](https://github.com/acrostics/acrostic-scout/issues/new) here on GitHub if you wish to use AcrosticScout with another language. +## What languages does AcrosticSleuth support? +AcrosticSleuth currently support **English, French, Russian, and Latin**. +The only language-specific component of AcrosticSleuth is the unigram language model produced by [sentencepiece](https://github.com/google/sentencepiece). +Support for new languages can, therefore, be easily added -- please [make an issue](https://github.com/acrostics/acrostic-sleuth/issues/new) here on GitHub if you wish to use AcrosticSleuth with another language. -## How to install and use AcrosticScout? +## How to install and use AcrosticSleuth? -To run AcrosticScout, you need Java SDK installed on your machine. -We have tested AcrosticScout on Mac OS and Linux. +To run AcrosticSleuth, you need Java SDK installed on your machine. +We have tested AcrosticSleuth on Mac, Mac-Arm, Ubuntu, and Windows [as part of our CI](.github/workflows/main.yml). First, compile the code from the base directory using: @@ -30,18 +30,18 @@ First, compile the code from the base directory using: javac -cp src -encoding UTF-8 src/acrostics/*.java ``` -Then run AcrosticScout using the command below, replacing `INPUT` and `LANG` with the name of the directory that contains the dataset you wish AcrosticScout to analyze and the language of that dataset, respectively: +Then run AcrosticSleuth using the command below, replacing `INPUT` and `LANG` with the name of the directory that contains the dataset you wish AcrosticSleuth to analyze and the language of that dataset, respectively: ```bash java -cp src acrostics.Main -input INPUT -language LANG ``` -AcrosticScout accepts multiple optional command line arguments (thank you, [picocli](https://github.com/remkop/picocli/tree/v4.7.6)) -- run the tool with the `--help` flag to get the up-to-date list of all available options. +AcrosticSleuth accepts multiple optional command line arguments (thank you, [picocli](https://github.com/remkop/picocli/tree/v4.7.6)) -- run the tool with the `--help` flag to get the up-to-date list of all available options. ## Hello World example -This repository includes an example dataset comprising a subset of pages with acrostics from the English subdomain of WikiSource database (see [How was AcrosticScout evaluated?](#how-was-acrosticscout-evaluated)). -You can test AcrosticScout on this small dataset using: +This repository includes an example dataset comprising a subset of pages with acrostics from the English subdomain of WikiSource database (see [How was AcrosticSleuth evaluated?](#how-was-acrosticsleuth-evaluated)). +You can test AcrosticSleuth on this small dataset using: ```bash java -cp src acrostics.Main -input data/example -language EN -mode LINE -charset utf-8 -outputSize 4000 --concise @@ -52,7 +52,7 @@ Here is the meaning behind each of the options used: - `-language EN`: use the default English language model - `-mode LINE`: search for line acrostics (where an acrostic is formed by the initial letters of each line) - `-charset utf-8`: use the utf-8 encoding when opening the files -- `-outputSize 4000`: return top 4000 instances (AcrosticScout clusters collocated instances, so the actual number of results it returns is much smaller -- 46) +- `-outputSize 4000`: return top 4000 instances (AcrosticSleuth clusters collocated instances, so the actual number of results it returns is much smaller -- 46) - `--concise`: only report key information (file,acrostic,rank). Specifically, you should be getting the following output (highest ranked acrostics appear at the bottom of the list): @@ -108,10 +108,10 @@ data/example/The PearlVolume 18Acrostic.txt cunt_is_sweet_when_young_and_ten data/example/The Confessions of William-Henry Ireland.txt warwick_at_dudley_at_southampton_at_rivers_at_shakspeare 7.6181055E+27 ``` -## How was AcrosticScout evaluated? +## How was AcrosticSleuth evaluated? We have created the [Acrostic Identification Task Dataset](https://github.com/acrostics/acrostic-identification-task-dataset) by manually identifying all poems explicitly referred to or formatted as acrostics on English, Russian, and French subdomains of [WikiSource](https://en.wikisource.org/wiki/Main_Page), an online library of source texts in the public domain. -AcrosticScout reaches recall of over 50% within the first 100 results it returns for English and Russian, and recall rises to up to 80% when considering more results. +AcrosticSleuth reaches recall of over 50% within the first 100 results it returns for English and Russian, and recall rises to up to 80% when considering more results. Read more in our [paper](): ![](RecallFigure.svg) @@ -131,9 +131,9 @@ First, clone this directory with the `--recursive` flag, so that it also include Next, follow the directions for [downloading and setting up the Acrostic Identification Task Dataset](https://github.com/acrostics/acrostic-identification-task-dataset/blob/main/README.md), which is cloned as a submodule for this repository in the `data` directory. Make sure to run the [get_data.sh](https://github.com/acrostics/acrostic-identification-task-dataset/blob/main/get_data.sh) script as discussed in the README linked above. -Finally, to run AcrosticScout on the dataset and measure its recall, run [data/evaluate_on_acrostics-identification-task-dataset.sh](data/evaluate_on_acrostics-identification-task-dataset.sh). +Finally, to run AcrosticSleuth on the dataset and measure its recall, run [data/evaluate_on_acrostics-identification-task-dataset.sh](data/evaluate_on_acrostics-identification-task-dataset.sh). The script will save the output files in the `output` directory and produce `recall.png` figure that plots the recall graph you see above and in the paper. ## How to cite this? -Fedchin, A., Cooperman, I., Chaudhuri, P., Dexter, J.P. 2024 "AcrosticScout: Differentiating True Acrostics from Random Noise in Multilingual Corpora Using Probabilistic Ranking". Forthcoming +Fedchin, A., Cooperman, I., Chaudhuri, P., Dexter, J.P. 2024 "AcrosticSleuth: Differentiating True Acrostics from Random Noise in Multilingual Corpora Using Probabilistic Ranking". Forthcoming diff --git a/data/acrostic-identification-task-dataset b/data/acrostic-identification-task-dataset index 0f8d4a7..c9db21b 160000 --- a/data/acrostic-identification-task-dataset +++ b/data/acrostic-identification-task-dataset @@ -1 +1 @@ -Subproject commit 0f8d4a7915ff2523f76508ebce7dde59917d3934 +Subproject commit c9db21bd4a7bc43cafbdd914e90d6b748dc7cb7d