Adding sentiment to corpora #891

TomazErjavec · 2025-01-05T11:01:54Z

In ParlaCAP sentiment scores and labels will be added to sentences and utterances and this issue serves to document the needed changes:

how to encode sentiment
changes to the schema and documentation
changes to the conversion programs
changes to the registry files

Note that I have made a new milestone ParlaCAP (and assigned this issue to it) which should be used for issues pertaining to the project.
The debate here should be releveant to @matyaskopp, @katjameden, @nljubesi.

TomazErjavec · 2025-01-05T11:15:57Z

For SI we have already added sentiment to <u> and <s> as well as modifying the schema and conversion to vertical file + adding a new sentiment taxonomy, currently local to SI (this is a draft and might well change, esp. the description part).

In short:

the sentiment label is encoded in u/@ana and s/@ana, and makes reference to the sentiment taxonomy and uses the extended prefix senti.
the sentiment score is encoded in u/@n and s/@n; this is not a very good solution, as @n is a very general attribute but I currently do not have a better idea how to preserve the score in the encoding.

E.g.

ParlaMint/Samples/ParlaMint-SI/2007/ParlaMint-SI_2007-11-28-SDZ4-Izredna-30.ana.xml

Line 144 in 6dd236d

    
           <u who="#PečeSašo" xml:id="ParlaMint-SI_2007-11-28-SDZ4-Izredna-30.ana.u1" ana="#chair senti:neupos" n="3.16">

ParlaMint/Samples/ParlaMint-SI/2007/ParlaMint-SI_2007-11-28-SDZ4-Izredna-30.ana.xml

Line 146 in 6dd236d

    
           <s xml:id="ParlaMint-SI_2007-11-28-SDZ4-Izredna-30.ana.seg1.1" ana="senti:mixpos" n="3.68">

The new ParlMint-SI is also available via the concordancer for testing.

The documentation and other conversions (in particular, to TSV) still need to be implemented.

The problem right now is that I've added the new taxonomy to all the relevant programs that deal with taxonomies but now the CI validation complains that this taxonomy is missing from all the corpora (except SI). @matyaskopp, how best to solve this? Make (somehow) the taxonomy optional or (manually) insert the taxonomy, it's XInclude and prefixDef into all the samples? Or something else?

TomazErjavec added the enhancement New feature or request label Jan 5, 2025

TomazErjavec added this to the ParlaCAP milestone Jan 5, 2025

TomazErjavec self-assigned this Jan 5, 2025

TomazErjavec added a commit that referenced this issue Jan 5, 2025

SI samples with added sentiment (#891).

6dd236d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding sentiment to corpora #891

Adding sentiment to corpora #891

TomazErjavec commented Jan 5, 2025

TomazErjavec commented Jan 5, 2025

Adding sentiment to corpora #891

Adding sentiment to corpora #891

Comments

TomazErjavec commented Jan 5, 2025

TomazErjavec commented Jan 5, 2025