Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding sentiment to corpora #891

Open
TomazErjavec opened this issue Jan 5, 2025 · 1 comment
Open

Adding sentiment to corpora #891

TomazErjavec opened this issue Jan 5, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@TomazErjavec
Copy link
Collaborator

In ParlaCAP sentiment scores and labels will be added to sentences and utterances and this issue serves to document the needed changes:

  • how to encode sentiment
  • changes to the schema and documentation
  • changes to the conversion programs
  • changes to the registry files

Note that I have made a new milestone ParlaCAP (and assigned this issue to it) which should be used for issues pertaining to the project.
The debate here should be releveant to @matyaskopp, @katjameden, @nljubesi.

@TomazErjavec TomazErjavec added the enhancement New feature or request label Jan 5, 2025
@TomazErjavec TomazErjavec added this to the ParlaCAP milestone Jan 5, 2025
@TomazErjavec TomazErjavec self-assigned this Jan 5, 2025
@TomazErjavec
Copy link
Collaborator Author

For SI we have already added sentiment to <u> and <s> as well as modifying the schema and conversion to vertical file + adding a new sentiment taxonomy, currently local to SI (this is a draft and might well change, esp. the description part).

In short:

  • the sentiment label is encoded in u/@ana and s/@ana, and makes reference to the sentiment taxonomy and uses the extended prefix senti.
  • the sentiment score is encoded in u/@n and s/@n; this is not a very good solution, as @n is a very general attribute but I currently do not have a better idea how to preserve the score in the encoding.

E.g.

<u who="#PečeSašo" xml:id="ParlaMint-SI_2007-11-28-SDZ4-Izredna-30.ana.u1" ana="#chair senti:neupos" n="3.16">

<s xml:id="ParlaMint-SI_2007-11-28-SDZ4-Izredna-30.ana.seg1.1" ana="senti:mixpos" n="3.68">

The new ParlMint-SI is also available via the concordancer for testing.

The documentation and other conversions (in particular, to TSV) still need to be implemented.

The problem right now is that I've added the new taxonomy to all the relevant programs that deal with taxonomies but now the CI validation complains that this taxonomy is missing from all the corpora (except SI). @matyaskopp, how best to solve this? Make (somehow) the taxonomy optional or (manually) insert the taxonomy, it's XInclude and prefixDef into all the samples? Or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant