CJK features for ZH,JA,KO wiki #232

5uperpalo · 2020-12-24T17:20:34Z

initial commit, not ready for merge, missing tests

halfak · 2020-12-31T16:34:19Z

editquality/feature_lists/tests/test_zhwiki.py

+
+def test_zhwiki():
+    assert ([round(i) for i in solve(zhwiki.cjk, cache=cache)] ==
+            [4, 2, 7, 0, 4, 2, 1])


Maybe test each feature explicitly.

Add test_jawiki & test_kowiki too.

5uperpalo · 2021-01-04T15:29:14Z

@halfak
I am trying to add tests and I am stuck at the debugging. Why is this happening? Do you have any ideas/suggestions? "た" is a single token and also character. If you uncomment the lines you will get the correct result [2,2,2] (2 cjk chars changed, 2 tokens inc.cjk changed, 2 cjk tokens changed). If you leave it uncommented the jawiki.wikitext.diff_cjk will use datasources.cjks and datasources.tokens and not datasources.cjk.cjks and datasources.cjk.tokens so the result is incorrect [2,0,0]. I am lost why this sis happening.

from revscoring.dependencies import solve
from revscoring.datasources import revision_oriented
from editquality.feature_lists import jawiki
from revscoring.features import wikitext
from revscoring.features.modifiers import sub

r_text = revision_oriented.revision.text
p_text = revision_oriented.revision.parent.text

p_text_text = """
敗戦後は桑原武夫の『第二芸術－現代俳句について』
（1946年）によって、短詩型である俳句の限界が指摘された。
"""
r_text_text = """
敗戦後は桑原武夫の『第二芸術－現代俳句について』
（1946年）によって、短詩型である俳句の限界が指摘されたたた。
"""
cache = {p_text: p_text_text,
         r_text: r_text_text}

# cjkwordthings_change = solve(sub(wikitext.revision.cjk.cjks, wikitext.revision.parent.cjk.cjks, name="revision.diff.cjk.cjkwordthings_change"), cache=cache)
# parent_cjks = len(solve(wikitext.revision.datasources.cjk.cjks, cache=cache))
# cjks = len(solve(wikitext.revision.datasources.parent.cjk.cjks, cache=cache))
# print("Revscoring results:\n parent_cjks = {}\n cjks = {}\n cjkwordthings_change = {}".format(parent_cjks, cjks, cjkwordthings_change))

cjkwordthings_change = list(solve(jawiki.wikitext.diff_cjk, cache=cache))
print("Editquality result:\n cjkwordthings_change = {}".format(cjkwordthings_change))

halfak · 2021-01-04T22:03:14Z

The inconsistency could be related to the cjk feature group naming issue I pointed out. See https://github.com/wikimedia/revscoring/pull/501/files#diff-499cc46dd0c97d4e81f2d23e15725821610c10e7be1f5a563846c84865c57069R21

5uperpalo · 2021-01-04T22:25:34Z

@halfak I thought about it and I fixed it in both datasource and features(this is being published to pip right now) but it didn't help, see the fixed names in the master revscoring branch:
https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/features/tokenized.py
https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/tokenized.py

halfak · 2021-01-04T22:34:00Z

Try enabling debug mode and running the code. You should get logging every time a dependency is evaluated.

You might be able to get away with just setting __debug__ = True in your terminal.

Or you might have to configure the logger and set level to logging.DEBUG. Consult https://github.com/wikimedia/editquality/blob/master/editquality/utilities/get_revisions.py#L43 for an example.

5uperpalo · 2021-01-07T17:59:16Z

[UPDATE - SOLVED!!! kinda...?] @halfak
I found what causes the issue, but before I make the changes I want your feedback as I am not 100% sure about this:

I found out that the cjk_chars feature alters the behavior and CJK tokenization never happens. If I put it in the last place in diff_cjk it works as it should
I added cjk (marked with double * in the code block below) to revscoring/features/wikitext/features/chars.py; the idea was to force python to do cjk tokenization as it seems it somehow went around it.

        self.cjk_chars = aggregators.sum(
            mappers.map(len, self.datasources**.cjk**.cjks),
            name=self._name + ".cjk_chars", returns=int
        )

this worked and the edit quality diff_cjk is giving correct results, BUT pytest on revscoring resulted in 1 error. It seems that the previous test was giving incorrect result(now correct), but somebody left it like that, see:

tests/features/wikitext/tests/test_chars.py
failing on

================================================= FAILURES ==================================================
______________________________________________ test_cjk_chars _______________________________________________

    def test_cjk_chars():
        cache = {p_text: "This is 55 {{るは 壌}} a string.",
                 r_text: "This is 56 [[るは 壌のは の]] a string."}

        assert solve(revision.cjk_chars, cache=cache) == 6
        assert solve(revision.parent.cjk_chars, cache=cache) == 3
>       assert solve(revision.diff.cjk_chars_added, cache=cache) == 4
E       AssertionError: assert 3 == 4
E        +  where 3 = solve(<feature.wikitext.revision.diff.cjk_chars_added>, cache={<datasource.revision.parent.text>: 'This is 55 {{るは 壌}} a string.', <datasource.revision.text>: 'This is 56 [[るは 壌のは ...ken('壌', type='cjk_word'), Token('の', type='cjk_word'), Token('は', type='cjk_word'), Token('の', type='cjk_word')], ...})
E        +    where <feature.wikitext.revision.diff.cjk_chars_added> = {wikitext.revision.diff}.cjk_chars_added
E        +      where {wikitext.revision.diff} = {wikitext.revision}.diff

tests/features/wikitext/tests/test_chars.py:92: AssertionError

Does this all make sense? may I proceed with revscoring update/merge and add tests to edit quality?

NOTE:
Minimal example that does not work as it should in current revscoring, but works fine in proposed micro update:

from revscoring.dependencies import solve
from revscoring.datasources import revision_oriented
from revscoring.features import wikitext

r_text = revision_oriented.revision.text
r_text_text = 'れた'

cache = {r_text: r_text_text}

# comment the list you are not testing!!!
# this order gives incorrect results (in current revscoring)
list(solve([wikitext.revision.cjk_chars, wikitext.revision.cjk.tokens], cache=cache))

# this order gives correct results (in current revscoring)
#list(solve([wikitext.revision.cjk.tokens, wikitext.revision.cjk_chars], cache=cache))

halfak · 2021-01-08T16:37:47Z

OK I think I figured it out. If you look at this line: https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/tokenized.py#L21

You'll see that we provide tokenized(revision_datasources.text, tok_strategy="CJK") as a dependency for the CJK style tokens. This datasource has the exact same name as the regular tokenization strategy that results from tokenized(revision_datasources.text). In order to make sure it doesn't have the same name, we'll need to provide one. The tokenized() meta datasource has a name field we can use.

E.g., tokenized(revision_datasources.text, tok_strategy="CJK", name=self._name + ".cjk.tokens") seems to be a good name to me.

halfak · 2021-01-08T16:41:22Z

An alternative solution would be to modify the default name generation for the tokenized() meta datasource so that it includes the tokenization strategy when generating a name.

See https://github.com/wikimedia/revscoring/blob/a4f887b773f7bfe0b28c5c28d9dfb0c2698c9256/revscoring/features/wikitext/datasources/tokenized.py#L458

You could modify this line to be:

name = "{0}({1!r}, {2!r})".format("tokenized", text_datasource, tok_strategy)

That would ensure that the tokens datasource has a unique name and it is generally a good practice for including any argument that could change the output in the "name".

halfak · 2021-01-08T16:41:46Z

FWIW, I like the second option better but both are good.

5uperpalo · 2021-01-12T21:42:30Z

@halfak well.. bill murray can only summarize my admiration towards your debugging skills :)

I like the second solution better also .. I updated revscoring, I am releasing version 2.9.3 so it can be downloaded from pip, after it's published I will push new editquality update with new tests

halfak · 2021-01-14T16:59:37Z

Let's add model builds to this before merging. Otherwise looks good.

5uperpalo · 2021-01-26T21:06:54Z

@halfak

Could you please look at the /dev/CJK_tokenization_testing/editquality/nohup.out on ores-misc-01? I could not create translatewiki models - "make" was giving nonstop WARNING messages related to mwapi and revscoring

Other notes:

I noticed that models from Makefile.manual are not creating output in model_info/ , is it ok? (fiwiki, plwiki, translatewiki, wikidatawiki)
at the end, I used conda to build models as, by default, ores-misc-01 has Python version 2.7.13 (I created conda env with Python 3.5.3 per discussion from the last Thursdays call)

pavol86@ores-misc-01:~$ which python
/usr/bin/python
pavol86@ores-misc-01:~$ python --version
Python 2.7.13
(base) pavol86@ores-misc-01:~$ conda activate editquality_test
(editquality_test) pavol86@ores-misc-01:~$ python --version
Python 3.5.3

I did not create tunning_reports, at least not yet as they take ages (we discussed it on Thursday)
it seems there is only a tiny improvement in some models, but at least we can sure that editquality with CJK features works with new revscoring, that supports CJK tokenization

initial_commit_with_cjk_features

7778eaf

5uperpalo self-assigned this Dec 24, 2020

added cjk features

4222d46

5uperpalo changed the title ~~initial_commit_with_cjk_features~~ CJK features for ZH,JA,KO wiki Dec 26, 2020

halfak reviewed Dec 31, 2020

View reviewed changes

update

811202c

updated tests

f4ffad4

5uperpalo mentioned this pull request Jan 12, 2021

fix related to issue found in editquality repository tests wikimedia/revscoring#503

Merged

5uperpalo added 2 commits January 14, 2021 10:54

update

0c788da

updated test-requirements

aa62890

model_info added

c7ec75e

Pavol86 added 3 commits February 4, 2021 15:57

new models

e839148

update

2eb9164

update

7892c8b

mScott224 approved these changes Dec 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJK features for ZH,JA,KO wiki #232

CJK features for ZH,JA,KO wiki #232

5uperpalo commented Dec 24, 2020

halfak Dec 31, 2020

halfak Dec 31, 2020

5uperpalo commented Jan 4, 2021 •

edited

Loading

halfak commented Jan 4, 2021

5uperpalo commented Jan 4, 2021

halfak commented Jan 4, 2021

5uperpalo commented Jan 7, 2021 •

edited

Loading

halfak commented Jan 8, 2021

halfak commented Jan 8, 2021

halfak commented Jan 8, 2021

5uperpalo commented Jan 12, 2021 •

edited

Loading

halfak commented Jan 14, 2021

5uperpalo commented Jan 26, 2021 •

edited

Loading

CJK features for ZH,JA,KO wiki #232

Are you sure you want to change the base?

CJK features for ZH,JA,KO wiki #232

Conversation

5uperpalo commented Dec 24, 2020

halfak Dec 31, 2020

Choose a reason for hiding this comment

halfak Dec 31, 2020

Choose a reason for hiding this comment

5uperpalo commented Jan 4, 2021 • edited Loading

halfak commented Jan 4, 2021

5uperpalo commented Jan 4, 2021

halfak commented Jan 4, 2021

5uperpalo commented Jan 7, 2021 • edited Loading

halfak commented Jan 8, 2021

halfak commented Jan 8, 2021

halfak commented Jan 8, 2021

5uperpalo commented Jan 12, 2021 • edited Loading

halfak commented Jan 14, 2021

5uperpalo commented Jan 26, 2021 • edited Loading

5uperpalo commented Jan 4, 2021 •

edited

Loading

5uperpalo commented Jan 7, 2021 •

edited

Loading

5uperpalo commented Jan 12, 2021 •

edited

Loading

5uperpalo commented Jan 26, 2021 •

edited

Loading