Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK features for ZH,JA,KO wiki #232

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

CJK features for ZH,JA,KO wiki #232

wants to merge 10 commits into from

Conversation

5uperpalo
Copy link
Collaborator

initial commit, not ready for merge, missing tests

@5uperpalo 5uperpalo self-assigned this Dec 24, 2020
@5uperpalo 5uperpalo changed the title initial_commit_with_cjk_features CJK features for ZH,JA,KO wiki Dec 26, 2020

def test_zhwiki():
assert ([round(i) for i in solve(zhwiki.cjk, cache=cache)] ==
[4, 2, 7, 0, 4, 2, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe test each feature explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test_jawiki & test_kowiki too.

@5uperpalo
Copy link
Collaborator Author

5uperpalo commented Jan 4, 2021

@halfak
I am trying to add tests and I am stuck at the debugging. Why is this happening? Do you have any ideas/suggestions? "た" is a single token and also character. If you uncomment the lines you will get the correct result [2,2,2] (2 cjk chars changed, 2 tokens inc.cjk changed, 2 cjk tokens changed). If you leave it uncommented the jawiki.wikitext.diff_cjk will use datasources.cjks and datasources.tokens and not datasources.cjk.cjks and datasources.cjk.tokens so the result is incorrect [2,0,0]. I am lost why this sis happening.

from revscoring.dependencies import solve
from revscoring.datasources import revision_oriented
from editquality.feature_lists import jawiki
from revscoring.features import wikitext
from revscoring.features.modifiers import sub

r_text = revision_oriented.revision.text
p_text = revision_oriented.revision.parent.text

p_text_text = """
敗戦後は桑原武夫の『第二芸術-現代俳句について』
(1946年)によって、短詩型である俳句の限界が指摘された。
"""
r_text_text = """
敗戦後は桑原武夫の『第二芸術-現代俳句について』
(1946年)によって、短詩型である俳句の限界が指摘されたたた。
"""
cache = {p_text: p_text_text,
         r_text: r_text_text}

# cjkwordthings_change = solve(sub(wikitext.revision.cjk.cjks, wikitext.revision.parent.cjk.cjks, name="revision.diff.cjk.cjkwordthings_change"), cache=cache)
# parent_cjks = len(solve(wikitext.revision.datasources.cjk.cjks, cache=cache))
# cjks = len(solve(wikitext.revision.datasources.parent.cjk.cjks, cache=cache))
# print("Revscoring results:\n parent_cjks = {}\n cjks = {}\n cjkwordthings_change = {}".format(parent_cjks, cjks, cjkwordthings_change))

cjkwordthings_change = list(solve(jawiki.wikitext.diff_cjk, cache=cache))
print("Editquality result:\n cjkwordthings_change = {}".format(cjkwordthings_change))

@halfak
Copy link
Member

halfak commented Jan 4, 2021

The inconsistency could be related to the cjk feature group naming issue I pointed out. See https://github.com/wikimedia/revscoring/pull/501/files#diff-499cc46dd0c97d4e81f2d23e15725821610c10e7be1f5a563846c84865c57069R21

@5uperpalo
Copy link
Collaborator Author

@halfak I thought about it and I fixed it in both datasource and features(this is being published to pip right now) but it didn't help, see the fixed names in the master revscoring branch:
https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/features/tokenized.py
https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/tokenized.py

@halfak
Copy link
Member

halfak commented Jan 4, 2021

Try enabling debug mode and running the code. You should get logging every time a dependency is evaluated.

You might be able to get away with just setting __debug__ = True in your terminal.

Or you might have to configure the logger and set level to logging.DEBUG. Consult https://github.com/wikimedia/editquality/blob/master/editquality/utilities/get_revisions.py#L43 for an example.

@5uperpalo
Copy link
Collaborator Author

5uperpalo commented Jan 7, 2021

[UPDATE - SOLVED!!! kinda...?] @halfak
I found what causes the issue, but before I make the changes I want your feedback as I am not 100% sure about this:

  1. I found out that the cjk_chars feature alters the behavior and CJK tokenization never happens. If I put it in the last place in diff_cjk it works as it should
  2. I added cjk (marked with double * in the code block below) to revscoring/features/wikitext/features/chars.py; the idea was to force python to do cjk tokenization as it seems it somehow went around it.
        self.cjk_chars = aggregators.sum(
            mappers.map(len, self.datasources**.cjk**.cjks),
            name=self._name + ".cjk_chars", returns=int
        )
  1. this worked and the edit quality diff_cjk is giving correct results, BUT pytest on revscoring resulted in 1 error. It seems that the previous test was giving incorrect result(now correct), but somebody left it like that, see:
tests/features/wikitext/tests/test_chars.py
failing on

================================================= FAILURES ==================================================
______________________________________________ test_cjk_chars _______________________________________________

    def test_cjk_chars():
        cache = {p_text: "This is 55 {{るは 壌}} a string.",
                 r_text: "This is 56 [[るは 壌のは の]] a string."}

        assert solve(revision.cjk_chars, cache=cache) == 6
        assert solve(revision.parent.cjk_chars, cache=cache) == 3
>       assert solve(revision.diff.cjk_chars_added, cache=cache) == 4
E       AssertionError: assert 3 == 4
E        +  where 3 = solve(<feature.wikitext.revision.diff.cjk_chars_added>, cache={<datasource.revision.parent.text>: 'This is 55 {{るは 壌}} a string.', <datasource.revision.text>: 'This is 56 [[るは 壌のは ...ken('壌', type='cjk_word'), Token('の', type='cjk_word'), Token('は', type='cjk_word'), Token('の', type='cjk_word')], ...})
E        +    where <feature.wikitext.revision.diff.cjk_chars_added> = {wikitext.revision.diff}.cjk_chars_added
E        +      where {wikitext.revision.diff} = {wikitext.revision}.diff

tests/features/wikitext/tests/test_chars.py:92: AssertionError

Does this all make sense? may I proceed with revscoring update/merge and add tests to edit quality?

NOTE:
Minimal example that does not work as it should in current revscoring, but works fine in proposed micro update:

from revscoring.dependencies import solve
from revscoring.datasources import revision_oriented
from revscoring.features import wikitext

r_text = revision_oriented.revision.text
r_text_text = 'れた'

cache = {r_text: r_text_text}

# comment the list you are not testing!!!
# this order gives incorrect results (in current revscoring)
list(solve([wikitext.revision.cjk_chars, wikitext.revision.cjk.tokens], cache=cache))

# this order gives correct results (in current revscoring)
#list(solve([wikitext.revision.cjk.tokens, wikitext.revision.cjk_chars], cache=cache))

@halfak
Copy link
Member

halfak commented Jan 8, 2021

OK I think I figured it out. If you look at this line: https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/tokenized.py#L21

You'll see that we provide tokenized(revision_datasources.text, tok_strategy="CJK") as a dependency for the CJK style tokens. This datasource has the exact same name as the regular tokenization strategy that results from tokenized(revision_datasources.text). In order to make sure it doesn't have the same name, we'll need to provide one. The tokenized() meta datasource has a name field we can use.

E.g., tokenized(revision_datasources.text, tok_strategy="CJK", name=self._name + ".cjk.tokens") seems to be a good name to me.

@halfak
Copy link
Member

halfak commented Jan 8, 2021

An alternative solution would be to modify the default name generation for the tokenized() meta datasource so that it includes the tokenization strategy when generating a name.

See https://github.com/wikimedia/revscoring/blob/a4f887b773f7bfe0b28c5c28d9dfb0c2698c9256/revscoring/features/wikitext/datasources/tokenized.py#L458

You could modify this line to be:

name = "{0}({1!r}, {2!r})".format("tokenized", text_datasource, tok_strategy)

That would ensure that the tokens datasource has a unique name and it is generally a good practice for including any argument that could change the output in the "name".

@halfak
Copy link
Member

halfak commented Jan 8, 2021

FWIW, I like the second option better but both are good.

@5uperpalo
Copy link
Collaborator Author

5uperpalo commented Jan 12, 2021

@halfak well.. bill murray can only summarize my admiration towards your debugging skills :)
image

I like the second solution better also .. I updated revscoring, I am releasing version 2.9.3 so it can be downloaded from pip, after it's published I will push new editquality update with new tests

@halfak
Copy link
Member

halfak commented Jan 14, 2021

Let's add model builds to this before merging. Otherwise looks good.

@5uperpalo
Copy link
Collaborator Author

5uperpalo commented Jan 26, 2021

@halfak

  • Could you please look at the /dev/CJK_tokenization_testing/editquality/nohup.out on ores-misc-01? I could not create translatewiki models - "make" was giving nonstop WARNING messages related to mwapi and revscoring

Other notes:

  • I noticed that models from Makefile.manual are not creating output in model_info/ , is it ok? (fiwiki, plwiki, translatewiki, wikidatawiki)
  • at the end, I used conda to build models as, by default, ores-misc-01 has Python version 2.7.13 (I created conda env with Python 3.5.3 per discussion from the last Thursdays call)
pavol86@ores-misc-01:~$ which python
/usr/bin/python
pavol86@ores-misc-01:~$ python --version
Python 2.7.13
(base) pavol86@ores-misc-01:~$ conda activate editquality_test
(editquality_test) pavol86@ores-misc-01:~$ python --version
Python 3.5.3
  • I did not create tunning_reports, at least not yet as they take ages (we discussed it on Thursday)
  • it seems there is only a tiny improvement in some models, but at least we can sure that editquality with CJK features works with new revscoring, that supports CJK tokenization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants