-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CJK features for ZH,JA,KO wiki #232
base: master
Are you sure you want to change the base?
Conversation
|
||
def test_zhwiki(): | ||
assert ([round(i) for i in solve(zhwiki.cjk, cache=cache)] == | ||
[4, 2, 7, 0, 4, 2, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe test each feature explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test_jawiki & test_kowiki too.
@halfak from revscoring.dependencies import solve
from revscoring.datasources import revision_oriented
from editquality.feature_lists import jawiki
from revscoring.features import wikitext
from revscoring.features.modifiers import sub
r_text = revision_oriented.revision.text
p_text = revision_oriented.revision.parent.text
p_text_text = """
敗戦後は桑原武夫の『第二芸術-現代俳句について』
(1946年)によって、短詩型である俳句の限界が指摘された。
"""
r_text_text = """
敗戦後は桑原武夫の『第二芸術-現代俳句について』
(1946年)によって、短詩型である俳句の限界が指摘されたたた。
"""
cache = {p_text: p_text_text,
r_text: r_text_text}
# cjkwordthings_change = solve(sub(wikitext.revision.cjk.cjks, wikitext.revision.parent.cjk.cjks, name="revision.diff.cjk.cjkwordthings_change"), cache=cache)
# parent_cjks = len(solve(wikitext.revision.datasources.cjk.cjks, cache=cache))
# cjks = len(solve(wikitext.revision.datasources.parent.cjk.cjks, cache=cache))
# print("Revscoring results:\n parent_cjks = {}\n cjks = {}\n cjkwordthings_change = {}".format(parent_cjks, cjks, cjkwordthings_change))
cjkwordthings_change = list(solve(jawiki.wikitext.diff_cjk, cache=cache))
print("Editquality result:\n cjkwordthings_change = {}".format(cjkwordthings_change)) |
The inconsistency could be related to the cjk feature group naming issue I pointed out. See https://github.com/wikimedia/revscoring/pull/501/files#diff-499cc46dd0c97d4e81f2d23e15725821610c10e7be1f5a563846c84865c57069R21 |
@halfak I thought about it and I fixed it in both datasource and features(this is being published to pip right now) but it didn't help, see the fixed names in the master revscoring branch: |
Try enabling debug mode and running the code. You should get logging every time a dependency is evaluated. You might be able to get away with just setting Or you might have to configure the logger and set level to |
[UPDATE - SOLVED!!! kinda...?] @halfak
Does this all make sense? may I proceed with revscoring update/merge and add tests to edit quality? NOTE:
|
OK I think I figured it out. If you look at this line: https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/tokenized.py#L21 You'll see that we provide E.g., |
An alternative solution would be to modify the default name generation for the You could modify this line to be:
That would ensure that the tokens datasource has a unique name and it is generally a good practice for including any argument that could change the output in the "name". |
FWIW, I like the second option better but both are good. |
@halfak well.. bill murray can only summarize my admiration towards your debugging skills :) I like the second solution better also .. I updated revscoring, I am releasing version 2.9.3 so it can be downloaded from pip, after it's published I will push new editquality update with new tests |
Let's add model builds to this before merging. Otherwise looks good. |
Other notes:
pavol86@ores-misc-01:~$ which python
/usr/bin/python
pavol86@ores-misc-01:~$ python --version
Python 2.7.13
(base) pavol86@ores-misc-01:~$ conda activate editquality_test
(editquality_test) pavol86@ores-misc-01:~$ python --version
Python 3.5.3
|
initial commit, not ready for merge, missing tests