Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce generic Rewriter dispatching between specific rewriters. #158

Merged
merged 7 commits into from
Jan 26, 2024

Conversation

mgautierfr
Copy link
Contributor

Some js coming from specific website need to be patched to work "offline".

This patching is made using a set of specific regex to "activate" depending of path.

Sadly, there is no test because there is none in wabac.js (the source of the rules set) and
I don't know want/how to test.

We probably have more to import from wabac.js, but this PR is a fix to #149

@mgautierfr mgautierfr requested a review from benoit74 January 25, 2024 15:47
Copy link

codecov bot commented Jan 25, 2024

Codecov Report

Attention: 80 lines in your changes are missing coverage. Please review.

Comparison is base (a597f3a) 93.65% compared to head (c807f4b) 86.52%.
Report is 22 commits behind head on warc2zim2.

Files Patch % Lines
src/warc2zim/content_rewriting/ds.py 26.08% 68 Missing ⚠️
src/warc2zim/content_rewriting/rx_replacer.py 90.90% 2 Missing and 3 partials ⚠️
src/warc2zim/content_rewriting/js.py 89.47% 2 Missing and 2 partials ⚠️
src/warc2zim/converter.py 92.68% 3 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           warc2zim2     #158      +/-   ##
=============================================
- Coverage      93.65%   86.52%   -7.13%     
=============================================
  Files             10       13       +3     
  Lines            709      846     +137     
  Branches         122      147      +25     
=============================================
+ Hits             664      732      +68     
- Misses            29       97      +68     
- Partials          16       17       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mgautierfr mgautierfr force-pushed the ds_rewriter branch 2 times, most recently from 55b3ffe to 16181fe Compare January 25, 2024 16:08
@rgaudin
Copy link
Member

rgaudin commented Jan 25, 2024

Is this a reimplementation of the JS version in wabac.js? I suppose it was not possible to source it directly.
What's the update process? I think we at least need a comment referencing the source and how to convert from it

@mgautierfr
Copy link
Contributor Author

Is this a reimplementation of the JS version in wabac.js? I suppose it was not possible to source it directly.

Yes, it is.

What's the update process? I think we at least need a comment referencing the source and how to convert from it

There is no update process. The simple things have been rewritten by chat-gpt 🙈 (and reviewed by a human). The other have been made by a human directly.

We could create a script which update the rules from the wabac source but it would not be straightforward as the "action" part of the rule is a bit different.
And if the code change, we will have to adapt manual anyway.


Last 4 commits are structural improvement but no functional changes.

@benoit74
Copy link
Collaborator

Is it correct to add a comment in ds.py saying that all rules present in this file comes from https://github.com/webrecorder/wabac.js/commits/main/src/rewrite/dsruleset.jsand have been manually converted from Javascript to Python (I would tell details about ChatGPT, who cares ...)? I would add this at least, as suggest by Renaud. I don't believe we can have a more precise conversion tool / procedure (unfortunately).

I'm not very comfortable with the rest of the code, but "that's life" 😄

@rgaudin
Copy link
Member

rgaudin commented Jan 25, 2024

Is it correct to add a comment in ds.py saying that all rules present in this file comes from https://github.com/webrecorder/wabac.js/commits/main/src/rewrite/dsruleset.jsand have been manually converted from Javascript to Python (I would tell details about ChatGPT, who cares ...)? I would add this at least, as suggest by Renaud. I don't believe we can have a more precise conversion tool / procedure (unfortunately).

That's what I have in mind ; just a starting point for another maintainer in the future: knowing where it came from, being able to check if there's a conversion bug and being able to convert new ones

@benoit74
Copy link
Collaborator

Maybe also specify the last commit (from wabac.js) which has been taken into account, so that it is clear what is left to be back-ported

Base automatically changed from adopt_bootstrap to warc2zim2 January 26, 2024 12:42
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please change your PR to target warc2zim2, rebase your branch and resolve merge conflict (I thought it would be easier this way, also in order to close the other PR) and add the comment we discussed explaining where this code comes from?

@benoit74
Copy link
Collaborator

Edit: github already decided to target warc2zim2 automatically (nice ^^)

@mgautierfr
Copy link
Contributor Author

add a comment in ds.py saying that all rules present in this file comes from https://github.com/webrecorder/wabac.js/commits/main/src/rewrite/dsruleset.jsand have been manually converted from Javascript to Python (I would tell details about ChatGPT, who cares ...)?

There was already a comment in the commit message, but now it is also in the code.
(Directly in modifying the commit, no fixup)

@mgautierfr mgautierfr requested a review from benoit74 January 26, 2024 14:22
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost OK, two important questions (could be tackled in a separate issue if needed) and a suggestion to fix typos.

Some js need specific patching. This is done with a set of extra rules.

`ds.py` is almost a plain copy of
https://github.com/webrecorder/wabac.js/blob/main/src/rewrite/dsruleset.js
It contains a base regex replacer, type declaration and helper function.
The function add a suffix only if it is non a property.
Let's move this information in the function name.
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with open issues

@benoit74 benoit74 merged commit ae18aed into warc2zim2 Jan 26, 2024
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants