-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade browsertrix crawler and remove redirect handling #285
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## zimit2 #285 +/- ##
==========================================
+ Coverage 14.88% 14.91% +0.03%
==========================================
Files 1 1
Lines 262 248 -14
Branches 38 35 -3
==========================================
- Hits 39 37 -2
+ Misses 223 211 -12 ☔ View full report in Codecov by Sentry. |
"Luckily", tests are failing due to openzim/warc2zim#198 (but even once this is merged, we still need to wait for openzim/warc2zim#196) |
@mgautierfr I did not asked you for a formal review of this since as far as I've understood you are less experienced with zimit, but do not hesitate to have a look and comment as well |
Review welcomed again, changing a test "to make it works" probably needs to be confirmed to be OK 🤣 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ; but the commit (296b104) must include the relevant information (for future blame's sake): why we had and expected 8 before and why we have and expect 7 now.
7 entries are expected: https://isago.rskg.org/ https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css https://isago.rskg.org/static/favicon256.png https://isago.rskg.org/conseils https://isago.rskg.org/faq https://isago.rskg.org/a-propos https://isago.rskg.org/static/tarifs-isago.pdf 1 unexpected entry is not produced anymore by Browsertrix crawler: https://dict.brave.com/edgedl/chrome/dict/en-us-10-1.bdic This was a technical artifact
Done, commit updated. |
👍 |
Fix #256
Fix #284
Fix #166
This PR adopts browsertrix crawler
1.0.0-beta51.0.0-beta.6.Among other things, this release now handles nicely redirect (webrecorder/browsertrix-crawler#476).
We hence have to remove the handling we've previously done on our side and caused issues (#256). We just keep the cleaning of the URL (remove default ports 443 and 80).
As a side-effect, this will also solve #166 since browsertrix crawler is already permissive in terms of SSL certificates issues. The only SSL issues which will continue to be blocked are the ones where the browser cannot establish at all the connection, like https://panzer-war.com/ were the browser has no cipher in common with the server
Redirect handling has been tested with https://metafilter.com:
Handling of insecure connection withhttps://www.moneyinstructor.com (which still fails without the simplification of check_url):
This PR should not be merged before openzim/warc2zim#196