Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance support of Browsertrix Crawler arguments #471

Merged
merged 5 commits into from
Feb 14, 2025
Merged

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Feb 13, 2025

Fix #416
Fix #433

  • Add many missing Browsertrix Crawler arguments
  • Drop default overrides by zimit
  • Drop --noMobileDevice setting (not needed anymore)
  • Document all Browsertrix Crawler default arguments values
  • Switch to preferred Browsertrix Crawler argument names
  • Fix confusion between Browsertrix stats filename and zimit stats filename + add support for warc2zim stats special location
  • Fix support of seeds being sourced from a URL (they were not properly passed to warc2zim)
  • Add support for --seedFile being a URL instead of a local file
  • Fix confusing variable / function names around arguments
    • zimit_args are not args for zimit but for the crawler or shared, hence better named known_args
    • get_node_cmd_line => get_crawler_cmd_line (crawler is indeed in node, but who cares?)

Due to multiple breaking changes, this will require a major version
Probably better to review commit by commit.

@benoit74 benoit74 self-assigned this Feb 13, 2025
@benoit74 benoit74 force-pushed the fix_browsertrix_args branch 3 times, most recently from f04ad98 to bcf3509 Compare February 13, 2025 17:21
@benoit74 benoit74 marked this pull request as ready for review February 13, 2025 19:24
@benoit74 benoit74 requested a review from rgaudin February 13, 2025 19:24
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please check my comments first

@benoit74 benoit74 force-pushed the fix_browsertrix_args branch from bcf3509 to 2f7a83e Compare February 14, 2025 14:29
@benoit74 benoit74 merged commit a9efec4 into main Feb 14, 2025
5 checks passed
@benoit74 benoit74 deleted the fix_browsertrix_args branch February 14, 2025 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider "new" crawler CLI arguments --help: clarify default values
2 participants