Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Context class, pass User-Agent to all web requests, set async job timeout, and retry header retrieval #92

Merged
merged 8 commits into from
Dec 6, 2024

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Nov 25, 2024

Fix #79

Changes:

  • pass a User-Agent compliant with upload.wikimedia.org Policy in all web requests
  • set a job timeout for all async jobs, serving as a "fallback" should something go terribly wrong (avoid scraper being stuck forever and never failing)
  • move code fetching online header in the try...catch for proper operation of logic avoiding to retry some known bad assets
  • add a new Context class to hold contextual / configuration information about current scraper run
    • remove all default values from argparse configuration and move them to this Context class, including sourcing from environment variables when appropriate
    • remove all dest properties from argparse since we now have a custom parsing function in context module, no need to do the same job at two different locations
    • if this proves to work well, idea is to do the same in zimscraperlib for everything one might want to customize and/or is not really a constant

@benoit74 benoit74 self-assigned this Nov 25, 2024
Copy link

codecov bot commented Nov 25, 2024

Codecov Report

Attention: Patch coverage is 63.10160% with 69 lines in your changes missing coverage. Please review.

Project coverage is 49.42%. Comparing base (d63320e) to head (34b596f).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
scraper/src/mindtouch2zim/processor.py 23.07% 30 Missing ⚠️
scraper/src/mindtouch2zim/__main__.py 0.00% 15 Missing ⚠️
scraper/src/mindtouch2zim/asset.py 23.52% 13 Missing ⚠️
scraper/src/mindtouch2zim/client.py 36.36% 7 Missing ⚠️
scraper/src/mindtouch2zim/download.py 66.66% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #92      +/-   ##
==========================================
+ Coverage   43.66%   49.42%   +5.76%     
==========================================
  Files          15       17       +2     
  Lines         987     1050      +63     
  Branches      133      147      +14     
==========================================
+ Hits          431      519      +88     
+ Misses        545      520      -25     
  Partials       11       11              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 marked this pull request as ready for review November 25, 2024 08:29
@benoit74 benoit74 changed the title Pass User-Agent to all web requests Pass User-Agent to all web requests, set async job timeout, and retry header retrieval Nov 25, 2024
@benoit74 benoit74 requested a review from rgaudin November 25, 2024 08:50
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rethink the contact thing. It's a new scraper, it will surely serve as example for future ones. Current behavior is sketchy.

@benoit74 benoit74 force-pushed the user_agent branch 2 times, most recently from fe12aab to e51c9cd Compare November 26, 2024 13:03
@benoit74
Copy link
Contributor Author

I made significant code changes to address your concern as discussed:

  • add a new Context class to hold contextual / configuration information about current scraper run
    • simple class, not a dataclass since we need to be able to initialize / modify "on-the-fly"
    • remove all default values from argparse configuration and move them to this Context class, including sourcing from environment variables when appropriate
    • remove all dest properties from argparse since we now have a custom parsing function in context module, no need to do the same job at two different locations
    • if this proves to work well, idea is to do the same in zimscraperlib for everything one might want to customize and/or is not really a constant => we will have two contexts, one from scraperlib and one from the scraper itself (and I think it is important and correct to have two distinct context to really know where configuration comes from)

@benoit74 benoit74 requested a review from rgaudin November 26, 2024 13:06
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this will be a great addition!

simple class, not a dataclass since we need to be able to initialize / modify "on-the-fly"

This justification is erroneous. We can (and do in other scrapers) initialize and modify on the fly.

remove all default values from argparse configuration and move them to this Context class, including sourcing from environment variables when appropriate

👍

remove all dest properties from argparse since we now have a custom parsing function in context module, no need to do the same job at two different locations

It's not entirely correct. You can rely on argparse's auto naming or be formal to control it but you always need the name to retrieve data. It is thus written twice but differently.

if this proves to work well, idea is to do the same in zimscraperlib for everything one might want to customize and/or is not really a constant => we will have two contexts, one from scraperlib and one from the scraper itself
(and I think it is important and correct to have two distinct context to really know where configuration comes from)

I agree


Here's a general comment as this is a new approach and it makes more sense to me this way. I know this is a mindtouch PR but I think we're aligned in having this replicated ; hence the care in designing it.

I think the current API is not conducting enough meaning/structure.
I like the concept of initializing the context. It helps understanding what's default, what needs initialization and what changes frequently.
In current form, we have a class with class variables that is immediately initialized into a pseudo-singleton and then it's dynamic. tmp_folder for instance is not assigned.
Also, the fact that we manipulate argparse's object in the context module is not appealing.

I'd suggest the following:

  • class def as it is but as a kw-only dataclass
  • argparse options dest only specified when not matching context variables or we want a distinction.
  • A singleton impl to ease importing/usage
  • Inited with cli values so once inited all the base context is present. Can(and will) change though
  • logger should be part of the context IMO

This provides the intent/meaning a lot better I find. It also allows clear Path and other initialization in post_init.

@dataclass(kw_only=True)
class Context:
    _instance: Self | None = None
    ...
    name: str

    # not necessary but it conveys intent
    @classmethod
    def setup(cls, **kwargs):
        cls(**kwargs)

    def __post_init__(self):
        self.__class__._instance = self

    @classmethod
    def get(cls) -> "Context":
        if not getattr(cls, "_instance"):
            raise OSError("Uninitialized context")
        return cls._instance

# entrypoint
from xxx.context import Context
Context.setup(**dict(args._get_kwargs()))

# anywhere
from xxx.context import Context
context = Context.get()  # as we do for logger

@benoit74 benoit74 force-pushed the user_agent branch 3 times, most recently from 262297f to 0278568 Compare December 2, 2024 19:14
@benoit74 benoit74 requested a review from rgaudin December 3, 2024 08:10
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it very much ; please see some minor suggestions and questions inline

@benoit74 benoit74 changed the title Pass User-Agent to all web requests, set async job timeout, and retry header retrieval Add Context class, pass User-Agent to all web requests, set async job timeout, and retry header retrieval Dec 6, 2024
@benoit74 benoit74 merged commit 1399a4a into main Dec 6, 2024
10 checks passed
@benoit74 benoit74 deleted the user_agent branch December 6, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass proper user-agent
2 participants