-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Context class, pass User-Agent to all web requests, set async job timeout, and retry header retrieval #92
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #92 +/- ##
==========================================
+ Coverage 43.66% 49.42% +5.76%
==========================================
Files 15 17 +2
Lines 987 1050 +63
Branches 133 147 +14
==========================================
+ Hits 431 519 +88
+ Misses 545 520 -25
Partials 11 11 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rethink the contact thing. It's a new scraper, it will surely serve as example for future ones. Current behavior is sketchy.
fe12aab
to
e51c9cd
Compare
I made significant code changes to address your concern as discussed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this will be a great addition!
simple class, not a dataclass since we need to be able to initialize / modify "on-the-fly"
This justification is erroneous. We can (and do in other scrapers) initialize and modify on the fly.
remove all default values from argparse configuration and move them to this Context class, including sourcing from environment variables when appropriate
👍
remove all dest properties from argparse since we now have a custom parsing function in context module, no need to do the same job at two different locations
It's not entirely correct. You can rely on argparse's auto naming or be formal to control it but you always need the name to retrieve data. It is thus written twice but differently.
if this proves to work well, idea is to do the same in zimscraperlib for everything one might want to customize and/or is not really a constant => we will have two contexts, one from scraperlib and one from the scraper itself
(and I think it is important and correct to have two distinct context to really know where configuration comes from)
I agree
Here's a general comment as this is a new approach and it makes more sense to me this way. I know this is a mindtouch PR but I think we're aligned in having this replicated ; hence the care in designing it.
I think the current API is not conducting enough meaning/structure.
I like the concept of initializing the context. It helps understanding what's default, what needs initialization and what changes frequently.
In current form, we have a class with class variables that is immediately initialized into a pseudo-singleton and then it's dynamic. tmp_folder
for instance is not assigned.
Also, the fact that we manipulate argparse's object in the context module is not appealing.
I'd suggest the following:
- class def as it is but as a kw-only dataclass
- argparse options dest only specified when not matching context variables or we want a distinction.
- A singleton impl to ease importing/usage
- Inited with cli values so once inited all the base context is present. Can(and will) change though
- logger should be part of the context IMO
This provides the intent/meaning a lot better I find. It also allows clear Path and other initialization in post_init.
@dataclass(kw_only=True)
class Context:
_instance: Self | None = None
...
name: str
# not necessary but it conveys intent
@classmethod
def setup(cls, **kwargs):
cls(**kwargs)
def __post_init__(self):
self.__class__._instance = self
@classmethod
def get(cls) -> "Context":
if not getattr(cls, "_instance"):
raise OSError("Uninitialized context")
return cls._instance
# entrypoint
from xxx.context import Context
Context.setup(**dict(args._get_kwargs()))
# anywhere
from xxx.context import Context
context = Context.get() # as we do for logger
262297f
to
0278568
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it very much ; please see some minor suggestions and questions inline
Fix #79
Changes:
Context
class to hold contextual / configuration information about current scraper runContext
class, including sourcing from environment variables when appropriatedest
properties from argparse since we now have a custom parsing function incontext
module, no need to do the same job at two different locations