-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add caching for timezone offsets, significantly speeds up import #1250
base: master
Are you sure you want to change the base?
Conversation
Thanks. I will have a look after we merge #1248 so tests can pass. In the meantime, please install |
@Gallaecio done |
Closing and reopening to make sure the CI runs with the latest changes from the main branch, which fix tests. |
I’m thinking that caching might not be the right approach here, at least not exactly this way. Caching makes sense to me for scenarios where there is user or system input that conditions the data to cache. Here, it seems the data is based on dateparser data only (specifically on the contents of this variable), in which case we could include the pickled data into dateparser itself, no need to generate it on the first run on each computer, and on each folder (by default). And if that’s the case, if this data is independent of the user system or input as it seems to me, then we could instead run this at development time, and include the pickled data into the dateparser package, or as Python code if that’s viable (so that the data is readable). We can also add a test to make sure that it does not go out of sync when the underlying data changes, i.e. have CI generate the data from scratch and compare it to the pickled data, complaining if they become different. |
@Gallaecio do you need me to do this? I'm not sure the best way to hook into your build system could there also be problems with pickling across python versions? i don't think this is safe to do. the pickle implementation is not safe across python versions, and having to compile and distribute per version seems complicated. this approach is nice because it doesn't change the default behavior, and folks can opt into it if they want. |
I don’t have time to work on a PR myself. But there is no rush, we can wait until someone else volunteers.
It should not be an issue provided you specify a protocol version when pickling (and unpickling) that is supported by the lowest supported Python version (3.9). The problem I see with this approach is that, to implement it properly, it will get complicated. We should use an XDG-based path for the caching, and implement a system to refresh the cache as new versions of dateparser are used. It feels like unnecessary effort when we could just include the cache file in dateparser itself, making dateparser faster even in the first use, and not only in the later uses. |
dateparser right now supports 3.7 on pypi is that not correct? just to confirm, you won't merge this PR as is and will only accept the more complicated approach? this approach although not perfect will speed up many use cases. |
The main branch no longer supports 3.8 and earlier (which are end-of-life). That is, the next release will not support 3.7 and 3.8.
Definitely not as is. But I am not entirely against a user-space caching approach if other maintainers are OK with this approach (cc @serhii73 @wRAR), provided we handle its pending issues (
My personal preference goes:
|
Just to be clear, the current approach handles refreshing the cache on new releases. It stores the version of date parser and invalidates it if it doesn’t match. |
@Gallaecio ok, i now do this at install time, this should get distributed with the wheels. let me know what you think |
f423124
to
1683155
Compare
this is different from pr scrapinghub#1181. it builds a cache at install time which can be distributed. closes scrapinghub#533
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1250 +/- ##
==========================================
- Coverage 98.23% 0.00% -98.24%
==========================================
Files 232 233 +1
Lines 2604 2735 +131
==========================================
- Hits 2558 0 -2558
- Misses 46 2735 +2689 ☔ View full report in Codecov by Sentry. |
Install time is tricky. It makes things harder during development, and it makes it harder to switch to |
@Gallaecio so would you be ok if i made it lazy in development, and then have ci/cd build the file on publish, which is then included as a part of the distribution? i want to align on an approach so i don't waste time iterating on something that's not good with you. |
Do you have a reason not to want to add the pickled file to Git? |
@Gallaecio are you absolutely sure pickle files can be safely reused across python minor versions? Additionally, how would I automate the commit of changing this file? I think it's much safer for this pickle file to be built when a user installs it to avoid any issues with platform compatibility. |
If you want me to just check in the pickle file though, and then have a script to auto generate it and the user needs to commit it, i'm fine with that. Just uncertain if a pickle file will work across python 3.9-3.13. It may be ok since we're only pickling regex objects, but I haven't tested it. |
From https://docs.python.org/3/library/pickle.html#comparison-with-marshal:
Exactly. I would automate the file generation, but not the triggering of the generation. e.g. you include in the repo a script to generate the file, along with the file itself. What I would do, however, is add a test that checks that the generated file matches the source data, e.g. have a test that generates a new file from scratch and verifies that the contents are the same as the pre-generated file. We could look into generating the file automatically with pre-commit when the source file changes, but I am not familiar with writing pre-commit hooks, so I am OK with not doing that. But if you want to look into that, I think it would be cool. It should be possible to trigger the generation only when the corresponding source file changes. |
@Gallaecio sounds good. i will do that. |
this is different from pr #1181. that pr only makes import faster but still incurs cost on the first usage. this one leverages an optional cache.
closes #533