Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add caching for timezone offsets, significantly speeds up import #1250

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tobymao
Copy link

@tobymao tobymao commented Jan 30, 2025

this is different from pr #1181. that pr only makes import faster but still incurs cost on the first usage. this one leverages an optional cache.

closes #533

@tobymao
Copy link
Author

tobymao commented Jan 30, 2025

@Gallaecio

@Gallaecio
Copy link
Member

Thanks. I will have a look after we merge #1248 so tests can pass. In the meantime, please install pre-commit and run pre-commit run --all.

@tobymao
Copy link
Author

tobymao commented Jan 31, 2025

@Gallaecio done

@Gallaecio
Copy link
Member

Closing and reopening to make sure the CI runs with the latest changes from the main branch, which fix tests.

@Gallaecio Gallaecio closed this Feb 3, 2025
@Gallaecio Gallaecio reopened this Feb 3, 2025
@Gallaecio
Copy link
Member

I’m thinking that caching might not be the right approach here, at least not exactly this way.

Caching makes sense to me for scenarios where there is user or system input that conditions the data to cache. Here, it seems the data is based on dateparser data only (specifically on the contents of this variable), in which case we could include the pickled data into dateparser itself, no need to generate it on the first run on each computer, and on each folder (by default).

And if that’s the case, if this data is independent of the user system or input as it seems to me, then we could instead run this at development time, and include the pickled data into the dateparser package, or as Python code if that’s viable (so that the data is readable).

We can also add a test to make sure that it does not go out of sync when the underlying data changes, i.e. have CI generate the data from scratch and compare it to the pickled data, complaining if they become different.

@tobymao
Copy link
Author

tobymao commented Feb 4, 2025

@Gallaecio do you need me to do this? I'm not sure the best way to hook into your build system

could there also be problems with pickling across python versions? i don't think this is safe to do. the pickle implementation is not safe across python versions, and having to compile and distribute per version seems complicated.

this approach is nice because it doesn't change the default behavior, and folks can opt into it if they want.

@Gallaecio
Copy link
Member

Gallaecio commented Feb 4, 2025

do you need me to do this?

I don’t have time to work on a PR myself. But there is no rush, we can wait until someone else volunteers.

could there also be problems with pickling across python versions?

It should not be an issue provided you specify a protocol version when pickling (and unpickling) that is supported by the lowest supported Python version (3.9).


The problem I see with this approach is that, to implement it properly, it will get complicated. We should use an XDG-based path for the caching, and implement a system to refresh the cache as new versions of dateparser are used. It feels like unnecessary effort when we could just include the cache file in dateparser itself, making dateparser faster even in the first use, and not only in the later uses.

@tobymao
Copy link
Author

tobymao commented Feb 4, 2025

dateparser right now supports 3.7 on pypi is that not correct?

just to confirm, you won't merge this PR as is and will only accept the more complicated approach?

this approach although not perfect will speed up many use cases.

@Gallaecio
Copy link
Member

Gallaecio commented Feb 5, 2025

dateparser right now supports 3.7 on pypi is that not correct?

The main branch no longer supports 3.8 and earlier (which are end-of-life). That is, the next release will not support 3.7 and 3.8.

you won't merge this PR

Definitely not as is. But I am not entirely against a user-space caching approach if other maintainers are OK with this approach (cc @serhii73 @wRAR), provided we handle its pending issues (make sure the cache is updated on new dateparser releases, choose a better default location for the file).

will only accept the more complicated approach?

My personal preference goes:

  1. An approach that will save time even in the first run. You call it “the more complicated approach”, but to me it seems simpler than this one.
  2. An approach that saves time until actual usage of the corresponding data (the original, lazy-loading approach).
  3. An approach that will save time on the 2nd and later usages only, i.e. the approach you are suggesting.
  4. Changing nothing.

@tobymao
Copy link
Author

tobymao commented Feb 5, 2025

Just to be clear, the current approach handles refreshing the cache on new releases. It stores the version of date parser and invalidates it if it doesn’t match.

@tobymao
Copy link
Author

tobymao commented Feb 5, 2025

@Gallaecio ok, i now do this at install time, this should get distributed with the wheels. let me know what you think

@tobymao tobymao force-pushed the master branch 2 times, most recently from f423124 to 1683155 Compare February 5, 2025 18:55
this is different from pr scrapinghub#1181. it builds a cache at install time which
can be distributed.

closes scrapinghub#533
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (0f5d9c5) to head (08ce4e2).
Report is 59 commits behind head on master.

Files with missing lines Patch % Lines
dateparser/timezone_parser.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #1250       +/-   ##
==========================================
- Coverage   98.23%   0.00%   -98.24%     
==========================================
  Files         232     233        +1     
  Lines        2604    2735      +131     
==========================================
- Hits         2558       0     -2558     
- Misses         46    2735     +2689     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Gallaecio
Copy link
Member

Install time is tricky. It makes things harder during development, and it makes it harder to switch to pyproject.toml, for example. I think we need to add the file to Git, and build it during development whenever we change the underlying data. This is where a CI test to compare the generated file to a fresh one would be useful, to make sure the file does not go out of sync.

@tobymao
Copy link
Author

tobymao commented Feb 6, 2025

@Gallaecio so would you be ok if i made it lazy in development, and then have ci/cd build the file on publish, which is then included as a part of the distribution?

i want to align on an approach so i don't waste time iterating on something that's not good with you.

@Gallaecio
Copy link
Member

Do you have a reason not to want to add the pickled file to Git?

@tobymao
Copy link
Author

tobymao commented Feb 6, 2025

@Gallaecio are you absolutely sure pickle files can be safely reused across python minor versions? Additionally, how would I automate the commit of changing this file?

I think it's much safer for this pickle file to be built when a user installs it to avoid any issues with platform compatibility.

@tobymao
Copy link
Author

tobymao commented Feb 6, 2025

If you want me to just check in the pickle file though, and then have a script to auto generate it and the user needs to commit it, i'm fine with that. Just uncertain if a pickle file will work across python 3.9-3.13. It may be ok since we're only pickling regex objects, but I haven't tested it.

@Gallaecio
Copy link
Member

Gallaecio commented Feb 6, 2025

re you absolutely sure pickle files can be safely reused across python minor versions?

From https://docs.python.org/3/library/pickle.html#comparison-with-marshal:

The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and [Python 2 stuff irrelevant for our Python 3+ scenario].


how would I automate the commit of changing this file?

If you want me to just check in the pickle file though, and then have a script to auto generate it and the user needs to commit it, i'm fine with that.

Exactly.

I would automate the file generation, but not the triggering of the generation. e.g. you include in the repo a script to generate the file, along with the file itself.

What I would do, however, is add a test that checks that the generated file matches the source data, e.g. have a test that generates a new file from scratch and verifies that the contents are the same as the pre-generated file.

We could look into generating the file automatically with pre-commit when the source file changes, but I am not familiar with writing pre-commit hooks, so I am OK with not doing that. But if you want to look into that, I think it would be cool. It should be possible to trigger the generation only when the corresponding source file changes.

@tobymao
Copy link
Author

tobymao commented Feb 6, 2025

@Gallaecio sounds good. i will do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import very slow
2 participants