feat: add caching for timezone offsets, significantly speeds up import #1250

tobymao · 2025-01-30T18:37:34Z

this is different from pr #1181. that pr only makes import faster but still incurs cost on the first usage. this one leverages an optional cache.

closes #533

tobymao · 2025-01-30T18:38:01Z

@Gallaecio

Gallaecio · 2025-01-31T09:27:47Z

Thanks. I will have a look after we merge #1248 so tests can pass. In the meantime, please install pre-commit and run pre-commit run --all.

tobymao · 2025-01-31T15:52:49Z

@Gallaecio done

Gallaecio · 2025-02-03T11:47:14Z

Closing and reopening to make sure the CI runs with the latest changes from the main branch, which fix tests.

Gallaecio · 2025-02-04T16:51:31Z

I’m thinking that caching might not be the right approach here, at least not exactly this way.

Caching makes sense to me for scenarios where there is user or system input that conditions the data to cache. Here, it seems the data is based on dateparser data only (specifically on the contents of this variable), in which case we could include the pickled data into dateparser itself, no need to generate it on the first run on each computer, and on each folder (by default).

And if that’s the case, if this data is independent of the user system or input as it seems to me, then we could instead run this at development time, and include the pickled data into the dateparser package, or as Python code if that’s viable (so that the data is readable).

We can also add a test to make sure that it does not go out of sync when the underlying data changes, i.e. have CI generate the data from scratch and compare it to the pickled data, complaining if they become different.

tobymao · 2025-02-04T17:12:04Z

@Gallaecio do you need me to do this? I'm not sure the best way to hook into your build system

could there also be problems with pickling across python versions? i don't think this is safe to do. the pickle implementation is not safe across python versions, and having to compile and distribute per version seems complicated.

this approach is nice because it doesn't change the default behavior, and folks can opt into it if they want.

Gallaecio · 2025-02-04T17:38:57Z

do you need me to do this?

I don’t have time to work on a PR myself. But there is no rush, we can wait until someone else volunteers.

could there also be problems with pickling across python versions?

It should not be an issue provided you specify a protocol version when pickling (and unpickling) that is supported by the lowest supported Python version (3.9).

The problem I see with this approach is that, to implement it properly, it will get complicated. We should use an XDG-based path for the caching, and implement a system to refresh the cache as new versions of dateparser are used. It feels like unnecessary effort when we could just include the cache file in dateparser itself, making dateparser faster even in the first use, and not only in the later uses.

tobymao · 2025-02-04T17:43:37Z

dateparser right now supports 3.7 on pypi is that not correct?

just to confirm, you won't merge this PR as is and will only accept the more complicated approach?

this approach although not perfect will speed up many use cases.

Gallaecio · 2025-02-05T09:32:56Z

dateparser right now supports 3.7 on pypi is that not correct?

The main branch no longer supports 3.8 and earlier (which are end-of-life). That is, the next release will not support 3.7 and 3.8.

you won't merge this PR

Definitely not as is. But I am not entirely against a user-space caching approach if other maintainers are OK with this approach (cc @serhii73 @wRAR), provided we handle its pending issues (~~make sure the cache is updated on new dateparser releases~~, choose a better default location for the file).

will only accept the more complicated approach?

My personal preference goes:

An approach that will save time even in the first run. You call it “the more complicated approach”, but to me it seems simpler than this one.
An approach that saves time until actual usage of the corresponding data (the original, lazy-loading approach).
An approach that will save time on the 2nd and later usages only, i.e. the approach you are suggesting.
Changing nothing.

tobymao · 2025-02-05T15:18:59Z

Just to be clear, the current approach handles refreshing the cache on new releases. It stores the version of date parser and invalidates it if it doesn’t match.

tobymao · 2025-02-05T18:54:13Z

@Gallaecio ok, i now do this at install time, this should get distributed with the wheels. let me know what you think

this is different from pr scrapinghub#1181. it builds a cache at install time which can be distributed. closes scrapinghub#533

codecov · 2025-02-06T06:17:53Z

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (0f5d9c5) to head (08ce4e2).
Report is 59 commits behind head on master.

Files with missing lines	Patch %	Lines
dateparser/timezone_parser.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #1250       +/-   ##
==========================================
- Coverage   98.23%   0.00%   -98.24%     
==========================================
  Files         232     233        +1     
  Lines        2604    2735      +131     
==========================================
- Hits         2558       0     -2558     
- Misses         46    2735     +2689

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Gallaecio · 2025-02-06T08:33:33Z

Install time is tricky. It makes things harder during development, and it makes it harder to switch to pyproject.toml, for example. I think we need to add the file to Git, and build it during development whenever we change the underlying data. This is where a CI test to compare the generated file to a fresh one would be useful, to make sure the file does not go out of sync.

tobymao · 2025-02-06T15:54:49Z

@Gallaecio so would you be ok if i made it lazy in development, and then have ci/cd build the file on publish, which is then included as a part of the distribution?

i want to align on an approach so i don't waste time iterating on something that's not good with you.

Gallaecio · 2025-02-06T17:12:19Z

Do you have a reason not to want to add the pickled file to Git?

tobymao · 2025-02-06T17:21:42Z

@Gallaecio are you absolutely sure pickle files can be safely reused across python minor versions? Additionally, how would I automate the commit of changing this file?

I think it's much safer for this pickle file to be built when a user installs it to avoid any issues with platform compatibility.

tobymao · 2025-02-06T17:23:51Z

If you want me to just check in the pickle file though, and then have a script to auto generate it and the user needs to commit it, i'm fine with that. Just uncertain if a pickle file will work across python 3.9-3.13. It may be ok since we're only pickling regex objects, but I haven't tested it.

Gallaecio · 2025-02-06T17:32:08Z

re you absolutely sure pickle files can be safely reused across python minor versions?

From https://docs.python.org/3/library/pickle.html#comparison-with-marshal:

The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and [Python 2 stuff irrelevant for our Python 3+ scenario].

how would I automate the commit of changing this file?

If you want me to just check in the pickle file though, and then have a script to auto generate it and the user needs to commit it, i'm fine with that.

Exactly.

I would automate the file generation, but not the triggering of the generation. e.g. you include in the repo a script to generate the file, along with the file itself.

What I would do, however, is add a test that checks that the generated file matches the source data, e.g. have a test that generates a new file from scratch and verifies that the contents are the same as the pre-generated file.

We could look into generating the file automatically with pre-commit when the source file changes, but I am not familiar with writing pre-commit hooks, so I am OK with not doing that. But if you want to look into that, I think it would be cool. It should be possible to trigger the generation only when the corresponding source file changes.

tobymao · 2025-02-06T17:37:49Z

@Gallaecio sounds good. i will do that.

tobymao force-pushed the master branch from a5d6b87 to 0b3a522 Compare January 30, 2025 21:57

tobymao mentioned this pull request Jan 30, 2025

postpone timezone regex evaluation until first use - shaves off time from package import #1181

Open

tobymao force-pushed the master branch from 0b3a522 to 15925f6 Compare January 31, 2025 15:52

Gallaecio closed this Feb 3, 2025

Gallaecio reopened this Feb 3, 2025

tobymao force-pushed the master branch from 15925f6 to dbda38d Compare February 5, 2025 18:53

tobymao force-pushed the master branch 2 times, most recently from f423124 to 1683155 Compare February 5, 2025 18:55

feat: add caching for timezone offsets, significantly speeds up import

08ce4e2

this is different from pr scrapinghub#1181. it builds a cache at install time which can be distributed. closes scrapinghub#533

tobymao force-pushed the master branch from 1683155 to 08ce4e2 Compare February 5, 2025 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add caching for timezone offsets, significantly speeds up import #1250

feat: add caching for timezone offsets, significantly speeds up import #1250

tobymao commented Jan 30, 2025

tobymao commented Jan 30, 2025

Gallaecio commented Jan 31, 2025

tobymao commented Jan 31, 2025

Gallaecio commented Feb 3, 2025

Gallaecio commented Feb 4, 2025

tobymao commented Feb 4, 2025 •

edited

Loading

Gallaecio commented Feb 4, 2025 •

edited

Loading

tobymao commented Feb 4, 2025 •

edited

Loading

Gallaecio commented Feb 5, 2025 •

edited

Loading

tobymao commented Feb 5, 2025

tobymao commented Feb 5, 2025

codecov bot commented Feb 6, 2025

Gallaecio commented Feb 6, 2025

tobymao commented Feb 6, 2025

Gallaecio commented Feb 6, 2025

tobymao commented Feb 6, 2025 •

edited

Loading

tobymao commented Feb 6, 2025 •

edited

Loading

Gallaecio commented Feb 6, 2025 •

edited

Loading

tobymao commented Feb 6, 2025

feat: add caching for timezone offsets, significantly speeds up import #1250

Are you sure you want to change the base?

feat: add caching for timezone offsets, significantly speeds up import #1250

Conversation

tobymao commented Jan 30, 2025

tobymao commented Jan 30, 2025

Gallaecio commented Jan 31, 2025

tobymao commented Jan 31, 2025

Gallaecio commented Feb 3, 2025

Gallaecio commented Feb 4, 2025

tobymao commented Feb 4, 2025 • edited Loading

Gallaecio commented Feb 4, 2025 • edited Loading

tobymao commented Feb 4, 2025 • edited Loading

Gallaecio commented Feb 5, 2025 • edited Loading

tobymao commented Feb 5, 2025

tobymao commented Feb 5, 2025

codecov bot commented Feb 6, 2025

Codecov Report

Gallaecio commented Feb 6, 2025

tobymao commented Feb 6, 2025

Gallaecio commented Feb 6, 2025

tobymao commented Feb 6, 2025 • edited Loading

tobymao commented Feb 6, 2025 • edited Loading

Gallaecio commented Feb 6, 2025 • edited Loading

tobymao commented Feb 6, 2025

tobymao commented Feb 4, 2025 •

edited

Loading

Gallaecio commented Feb 4, 2025 •

edited

Loading

tobymao commented Feb 4, 2025 •

edited

Loading

Gallaecio commented Feb 5, 2025 •

edited

Loading

tobymao commented Feb 6, 2025 •

edited

Loading

tobymao commented Feb 6, 2025 •

edited

Loading

Gallaecio commented Feb 6, 2025 •

edited

Loading