Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persisted memoization (ie saving @mo.cache values) #3471

Closed
gabrielgrant opened this issue Jan 16, 2025 · 3 comments · Fixed by #3550
Closed

persisted memoization (ie saving @mo.cache values) #3471

gabrielgrant opened this issue Jan 16, 2025 · 3 comments · Fixed by #3550
Labels
enhancement New feature or request

Comments

@gabrielgrant
Copy link
Contributor

gabrielgrant commented Jan 16, 2025

Description

I'd like to have the control offered by the standard memoization interface of the mo.cache decorator, but have the values persist across kernel restarts (as with marimo.persistent_cache)

There are a few third-party options that do something similar (at least for pure functions):

How does marimo's built-in caching compare to these options?

AFAICT the main difference would be that mo.cache takes closed-over dependent variables and/or source code changes into account in the same way mo.persistent_cache does, right? Are there any differences between the caching rules of the two marimo interfaces? Or just whether values are persisted? What is the thinking in having the style of cache interface (function memoization vs block-based-with-context-manager) also dictate whether the cache is persisted to disk or not?

Suggested solution

Possibly just adding a "persisted" flag to mo.cache?

Not sure all the details of how caching is implemented internally, but might make sense to re-use one of these external libs in order to not have to fully reinvent the wheel?

Alternative

Recommend using one of the existing third-party libs

Additional context

No response

@gabrielgrant gabrielgrant added the enhancement New feature or request label Jan 16, 2025
@gabrielgrant
Copy link
Contributor Author

gabrielgrant commented Jan 17, 2025

One (potential? hopefully?) difference is that these other solutions seem to have issues with consistently/correctly caching Pandas DataFrames:

joblib/joblib#1611
grantjenks/python-diskcache#314

This is obviously a pretty common use-case in a notebook environment. Is this handled correctly in marimo's cache implementation(s)?

@dmadisetti
Copy link
Collaborator

I've been promising a cache update and here's the branch:

#3480

I think for the pd case, this is something we can explicitly test against, and maybe squeeze into this branch. Some of the hashing mechanism is inspired by job lib. I think marimo should catch this, since it refers directly to the
I think I'd like to group some of these changes together to prevent repeated cache misses on version updates

Related to this conversation is this discussion:

#2653

Where we decided @persistent_cache should be a drop in for @cache

@dmadisetti
Copy link
Collaborator

Just looked at this code again, and the answer is better than joblib, but still not ideal.

marimo has 2 modes of hashing, ContentAddressed (which is the content of the variable) and ExecutionContext (Which is the code run to produce the variable, used for cases where we cannot get a clear hash).

IF a dataframe is all numerical data then the raw, contiguous memory behind the df is used for hashing. If it is not (df contains objects), then the fallback is to ExecutionContext.

More explicit handling could be done, and I don't think column names contribute to the hash in this case.

@dmadisetti dmadisetti linked a pull request Jan 23, 2025 that will close this issue
akshayka pushed a commit that referenced this issue Jan 24, 2025
## 📝 Summary

fixes #2653 #3471

## 🔍 Description of Changes

Enables `persistent_cache` to be used as a decorator for functions, and
`cache` to be used as a context block. e.g.

```python
@mo.persistent_cache
def expensive_function_written_to_disk():
    ...

# or

with mo.cache("expensive_block_in_memory") as c:
    ...
```

`cache` is also used as the general entry point for custom "Loaders" 

---

The breadth of the API makes the implementation a bit hairy, but I think
that if it's a smooth experience for the user then it's worth it.

@akshayka
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants