-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Stream File Parsing with ray.io #260
Conversation
for more information, see https://pre-commit.ci
This all looks good to me, but I'd like to play with it a bit more before merging. I think I agree with making What are your thoughts on us adding an explicit |
I'm happy to defer to your preferences regarding requirements. I don't have strong feelings as long as we make it easy for users to figure out how to get parallelism. I am not yet familiar enough with |
@marinegor , would you be willing to test out this branch for us? |
@kmdalton sure, I can have a look! what kind of testing are you thinking, could you elaborate? I imagine you want to make sure that your parser produces same results as the previous one, right? |
Thank you! I don't have access to a lot of stream files, and I have noticed there can be some differences in the metadata between files. Mostly I want to make sure I'm not missing anything which will break the parser for some edge cases. Additionally, I would hope you could let us know
|
for more information, see https://pre-commit.ci
$ fails for me like this in a fresh conda environment in which I (tried and maybe failed) to use your careless install script:
|
This also fails:
|
@DHekstra , I think some of your packages (careless for sure) do not have python 3.12 support which is confusing the package solver. You should use python 3.11 for now. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kmdalton , sorry for the long delay -- hope it'll go faster from here.
Great PR, thanks for implementing it! It certainly needs a re-work, and ti's really cool that you did it :)
Short changes summary (detailed below):
- allow running without
ray
even if it's installed - run
ray
in context manager to gracefully shutdown if interrupted r+
reading rights break when stream files are accessible readonly
And that's it! Really great work, thanks again for inviting me :)
Detailed answers to your questions:
- I don't have access to a lot of stream files, and I have noticed there can be some differences in the metadata between files.
If you want smoke tests, I'd ask Filippe di Maia from CXIDB (https://cxidb.org/contact.html) for contacts -- I'm sure he could find you few dozens of various streams for testing.
Or you can parse CXIDB for links to stream files -- I wrote a parser with BeautifulSoup, it's doable.
- Whether it is reasonably that it requires an additional dependency, ray, in order to run in parallel
A: it's perfectly reasonable, as long as it's possible to fall back to non-ray option even if ray is installed.
I left a comment about that, but basically, being able to run without ray
even if it's installed, is necessary.
- Whether the read_crystfel and the StreamLoader class are adequately documented
I'd add important ray keywords explicitly, and examples with how to limit CPU and RAM usage.
- Whether the parser runs in parallel as expected
It seems that yes, and even num_cpus
influences that as well.
-
Whether the parser has reasonable memory usage relative to the original implementation
It probably heavily depends on the stream size, and I'd add some rule-of-thumb rules based on that. For example, in case of CXIDB ID106's stream, previous implementation was much faster for me, since the ray cluster's overhead is too big.
I can easily imagine that ray can give performance boosts when dealing with e.g. >2Gb streams or something, depending on IO speed, but it's probably better to document this. -
Whether it generates the expected output
That I can't say, but it can easily be tested -- I'd take 3-4 stream files from CXIDB (e.g. check ID106, ID223 -- combined they have 5 stream files).
You can make up syntetic datasets (one crystal, two crystals, five crystals), and test if they give comparable results.
reciprocalspaceship/io/crystfel.py
Outdated
results = ray.get(result_ids) | ||
ray.shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs a context manager -- now, in case I do KeyboardInterrupt
(in my case, seeing that my program almost ran out of memory on my laptop), re-running read_crystfel
would cause:
RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'.
so I'd do something like
def __init__(self, ..., **ray_kwargs):
self._ray_kwargs = ray_kwargs
def parallel_read_crystfel(...):
with start_ray(**ray_kwargs):
results = ray.get(result_ids)
where
from contextlib import contextmanager
@contextmanager
def start_ray(**ray_kwargs):
ray.init(**ray_kwargs)
yield
ray.shutdown()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added something like you requested in a03780a. I'm not sure if it ticks all your boxes, would you mind having a look? @marinegor
@marinegor , thanks so much for this feedback. It'll take me some time to work through it, but this is super helpful stuff. 🙏 |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmdalton thanks for the swift reply, I think it's good to go now! Nice changes, and on CXIDB's ID106 it indeed gives the same results:
which is amazing, given how many rounding errors could've been there :)
I added few nitpicking comments that popped up in my IDE, and that's it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me. Feel free to merge if/when you're happy with it.
This PR adds support for faster stream file parsing which is parallelized using the ray. I did not add ray as a dependency for users, so the code falls back to serial python when it is not available.