Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Stream File Parsing with ray.io #260

Merged
merged 31 commits into from
Aug 5, 2024
Merged

Parallel Stream File Parsing with ray.io #260

merged 31 commits into from
Aug 5, 2024

Conversation

kmdalton
Copy link
Member

This PR adds support for faster stream file parsing which is parallelized using the ray. I did not add ray as a dependency for users, so the code falls back to serial python when it is not available.

@kmdalton kmdalton requested a review from JBGreisman June 19, 2024 16:29
@JBGreisman JBGreisman added the performance Issue or pull request related to performance label Jun 19, 2024
@JBGreisman
Copy link
Member

This all looks good to me, but I'd like to play with it a bit more before merging. I think I agree with making ray an optional dependency, but I don't think I like adding it to tests_require -- seems like a hacky solution.

What are your thoughts on us adding an explicit parallel_require=["ray"], that is added to [dev] and maybe a new [parallel] pip option that only adds on the ray extra requirement?

@kmdalton
Copy link
Member Author

I'm happy to defer to your preferences regarding requirements. I don't have strong feelings as long as we make it easy for users to figure out how to get parallelism. I am not yet familiar enough with ray to know how nicely it plays with other packages. So far it seems very promising.

@kmdalton
Copy link
Member Author

@marinegor , would you be willing to test out this branch for us?

@marinegor
Copy link
Contributor

@kmdalton sure, I can have a look! what kind of testing are you thinking, could you elaborate? I imagine you want to make sure that your parser produces same results as the previous one, right?

@kmdalton
Copy link
Member Author

Thank you! I don't have access to a lot of stream files, and I have noticed there can be some differences in the metadata between files. Mostly I want to make sure I'm not missing anything which will break the parser for some edge cases. Additionally, I would hope you could let us know

  • Whether it is reasonably that it requires an additional dependency, ray, in order to run in parallel
  • Whether the read_crystfel and the StreamLoader class are adequately documented
  • Whether the parser runs in parallel as expected
  • Whether the parser has reasonable memory usage relative to the original implementation
  • Whether it generates the expected output

@DHekstra
Copy link
Contributor

$ conda install -c conda-forge "ray-default"

fails for me like this in a fresh conda environment in which I (tried and maybe failed) to use your careless install script:

Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - nothing provides _python_rc needed by python-3.12.0rc3-rc3_hab00c5b_1_cpython

Could not solve for environment specs
The following packages are incompatible
├─ python 3.12**  is installable with the potential options
│  ├─ python [3.12.0|3.12.1|3.12.2|3.12.3|3.12.4], which can be installed;
│  ├─ python [3.12.0|3.12.1|3.12.2|3.12.3|3.12.4] would require
│  │  └─ python_abi 3.12.* *_cp312, which can be installed;
│  └─ python 3.12.0rc3 would require
│     └─ _python_rc, which does not exist (perhaps a missing channel);
└─ ray-default is not installable because there are no viable options
   ├─ ray-default [1.10.0|1.11.0|...|2.0.0] would require
   │  ├─ python >=3.7,<3.8.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.7.* *_cp37m, which conflicts with any installable versions previously reported;
   ├─ ray-default [1.10.0|1.11.0|...|2.9.3] would require
   │  ├─ python >=3.8,<3.9.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.8.* *_cp38, which conflicts with any installable versions previously reported;
   ├─ ray-default [1.10.0|1.11.0|...|2.9.3] would require
   │  ├─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.9.* *_cp39, which conflicts with any installable versions previously reported;
   ├─ ray-default [1.13.0|2.0.0|...|2.9.3] would require
   │  ├─ python >=3.10,<3.11.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.10.* *_cp310, which conflicts with any installable versions previously reported;
   ├─ ray-default [1.5.0|1.5.1|1.5.2|1.6.0] would require
   │  ├─ python >=3.6,<3.7.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.6.* *_cp36m, which conflicts with any installable versions previously reported;
   ├─ ray-default [2.10.0|2.11.0|...|2.9.3] would require
   │  ├─ python >=3.11,<3.12.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.11.* *_cp311, which conflicts with any installable versions previously reported;
   ├─ ray-default 2.8.0 would require
   │  └─ ray-core 2.8.0 py38h1702d6c_1, which does not exist (perhaps a missing channel);
   ├─ ray-default [1.6.0|1.9.2|2.0.1] would require
   │  └─ python >=3.7,<3.8.0a0 , which conflicts with any installable versions previously reported;
   ├─ ray-default [1.6.0|1.9.2|2.0.1|2.3.0|2.6.3] would require
   │  └─ python >=3.8,<3.9.0a0 , which conflicts with any installable versions previously reported;
   ├─ ray-default [1.6.0|1.9.2|2.0.1|2.3.0|2.6.3] would require
   │  └─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported;
   ├─ ray-default [2.0.1|2.3.0|2.6.3] would require
   │  └─ python >=3.10,<3.11.0a0 , which conflicts with any installable versions previously reported;
   └─ ray-default 2.6.3 would require
      └─ python >=3.11,<3.12.0a0 , which conflicts with any installable versions previously reported.

@DHekstra
Copy link
Contributor

This also fails:

(careless-13)[dhekstra@holy8a24301 reciprocalspaceship]$ pip install -U "ray"
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

(careless-13)[dhekstra@holy8a24301 reciprocalspaceship]$ pip install -U "ray[default]"
ERROR: Could not find a version that satisfies the requirement ray[default] (from versions: none)
ERROR: No matching distribution found for ray[default]

(careless-13)[dhekstra@holy8a24301 reciprocalspaceship]$ pip install ray
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

@kmdalton
Copy link
Member Author

@DHekstra , I think some of your packages (careless for sure) do not have python 3.12 support which is confusing the package solver. You should use python 3.11 for now.

Copy link
Contributor

@marinegor marinegor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kmdalton , sorry for the long delay -- hope it'll go faster from here.
Great PR, thanks for implementing it! It certainly needs a re-work, and ti's really cool that you did it :)

Short changes summary (detailed below):

  • allow running without ray even if it's installed
  • run ray in context manager to gracefully shutdown if interrupted
  • r+ reading rights break when stream files are accessible readonly

And that's it! Really great work, thanks again for inviting me :)


Detailed answers to your questions:

  1. I don't have access to a lot of stream files, and I have noticed there can be some differences in the metadata between files.

If you want smoke tests, I'd ask Filippe di Maia from CXIDB (https://cxidb.org/contact.html) for contacts -- I'm sure he could find you few dozens of various streams for testing.
Or you can parse CXIDB for links to stream files -- I wrote a parser with BeautifulSoup, it's doable.

  1. Whether it is reasonably that it requires an additional dependency, ray, in order to run in parallel

A: it's perfectly reasonable, as long as it's possible to fall back to non-ray option even if ray is installed.
I left a comment about that, but basically, being able to run without ray even if it's installed, is necessary.

  1. Whether the read_crystfel and the StreamLoader class are adequately documented

I'd add important ray keywords explicitly, and examples with how to limit CPU and RAM usage.

  1. Whether the parser runs in parallel as expected

It seems that yes, and even num_cpus influences that as well.

  1. Whether the parser has reasonable memory usage relative to the original implementation
    It probably heavily depends on the stream size, and I'd add some rule-of-thumb rules based on that. For example, in case of CXIDB ID106's stream, previous implementation was much faster for me, since the ray cluster's overhead is too big.
    I can easily imagine that ray can give performance boosts when dealing with e.g. >2Gb streams or something, depending on IO speed, but it's probably better to document this.

  2. Whether it generates the expected output

That I can't say, but it can easily be tested -- I'd take 3-4 stream files from CXIDB (e.g. check ID106, ID223 -- combined they have 5 stream files).
You can make up syntetic datasets (one crystal, two crystals, five crystals), and test if they give comparable results.

Comment on lines 309 to 310
results = ray.get(result_ids)
ray.shutdown()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs a context manager -- now, in case I do KeyboardInterrupt (in my case, seeing that my program almost ran out of memory on my laptop), re-running read_crystfel would cause:

RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'.

so I'd do something like

def __init__(self, ..., **ray_kwargs):
  self._ray_kwargs = ray_kwargs

def parallel_read_crystfel(...):
  with start_ray(**ray_kwargs):
    results = ray.get(result_ids)

where

from contextlib import contextmanager

@contextmanager
def start_ray(**ray_kwargs):
  ray.init(**ray_kwargs)
  yield
  ray.shutdown()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something like you requested in a03780a. I'm not sure if it ticks all your boxes, would you mind having a look? @marinegor

@kmdalton
Copy link
Member Author

@marinegor , thanks so much for this feedback. It'll take me some time to work through it, but this is super helpful stuff. 🙏

Copy link
Contributor

@marinegor marinegor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmdalton thanks for the swift reply, I think it's good to go now! Nice changes, and on CXIDB's ID106 it indeed gives the same results:

image

which is amazing, given how many rounding errors could've been there :)

I added few nitpicking comments that popped up in my IDE, and that's it!

Copy link
Member

@JBGreisman JBGreisman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. Feel free to merge if/when you're happy with it.

@kmdalton kmdalton merged commit 870a019 into main Aug 5, 2024
8 checks passed
@kmdalton kmdalton deleted the parstream branch September 11, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issue or pull request related to performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants