Parallel Stream File Parsing with ray.io #260

kmdalton · 2024-06-19T15:34:57Z

This PR adds support for faster stream file parsing which is parallelized using the ray. I did not add ray as a dependency for users, so the code falls back to serial python when it is not available.

…p into parstream

for more information, see https://pre-commit.ci

JBGreisman · 2024-06-19T20:41:01Z

This all looks good to me, but I'd like to play with it a bit more before merging. I think I agree with making ray an optional dependency, but I don't think I like adding it to tests_require -- seems like a hacky solution.

What are your thoughts on us adding an explicit parallel_require=["ray"], that is added to [dev] and maybe a new [parallel] pip option that only adds on the ray extra requirement?

kmdalton · 2024-06-20T19:24:49Z

I'm happy to defer to your preferences regarding requirements. I don't have strong feelings as long as we make it easy for users to figure out how to get parallelism. I am not yet familiar enough with ray to know how nicely it plays with other packages. So far it seems very promising.

kmdalton · 2024-06-21T16:00:48Z

@marinegor , would you be willing to test out this branch for us?

marinegor · 2024-06-21T18:02:11Z

@kmdalton sure, I can have a look! what kind of testing are you thinking, could you elaborate? I imagine you want to make sure that your parser produces same results as the previous one, right?

kmdalton · 2024-06-21T18:26:21Z

Thank you! I don't have access to a lot of stream files, and I have noticed there can be some differences in the metadata between files. Mostly I want to make sure I'm not missing anything which will break the parser for some edge cases. Additionally, I would hope you could let us know

Whether it is reasonably that it requires an additional dependency, ray, in order to run in parallel
Whether the read_crystfel and the StreamLoader class are adequately documented
Whether the parser runs in parallel as expected
Whether the parser has reasonable memory usage relative to the original implementation
Whether it generates the expected output

for more information, see https://pre-commit.ci

…p into parstream

DHekstra · 2024-06-25T03:16:49Z

$ conda install -c conda-forge "ray-default"

fails for me like this in a fresh conda environment in which I (tried and maybe failed) to use your careless install script:

Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - nothing provides _python_rc needed by python-3.12.0rc3-rc3_hab00c5b_1_cpython

Could not solve for environment specs
The following packages are incompatible
├─ python 3.12**  is installable with the potential options
│  ├─ python [3.12.0|3.12.1|3.12.2|3.12.3|3.12.4], which can be installed;
│  ├─ python [3.12.0|3.12.1|3.12.2|3.12.3|3.12.4] would require
│  │  └─ python_abi 3.12.* *_cp312, which can be installed;
│  └─ python 3.12.0rc3 would require
│     └─ _python_rc, which does not exist (perhaps a missing channel);
└─ ray-default is not installable because there are no viable options
   ├─ ray-default [1.10.0|1.11.0|...|2.0.0] would require
   │  ├─ python >=3.7,<3.8.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.7.* *_cp37m, which conflicts with any installable versions previously reported;
   ├─ ray-default [1.10.0|1.11.0|...|2.9.3] would require
   │  ├─ python >=3.8,<3.9.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.8.* *_cp38, which conflicts with any installable versions previously reported;
   ├─ ray-default [1.10.0|1.11.0|...|2.9.3] would require
   │  ├─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.9.* *_cp39, which conflicts with any installable versions previously reported;
   ├─ ray-default [1.13.0|2.0.0|...|2.9.3] would require
   │  ├─ python >=3.10,<3.11.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.10.* *_cp310, which conflicts with any installable versions previously reported;
   ├─ ray-default [1.5.0|1.5.1|1.5.2|1.6.0] would require
   │  ├─ python >=3.6,<3.7.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.6.* *_cp36m, which conflicts with any installable versions previously reported;
   ├─ ray-default [2.10.0|2.11.0|...|2.9.3] would require
   │  ├─ python >=3.11,<3.12.0a0 , which conflicts with any installable versions previously reported;
   │  └─ python_abi 3.11.* *_cp311, which conflicts with any installable versions previously reported;
   ├─ ray-default 2.8.0 would require
   │  └─ ray-core 2.8.0 py38h1702d6c_1, which does not exist (perhaps a missing channel);
   ├─ ray-default [1.6.0|1.9.2|2.0.1] would require
   │  └─ python >=3.7,<3.8.0a0 , which conflicts with any installable versions previously reported;
   ├─ ray-default [1.6.0|1.9.2|2.0.1|2.3.0|2.6.3] would require
   │  └─ python >=3.8,<3.9.0a0 , which conflicts with any installable versions previously reported;
   ├─ ray-default [1.6.0|1.9.2|2.0.1|2.3.0|2.6.3] would require
   │  └─ python >=3.9,<3.10.0a0 , which conflicts with any installable versions previously reported;
   ├─ ray-default [2.0.1|2.3.0|2.6.3] would require
   │  └─ python >=3.10,<3.11.0a0 , which conflicts with any installable versions previously reported;
   └─ ray-default 2.6.3 would require
      └─ python >=3.11,<3.12.0a0 , which conflicts with any installable versions previously reported.

DHekstra · 2024-06-25T03:18:28Z

This also fails:

(careless-13)[dhekstra@holy8a24301 reciprocalspaceship]$ pip install -U "ray"
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

(careless-13)[dhekstra@holy8a24301 reciprocalspaceship]$ pip install -U "ray[default]"
ERROR: Could not find a version that satisfies the requirement ray[default] (from versions: none)
ERROR: No matching distribution found for ray[default]

(careless-13)[dhekstra@holy8a24301 reciprocalspaceship]$ pip install ray
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

kmdalton · 2024-06-25T13:38:06Z

@DHekstra , I think some of your packages (careless for sure) do not have python 3.12 support which is confusing the package solver. You should use python 3.11 for now.

for more information, see https://pre-commit.ci

marinegor

Hey @kmdalton , sorry for the long delay -- hope it'll go faster from here.
Great PR, thanks for implementing it! It certainly needs a re-work, and ti's really cool that you did it :)

Short changes summary (detailed below):

allow running without ray even if it's installed
run ray in context manager to gracefully shutdown if interrupted
r+ reading rights break when stream files are accessible readonly

And that's it! Really great work, thanks again for inviting me :)

Detailed answers to your questions:

I don't have access to a lot of stream files, and I have noticed there can be some differences in the metadata between files.

If you want smoke tests, I'd ask Filippe di Maia from CXIDB (https://cxidb.org/contact.html) for contacts -- I'm sure he could find you few dozens of various streams for testing.
Or you can parse CXIDB for links to stream files -- I wrote a parser with BeautifulSoup, it's doable.

Whether it is reasonably that it requires an additional dependency, ray, in order to run in parallel

A: it's perfectly reasonable, as long as it's possible to fall back to non-ray option even if ray is installed.
I left a comment about that, but basically, being able to run without ray even if it's installed, is necessary.

Whether the read_crystfel and the StreamLoader class are adequately documented

I'd add important ray keywords explicitly, and examples with how to limit CPU and RAM usage.

Whether the parser runs in parallel as expected

It seems that yes, and even num_cpus influences that as well.

Whether the parser has reasonable memory usage relative to the original implementation
It probably heavily depends on the stream size, and I'd add some rule-of-thumb rules based on that. For example, in case of CXIDB ID106's stream, previous implementation was much faster for me, since the ray cluster's overhead is too big.
I can easily imagine that ray can give performance boosts when dealing with e.g. >2Gb streams or something, depending on IO speed, but it's probably better to document this.
Whether it generates the expected output

That I can't say, but it can easily be tested -- I'd take 3-4 stream files from CXIDB (e.g. check ID106, ID223 -- combined they have 5 stream files).
You can make up syntetic datasets (one crystal, two crystals, five crystals), and test if they give comparable results.

reciprocalspaceship/io/crystfel.py

marinegor · 2024-07-18T21:54:33Z

reciprocalspaceship/io/crystfel.py

+                results = ray.get(result_ids)
+                ray.shutdown()


This probably needs a context manager -- now, in case I do KeyboardInterrupt (in my case, seeing that my program almost ran out of memory on my laptop), re-running read_crystfel would cause:

RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'.

so I'd do something like

def __init__(self, ..., **ray_kwargs): self._ray_kwargs = ray_kwargs def parallel_read_crystfel(...): with start_ray(**ray_kwargs): results = ray.get(result_ids)

where

from contextlib import contextmanager @contextmanager def start_ray(**ray_kwargs): ray.init(**ray_kwargs) yield ray.shutdown()

I added something like you requested in a03780a. I'm not sure if it ticks all your boxes, would you mind having a look? @marinegor

kmdalton · 2024-07-19T03:57:39Z

@marinegor , thanks so much for this feedback. It'll take me some time to work through it, but this is super helpful stuff. 🙏

for more information, see https://pre-commit.ci

marinegor

@kmdalton thanks for the swift reply, I think it's good to go now! Nice changes, and on CXIDB's ID106 it indeed gives the same results:

which is amazing, given how many rounding errors could've been there :)

I added few nitpicking comments that popped up in my IDE, and that's it!

reciprocalspaceship/io/crystfel.py

JBGreisman

This all looks good to me. Feel free to merge if/when you're happy with it.

kmdalton added 7 commits June 19, 2024 10:57

Merge remote-tracking branch 'origin/main' into parstream

ebb7c75

add ray dep for parallelization

4ebaea4

add ray as testing dep

2c29aa8

fix up docstrings

08ba5cd

add proper capitalization, fix dtypes in A / eV

315fccb

use Union instead of |

9418c3b

Merge branch 'main' of ssh://github.com/rs-station/reciprocalspaceshi…

db9bfaa

…p into parstream

kmdalton force-pushed the parstream branch from 594dbdc to db9bfaa Compare June 19, 2024 15:41

pre-commit-ci bot and others added 2 commits June 19, 2024 15:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

a2587c9

for more information, see https://pre-commit.ci

maybe fix build issue

7007266

kmdalton requested a review from JBGreisman June 19, 2024 16:29

JBGreisman added the performance Issue or pull request related to performance label Jun 19, 2024

JBGreisman mentioned this pull request Jun 19, 2024

Faster parser for CrystFEL stream files #216

Closed

kmdalton mentioned this pull request Jun 20, 2024

Careless tutorial lcls-users/lcls-users#50

Open

Kevin Dalton and others added 8 commits June 21, 2024 16:47

remove unused imports, attr

8dbeead

add average cell method

e928b9a

[pre-commit.ci] auto fixes from pre-commit.com hooks

6a37e86

for more information, see https://pre-commit.ci

fix wavelength for empty peak lists

c784115

fix numpy v1 support

88f1caa

add numpy v1 to matrix

df663cd

fix numpy versioning

69ed824

Merge branch 'npv1' of ssh://github.com/rs-station/reciprocalspaceshi…

3b4f4b2

…p into parstream

kmdalton and others added 2 commits June 27, 2024 17:25

pass through kwargs to ray.init

0f34c9d

[pre-commit.ci] auto fixes from pre-commit.com hooks

d5913b9

for more information, see https://pre-commit.ci

marinegor suggested changes Jul 18, 2024

View reviewed changes

Kevin Dalton and others added 9 commits July 19, 2024 11:17

enable read only stream parsing

c11e99f

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f1f547

for more information, see https://pre-commit.ci

remove unused imports

b5729e6

fix column spec, add option to disable ray, make num_cpus explicit

84b83ad

test with and without ray

612ef0a

[pre-commit.ci] auto fixes from pre-commit.com hooks

dff3c28

for more information, see https://pre-commit.ci

add error message

a1e44a3

clean up dataset creation

a03780a

[pre-commit.ci] auto fixes from pre-commit.com hooks

1082a92

for more information, see https://pre-commit.ci

marinegor approved these changes Jul 21, 2024

View reviewed changes

reciprocalspaceship/io/crystfel.py Outdated Show resolved Hide resolved

reciprocalspaceship/io/crystfel.py Outdated Show resolved Hide resolved

reciprocalspaceship/io/crystfel.py Outdated Show resolved Hide resolved

Kevin Dalton added 3 commits August 2, 2024 11:49

remove unused variable

f1770ac

removed unused timing stuff

33b57be

removed unused DataSeries

ec6ad73

JBGreisman approved these changes Aug 3, 2024

View reviewed changes

kmdalton merged commit 870a019 into main Aug 5, 2024
8 checks passed

DHekstra mentioned this pull request Aug 24, 2024

Return keys of dictionary in crystfel.py #213

Open

kmdalton deleted the parstream branch September 11, 2024 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Stream File Parsing with ray.io #260

Parallel Stream File Parsing with ray.io #260

kmdalton commented Jun 19, 2024

JBGreisman commented Jun 19, 2024

kmdalton commented Jun 20, 2024

kmdalton commented Jun 21, 2024

marinegor commented Jun 21, 2024

kmdalton commented Jun 21, 2024

DHekstra commented Jun 25, 2024

DHekstra commented Jun 25, 2024

kmdalton commented Jun 25, 2024

marinegor left a comment

marinegor Jul 18, 2024

kmdalton Jul 19, 2024

kmdalton commented Jul 19, 2024

marinegor left a comment

JBGreisman left a comment

Parallel Stream File Parsing with ray.io #260

Parallel Stream File Parsing with ray.io #260

Conversation

kmdalton commented Jun 19, 2024

JBGreisman commented Jun 19, 2024

kmdalton commented Jun 20, 2024

kmdalton commented Jun 21, 2024

marinegor commented Jun 21, 2024

kmdalton commented Jun 21, 2024

DHekstra commented Jun 25, 2024

DHekstra commented Jun 25, 2024

kmdalton commented Jun 25, 2024

marinegor left a comment

Choose a reason for hiding this comment

marinegor Jul 18, 2024

Choose a reason for hiding this comment

kmdalton Jul 19, 2024

Choose a reason for hiding this comment

kmdalton commented Jul 19, 2024

marinegor left a comment

Choose a reason for hiding this comment

JBGreisman left a comment

Choose a reason for hiding this comment