-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numcpus does not work #2
Comments
@lskatz Would you mind sharing the method you're planing to use for multi-threading here? |
I had ideas of making threads but never got around to it. I am open to getting help on it though! If/when we submit this to JOSS, I will be open to giving coauthorship to contributions like this! |
And re: proteins, I think there shouldn't be any problem using it with protein sequences. Let me know! |
Oh and the extent to which I was thinking of multithreading. I didn't think any particular rust implementation would matter. I was just thinking of picking some smart number of number of reads per chunk (100k?) and letting each thread run its thing. The "thing" would depend on what the executable actually does. |
Super ! I'm a total novice in rust but I'd be happy to help as much as I can. re:re: proteins, many of the core functionality should work 'as is', I started forking around fasten yesterday and I think the main TODOs are adding amino acid alphabet validation, disabling paired end mode for the protein mod, and avoiding trying 'reverse complementing' them. I could try to get to these if you wouldn't mind. re: threading, that sounds good and straightforward, I think rayon would fit the bill.As for the number of reads per chunk - interesting, If the number of records/reads to be processed is known beforehand (not very fitting to Unix piping but still), what do you think about splitting the chunks more or less equally between (NUMCPUs - 1)? I'm assuming a bit here like no overhead for the picking of the actual chunks, and as I noted, I'm really new to rust so sorry in advance if I misunderstood the actual inner workings of fasten. Unrealted: I think you forgot the 'L' in Chandler 😄 but at least it's a consistent typo (both in the pics dir and fasten_kmer help arg). |
Thank you for catching those typos! I want to remove all the Friends references but keep finding more :) Any PRs you have, I would appreciate and review! Your plan for multithreading is as good as any and I would love to see it :) I think that reading a file twice however (once to see how many reads there are and then again to process) might be too much overhead and maybe even impossible with pipes, and so I would prefer chunks but I have an open mind if you can benchmark it. |
Thanks for being so open! it's really refreshing :-) Yeah, knowing how many reads need to be processed only to set chunk size, doesn't really justify reading the file twice. I was thinking maybe checking the file size and if the input format is indicative enough, then maybe estimating the number of records by that (but that won't work when the input is piped). I think your initial suggestion of 100k reads per chunk would work great - I guess it's better to have more chunks than threads, and just queue the chunks in memory until a thread has finished working on a chunk and is ready process the next one. So, assuming must people (like me) would usually process a total of >1Mil reads using 6-8 threads, the chunks to threads ratio will be fine. It might be a while (month+) but I'll let you know if I get to actually making changes (I also want to read some more on MT in rust first). Thanks again and have happy new year! |
I added a method to @UriNeri were you able to try adding multithreading on your end at all? |
Local test on random but large fastq file:
|
Hi @lskatz ! |
Need to add functionality for numcpus
The text was updated successfully, but these errors were encountered: