-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing large numbers of files to createsetdb #5
Comments
Hi! Did you reach a solution for your issue with the large number of files? |
Hi Fazel, So we do not yet have a full solution, but we have identified a core problem. On our linux system (and many others unless compiled with special parameters) there is a byte limit on the size of a terminal command (2Mb of total text). So when the command for spacedust is passed using a file glob (*) the actual in memory size of the full command text becomes larger than 2Mb. We have implemented the following workarounds at the moment:
However, as we are using plasmid sequences that are much shorter than genomes, and have the resource, we are planning to further scale this up into the 100s of thousands of sequences. One potential solution (if possible in your code framework) would be the option to provide a single input file to the input that contains the complete paths to every file you want to include in the analysis. Do you think this would be possible to implement?
|
Hi Spencer, Thank you for the helpful tips. I checked the command size limit in my linux system by Fazel |
Hi, I have updated the workflow. Now it is also possible to pass a directory or a .tsv file with the list of paths to the desired files, for example: You can download the new pre-compiled repository. |
Fantastic thank you all. I will give it a test in the coming days.
- Spencer
-----------------------------------------------
Spencer Diamond, Ph.D.
Principal Investigator
Innovative Genomics Institute
University of California, Berkeley
2151 Berkeley Way
Berkeley, CA 94720
Diamond Lab <https://diamondlab.bio/> | IGI <https://innovativegenomics.org/> | BIOME <https://innovativegenomics.org/microbiome-editing/>
X: @Dr__Diamond <https://x.com/Dr__Diamond>
… On Dec 17, 2024, at 8:38 AM, RuoshiZ ***@***.***> wrote:
Hi, I have updated the workflow. Now it is also possible to pass a directory or a .tsv file with the list of paths to the desired files, for example:
$spacedust createsetdb /individual_faa SpacedustDB tmp --file-include ".faa$" --threads 18
or
$spacedust createsetdb path_to_faa.tsv SpacedustDB tmp --threads 18
You can download the new pre-compiled repository.
—
Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEFTX2JNWNPJC6RRZJUF5AT2GBHPRAVCNFSM6AAAAABQ3FSHJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBYHE3TSOBTHE>.
You are receiving this because you authored the thread.
|
Hi Ruoshi, I also ran into the same problem, and the provided fix still doesn't work because
still uses a system call to the generated tmp bash script, which also uses a system call to the mmseqs command. So it fails with the same error " E2BIG (Argument list too long)" in the subsequent steps. Thanks, |
I would like to run spacedust on a plasmid database. This database has ~60k individual files that represent separate plasmid "genomes". However when I pass the following command to spacedust:
I receive a bash error that the arguments list is too long. I have tried a number of workarounds to this such as passing an environment variable that contains all the file names...but to no avail
It would be useful if instead of passing a file glob (*), that spacedust createsetdb could instead take a single input file with paths to each of the .faa files needed for db creation. Alternatively if I could create databases in batches and combine them that could be another approach, just not sure if that is supported. Finally, if you have any other suggestions I would be forever greatful.
In terms of the total number of proteins in these plasmid "genomes" it would be quite similar to the 9000 genomes you ran in the spacedust paper since plasmids are much smaller in size. So I think computationally it should be managable just trouble getting all the files in :-)
My Environment
The text was updated successfully, but these errors were encountered: