Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing large numbers of files to createsetdb #5

Open
SDmetagenomics opened this issue Oct 30, 2024 · 6 comments
Open

Passing large numbers of files to createsetdb #5

SDmetagenomics opened this issue Oct 30, 2024 · 6 comments

Comments

@SDmetagenomics
Copy link

I would like to run spacedust on a plasmid database. This database has ~60k individual files that represent separate plasmid "genomes". However when I pass the following command to spacedust:

$spacedust createsetdb /individual_faa/*.faa SpacedustDB tmp --threads 18

bash: /shared/software/bin/spacedust: Argument list too long

I receive a bash error that the arguments list is too long. I have tried a number of workarounds to this such as passing an environment variable that contains all the file names...but to no avail

It would be useful if instead of passing a file glob (*), that spacedust createsetdb could instead take a single input file with paths to each of the .faa files needed for db creation. Alternatively if I could create databases in batches and combine them that could be another approach, just not sure if that is supported. Finally, if you have any other suggestions I would be forever greatful.

In terms of the total number of proteins in these plasmid "genomes" it would be quite similar to the 9000 genomes you ran in the spacedust paper since plasmids are much smaller in size. So I think computationally it should be managable just trouble getting all the files in :-)

My Environment

  • Linux
  • Using Statically compiled spacedust executable for AVX2 instruction set
@Fazel-AVB
Copy link

Hi! Did you reach a solution for your issue with the large number of files?

@SDmetagenomics
Copy link
Author

Hi Fazel,

So we do not yet have a full solution, but we have identified a core problem. On our linux system (and many others unless compiled with special parameters) there is a byte limit on the size of a terminal command (2Mb of total text). So when the command for spacedust is passed using a file glob (*) the actual in memory size of the full command text becomes larger than 2Mb. We have implemented the following workarounds at the moment:

  1. Running spacedust within the data folder so that the relative path to the data does not contain any extra directories (e.g. .faa rather than /spacedust_input/.faa). This significantly reduces the number of bites the command takes up as the relative path no longer contains a repeated directory entry for every file we want to analyze (e.g. /spacedust_input/1.faa, /spacedust_input/2.faa) .

  2. We have started re-naming input files so that they are the shortest name possible (e.g. 1.faa, 2.faa) and linked these new names to a lookup table. This also seems to help mitigate the problem.

However, as we are using plasmid sequences that are much shorter than genomes, and have the resource, we are planning to further scale this up into the 100s of thousands of sequences. One potential solution (if possible in your code framework) would be the option to provide a single input file to the input that contains the complete paths to every file you want to include in the analysis. Do you think this would be possible to implement?

  • Spencer

@Fazel-AVB
Copy link

Fazel-AVB commented Dec 17, 2024

Hi Spencer,

Thank you for the helpful tips. I checked the command size limit in my linux system by ulimit -a and it turned out to be stack size (kbytes, -s) 8192 (i.e., 8 Mb). Then I increased it to 16 Mb by ulimit -s 16384. Followingly I ran the $spacedust createsetdb ./*.faa SpacedustDB tmp inside the input directory.
This worked for me by running it in the command line, however I haven't checked it yet via sending as a job file.
I hope it helps.

Fazel

@RuoshiZhang
Copy link
Member

RuoshiZhang commented Dec 17, 2024

Hi, I have updated the workflow. Now it is also possible to pass a directory or a .tsv file with the list of paths to the desired files, for example:
$spacedust createsetdb /individual_faa SpacedustDB tmp --file-include ".faa$" --threads 18
or
$spacedust createsetdb path_to_faa.tsv SpacedustDB tmp --threads 18

You can download the new pre-compiled repository.

@SDmetagenomics
Copy link
Author

SDmetagenomics commented Dec 17, 2024 via email

@canerbagci
Copy link

Hi Ruoshi,

I also ran into the same problem, and the provided fix still doesn't work because

cmd.execProgram(program.c_str(), par.filenames);

still uses a system call to the generated tmp bash script, which also uses a system call to the mmseqs command. So it fails with the same error

" E2BIG (Argument list too long)"

in the subsequent steps.

Thanks,
Caner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants