A SEDA pipeline created in Compi that implements the "Obtaining protein family members" SEDA-based protocol. Created using the SEDA-Compi pipelines framework.
This protocol shows how to retrieve all members of a given protein family such as, for instance, mucins. The main feature of mucin proteins is their extended region of tandemly repeated sequences (PTS repeats), which contain prolines (P) together with serines (S), and/or threonines (T), which generally occupy between 30% and 90% of the protein length, and that cannot be detected in homology searches due to their poor sequence conservation (https://doi.org/10.1371/journal.pone.0003041). Mucins also show signal peptides and other associated domains.
Download this ZIP and decompress it. The path where it is extracted will be referred as "working directory" (/path/to/working_dir
).
Move to the working directory and edit the params/pfamscan.sedaParams
file to set your e-mail address in the third line (eMail
), otherwise PfamScan will not run. Then, simply run ./run.sh "$(pwd)"
to execute the entire pipeline with two input files.
The two input FASTA files for Homo sapiens (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39) and Drosophila melanogaster (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001215.4) were downloaded from the NCBI assembly RefSeq database by selecting the Download assembly / Protein FASTA (.faa) option.
To run specific tasks an additional parameter can be passed to the run.sh
script: ./run.sh "$(pwd)" "--single-task extract-headers"
or ./run.sh "$(pwd)" "--until pfamscan"
.
Applying the protocol to other case studies is easy, you only need to:
- Put the protein FASTA files at
input/pattern-filtering/
. - Edit the
params/pattern-filtering.sedaParams
to set an appropriate pattern filtering parameters to your case study. This file can be created and exported using the SEDA GUI, which is handy for advanced pattern filtering cases.
Made with contrib.rocks.