-
Notifications
You must be signed in to change notification settings - Fork 1
Mapper Template
To write a new mapper template you will need at least a minimal YAML configuration template similar to the following Minimal Template Example
# mapper type DNA/RNA
type: DNA
# dependency for auto install if it is not installed
dep: bioconda/bwa
# mapper name, this name will be used to name output directories [optional] default: mapper
mapper_name : bwa
template:
# index template
index : "bwa index -p {ref_index} {ref_fasta}"
# mapping command template in single layout
se :
- "bwa mem {ref_index} {1} > {outputfile}"
# mapping command template in pairend layout
pe :
- "bwa mem {ref_index} {1} {2} > {outputfile}"
This section describes minimal requirement configuration controls, that are needed to be in the template.
## RNA/DNA
type: DNA
Inside the template section there should be two subsections: index
command and mapping commands se
and pe
, or in case both command are the same then use both
section instead. All those sections can accept one command template or a list of commands that will be executed in order.
You can specify the index command of the mapper in this configuration section. It could be just one command
template:
# index template
index : "STAR --runMode genomeGenerate --runThreadN 10 --genomeDir {ref_dir} --genomeFastaFiles {ref_fasta} --genomeSAindexNbases 11"
or could take a list of command template (normally it would be just one command).
template:
# index template
index :
- "echo 'example of multi command template' "
- "STAR --runMode genomeGenerate --runThreadN 10 --genomeDir {ref_dir} --genomeFastaFiles {ref_fasta} --genomeSAindexNbases 11"
se
template will be used to run commands for single read layout
template :
se :
- "STAR --genomeDir {ref_dir} --sjdbGTFfile {gtf} --readFilesIn {1} {2} --outFileNamePrefix {output_dir}/{outputfile_prefix}_ "
pe
template will be used to run commands for pair end read layout
template :
pe :
- "STAR --genomeDir {ref_dir} --sjdbGTFfile {gtf} --readFilesIn {1} {2} --outFileNamePrefix {output_dir}/{outputfile_prefix}_ "
both
template will be used to run commands for single and pair-end read layout. If the template used for mapping is the same after variable substitution in both cases, you can specify one template command under both
section. For example in STAR mapper command the variables substitution in --readFilesIn {1} {2}
can be done in se
and pe
since there is no additional switch needed in different cases. However in bwa; the command in pe
needs additional argument switch '-2 {2}', which can not be added in se
without causing problem in the final command.
template :
both :
- "STAR --genomeDir {ref_dir} --sjdbGTFfile {gtf} --readFilesIn {1} {2} --outFileNamePrefix {output_dir}/{outputfile_prefix}_ "
This section describes optional configuration controls, that if not presented in the template, the default behavior will be executed.
If you want Crossmapper to install the mapper package before running the template command, you need to specify the package name with the following configuration
dep: bioconda/star
where you need to specify the channel first then the package name. This will be usual if you want to share the template with others and let the template install the package for them if they do not have already.
You can specify a custom name for the mapper that will be used in output messages logs and naming output folders. If you did not specify it, the default name will be mapper
mapper_name : mymapper
By default Crossmapper treats output files as not sorted, then it run samtools to sort and convert the file to bam format. If the mapper outputs a sorted file you can disable Crossmapper sorting by specifying this in the template configuration.
## yes/no, default: no
sorted : yes
Crossmapper follows the following pattern to name the output files concat_{rlen}_{layout}.{output_type}
where rlen
is the read length of the simulated reads, layout
is the mapping layout (SE/PE) and output_type
is output file format (sam/bam). If at any case this patter is changed in the template command, you need to specify the new pattern here, as Crossmapper expects to know the exact file name to pass it to the next step in the pipeline.
For example, in STAR mapper an extra suffix is added to the name of the file by the mapper. To adjust for this, we use the following in the configuration, where {outputfile_prefix}
is variable that will be substituted by the actual prefix value of the file being processed which follows the above pattern concat_{rlen}_{layout}
See Variable Section for more details.
outputfile_pattern: "{outputfile_prefix}_Aligned.sortedByCoord.out.bam"
Templates use variables to be substituted with values according to the current files being processed.
Here is a list of variables that can be used inside command templates. Crossmapper treats any string surrounded with curly brackets e.g. {var_name}
as variable name, that will be substituted with the corresponding value before executing the command
-
{base_dir}
:-o
option of Crossmapper, . -
{output_dir}
: output directory this will be the whole path to mapping output folder which is{base_dir}/{mapper_name}_output
-
{1}
and{2}
: fastq files -
{layout}
: PE or SE -
{read_len}
: read length
-
{outputfile}
: this would be the file name with full path {base_dir}/{outputdir_name}_output/{outputfile_pattern} -
{outputfile_pattern}
: same as given in the template configuration or the default value -
{outputfile_prefix}
:concat_{read_len}_{layout}
-
{output_type}
: output file format sam/bam
-
{gtf}
: gtf file -
{ref_fasta}
: reference input fasta file -
{ref_prefix}
: prefix name to the genome index -
{ref_dir}
: reference index directory, this will be {base_dir}/{mapper_name}_index -
{ref_index}
: {ref_dir}/{ref_prefix} -
{genome_len}
: concatenated genome length calculated from ref_fasta usingbasename
qualifier with {out:basename} will exclude path and extension and will return only [internal pattern with read_len and layout e.g concat_50_PE ]
-
{n_threads}
: Number of cores to be used for all multicore-supporting steps,-t/--threads
Crossmapper option -
{tmp_dir}
: local temporary directory,-star_tmp
Crossmapper option
the configuration file is based on YAML syntax Ain’t Markup Language which is human readable format and an easy to understand and work with.
Basically YAML files have records of key:value
pairs, for example mapper_name : mymapper
where mapper_name
is a key of a configuration parameter in the file and mymapper
is its value. YAML files could have sections which are basically a more complex structure of key:value
records where the value of this record is a complex structure itself and could have more key:values parameters. For example, the following part
template:
index : "some/cmd/to/index"
both :
- "cmd1"
- "cmd2"
Also note that to indicate that index
is a part of the template section, you should use a white-space indent, and you can not use tab indentation YAML Indentation.
YAML References
- Bash Redirection : you can redirect stdout to a file in any template command, for example:
---
template:
se : "bwa mem {ref_index} {1} > {outputfile}"
This will redirect bwa stdout to {outputfile}
.
However current implementation of Crossmapper support stdout only.
- Bash Piping : current implementation of Crossmapper does not support piping. However, if you need such behavior you can use multiple commands in the template and use an intermediate file. We found for current requirements of Crossmapper it is not important to add support for this feature.
- Bash variables : current implementation of Crossmapper does not support this. Any command templates should not use bash variables, any variables substitution between curly brackets
{var_name}
will be interpreted as internal Crossmapper variables as stated in this documentation and if not found by Crossmapper will cause an internal error.