Skip to content

Mapper Template

Hrant Hovhannisyan edited this page Jul 9, 2019 · 5 revisions

Writing Mapper Template

To write a new mapper template you will need at least a minimal YAML configuration template similar to the following Minimal Template Example

# mapper type DNA/RNA
type: DNA
# dependency for auto install if it is not installed
dep: bioconda/bwa
# mapper name, this name will be used to name output directories [optional] default: mapper
mapper_name : bwa
template:
  # index template
  index : "bwa index -p {ref_index} {ref_fasta}"
  # mapping command template in single layout
  se : 
    - "bwa mem {ref_index} {1} > {outputfile}"
  # mapping command template in pairend layout
  pe :
    - "bwa mem {ref_index} {1} {2} > {outputfile}"

Configuration controls

This section describes minimal requirement configuration controls, that are needed to be in the template.

Requirements

Mapper type

## RNA/DNA
type: DNA

Templates

Inside the template section there should be two subsections: index command and mapping commands se and pe , or in case both command are the same then use both section instead. All those sections can accept one command template or a list of commands that will be executed in order.

index

You can specify the index command of the mapper in this configuration section. It could be just one command

template:
  # index template
  index : "STAR --runMode genomeGenerate --runThreadN 10 --genomeDir {ref_dir} --genomeFastaFiles {ref_fasta} --genomeSAindexNbases 11"

or could take a list of command template (normally it would be just one command).

template:
  # index template
  index : 
    - "echo 'example of multi command template' "
    - "STAR --runMode genomeGenerate --runThreadN 10 --genomeDir {ref_dir} --genomeFastaFiles {ref_fasta} --genomeSAindexNbases 11"

se (single reads mapping command)

se template will be used to run commands for single read layout

template : 
   se : 
     - "STAR --genomeDir {ref_dir} --sjdbGTFfile {gtf} --readFilesIn {1} {2} --outFileNamePrefix {output_dir}/{outputfile_prefix}_ "

pe (paired reads mapping command)

pe template will be used to run commands for pair end read layout

template : 
   pe : 
     - "STAR --genomeDir {ref_dir} --sjdbGTFfile {gtf} --readFilesIn {1} {2}   --outFileNamePrefix {output_dir}/{outputfile_prefix}_ "

both

both template will be used to run commands for single and pair-end read layout. If the template used for mapping is the same after variable substitution in both cases, you can specify one template command under both section. For example in STAR mapper command the variables substitution in --readFilesIn {1} {2} can be done in se and pe since there is no additional switch needed in different cases. However in bwa; the command in pe needs additional argument switch '-2 {2}', which can not be added in se without causing problem in the final command.

template : 
   both : 
     - "STAR --genomeDir {ref_dir} --sjdbGTFfile {gtf} --readFilesIn {1} {2}   --outFileNamePrefix {output_dir}/{outputfile_prefix}_ "

Optionals

This section describes optional configuration controls, that if not presented in the template, the default behavior will be executed.

Dependency

If you want Crossmapper to install the mapper package before running the template command, you need to specify the package name with the following configuration

dep: bioconda/star

where you need to specify the channel first then the package name. This will be usual if you want to share the template with others and let the template install the package for them if they do not have already.

Mapper Name

You can specify a custom name for the mapper that will be used in output messages logs and naming output folders. If you did not specify it, the default name will be mapper

mapper_name : mymapper

sorted

By default Crossmapper treats output files as not sorted, then it run samtools to sort and convert the file to bam format. If the mapper outputs a sorted file you can disable Crossmapper sorting by specifying this in the template configuration.

## yes/no, default: no
sorted : yes

outputfile_pattern

Crossmapper follows the following pattern to name the output files concat_{rlen}_{layout}.{output_type} where rlen is the read length of the simulated reads, layout is the mapping layout (SE/PE) and output_type is output file format (sam/bam). If at any case this patter is changed in the template command, you need to specify the new pattern here, as Crossmapper expects to know the exact file name to pass it to the next step in the pipeline. For example, in STAR mapper an extra suffix is added to the name of the file by the mapper. To adjust for this, we use the following in the configuration, where {outputfile_prefix} is variable that will be substituted by the actual prefix value of the file being processed which follows the above pattern concat_{rlen}_{layout} See Variable Section for more details.

outputfile_pattern: "{outputfile_prefix}_Aligned.sortedByCoord.out.bam"

Variables

Templates use variables to be substituted with values according to the current files being processed. Here is a list of variables that can be used inside command templates. Crossmapper treats any string surrounded with curly brackets e.g. {var_name} as variable name, that will be substituted with the corresponding value before executing the command

Variables related to output directories

  • {base_dir} : -o option of Crossmapper, .
  • {output_dir} : output directory this will be the whole path to mapping output folder which is {base_dir}/{mapper_name}_output

Variables related to input files

  • {1} and {2} : fastq files
  • {layout} : PE or SE
  • {read_len} : read length

Variables related to output files

  • {outputfile} : this would be the file name with full path {base_dir}/{outputdir_name}_output/{outputfile_pattern}
  • {outputfile_pattern} : same as given in the template configuration or the default value
  • {outputfile_prefix} : concat_{read_len}_{layout}
  • {output_type} : output file format sam/bam

Variables related to index command

  • {gtf} : gtf file
  • {ref_fasta} : reference input fasta file
  • {ref_prefix} : prefix name to the genome index
  • {ref_dir} : reference index directory, this will be {base_dir}/{mapper_name}_index
  • {ref_index} : {ref_dir}/{ref_prefix}
  • {genome_len} : concatenated genome length calculated from ref_fasta using basename qualifier with {out:basename} will exclude path and extension and will return only [internal pattern with read_len and layout e.g concat_50_PE ]

Variables related to other Crossmapper options

  • {n_threads} : Number of cores to be used for all multicore-supporting steps, -t/--threads Crossmapper option
  • {tmp_dir} : local temporary directory, -star_tmp Crossmapper option

Notes

YAML : brief introduction

the configuration file is based on YAML syntax Ain’t Markup Language which is human readable format and an easy to understand and work with. Basically YAML files have records of key:value pairs, for example mapper_name : mymapper where mapper_name is a key of a configuration parameter in the file and mymapper is its value. YAML files could have sections which are basically a more complex structure of key:value records where the value of this record is a complex structure itself and could have more key:values parameters. For example, the following part

template:
  index : "some/cmd/to/index"
  both : 
     - "cmd1"
     - "cmd2"

Also note that to indicate that index is a part of the template section, you should use a white-space indent, and you can not use tab indentation YAML Indentation.

YAML References

bash command in template

  • Bash Redirection : you can redirect stdout to a file in any template command, for example:
---
template:
   se : "bwa mem {ref_index} {1} > {outputfile}"

This will redirect bwa stdout to {outputfile}. However current implementation of Crossmapper support stdout only.

  • Bash Piping : current implementation of Crossmapper does not support piping. However, if you need such behavior you can use multiple commands in the template and use an intermediate file. We found for current requirements of Crossmapper it is not important to add support for this feature.
  • Bash variables : current implementation of Crossmapper does not support this. Any command templates should not use bash variables, any variables substitution between curly brackets {var_name} will be interpreted as internal Crossmapper variables as stated in this documentation and if not found by Crossmapper will cause an internal error.