diff --git a/README.md b/README.md index ecea1cf..cfc257c 100644 --- a/README.md +++ b/README.md @@ -24,437 +24,6 @@ module purge module load cufflinks samtools python/3.5 ``` -### 3. Build Resources +### 3. Usage -METRO has a `build` sub command to create any required reference files for the `run` sub command. The `build` sub command will generate a reference file containing the transcriptome (i.e. CDS region of each transcript) in FASTA format from a genomic FASTA file and an annotation in GTF format. The sequence of each transcript annotated in the GTF file will be reported in this transcripts FASTA file. This file can then be provided to the `--transcripts` option of the `run` sub command. When the `build` sub command is executed, the transcripts FASTA file (named _transcripts.fa_) will be generated in the defined output directory. - -It is important to note that when building reference files for METRO, you should used the same genomic FASTA file and GTF file that was used to call and annotate your variants. If a transcript is reported in the MAF file but cannot be found in the provided GTF file, a warning message will be produced to standard error. This warning message may indicate that the genomic FASTA and/or the GTF file you provided is not correct. - -#### 3.1 Build Synposis - -The `./metro` executable is composed of several inter-related sub commands. Please see `./metro -h` for all available options. The synopsis for the `build` sub command shows its parameters and their usage. Optional parameters are shown in square brackets. - -``` -$ ./metro build [-h] --ref-fa REF_FA \ - --ref-gtf REF_GTF \ - --output OUTPUT -``` - -This part of the documentation describes options and concepts for the `./metro build` sub command in more detail. With minimal configuration, the build sub command enables you to build reference file for the `./metro run` sub command. Buidling refernce file for the run sub command is fast and easy! In its most basic form, `./metro build` only has _three required inputs_. - -#### 3.2 Required Build Arguments - -Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code. - - - `--ref-fa REF_FA` -> **Genomic FASTA file of the reference genome.** -> *type: file* -> -> This file represents the genome sequence of the reference assembly in FASTA format. This input file should not be compressed. Sequence identifers in this file must match with sequence identifers in the GTF file provided to `--ref-gtf`. -> -> ***Example:*** -> `--ref-fa GRCh38.primary_assembly.genome.fa` - ---- - `--ref-gtf REF_GTF` -> **Gene annotation or GTF file for the reference genome.** -> *type: file* -> -> This file represents the reference genome's gene annotation in GTF format. This input file should not be compressed. Sequence identifers (column 1) in this file must match with sequence identifers in the FASTA file provided to `--ref-fa`. -> -> ***Example:*** -> `--ref-gtf gencode.v36.primary_assembly.annotation.gtf` - ---- - `--output OUTPUT` -> **Output directory where reference files will be generated.** -> *type: path* -> -> This location is where the build pipeline will create all of its output files. If the user-provided path does not exist, it will be created automatically. -> -> ***Example:*** -> `--output /scratch/$USER/refs/hg38_v36/` - -#### 3.3 Build Options - -Each of the following arguments are optional and do not need to be provided. - - `-h, --help` -> **Display Help.** -> *type: boolean* -> -> Shows command's synopsis, help message, and an example command -> -> ***Example:*** -> `--help` - -#### 3.4 Build Example - -Build reference files for the run sub comamnd. - -```bash -# Step 0.) Grab an interactive node -# Do not run on head node! -srun -N 1 -n 1 --time=12:00:00 -p interactive --mem=8gb --cpus-per-task=4 --pty bash -module purge -module load cufflinks samtools - -# Step 1.) Build METRO reference files -metro build --ref-fa GRCm39.primary_assembly.genome.fa \ - --ref-gtf gencode.vM26.annotation.gtf \ - --output /scratch/$USER/METRO/refs/ -``` - -### 4. Input METRO - -METRO has a `input` sub command to generate a MAF input file from one or more MAF-like files. Input files will be filtered and merged dependent on user parameters provided. These options include: `--vafFilter` which represents the minimum value for average VAF (calculated as t_alt_count/t_depth), `passFilter` which represents the minimum number of input files with a filter rating of "PASS", `--impactFilter` which represents minimum number of input files with an impact rating of either "MODERATE" or "HIGH". - -#### 4.1 Input Synopsis - -The `./metro` executable is composed of several inter-related sub commands. Please see `./metro -h` for all available options. The synopsis for the input sub command shows its parameters and their usage. Optional parameters are shown in square brackets. - -``` -$ ./metro input [-h] --mafFiles MAFFILES \ - --outputdir OUTPUTdir \ - --outputprefix OUTPUTprefix \ - [--vafFilter VAFFILTER] \ - [--passFilter PASSFILTER] \ - [--impactFilter IMPACTFILTER] -``` - -This part of the documentation describes options and concepts for `./metro input` sub command in more detail. With minimal configuration, the `input` sub command enables you to create filtered MAF files for the metro `run` pipeline. - -#### 4.2 Required Input Arguments - -Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code. - - - `--mafFiles MAFFILES [MAFFILES ...]` -> **Input MAF-like file(s) to process.** -> *type: file* -> -> Input VCF file(s) in MAF format. Provide a minimum of two files, separated by a comma. -> -> ***Example:*** -> `--mafFiles data/test1.maf,data/test2.maf` - ---- - `--outputdir OUTPUTDIR` -> **Path to an output directory.** -> *type: path* -> -> This location is where the metro will create all of its output files, also known as the pipeline's working directory. If the provided output directory does not exist, it will be created automatically. -> -> ***Example:*** -> `--outputdir /scratch/$USER/RNA_hg38` - ---- - `--outputprefix OUTPUTPREFIX` -> **Output file prefix.** -> *type: prefix* -> -> Prefix for sample output files. -> -> ***Example:*** -> ` --outputprefix test` - -#### 4.3 Input Options - -Each of the following arguments are optional and do not need to be provided. Default values listed in each example will be used, if value not provided. - - `-h, --help` -> **Display Help.** -> *type: boolean* -> -> Shows command's synopsis, help message, and an example command -> -> ***Example:*** -> `--help` - ---- - `--vafFilter VAFFilter` -> **Filter for VAF values.** -> *type: numeric* -> -> Minimum value for average VAF, calculated as t_alt_count/t_depth, to be included. -> -> ***Example:*** -> `--vafFilter 0.2` - ---- - `--passFilter PASSFILTER` -> **Filter for PASS values.** -> *type: numeric* -> -> Minimum number of input files with a filter rating of "PASS", to be included. -> -> ***Example:*** -> `--passFilter 2` - ---- - `--impactFilter IMPACTFILTER` -> **Filter for IMPACT values.** -> *type: numeric* -> -> Minimum number of input files with an IMPACT rating of "MODERATE" or "HIGH", to be included. -> -> ***Example:*** -> `--impactFilter 2` - -#### 4.4 Input Example - -Filter MAF files in preparation of metro run. - -```bash -# Step 0.) Grab an interactive node -# Do not run on head node! -srun -N 1 -n 1 --time=12:00:00 -p interactive --mem=8gb --cpus-per-task=4 --pty bash -module purge -module load python/3.5 - - # Step 1.) Run METRO to find mutated protein products - ./metro input --mafFiles /data/*.maf \ - --outputdir /scratch/$USER/METRO \ - --outputprefix test \ - --vafFilter 0.2 \ - --passFilter 2 \ - --impactFilter 2 -``` - -### 5. Run METRO - -METRO has a `run` sub command to generate a mutated amino acid sequence described by an HGVS term. METRO takes a MAF-like file containing HGVS terms describing a given mutation and a FASTA file containing transcript sequences to determine the mutated amino acid sequence. The build sub command can be used to generate a FASTA file containing CDS sequence of each transcript. - -METRO supports each major class of HGVS terms encoding for mutations in coding DNA sequences: substitution, deletion, insertions, duplications, and INDELS. METRO does not support HGVS tokenization of terms describing mutations in non-exonic (or non-CDS) regions like intronic or UTR regions. - -METRO will also truncate a given amino acid sequence +/- N amino acids relativve to a given mutation start site. This feature can be controlled via the '--subset' option. - -#### 5.1 Run Synopsis - -The `./metro` executable is composed of several inter-related sub commands. Please see `./metro -h` for all available options. The synopsis for the run sub command shows its parameters and their usage. Optional parameters are shown in square brackets. - -``` -$ ./metro run [-h] [--subset SUBSET] \ - --input INPUT [INPUT ...] \ - --transcripts TRANSCRIPTS \ - --output OUTPUT -``` - -This part of the documentation describes options and concepts for `./metro run` sub command in more detail. With minimal configuration, the `run` sub command enables you to start running metro pipeline. - -Setting up the metro is fast and easy! In its most basic form, `./metro run` only has _three required inputs_. - -#### 5.2 Required Run Arguments - -Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code. - - - `--input INPUT [INPUT ...]` -> **Input MAF-like file(s) to process.** -> *type: file* -> -> One or more MAF-like files can be provided. From the command-line, each input file should seperated by a space. Globbing is also supported! This makes selecting input files easier. Input MAF-like input files should be in an excel-like, CSV, or TSV format. For each input file a new output file will be generated in the specified output directory. Each file will end with the following extension: `.metro.tsv`. -> -> ***Example:*** -> `--input data/*.xls*` - ---- - `--output OUTPUT` -> **Path to an output directory.** -> *type: path* -> -> This location is where the metro will create all of its output files, also known as the pipeline's working directory. If the provided output directory does not exist, it will be created automatically. -> -> ***Example:*** -> `--output /scratch/$USER/RNA_hg38` - ---- - `--transcripts TRANSCRIPTS` -> **Transcriptomic FASTA file.** -> *type: file* -> -> This reference file contains the sequence of each transcript in the reference genome. The file can be generated by running the build sub command, (i.e. /path/to/build/output/transcripts.fa). When creating this reference file, it is very important to use the same genomic FASTA and annotation file to call and annotate variants. Failure to use the correct reference file may result in multiple warnings and/or errors. -> -> ***Example:*** -> ` --transcripts transcripts.fa` - -#### 5.3 Run Options - -Each of the following arguments are optional and do not need to be provided. - - `-h, --help` -> **Display Help.** -> *type: boolean* -> -> Shows command's synopsis, help message, and an example command -> -> ***Example:*** -> `--help` - ---- - `--subset SUBSET` -> **Subset resulting mutated amino acid sequences.** -> *type: int* -> -> If defined, this option will obtain the mutated amino acid sequence (AAS) +/- N amino acids of the mutation start site. By default, the first 30 upstream and downstream amino acids from the mutation site are recorded for non-frame shift mutations. Amino acids downstream of a frame shit mutation will be reported until the end of the amino acids sequence for the variants transcript or until the first reported terminating stop codon is found. -> -> ***Example:*** -> `--subset 30` - -#### 5.4 Run Example - -Run metro with the references files generated in the build example. - -```bash -# Step 0.) Grab an interactive node -# Do not run on head node! -srun -N 1 -n 1 --time=12:00:00 -p interactive --mem=8gb --cpus-per-task=4 --pty bash -module purge -module load python/3.5 - - # Step 1.) Run METRO to find mutated protein products - ./metro run --input /data/*.xlsx \ - --output /scratch/$USER/METRO \ - --transcripts /scratch/$USER/METRO/refs/transcripts.fa \ - --subset 30 -``` - - - -### 6. Predict METRO - -METRO has a `predict` sub command which utilizes the tool netMHCpan to predict the binding of peptides to any MHC molecule of known sequence using artificial neural networks (ANNs), then perform filtering of output based on user-provided parameters. This sub command uses the output of the `run` sub command for processing, submitting filtered file to netMHCpan, and then filtering the resulting files with user-provided parameters. Users select alleles of interest (`--alleleList`), length of kmers (`kmerLength`), length of peptide sequence (`--peptideLength`), binding affinity threshold ranges (`--highBind` and `--lowBind`). - -#### 6.1 Predict Synopsis - -The `./metro` executable is composed of several inter-related sub commands. Please see `./metro -h` for all available options. The synopsis for the predict sub command shows its parameters and their usage. Optional parameters are shown in square brackets. - -``` -$ ./ metro predict [-h] --mutationFile MUTATIONFILE \ - --alleleList ALLELELIST \ - --outputdir OUTPUTDIR \ - --outprefix OUTPREFIX \ - [--kmerLength KMERLENGTH] \ - [--peptideLength PEPTIDELENGTH] \ - [--highbind HIGHBIND] \ - [--lowbind LOWBIND] -``` - -This part of the documentation describes options and concepts for `./metro input` sub command in more detail. With minimal configuration, the `predict` sub command enables you to generate prediction files for each mutated sequence identified in the metro `run` sub command. - -#### 6.2 Required Predict Arguments - -Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code. - - - `--mutationFile MUTATIONFILE [MUTATIONFILE ...]` -> **Input TSV mutation file to process.** -> *type: file* -> -> Input file in tsv format. This can be the output of the METRO run command -> -> ***Example:*** -> `--mutationFile data/test_Variant.metro.tsv` -> -> ***Required headers:*** -> Required header (in any order): -> - Variant_Classification -> - Hugo_Symbol -> - Transcript_ID -> - WT_Subset_AA_Sequence -> - Mutated_Subset_AA_Sequence - ---- - `--outputdir OUTPUTDIR` -> **Path to an output directory.** -> *type: path* -> -> This location is where the metro will create all of its output files, also known as the pipeline's working directory. If the provided output directory does not exist, it will be created automatically. -> -> ***Example:*** -> `--outputdir /scratch/$USER/RNA_hg38` - ---- - `--outputprefix OUTPUTPREFIX` -> **Output file prefix.** -> *type: prefix* -> -> Prefix for sample output files. -> -> ***Example:*** -> `--outputprefix test` - ---- - `--alleleList ALLELELIST` -> **List of Alleles for netMHCpan input.** -> *type: list* -> -> Allele name(s) to input into netMHCpan. If this is a list, each allele is separated by commas and without spaces (max 20 per submission). For full list of alleles is available on netMHC's [website](https://services.healthtech.dtu.dk/services/NetMHCpan-4.1/MHC_allele_names.txt) -> -> ***Example:*** -> `--alleleList H-2-Ld,H-2-Dd,H-2-Kb` - -#### 6.3 Predict Options - -Each of the following arguments are optional and do not need to be provided. Default values listed in each example will be used, if value not provided. - - `-h, --help` -> **Display Help.** -> *type: boolean* -> -> Shows command's synopsis, help message, and an example command -> -> ***Example:*** -> `--help` - ---- - `--kmerLength KMERLENGTH` -> **Length of kmer for netMHC input.** -> *type: numeric* -> -> Single value, or list, of the length of peptide sequence used for prediction analysis. If this is a list, each length is separated by a comma and without spaces. -> -> ***Example:*** -> `--peptideLength 8,9,10,11` - ---- - `--highbind HIGHBIND` -> **Threshold for identifying STRONG affinity.** -> *type: numeric* -> -> Threshold to define binding affinity as "STRONG" for netHMC output. Must be an integer that is lower than `--lowbind`. -> -> ***Example:*** -> `--highbind 0.5` - ---- - `--lowbind LOWBIND` -> **Filter for IMPACT values.** -> *type: numeric* -> -> Threshold to define binding affinity as "WEAK" for netHMC output. Must be an integer that is higher than `--highbind`. -> -> ***Example:*** -> `--lowbind 2` - -#### 6.4 Predict Example - -Predict the binding of peptides to any MHC molecule of known sequence using artificial neural networks (ANNs) and perform filtering of output based on user-provided parameters. - -```bash -# Step 0.) Grab an interactive node -# Do not run on head node! -srun -N 1 -n 1 --time=12:00:00 -p interactive --mem=8gb --cpus-per-task=4 --pty bash -module purge -module load python/3.5 - - # Step 1.) Run METRO predict to find the binding of peptides to any MHC molecule - ./metro predict \ - --mutationFile /scratch/$USER/METRO/test_Variant.asap.tsv \ - --allelList H-2-Ld,H-2-Dd,H-2-Kb \ - --peptideLength 8,9,10,11 \ - --kmerLength 21 \ - --outputdir /scratch/$USER/METRO/ \ - --outprefix test -``` +For usage, and example code with test data, please visit the [METRO docs](https://ccbr.github.io/METRO/METRO/predict/) page. \ No newline at end of file