Skip to content

An ensemble approach to accurately detect somatic mutations using SomaticSeq

License

Notifications You must be signed in to change notification settings

iranmdl/somaticseq

 
 

Repository files navigation

SomaticSeq: An ensemble approach to accurately detect somatic mutations

The following command is an example SomaticSeq command after mutation caller jobs are complete:

$somaticseq/SomaticSeq.Wrapper.sh \
--output-dir       /PATH/TO/RESULTS/SomaticSeq_MVSDULPK \
--genome-reference /PATH/TO/GRCh38.fa \
--tumor-bam        /PATH/TO/HCC1395.bam \
--normal-bam       /PATH/TO/HCC1395BL.bam \
--dbsnp            /PATH/TO/dbSNP.GRCh38.vcf \
--cosmic           /PATH/TO/COSMIC.v77.vcf \
--mutect2          /PATH/TO/RESULTS/MuTect2.vcf \
--varscan-snv      /PATH/TO/RESULTS/VarScan2.snp.vcf \
--varscan-indel    /PATH/TO/RESULTS/VarScan2.indel.vcf \
--sniper           /PATH/TO/RESULTS/SomaticSniper.vcf \
--vardict          /PATH/TO/RESULTS/VarDict.vcf \
--muse             /PATH/TO/RESULTS/MuSE.vcf \
--lofreq-snv       /PATH/TO/RESULTS/LoFreq.somatic_final.snvs.vcf.gz \
--lofreq-indel     /PATH/TO/RESULTS/LoFreq.somatic_final.indels.vcf.gz \
--scalpel          /PATH/TO/RESULTS/Scalpel.vcf \
--strelka-snv      /PATH/TO/RESULTS/Strelka/results/variants/somatic.snvs.vcf.gz \
--strelka-indel    /PATH/TO/RESULTS/Strelka/results/variants/somatic.indels.vcf.gz \
--inclusion-region /PATH/TO/RESULTS/captureRegion.bed \
--exclusion-region /PATH/TO/RESULTS/blackList.bed
  • For all those input VCF files, either .vcf or .vcf.gz are acceptable.
  • You must make sure all the input files (i.e., VCF, BAM, FASTA, etc.) are sorted identically. Otherwise, the results would not be valid, because the program does not check for proper ordering.
  • Additional parameters for training/prediction:
    • --truth-snv: if you have ground truth VCF file for SNV
    • --truth-indel: if you have a ground truth VCF file for INDEL
    • --ada-r-script: $somaticseq/r_scripts/ada_model_builder_ntChange.R to build classifiers (.RData files), if you have ground truths supplied.
    • --classifier-snv: classifier (.RData file) previously built for SNV
    • --classifier-indel: classifier (.RData file) previously built for INDEL
    • --ada-r-script: $somaticseq/r_scripts/ada_model_predictor.R to use the classifiers specified above to make predictions
  • Do not worry if Python throws the following warning. This occurs when SciPy attempts a statistical test with empty data, e.g., z-scores between reference- and variant-supporting reads will be NaN if there is no reference read at a position.
      RuntimeWarning: invalid value encountered in double_scalars
      z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)
    

Pipelines and Workflows

For a quick description of SomaticSeq, you may watch this 8-minute video: SomaticSeq Video

About

An ensemble approach to accurately detect somatic mutations using SomaticSeq

http://bioinform.github.io/somaticseq/

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Shell 61.7%
  • Python 37.2%
  • Other 1.1%