updated README and help documentation.

Tested with sample data.
stschiff · Sep 9, 2019 · f908aad · f908aad
1 parent 54921d7
commit f908aad
Show file tree

Hide file tree

Showing 2 changed files with 31 additions and 25 deletions.
diff --git a/README.md b/README.md
@@ -16,15 +16,15 @@ The main tool in this repository is the program `pileupCaller` to sample alleles
 
 Important Note: You should definitely use the `-B` flag, which disables base alignment quality recalibration. This mechanism is turned on by default and causes huge reference bias with low coverage ancient DNA data. This flag disables the mechanism.
 
-To input the SNP positions you can either use a Text file with all positions or a bed file (see samtools manual). The output is a simple text file with all positions that could be genotyped in the three samples.
+In the above command line, the file "list_of_positions.txt" should either contain positions (0-based) or a bed file (see samtools manual for details). The output is a simple text file with all positions that could be genotyped in the three samples.
 
 Next, you need to run my tool pileupCaller, which you run like this:
 
-    pileupCaller --sampleNames Sample1,Sample2,Sample3 \
+    pileupCaller --randomHaploid --sampleNames Sample1,Sample2,Sample3 \
         --samplePopName MyPop -f <Eigenstrat.snp> \
-        -o EigenStrat -e <My_output_prefix> < pileup.txt
+        -e <My_output_prefix> < pileup.txt
 
-Here, options –sampleNames gives the names of the samples that is output in the Eigenstrat `*.ind` file, and and `–samplePopName` is optional to also give the population names in that file (defaults to `Unknown`, you can also change it later in the output). Then, option `-f` gives an Eigenstrat positions file. This is important because the pileup file only contains sites which could be called in at least one of your samples. In order to later merge your dataset with another Eigenstrat file, pileupCaller will check every position in the other Eigenstrat file to make sure every position is output with the correct alleles and missing genotypes if appropriate. Finally, the -o option gives the output format (here Eigenstrat) and the `-e` option gives the prefix for the `*.ind`, `*.pos` and `*.geno` files. Note that this is specific for Eigenstrat output. For FreqSum output, the results will simply be written out to standard out.
+Here, options `--sampleNames` gives the names of the samples that is output in the Eigenstrat `*.ind` file, and and `-–samplePopName` is optional to also give the population names in that file (defaults to `Unknown`, you can also change it later in the output). Then, option `-f` gives an Eigenstrat positions file. This is important because the pileup file only contains sites which could be called in at least one of your samples. In order to later merge your dataset with another Eigenstrat file, pileupCaller will check every position in the other Eigenstrat file to make sure every position is output with the correct alleles and missing genotypes if appropriate. Finally, the -o option gives the output format (here Eigenstrat) and the `-e` option gives the prefix for the `*.ind`, `*.pos` and `*.geno` files. Note that this is specific for Eigenstrat output. For FreqSum output, the results will simply be written out to standard out. Note that without specifying `-e`, the output format will be freqSum format, described [here](https://rarecoal-docs.readthedocs.io/en/latest/rarecoal-tools.html#vcf2freqsum), which is useful for debugging your pipeline, since it's just a single file that is output into the terminal and can therefore easily be inspected.
 
 You can also get some help by typing `pileupCaller -h`, which shows a lot more option, for example the sampling method, minimal coverage and other important options.
 
@@ -33,9 +33,9 @@ Note that you can also fuse the two steps above into one unix pipe:
     samtools mpileup -R -B -q30 -Q30 -l <list_of_positions.txt> \
         -f <reference_genome.fasta> \
         Sample1.bam Sample2.bam Sample3.bam | \
-    pileupCaller --sampleNames Sample1,Sample2,Sample3 \
+    pileupCaller --randomHaploid --sampleNames Sample1,Sample2,Sample3 \
         --samplePopName MyPop -f <Eigenstrat.snp> \
-        -o EigenStrat -e <My_output_prefix>
+        -e <My_output_prefix>
 
 There is however an issue here: If you have aligned your read data to a version of the reference genome that uses `chr1`, `chr2` and so on as chromosome names, the resulting Eigenstrat file will be valid, but won't merge with other Eigenstrat datasets that use chromosome names `1`, `2` and so on. I would therefore recommend to strip the `chr` from your chromosome names if necessary. You can do that easily using a little UNIX filter using the `sed` tool. In the full pipeline, it looks like this:
 

diff --git a/src-executables/pileupCaller.hs b/src-executables/pileupCaller.hs
@@ -78,30 +78,30 @@ argParser = ProgOpt <$> parseCallingMode <*> parseKeepIncongruentReads <*> parse
         OP.help "Sample two reads at random (without replacement) at each site and represent the \
         \individual by a diploid genotype constructed from those two random \
         \picks. This will always assign missing data to positions where only \
-        \one read is present, even if minDepth=1.")
+        \one read is present, even if minDepth=1. The main use case for this \
+        \option is for estimating mean heterozygosity across sites.")
     parseMajorityCalling = MajorityCalling <$> (parseMajorityCallingFlag *> parseDownsamplingFlag)
     parseMajorityCallingFlag = OP.flag' True (OP.long "majorityCall" <> OP.help
         "Pick the allele supported by the \
         \most reads at a site. If an equal numbers of alleles fulfil this, pick one at \
-        \random. This results in a haploid call.")
+        \random. This results in a haploid call. See --downSampling for best practices \
+        \for calling rare variants")
     parseDownsamplingFlag = OP.switch (OP.long "downSampling" <> OP.help "When this switch is given, \
         \the MajorityCalling mode will downsample \
         \from the total number of reads a number of reads \
         \(without replacement) equal to the --minDepth given. This mitigates \
         \reference bias in the MajorityCalling model, which increases with higher coverage. \
         \The recommendation for rare-allele calling is --majorityCall --downsampling --minDepth 3")
     parseKeepIncongruentReads = OP.switch (OP.long "keepIncongruentReads" <> OP.help "By default, \
-        \pileupCaller removes reads with alleles that are neither of the two alleles specified in the SNP file. \
-        \Any sampling (depending on the calling mode) is therefore by default constrained to reads \
-        \supporting the reference an alternative allele, and any other alleles (for example at triallelic sites or \
-        \due to damage) are removed. With this flag given, sampling is performed unconstrained on all reads. If \
-        \a read is sampled with an allele that is neither of the two given alleles, a missing genotype is generated. \
-        \IMPORTANT NOTE: The default behaviour has changed in pileupCaller. If you want to emulate the previous \
+        \pileupCaller now removes reads with tri-allelic alleles that are neither of the two alleles specified in the SNP file. \
+        \To keep those reads for sampling, set this flag. With this option given, if \
+        \the sampled read has a tri-allelic allele that is neither of the two given alleles in the SNP file, a missing genotype is generated. \
+        \IMPORTANT NOTE: The default behaviour has changed in pileupCaller version 1.4.0. If you want to emulate the previous \
         \behaviour, use this flag. I recommend now to NOT set this flag and use the new behaviour.")
     parseSeed = OP.option (Just <$> OP.auto) (OP.long "seed" <>
         OP.value Nothing <> OP.metavar "<RANDOM_SEED>" <>
         OP.help "random seed used for the random number generator. If not given, use \
-        \system clock to seed the random number generator")
+        \system clock to seed the random number generator.")
     parseMinDepth = OP.option OP.auto (OP.long "minDepth" <> OP.short 'd' <>
         OP.value 1 <> OP.showDefault <> OP.metavar "<DEPTH>" <>
         OP.help "specify the minimum depth for a call. For sites with fewer \
@@ -114,7 +114,7 @@ argParser = ProgOpt <$> parseCallingMode <*> parseKeepIncongruentReads <*> parse
     parseSingleStrandMode = OP.flag' SingleStrandMode (OP.long "singleStrandMode" <>
         OP.help "At C/T polymorphisms, ignore reads aligning to the forward strand. \
         \At G/A polymorphisms, ignore reads aligning to the reverse strand. This should \
-        \remove ancient-DNA damage in ancient DNA libraries prepared with the single-stranded protocoll.")
+        \remove post-mortem damage in ancient DNA libraries prepared with the non-UDG single-stranded protocol.")
     parseSnpFile = OP.strOption (OP.long "snpFile" <> OP.short 'f' <>
         OP.metavar "<FILE>" <> OP.help "an Eigenstrat-formatted SNP list file for \
         \the positions and alleles to call. All \
@@ -128,14 +128,18 @@ argParser = ProgOpt <$> parseCallingMode <*> parseKeepIncongruentReads <*> parse
         \ordered by chromosome and position, and this is checked. The chromosome order in humans is 1-22,X,Y,MT. \
         \Chromosome can generally begin with \"chr\". In case of non-human data with different chromosome \
         \names, you should convert all names to numbers. They will always considered to \
-        \be numerically ordered, even beyond 22.")
+        \be numerically ordered, even beyond 22. Finally, I note that for internally, \
+        \X is converted to 23, Y to 24 and MT to 90. This is the most widely used encoding in Eigenstrat \
+        \databases for human data, so using a SNP file with that encoding will automatically be correctly aligned \
+        \to pileup data with actual chromosome names X, Y and MT (or chrX, chrY and chrMT, respectively).")
     parseFormat = (EigenstratFormat <$> parseEigenstratPrefix <*> parseSamplePopName) <|> pure FreqSumFormat
     parseEigenstratPrefix = OP.strOption (OP.long "eigenstratOut" <> OP.short 'e' <>
         OP.metavar "<FILE_PREFIX>" <>
         OP.help "Set Eigenstrat as output format. Specify the filenames for the EigenStrat \
         \SNP and IND file outputs: <FILE_PREFIX>.snp.txt and <FILE_PREFIX>.ind.txt \
-        \If not set, output will be FreqSum (Default). Note that freqSum format, described at\
-        \... is useful for testing your pipeline, since it's output to standard out")
+        \If not set, output will be FreqSum (Default). Note that freqSum format, described at \
+        \https://rarecoal-docs.readthedocs.io/en/latest/rarecoal-tools.html#vcf2freqsum, \
+        \is useful for testing your pipeline, since it's output to standard out")
     parseSampleNames = parseSampleNameList <|> parseSampleNameFile
     parseSampleNameList = OP.option (Left . splitOn "," <$> OP.str)
         (OP.long "sampleNames" <> OP.metavar "NAME1,NAME2,..." <>
@@ -153,19 +157,21 @@ programHelpDoc :: PP.Doc
 programHelpDoc =
     part1 PP.<$>
     PP.enclose PP.line PP.line (PP.indent 4 samtoolsExample) PP.<$>
-    part2
+    part2 PP.<$> PP.line PP.<$>
+    PP.string versionInfoText
   where
     part1 = PP.fillSep . map PP.text . words $
-        "PileupCaller is a simple tool to create genotype calls from bam files. \
-        \You need to convert bam files into the mpileup-format, specified at \
+        "PileupCaller is a tool to create genotype calls from bam files using read-sampling methods. \
+        \To use this tool, you need to convert bam files into the mpileup-format, specified at \
         \http://www.htslib.org/doc/samtools.html (under \"mpileup\"). The recommended command line \
         \to create a multi-sample mpileup file to be processed with pileupCaller is"
     samtoolsExample = PP.hang 4 . PP.fillSep . map PP.text . words $
-        "samtools mpileup -B -q30 -Q30 -l <BED_FILE> -f <FASTA_REFERENCE_FILE> \
+        "samtools mpileup -B -q30 -Q30 -l <BED_FILE> -R -f <FASTA_REFERENCE_FILE> \
         \Sample1.bam Sample2.bam Sample3.bam | pileupCaller ..."
     part2 = PP.fillSep . map PP.text . words $
-        "Note that flag -B in samtools is very important to reduce reference \
-        \bias in low coverage data. " ++ versionInfoText
+        "You can lookup what these options do in the samtools documentation. \
+        \Note that flag -B in samtools is very important to reduce reference \
+        \bias in low coverage data."
 
 initialiseEnvironment :: ProgOpt -> IO Env
 initialiseEnvironment args = do