This directory contains data from simulated hybrid genome with 5% divergence between parentals and 40% of loss of heterozygosity (LOH).
To run the test example, execute:
./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1
Table of contents
Below, you can find detailed description for regeneration of test dataset with any input genome.
Test dataset is based on 100 Kb region from C. parapsilosis CDC317 [HE605202:100000-200000].
samtools faidx CANPA.fa HE605202:100000-200000 > ref.fa
Heterozygous regions, 5% divergence and 40% LOH, have been simulated with fasta2diverged.py.
fasta2diverged.py -v -i ref.fa --loh 0.40 --dna -d 0.05 > ref.d05.l40.fa
Two shotgun sequencing libraries including typical Illumina-related errors were simulated using GemSIM:
- 100 bp paired-end reads with 600 bp insert [200X coverage]:
- 600_1.fq.gz
- 600_2.fq.gz
- 50 bp mate-pair reads with 5 kb insert [20X coverage]:
- 5000_1.fq.gz
- 5000_2.fq.gz
# generate SNPs
~/src/GemSIM_v1.6/GemHaps.py -r ref.fa -g '.50,0 .50,6'
# 2x 50,000 reads
## 100 bp paired-end with 600 bp insert +/- 50 bp
~/src/GemSIM_v1.6/GemReads.py -r ref.fa -n 50000 -g ref.txt -l 100 -m /home/lpryszcz/src/GemSIM_v1.6/models/ill100v5_p.gzip -q 33 -o 600 -u 600 -s 50 -p
~/src/GemSIM_v1.6/GemReads.py -r ref.d05.l40.fa -n 50000 -g ref.txt -l 100 -m /home/lpryszcz/src/GemSIM_v1.6/models/ill100v5_p.gzip -q 33 -o 600.d05.l40 -u 600 -s 50 -p
# 2x 10,000 reads
## 50 bp mate-pairs with 5 kb insert +/- 500 bp
~/src/GemSIM_v1.6/GemReads.py -r ref.fa -n 10000 -g ref.txt -l 50 -m /home/lpryszcz/src/GemSIM_v1.6/models/ill100v5_p.gzip -q 33 -o 5000 -u 5000 -s 500 -p
~/src/GemSIM_v1.6/GemReads.py -r ref.d05.l40.fa -n 10000 -g ref.txt -l 50 -m /home/lpryszcz/src/GemSIM_v1.6/models/ill100v5_p.gzip -q 33 -o 5000.d05.l40 -u 5000 -s 500 -p
# combine both libraries in random order
## 600_
paste <(zcat 600_1.fastq.gz 600.d05.l40_1.fastq.gz) <(zcat 600_2.fastq.gz 600.d05.l40_2.fastq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 | "gzip > random.600.d05.l40_1.fq.gz"; print $2,$4,$6,$8 | "gzip > random.600.d05.l40_2.fq.gz"}'
## 5000_
paste <(zcat 5000_1.fastq.gz 600.d05.l40_1.fastq.gz) <(zcat 5000_2.fastq.gz 5000.d05.l40_2.fastq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 | "gzip > random.5000.d05.l40_1.fq.gz"; print $2,$4,$6,$8 | "gzip > random.5000.d05.l40_2.fq.gz"}'
# symlinks with handier names
for f in *.fq.gz; do ln -s $f `echo $f | sed 's/random.//;s/.d05.l40//'`; done
# mate-pairs contaminated with paired-end reads
#for i in 1 2; do zcat <(zcat test/5000_$i.fq.gz test/5000_$i.fq.gz | fastx2reverse_complement.py fastq | gzip) test/600_$i.fq.gz > test/mixed.$i.fq.gz; done
Simulated reads were assembled with SPAdes v.3.5.0.
dipspades.py --only-assembler -t 4 -o 600 -1 600_1.fq.gz -2 600_2.fq.gz
Finally, heterozygous genome assembly pipeline can be applied to simulated heterozygous contigs and paired/mate-pairs reads.
./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1
This is optional step performed only if no contigs (-f
) are given.
Redundans uses Platanus for contigs assembly, scaffolding and gap closing.
In first order, Redundans uses libraries with reads between 70-150 bp. If no such libraries are available,
libraries with other read lenghts are used, but only libraries with cummulative sequence length of
at least 2/3 of the largest library will be used.
Note, assembly is be performed in single-end mode and only resulting contigs (not scaffolds) are going to be used.
Subsequently, library statistics are estimated and if FR paired-end libraries are detected,
Redundans will proceed with scaffolding and gap closing of de novo contigs using Platanus.
This shouldn't be mixed with Redundans scaffolding and gap closing modules - those are using different software and set of parameters.
At the beginning, Redundans estimates number of parameters:
- number of reads that it's going to align (based on
-l/--limit
parameter)
Aligning 32779 mates per library...
- library statistics:
- insert size: median, mean, stdev
- mates orientation: FF, FR, RF, RR
FastQ files | read length | median | mean | stdev | FF | FR | RF | RR |
---|---|---|---|---|---|---|---|---|
test/5000_1.fq.gz test/5000_2.fq.gz | 50 | 4998 | 4990.20 | 721.47 | 0 | 4,674 | 0 | 0 |
test/600_1.fq.gz test/600_2.fq.gz | 100 | 599 | 598.63 | 47.68 | 0 | 10,000 | 0 | 0 |
Above, you can see, there are two libraries:
- test/5000_1.fq.gz & test/5000_2.fq.gz:
- FR orientation
- mate-pairs with 5000 bp insert size +/- 600 bp
- test/600_1.fq.gz & test/600_2.fq.gz:
- FR orientation
- paired-end with 600 bp insert size +/- 39 bp
Based on these statistics, the program will perform scaffolding and gap closing starting from libraries with the smallest insert size.
Also at this stage, the user is notified about libraries with poor statistics i.e. large stdev or not consisted orientation. For these cases, the statistics are updated before every scaffolding iteration. This is done, because for libraries with large insert size, the estimation of insert size may be highly affected by the fragmentation of the assembly.
In the first step, Redundans will recognise and selectively remove heterozygous contigs from initial assembly.
The details about all removed heterozygous contigs are stored in tab-delimited file contigs.reduced.fa.hetero.tsv
.
Additionally, the histogram with identities of reduced contigs are saved as contigs.reduced.fa.hist.png
.
At the end of this step, the user will be notified about:
- initial assembly size and fragmentation
- what portion of the the initial assembly have been recognised as heterozygous
- identity between heterozygous regions
- size and fragmentation of reduced assembly
file name | genome size | contigs | heterozygous size | [%] | heterozygous contigs | [%] | identity [%] | possible joins | homozygous size | [%] | homozygous contigs | [%] |
---|---|---|---|---|---|---|---|---|---|---|---|---|
test/run6/contigs.fa | 163897 | 245 | 76363 | 46.59 | 222 | 90.61 | 94.837 | 0 | 97319 | 59.38 | 23 | 9.39 |
Above, you can see that:
- initial SPAdes assembly was fragmented (245 contigs) and larger (163.8 Kb) than reference sequence (100 Kb)
- Redundans:
- removed 220 heterozygous contigs, resulting in reduced assembly of 97.8 Kb (very similar to reference) in 28 contigs
- estimated 94.837% identity between heterozygous regions, which is very similar to the divergence that was used for simulations (5%)
Redundans repeat scaffolding on each library several times (--iters
). For each scaffolding iteration, the program reports number of:
- processed pairs
- pairs passing alignment quality
- pairs that aligned in different contigs
iteration 1.1 ...
19572 pairs. 18505 passed filtering [94.55%]. 1993 in different contigs [10.18%].
iteration 1.2 ...
19572 pairs. 18569 passed filtering [94.88%]. 340 in different contigs [1.74%].
iteration 2.1 ...
19572 pairs. 17009 passed filtering [86.90%]. 755 in different contigs [3.86%].
closing gaps ...
iteration 2.2 ...
19572 pairs. 17261 passed filtering [88.19%]. 0 in different contigs [0.00%].
closing gaps ...
Above, you can see that in iteration 1.1 above 10% of reads aligned in different contigs, while in subsequent iteration, less than 2% reads. This informs that in general it's enough to use 1-2 iterations of scaffolding per library.
For the time-being, only the iteration number is reported.
iteration 1.1 ...
iteration 1.2 ...
iteration 2.1 ...
iteration 2.2 ...
At the end, the program reports details of each step:
- file name
- fragmentation and size of the assembly at this stage
- %GC content
- fragmentation and size of assembly in contigs >1 Kb
- N50 & N90
- cumulative size of gaps
- the size of the longest contig
fname | contigs | bases | GC [%] | contigs >1kb | bases in contigs >1kb | N50 | N90 | Ns | longest |
---|---|---|---|---|---|---|---|---|---|
test/run6/contigs.fa | 245 | 163897 | 40.298 | 24 | 117391 | 3975 | 233 | 0 | 29603 |
test/run6/contigs.reduced.fa | 23 | 97319 | 39.333 | 17 | 94157 | 7321 | 2195 | 0 | 29603 |
test/run6/_sspace.1.1.fa | 4 | 97242 | 39.315 | 3 | 96924 | 87727 | 87727 | 473 | 87727 |
test/run6/_sspace.1.2.fa | 3 | 97321 | 39.315 | 2 | 97003 | 87727 | 87727 | 552 | 87727 |
test/run6/_sspace.2.1.fa | 1 | 100280 | 39.315 | 1 | 100280 | 100280 | 100280 | 3511 | 100280 |
test/run6/_sspace.2.1.filled.fa | 1 | 100251 | 39.363 | 1 | 100251 | 100251 | 100251 | 2782 | 100251 |
test/run6/_sspace.2.2.fa | 1 | 100251 | 39.363 | 1 | 100251 | 100251 | 100251 | 2782 | 100251 |
test/run6/_sspace.2.2.filled.fa | 1 | 100123 | 39.360 | 1 | 100123 | 100123 | 100123 | 2677 | 100123 |
test/run6/scaffolds.fa | 1 | 100123 | 39.360 | 1 | 100123 | 100123 | 100123 | 2677 | 100123 |
test/run6/_gapcloser.1.1.fa | 1 | 100495 | 39.542 | 1 | 100495 | 100495 | 100495 | 1265 | 100495 |
test/run6/_gapcloser.1.2.fa | 1 | 100498 | 39.579 | 1 | 100498 | 100498 | 100498 | 975 | 100498 |
test/run6/_gapcloser.2.1.fa | 1 | 100502 | 39.679 | 1 | 100502 | 100502 | 100502 | 4 | 100502 |
test/run6/_gapcloser.2.2.fa | 1 | 100506 | 39.679 | 1 | 100506 | 100506 | 100506 | 4 | 100506 |
test/run6/scaffolds.filled.fa | 1 | 100506 | 39.679 | 1 | 100506 | 100506 | 100506 | 4 | 100506 |
Accuracy of recovered contigs can be assessed by alignment of final scaffolds (run1/scaffolds.filled.fa
) back onto reference (ref.fa
).
This can be quickly achieved with nucmer (works for small genomes):
cd run1
# align
nucmer -p nucmer.ref ../ref.fa scaffolds.filled.fa
# generate pairwise plot
mummerplot --png --large nucmer.ref.delta -p nucmer.ref.plot
# open plot
eog nucmer.ref.plot.png
Figure 1: Pairwise alignment of reference sequence and redundans scaffolds
You may want to compare final redundans scaffolds (run1/scaffolds.filled.fa
) with initial SPAdes contigs (contigs.fa
).
cd run1
# align
nucmer -p nucmer.contigs scaffolds.filled.fa ../contigs.fa
# generate pairwise plot
mummerplot --png --large nucmer.contigs.delta -p nucmer.contigs.plot
# open plot
eog nucmer.contigs.plot.png
Figure 2: Pairwise alignment of redundans scaffolds and SPAdes contigs
Finally, comparison of consensus contigs from dipSPAdes (consensus_contigs.fa
) and reference sequnece (ref.fa
) informs about missing regions in dipSPAdes reconstruction.
# align
nucmer -p nucmer.ref_dipspades ref.fa consensus_contigs.fa
# generate pairwise plot
mummerplot --png --large nucmer.ref_dipspades.delta -p nucmer.ref_dipspades.plot
# open plot
eog nucmer.ref_dipspades.plot.png
Figure 3: Pairwise alignment of reference sequence and dipSPAdes consensus contigs
The same can be done using LAST (this will work also for large genomes):
# index reference genome (need to be done only once per each reference)
lastdb ../ref.fa ../ref.fa
# align & generate dotplot on the fly
lastal -f TAB ../ref.fa scaffolds.filled.fa | last-dotplot - scaffolds.filled.fa.png
# open plot
eog scaffolds.filled.fa.png
Figure 4: Pairwise alignment of reference sequence and Redundans contigs
All programs/scripts not disclosed in this repository (i.e. fasta2diverged.py), can be found in https://github.com/lpryszcz/bin.