Multiprocess get_quality_distribution by splitting FASTQ input #1

yihchii · 2019-04-24T19:00:19Z

Highlights of this PR:

Split the input FASTQ into smaller chunks for supporting calculation of error_prob on multiple processors.
Instead of recording the error probability of every read in memory, recording those values in disc (files) to reduce the memory load.
Use AWK to calculate mean and stdev of the error probability distribution, as well as the coverage of the sequencing data.

…er_chunk), and variable to different sizes of genome for coverage calculation (captured_size_in_bp)

…king error prob per read. Use awk for calculating the mean and stdev. Split reads first to use multiple processors

…rding in memory

AndrewCarroll · 2019-05-25T10:42:34Z

Hi Yih-Chii,

These are all valid methods to speed up the process of getting the quality distribution for large numbers of reads. However, I would recommend as an alternative first exploring the use of the maximum reads parameter in the get_quality_distribution file:

parser.add_argument('--maximum-reads', help="Compute statistics on at most this many reads", action="store", dest="maximum_reads", required=False)

Empirically, I have found that the distribution of a few million reads is very representative of the entire file. It may be more worthwhile to use that parameter instead of using this parallelization to calculate the distribution of a full file.

yihchii added 5 commits April 24, 2019 11:45

rename get_quality_distribution.py -> get_error_prob.py

46b5248

Add input for supporting splitting input reads to chunks (num_reads_p…

d945f9d

…er_chunk), and variable to different sizes of genome for coverage calculation (captured_size_in_bp)

Rewrite the app code: use disk space instead of memory space for trac…

1291f09

…king error prob per read. Use awk for calculating the mean and stdev. Split reads first to use multiple processors

Update Python script and remove the recording of error prob from reco…

d3334ee

…rding in memory

Fix STDEV formula

8396646

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocess get_quality_distribution by splitting FASTQ input #1

Multiprocess get_quality_distribution by splitting FASTQ input #1

yihchii commented Apr 24, 2019

AndrewCarroll commented May 25, 2019

Multiprocess get_quality_distribution by splitting FASTQ input #1

Are you sure you want to change the base?

Multiprocess get_quality_distribution by splitting FASTQ input #1

Conversation

yihchii commented Apr 24, 2019

AndrewCarroll commented May 25, 2019