Skip to content
This repository has been archived by the owner on Nov 17, 2021. It is now read-only.

Multiprocess get_quality_distribution by splitting FASTQ input #1

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yihchii
Copy link

@yihchii yihchii commented Apr 24, 2019

Highlights of this PR:

  1. Split the input FASTQ into smaller chunks for supporting calculation of error_prob on multiple processors.
  2. Instead of recording the error probability of every read in memory, recording those values in disc (files) to reduce the memory load.
  3. Use AWK to calculate mean and stdev of the error probability distribution, as well as the coverage of the sequencing data.

yihchii added 5 commits April 24, 2019 11:45
…er_chunk), and variable to different sizes of genome for coverage calculation (captured_size_in_bp)
…king error prob per read. Use awk for calculating the mean and stdev. Split reads first to use multiple processors
@AndrewCarroll
Copy link
Contributor

Hi Yih-Chii,

These are all valid methods to speed up the process of getting the quality distribution for large numbers of reads. However, I would recommend as an alternative first exploring the use of the maximum reads parameter in the get_quality_distribution file:

parser.add_argument('--maximum-reads', help="Compute statistics on at most this many reads", action="store", dest="maximum_reads", required=False)

Empirically, I have found that the distribution of a few million reads is very representative of the entire file. It may be more worthwhile to use that parameter instead of using this parallelization to calculate the distribution of a full file.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants