Skip to content

Latest commit

 

History

History
126 lines (98 loc) · 6.73 KB

README.md

File metadata and controls

126 lines (98 loc) · 6.73 KB

HPP Production Workflows

This repository holds WDL workflows and Docker build scripts for production workflows for data QC, assembly generation, and assembly QC used by the Human Pangenome Reference Consortium.

All WDLs and containers created in this repository are licensed under the MIT license. The underlying tools (that the WDLs and containers run) are likely covered under one or more Free and Open Source Software licenses, but we cannot make any guarantees to that fact.


Repository Organization

Workflows are split across data_processing, assembly, and (assembly) QC folders; each with the following folder structure:

 ── docker/
    └── toolName/
        └── Dockerfile
        └── Makefile
        └── scripts/
            └── toolName/
                └── scriptName.py
 ── wdl/
    └── tasks/
    │   └── taskName.wdl
    └── workflows/
        └── workFlowName.wdl

The root level of the data_processing, assembly, and (assembly) QC folders each contain a readme that provides details about the workflows and how to use them. Summaries of the workflows in each area are below.


Workflow Types

Data Processing

The HPRC produces HiFi, ONT, and Illumina Hi-C data. Each data type has a workflow to check data files to ensure they pass QC.

  • HiFi QC Workflow
    • Check for file-level sample swaps with NTSM
    • Calculate coverage (Gbp) and insert(N50) metrics from fastqs/bams using in-house tooling
    • Check for methylation and kinetics tags (in progress)
  • ONT QC Workflow
    • Check for file-level sample swaps with NTSM
    • Calculate coverage (Gbp) and insert(N50) metrics from summary files using in-house tooling
  • Hi-C QC Workflow
    • Check for file-level sample swaps with NTSM
    • Calculate total bases for the data file

Assembly

Assemblies are produced with one of two Hifiasm workflows using HiFi and ONT ultralong reads with phasing by either Illumina Hi-C or parental Illumina data for the Hi-C and trio workflows, respectively. The major steps included in the assembly workflows are:

  • Yak for creation of kmer databases for trio phased assemblies
  • Cutadapt for adapter filtering of HiFi reads
  • Run Hifiasm with HiFi and ONT ultralong and trio or Hi-C phasing
  • Yak for sex chromosome assignment in Hi-C phased assemblies

In addition to the Hifiasm workflows there is an assembly cleanup workflow which:

  • Removes contamination with NCBI's FCS
  • Removes mitochondrial contigs
  • Runs MitoHiFi to assemble mitochondrial contigs
  • Assigns chromosome labels to fasta headers of T2T contigs/scaffolds

Polishing

Assemblies are polished using a custom pipeline based around DeepPolisher. The polishing pipeline workflow wdl can be found at polishing/wdl/workflows/hprc_DeepPolisher.wdl. The major steps in the HPRC assembly polishing pipeline are:

  • Alignment of all HiFi reads to the diploid assembly using minimap2
  • Alignment of all ONT UL reads > 100kb separately to each haplotype assembly using minimap2
  • PHARAOH pipeline. PHARAOH ensures optimal HiFi read phasing, by leveraging ONT UL information to assign HiFi reads to the correct haplotype in stretches of homozygosity longer than 20kb.
  • DeepPolisher is an encoder-only transformer model which is run on the PHARAOH-corrected HiFi alignments, to predict polishing edits in the assemblies.

QC

Automated Assembly QC

Assembly QC is broken down into two types:

  • standard_qc: these tools are relatively fast to run and provide insight into the completeness, correctness, and contiguity of the assemblies.
  • alignment_based_qc: these tools rely on long read alignment of a sample's reads to it's own assembly. The alignments are then used to identify unexpected variation that indicates missassembly.

The following tools are included in the standard_qc pipeline:

The following tools are included in the alignment_based_qc pipeline:


Running WDLs

If you haven't run a WDL before, there are good resources online to get started. You first need to choose a way to run WDLs. Below are a few options:

  • Terra: An online platform with a GUI which can run workflows in Google Cloud or Microsoft Azure.
  • Cromwell: A workflow runner from the BROAD with support for Slurm, local compute, and multiple cloud platforms.
  • Toil: A workflow runner from UCSC with support for Slurm, local compute, and multiple cloud platforms.

Running with Cromwell

Before starting, read the Cromwell 5 minute intro.

Once you've done that, download the latest version of cromwell and make it executable. (Replace XY with newest version number)

wget https://github.com/broadinstitute/cromwell/releases/download/86/cromwell-XY.jar
chmod +x cromwell-XY.jar

And run your WDL:

java -jar cromwell-XY.jar run \
   /path/to/my_workflow.wdl \
   -i my_workflow_inputs.json \
   > run_log.txt

Input files

Each workflow requires an input json. You can create a template using womtool:

java -jar womtool-XY.jar \
    inputs \
    /path/to/my_workflow.wdl \
    > my_workflow_inputs.json