Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support WGS filtering in the preprocessing pipeline #46

Closed
wants to merge 50 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
ce4e026
Create preprocess_wgs.snakefile
endast Dec 12, 2023
2b45939
update
endast Dec 13, 2023
ea9b22d
Working pipeline
endast Dec 13, 2023
178b46e
update snakemake
endast Dec 14, 2023
fd97555
add gtf filtering
endast Dec 19, 2023
fc599e7
Fix paths and max_mem
endast Dec 19, 2023
574a8e5
Cache GTF file in git hub action
endast Dec 19, 2023
fc0ecf3
Fix cache
endast Dec 19, 2023
6b3f027
fix smoke test
endast Dec 19, 2023
d11213d
Fix path
endast Dec 19, 2023
6962f6f
cache gtf in all smoke tests
endast Dec 19, 2023
01abf1b
Fix path for expanded bed
endast Dec 19, 2023
e875ce1
add requierments
endast Dec 19, 2023
4b6e0f3
install biotoolbox
endast Dec 19, 2023
fce06ff
add force
endast Dec 19, 2023
d052c1c
add bedtools
endast Dec 19, 2023
2de9f60
add bedops
endast Dec 20, 2023
a16491b
update example files
endast Dec 20, 2023
1e6b358
Merge branch 'feature/wgs-preprocessing' of https://github.com/PMBio/…
endast Dec 20, 2023
4be54df
add tbi index
endast Dec 20, 2023
7e36abd
update example
endast Dec 20, 2023
e7147d1
Update files
endast Dec 20, 2023
64d436b
update test
endast Dec 20, 2023
43b0917
Update test data
endast Dec 20, 2023
bf9a066
fix chr21
endast Dec 20, 2023
e84e339
update chr1
endast Dec 20, 2023
59e8884
update chr21
endast Dec 20, 2023
a1a7af3
Update dag viz
endast Dec 20, 2023
9806571
Delete preprocess_wgs.snakefile
endast Dec 20, 2023
e9fe508
Update preprocess.snakefile
endast Dec 20, 2023
3c2ca77
update docs
endast Dec 20, 2023
66c8afd
Update preprocessing.md
endast Dec 20, 2023
9b2371c
Merge branch 'main' into feature/wgs-preprocessing
endast Jan 2, 2024
2791489
Merge branch 'main' into feature/wgs-preprocessing
endast Jan 30, 2024
f8c24b0
fixup! Format Python code with psf/black pull_request
Jan 30, 2024
cb01f49
Update preprocessing.md
endast Jan 30, 2024
74fad88
Merge branch 'feature/wgs-preprocessing' of https://github.com/PMBio/…
endast Jan 30, 2024
bb32bae
Merge branch 'main' into feature/wgs-preprocessing
endast Feb 26, 2024
d146935
Fix ruff errors
endast Feb 26, 2024
00cd41a
Merge branch 'main' into feature/wgs-preprocessing
endast Mar 22, 2024
ce759e3
fix typo
endast Mar 22, 2024
9a45d90
Merge branch 'main' into feature/wgs-preprocessing
endast Apr 4, 2024
82f962e
Squashed commit of the following:
endast Apr 10, 2024
34f2c1f
Merge branch 'main' into feature/wgs-preprocessing
endast Apr 15, 2024
d88a78e
Squashed commit of the following:
endast Apr 15, 2024
a2350b3
Merge branch 'main' into feature/wgs-preprocessing
endast Apr 16, 2024
0235084
Update figures in doc
endast Apr 16, 2024
781d973
Remove white space
endast Apr 16, 2024
14f264f
update graphs
endast Apr 16, 2024
4d15082
update cache version
endast Apr 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 37 additions & 4 deletions .github/workflows/github-actions.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: DeepRVAT
run-name: DeepRVAT 🧬🧪💻🧑‍🔬
on: [ push ]
on: [push]

jobs:
DeepRVAT-Pipeline-Smoke-Tests:
Expand Down Expand Up @@ -77,7 +77,6 @@ jobs:
--snakefile ${{ github.workspace }}/pipelines/seed_gene_discovery.snakefile --show-failed-logs
shell: micromamba-shell {0}


DeepRVAT-Preprocessing-Pipeline-Smoke-Tests:
runs-on: ubuntu-latest
steps:
Expand Down Expand Up @@ -143,6 +142,10 @@ jobs:
cache-environment: true
cache-downloads: true

- name: Install biotoolbox
run: cpanm Bio::ToolBox@1.691 --force
shell: micromamba-shell {0}

- name: Install DeepRVAT
run: pip install -e ${{ github.workspace }}
shell: micromamba-shell {0}
Expand All @@ -162,19 +165,31 @@ jobs:
-O workdir/reference/GRCh38.primary_assembly.genome.fa.gz \
&& gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz

- name: Cache GTF file
id: cache-gtf
uses: actions/cache@v4
with:
path: example/preprocess/workdir/reference
key: ${{ runner.os }}-reference-gtf

- name: Download gtf data
if: steps.cache-gtf.outputs.cache-hit != 'true'
run: |
cd ${{ github.workspace }}/example/preprocess && \
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
-O workdir/reference/gencode.v44.annotation.gtf.gz

- name: Run preprocessing pipeline
run: |
python -m snakemake -j 2 --directory ${{ github.workspace }}/example/preprocess \
--snakefile ${{ github.workspace }}/pipelines/preprocess_no_qc.snakefile \
--configfile ${{ github.workspace }}/pipelines/config/deeprvat_preprocess_config.yaml --show-failed-logs
shell: micromamba-shell {0}


DeepRVAT-Preprocessing-Pipeline-Tests-With-QC:
runs-on: ubuntu-latest
needs: DeepRVAT-Preprocessing-Pipeline-Smoke-Tests
steps:

- name: Check out repository code
uses: actions/checkout@v4
- uses: mamba-org/setup-micromamba@v1.8.0
Expand All @@ -184,6 +199,10 @@ jobs:
cache-environment: true
cache-downloads: true

- name: Install biotoolbox
run: cpanm Bio::ToolBox@1.691 --force
shell: micromamba-shell {0}

- name: Install DeepRVAT
run: pip install -e ${{ github.workspace }}
shell: micromamba-shell {0}
Expand All @@ -203,6 +222,20 @@ jobs:
-O workdir/reference/GRCh38.primary_assembly.genome.fa.gz \
&& gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz

- name: Cache GTF file
id: cache-gtf
uses: actions/cache@v4
with:
path: example/preprocess/workdir/reference
key: ${{ runner.os }}-reference-gtf

- name: Download gtf data
if: steps.cache-gtf.outputs.cache-hit != 'true'
run: |
cd ${{ github.workspace }}/example/preprocess && \
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
-O workdir/reference/gencode.v44.annotation.gtf.gz

- name: Run preprocessing pipeline
run: |
python -m snakemake -j 2 --directory ${{ github.workspace }}/example/preprocess \
Expand Down
4 changes: 4 additions & 0 deletions deeprvat_preprocessing_env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,7 @@ dependencies:
- snakemake=7.17.1
- bcftools=1.17
- samtools=1.17
- bedtools=2.31.1
- perl-app-cpanminus=1.7047
- bedops=2.4.41
- gcc=13.2.0
152 changes: 97 additions & 55 deletions docs/_static/preprocess_rulegraph_no_qc.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
346 changes: 194 additions & 152 deletions docs/_static/preprocess_rulegraph_with_qc.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 27 additions & 7 deletions docs/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,16 +72,23 @@ sparse_dir_name : sparse

# Expected to be found in working_dir/reference_dir
reference_fasta_file : GRCh38.primary_assembly.genome.fa
gtf_file : gencode.v44.annotation.gtf.gz

# Mac memory used by convert2bed
convert2bed_max_mem: 64G

# Increase the BED entry by the same number base pairs in each direction
region_expand: 3000

# You can specify a different zcat cmd for example gzcat here, default zcat
zcat_cmd:
```


```
The config above would use the following directory structure:

```shell
parent_directory
`-- workdir
-- workdir
|-- norm
| |-- bcf
| |-- sparse
Expand Down Expand Up @@ -151,13 +158,19 @@ wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38
gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
```

4. Run with the example config
4. Download the gtf file

```shell
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz -P workdir/reference/GRCh38.primary_assembly.genome.fa.gz
```

5. Run with the example config

```shell
snakemake -j 1 --snakefile ../../pipelines/preprocess_with_qc.snakefile --configfile ../../pipelines/config/deeprvat_preprocess_config.yaml
```

5. Enjoy the preprocessed data 🎉
6. Enjoy the preprocessed data 🎉

```shell
ls -l workdir/preprocesed
Expand Down Expand Up @@ -195,13 +208,20 @@ wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38
gzip -d workdir/reference/GRCh38.primary_assembly.genome.fa.gz
```

4. Run with the example config
5. Download the gtf file

```shell
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz -P workdir/reference/GRCh38.primary_assembly.genome.fa.gz
```


6. Run with the example config

```shell
snakemake -j 1 --snakefile ../../pipelines/preprocess_no_qc.snakefile --configfile ../../pipelines/config/deeprvat_preprocess_config.yaml
```

5. Enjoy the preprocessed data 🎉
7. Enjoy the preprocessed data 🎉

```shell
ls -l workdir/preprocesed
Expand Down
Binary file modified example/preprocess/data/vcf/test_vcf_data_c21_b1.vcf.gz
Binary file not shown.
Binary file not shown.
Binary file modified example/preprocess/data/vcf/test_vcf_data_c22_b1.vcf.gz
Binary file not shown.
Binary file not shown.
Binary file not shown.
7 changes: 7 additions & 0 deletions pipelines/config/deeprvat_preprocess_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@ sparse_dir_name : sparse

# Expected to be found in working_dir/reference_dir
reference_fasta_file : GRCh38.primary_assembly.genome.fa
gtf_file : gencode.v44.annotation.gtf.gz

# Max memory used by convert2bed
convert2bed_max_mem: 64G

# Increase the BED entry by the same number base pairs in each direction
region_expand: 3000

# You can specify a different zcat cmd for example gzcat here, default zcat
zcat_cmd:
45 changes: 44 additions & 1 deletion pipelines/preprocessing/preprocess.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,13 @@ qc_allelic_imbalance_dir = qc_dir / "allelic_imbalance"
qc_duplicate_vars_dir = qc_dir / "duplicate_vars"
qc_filtered_samples_dir = qc_dir / "filtered_samples"

gtf_workdir = working_dir / "gtf"

gtf_file = reference_dir / config["gtf_file"]
gtf_filtered_file = gtf_workdir / f"{gtf_file.stem}_filtered_genes.gtf"
bed_file = gtf_workdir / f"{gtf_file.stem}_filtered_genes.bed"
expanded_bed = gtf_workdir / f"{gtf_file.stem}_filtered_expanded_regions.bed"

vcf_stems, vcf_files, vcf_look_up = deeprvat_preprocess.parse_file_path_list(config["vcf_files_list"])
chromosomes = config["included_chromosomes"]

Expand All @@ -56,12 +63,48 @@ rule normalize:
samplefile=norm_dir / "samples_chr.csv",
fasta=fasta_file,
fastaindex=fasta_index_file,
expanded_bed=expanded_bed,
params:
vcf_file=lambda wildcards: vcf_look_up[wildcards.vcf_stem],
output:
bcf_file=bcf_dir / "{vcf_stem}.bcf",
shell:
f"""{load_bcftools} bcftools view --samples-file {{input.samplefile}} --output-type u {{params.vcf_file}} | bcftools view --include 'COUNT(GT="alt") > 0' --output-type u | bcftools norm -m-both -f {{input.fasta}} --output-type b --output {{output.bcf_file}}"""
f"""{load_bcftools} bcftools view -R "{{input.expanded_bed}}" "{{params.vcf_file}}" --output-type u \
| bcftools view --samples-file {{input.samplefile}} --output-type u \
| bcftools view --include 'COUNT(GT="alt") > 0' --output-type u \
| bcftools norm -m-both -f {{input.fasta}} --output-type b --output {{output.bcf_file}}"""


rule fiter_gtf:
input:
gtf_file,
output:
gtf_filtered_file,
shell:
'get_features.pl --in "{input}" --out "{output}" --include "gene_type=protein_coding" --feature "gene" --gtf'


rule create_bed:
input:
gtf_filtered_file,
output:
bed_file
params:
maxmem=config["convert2bed_max_mem"]
shell:
'convert2bed --max-mem={params.maxmem} --input=gtf --output=bed < "{input}" > "{output}"'


rule expand_regions:
input:
bed=bed_file,
faidx=fasta_index_file,
params:
region_expand=config["region_expand"],
output:
expanded_bed
shell:
'bedtools slop -i "{input.bed}" -g "{input.faidx}" -b {params.region_expand} > "{output}"'


rule index_fasta:
Expand Down