Merge pull request #26 from aofarrel/clearer-inputs

Overhaul input variables and names
aofarrel · May 23, 2023 · c63ff26 · c63ff26
2 parents 4794566 + c7ff0a7
commit c63ff26
Show file tree

Hide file tree

Showing 7 changed files with 234 additions and 135 deletions.
diff --git a/doc/FAQs.md b/doc/FAQs.md
@@ -0,0 +1,44 @@
+# FAQs
+
+## General
+### Where do samples get dropped?
+|   | pipeline                         | task                     | situation                                                                                                                       | can this filter be disabled?            | can be made a fatal error instead of a silent filter? |
+|---|----------------------------------|--------------------------|---------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|-------------------------------------------------------|
+|   | myco_sra                         | get_sample_IDs           | BioSample accession appears to be invalid                                                                                       | no                                      | no                                                    |
+|   | myco_sra                         | pull                     | ALL of a BioSample's run accessions fail prefetch and/or fasterq-dump                                                           | no                                      | yes, via `pull.fail_on_invalid`                       |
+|   | myco_sra                         | pull                     | ≥1 of a BioSample's run accessions fail prefetch and/or fasterq-dump                                                            | yes, disabled by default                | yes, via `pull.fail_on_invalid`                       |
+|   | myco_sra                         | pull                     | ALL of a BioSample's run accessions have only one fastq                                                                         | no                                      | yes, via `pull.fail_on_invalid`                       |
+|   | myco_sra                         | pull                     | ≥1 of a BioSample's run accessions have only one fastq                                                                          | yes, disabled by default                | yes, via `pull.fail_on_invalid`                       |
+|   | myco_sra, myco_raw               | decontam_each_sample     | sample takes ≥ `timeout_decontam_part1` minutes to map to the decontamination reference via `clockwork map_reads`               | yes, via `timeout_decontam_part1` = 0   | yes, via `variant_call_each_sample.crash_on_timeout`  |
+|   | myco_sra, myco_raw               | decontam_each_sample     | sample takes ≥ `timeout_decontam_part2` minutes to map to the decontamination reference via `clockwork remove_contam`           | yes, via `timeout_decontam_part2` = 0   | yes, via `variant_call_each_sample.crash_on_timeout`  |
+|   | myco_sra, myco_raw, myco_cleaned | variant_call_each_sample | sample takes ≥ `timeout_variant_caller` minutes to map to the decontamination reference via `clockwork variant_call_one_sample` | yes, via `timeout_variant_caller` = 0   | yes, via `variant_call_each_sample.crash_on_timeout`  |
+|   | myco_sra, myco_raw, myco_cleaned | variant_call_each_sample | non-timeout error in `clockwork variant_call_one_sample`                                                                        | no                                      | yes, via `variant_call_each_sample.crash_on_error`    |
+|   | myco_sra, myco_raw, myco_cleaned | trees.cat_diff_files     | porportion of low coverage sites in a sample's diff file ≥ `max_low_coverage_sites`                                             | yes, via `max_low_coverage_sites` = 1.0 | no                                                    |
+
+
+Miscellanous notes:
+* SRA data failing prefetch or fasterq-dump (from sra-tools) usually means that data is corrupt. If you find ENA data on SRA that looks it ought not be corrupt, but can't be downloaded, try ENABrowserTools or [my WDlization of it](https://github.com/aofarrel/enaBrowserTools-wdl)
+* These timers apply to the same WDL task but are for different processes within that task -- `timeout_decontam_part1` is 20 and `timeout_decontam_part1` is 15, and a sample spends 19 minutes mapping plus another 14 minutes finishing the decontamination process, it will *not* be filtered out.
+* clockwork's variant caller failing could mean it ran out of memory or was passed bad data, so the pipeline doesn't crash when this happens unless `variant_call_each_sample.crash_on_error` is set to true. Erroring out in the decontamination step is more indicative a serious issue, so a non-timeout error in that step will crash the pipeline.
+
+### Where does data get filtered?
+* If a FASTQ is above `subsample_cutoff` MB, it will get downsampled by seqtk
+* When diff files are created, two types of regions will be masked:
+  * regions that tend to be frequently masked when dealing with TB are masked (note: upstream outputs such as the VCF do not get masked)
+  * regions which have called a variant but that variant has less coverage than 
+
+### Why are there so many places where samples get dropped?
+Myco was originally designed with the goal of analyzing as much TB SRA data as we could relatively quickly and cheaply. About 94% of MTBC BioSamples which are tagged as containing paired Illumina data pass myco_sra default's filters. Roughly 3% of that data is either completely invalid (fails fasterq-dump due to a database migration error that happended years ago, length of quality score doesn't match length of nucleotide strings, etc) or can't be used by clockwork (not actually paired Illumina reads, etc). The remainder of what gets filtered out seems to be extremely heavily contaminated and/or low overall coverage. As such, we're reasonably confident that myco_sra's default filters are a good tradeoff for the realities of SRA's data. Of course, you might want to cast a wider net than we did, or you might not be using SRA data at all, so all of these filters except for this-will-100%-break-the-pipeline-if-you-let-this-through ones can be configured.
+
+
+### What if I want to use a different reference genome?
+This isn't officially supported due to TBProfiler, UShER, and clockwork each needing specific reference genomes:
+* TBProfiler's reference genome must be *exactly* the same as the one you called variants upon, if you're running TBProfiler on bams
+* UShER has limitations on how long a chromosome's name can be
+* clockwork's decontamination is designed with their specific decontamination reference in mind
+That being said, old versions of myco used clockwork reference prepare to prepare the TB genome, and you could hack that passing-reference-genomes-around functionality to use your own custom reference genomes if you're confident. The latest version of myco that used clockwork reference prepare was [4.1.3](https://github.com/aofarrel/myco/releases/tag/4.1.3), so that's a good place to start.
+
+
+## Common warnings/errors
+### trees/tree_nine/cat_diff_files fails with `Disk strings should be of the format 'local-disk SIZE TYPE' or '/mount/point SIZE TYPE' but got: 'local-disk 0 SSD'`
+This means none of your samples made it to the phylogenetic tree building task. See "Where do samples get dropped?" for more information.
diff --git a/doc/inputs.md b/doc/inputs.md
@@ -43,12 +43,13 @@ myco_cleaned expects that the FASTQs you are putting into have already been clea
 
 | name | type | default | description |  
 |:---:|:---:|:---:|:---:|  
-| bad_data_threshold | Float  | 0.05 | If a diff file has higher than this percent (0.5 = 50%) bad data, do not include it in the tree |  
 | decorate_tree | Boolean  | false | Should usher, taxonium, and NextStrain trees be generated? Requires input_tree and ref_genome |  
 | fastqc_on_timeout | Boolean  | false | If true, fastqc one read from a sample when decontamination or variant calling times out |  
 | force_diff | Boolean  | false | If true and if decorate_tree is false, generate diff files. (Diff files will always be created if decorate_tree is true.) |  
 | input_tree | File? |  | Base tree to use if decorate_tree = true |  
-| min_coverage | Int  | 10 | Positions with coverage below this value will be masked in diff files |  
+| max_low_coverage_sites | Float  | 0.05 | If a diff file has higher than this percent (0.5 = 50%) bad data, do not include it in the tree |  
+| min_coverage_per_site | Int  | 10 | Positions with coverage below this value will be masked in diff files |  
+| ref_genome_for_tree_building | File |  | Ref genome for building trees -- must have ONLY '>NC_000962.3' on its first line |  
 | ref_genome_for_tree_building | File? |  | Ref genome for building trees -- must have ONLY '>NC_000962.3' on its first line |  
 | subsample_cutoff | Int  | 450 | If a fastq file is larger than than size in MB, subsample it with seqtk (set to -1 to disable) |  
 | subsample_seed | Int  | 1965 | Seed used for subsampling with seqtk |  
@@ -65,53 +66,58 @@ If you are on a backend that does not support call cacheing, you can use the 'bl
 
 | task | name | type | default | description |  
 |:---:|:---:|:---:|:---:|:---:|  
-| ClockworkRefPrepTB | bluepeter__tar_indexd_H37Rv_ref | File? |  |  |  
 | ClockworkRefPrepTB | bluepeter__tar_indexd_dcontm_ref | File? |  |  |  
 | ClockworkRefPrepTB | bluepeter__tar_tb_ref_raw | File? |  |  |  
-| cat_reports | out | String  | \'pull_reports.txt\' | Override default output file name with this string |  
+| decontam_each_sample | contam_out_1 | String? |  | Override default output file name with this string |  
+| decontam_each_sample | contam_out_2 | String? |  | Override default output file name with this string |  
+| decontam_each_sample | counts_out | String? |  | Override default output file name with this string |  
+| decontam_each_sample | crash_on_timeout | Boolean  | false | If this task times out, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)? |  
+| decontam_each_sample | done_file | String? |  | Override default output file name with this string |  
+| decontam_each_sample | no_match_out_1 | String? |  | Override default output file name with this string |  
+| decontam_each_sample | no_match_out_2 | String? |  | Override default output file name with this string |  
+| decontam_each_sample | subsample_cutoff | Int  | -1 | If a FASTQ file is larger than than size in MB, subsample it with seqtk (set to -1 to disable) |  
+| decontam_each_sample | subsample_seed | Int  | 1965 | Seed used for subsampling with seqtk |  
+| decontam_each_sample | threads | Int? |  | Try to use this many threads for decontamination. Note that actual number of threads also relies on your hardware. |  
+| decontam_each_sample | verbose | Boolean  | true |  |  
 | make_mask_and_diff | histograms | Boolean  | false | Should coverage histograms be output? |  
-| per_sample_decontam | contam_out_1 | String? |  | Override default output file name with this string |  
-| per_sample_decontam | contam_out_2 | String? |  | Override default output file name with this string |  
-| per_sample_decontam | counts_out | String? |  | Override default output file name with this string |  
-| per_sample_decontam | crash_on_timeout | Boolean  | false | If this task times out, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)? |  
-| per_sample_decontam | done_file | String? |  | Override default output file name with this string |  
-| per_sample_decontam | no_match_out_1 | String? |  | Override default output file name with this string |  
-| per_sample_decontam | no_match_out_2 | String? |  | Override default output file name with this string |  
-| per_sample_decontam | subsample_cutoff | Int  | -1 | If a FASTQ file is larger than than size in MB, subsample it with seqtk (set to -1 to disable) |  
-| per_sample_decontam | subsample_seed | Int  | 1965 | Seed used for subsampling with seqtk |  
-| per_sample_decontam | threads | Int? |  | Try to use this many threads for decontamination. Note that actual number of threads also relies on your hardware. |  
-| per_sample_decontam | verbose | Boolean  | true |  |  
 | trees | make_nextstrain_subtrees | Boolean  | true |  |  
 | trees | outfile | String? |  | Override default output file name with this string |  
-| varcall_with_array | crash_on_error | Boolean  | false | If this task, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)? Note that errors that crash the VM (such as running out of space on a GCP instance) will stop the whole pipeline regardless of this setting. |  
-| varcall_with_array | crash_on_timeout | Boolean  | false | If this task times out, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)? |  
-| varcall_with_array | debug | Boolean  | false | Do not clean up any files and be verbose |  
-| varcall_with_array | mem_height | Int? |  | cortex mem_height option. Must match what was used when reference_prepare was run (in other words do not set this variable unless you are also adjusting the reference preparation task) |  
+| variant_call_each_sample | crash_on_error | Boolean  | false | If this task, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)? Note that errors that crash the VM (such as running out of space on a GCP instance) will stop the whole pipeline regardless of this setting. |  
+| variant_call_each_sample | crash_on_timeout | Boolean  | false | If this task times out, should it stop the whole pipeline (true), or should we just discard this sample and move on (false)? |  
+| variant_call_each_sample | debug | Boolean  | false | Do not clean up any files and be verbose |  
+| variant_call_each_sample | mem_height | Int? |  | cortex mem_height option. Must match what was used when reference_prepare was run (in other words do not set this variable unless you are also adjusting the reference preparation task) |  
 
 
 ### Runtime attributes  
 These variables adjust runtime attributes, which includes hardware settings. See https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/ for more information.  
 
 | task | name | type | default | description |  
 |:---:|:---:|:---:|:---:|:---:|  
-| cat_reports | disk_size | Int  | 10 | Disk size, in GB. Note that since cannot auto-scale as it cannot anticipate the size of reads from SRA. |  
+| cat_resistance | disk_size | Int  | 10 | Disk size, in GB. Note that since cannot auto-scale as it cannot anticipate the size of reads from SRA. |  
+| cat_strains | disk_size | Int  | 10 | Disk size, in GB. Note that since cannot auto-scale as it cannot anticipate the size of reads from SRA. |  
+| decontam_each_sample | addldisk | Int  | 100 | Additional disk size, in GB, on top of auto-scaling disk size. |  
+| decontam_each_sample | cpu | Int  | 8 | Number of CPUs (cores) to request from GCP. |  
+| decontam_each_sample | memory | Int  | 16 | Amount of memory, in GB, to request from GCP. |  
+| decontam_each_sample | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
+| decontam_each_sample | ssd | Boolean  | true | If true, use SSDs for this task instead of HDDs |  
 | get_sample_IDs | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
 | make_mask_and_diff | addldisk | Int  | 10 | Additional disk size, in GB, on top of auto-scaling disk size. |  
 | make_mask_and_diff | cpu | Int  | 8 | Number of CPUs (cores) to request from GCP. |  
 | make_mask_and_diff | memory | Int  | 16 | Amount of memory, in GB, to request from GCP. |  
 | make_mask_and_diff | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
+| make_mask_and_diff | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
 | make_mask_and_diff | retries | Int  | 1 | How many times should we retry this task if it fails after it exhausts all uses of preemptibles? |  
-| per_sample_decontam | addldisk | Int  | 100 | Additional disk size, in GB, on top of auto-scaling disk size. |  
-| per_sample_decontam | cpu | Int  | 8 | Number of CPUs (cores) to request from GCP. |  
-| per_sample_decontam | memory | Int  | 16 | Amount of memory, in GB, to request from GCP. |  
-| per_sample_decontam | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
-| per_sample_decontam | ssd | Boolean  | true | If true, use SSDs for this task instead of HDDs |  
+| merge_reports | disk_size | Int  | 10 | Disk size, in GB. Note that since cannot auto-scale as it cannot anticipate the size of reads from SRA. |  
+| profile | cpu | Int  | 2 | Number of CPUs (cores) to request from GCP. |  
+| profile | memory | Int  | 4 | Amount of memory, in GB, to request from GCP. |  
+| profile | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
+| profile | ssd | Boolean  | false | If true, use SSDs for this task instead of HDDs |  
 | pull | disk_size | Int  | 100 | Disk size, in GB. Note that since cannot auto-scale as it cannot anticipate the size of reads from SRA. |  
 | pull | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
-| varcall_with_array | addldisk | Int  | 100 | Additional disk size, in GB, on top of auto-scaling disk size. |  
-| varcall_with_array | cpu | Int  | 16 | Number of CPUs (cores) to request from GCP. |  
-| varcall_with_array | memory | Int  | 32 | Amount of memory, in GB, to request from GCP. |  
-| varcall_with_array | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
-| varcall_with_array | retries | Int  | 1 | How many times should we retry this task if it fails after it exhausts all uses of preemptibles? |  
-| varcall_with_array | ssd | Boolean  | true | If true, use SSDs for this task instead of HDDs |  
+| variant_call_each_sample | addldisk | Int  | 100 | Additional disk size, in GB, on top of auto-scaling disk size. |  
+| variant_call_each_sample | cpu | Int  | 16 | Number of CPUs (cores) to request from GCP. |  
+| variant_call_each_sample | memory | Int  | 32 | Amount of memory, in GB, to request from GCP. |  
+| variant_call_each_sample | preempt | Int  | 1 | How many times should this task be attempted on a preemptible instance before running on a non-preemptible instance? |  
+| variant_call_each_sample | retries | Int  | 1 | How many times should we retry this task if it fails after it exhausts all uses of preemptibles? |  
+| variant_call_each_sample | ssd | Boolean  | true | If true, use SSDs for this task instead of HDDs |  
 
diff --git a/inputs/myco_sra_local.json b/inputs/myco_sra_local.json
@@ -2,7 +2,7 @@
   "myco.biosample_accessions": "inputs/multi_sample/3_samples.txt",
   "myco.decorate_tree": true,
   "myco.input_tree": "inputs/trees/alldiffs_mask2ref.L.fixed.pb",
-  "myco.min_coverage": 10,
+  "myco.min_coverage_per_site": 10,
   "myco.ref_genome_for_tree_building": "inputs/ref/tb_seq.fasta",
   "myco.typical_tb_masked_regions": "inputs/masks/R00000039_repregions.bed"
 }
diff --git a/inputs/myco_sra_terra.json b/inputs/myco_sra_terra.json
@@ -2,7 +2,7 @@
   "myco.biosample_accessions": "gs://topmed_workflow_testing/tb/sra/20_samples.txt",
   "myco.decorate_tree": true,
   "myco.ref_genome_for_tree_building": "gs://topmed_workflow_testing/tb/alldiffs_mask2ref.L.fixed.pb",
-  "myco.min_coverage": 10,
+  "myco.min_coverage_per_site": 10,
   "myco.ref_genome": "gs://topmed_workflow_testing/tb/tb_seq.fasta",
   "myco.typical_tb_masked_regions": "gs://topmed_workflow_testing/tb/R00000039_repregions.bed"
 }