Skip to content

pnm27/snRNA_scRNA_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

snRNA_scRNA_Pipeline Introduction

TODO

  • Miscellaneous:
    • Simplified shell script in the rules demux_samples and add_obs_to_final_count_matrix in demultiplex.smk
    • Add PICARD option in new_config file.
    • Write down schemas.
    • Add tutorials.
      • pooled snRNA seq
        • single wildcard
        • multiple wildcards
      • scRNA seq
        • single wildcard
        • multiple wildcards
      • Double HTOs
    • Remove dependency on STARsolo as an aligner.
    • For rules that use genefull_matrices make input function that take either Gene or GeneFull dependent on the project.
    • Combine sub-workflows split_bams and split_bams_gt.
      • Search Ranking of readthedocs (using config file for this too).
    • Might incorporate git submodules for repos on git that I use.
    • Add new Picard metrics.
    • Add options in config file to allow adding extra params for every software:
    • For reruns of vireo, provide a way to retain those information in update logs file.
    • Fix demultiplex_helper_funcs.py for double HTOs in function parse_file.
    • Simplify structure of wildcards.
      • Folder structure should include
  • analyse_vireo:
    • new_config params:
    • snakemake_rules:
    • scripts:
  • calico_solo_demux:
    • new_config params:
    • snakemake_rules:
    • scripts:
  • demultiplex:
    • new_config params:
    • snakemake_rules:
    • Employ a strategy for final count matrix dir (file dir in config file) for the cases:
      • when both demultiplex software are run simultaneously.
      • When there's an order (try to name each run separately or at least keep the order somewhere mentioned).
    • scripts:
      • when adding calico_solo or vireo include the demultiplex file (file containing demux stats) as input and append to it.
      • For reruns of vireo, provide a way to retain those information in demultiplex info file.
  • helper_functions:
    • new_config params:
    • snakemake_rules:
    • scripts:
  • identify_swaps:
    • new_config params:
    • snakemake_rules:
    • scripts:
  • input_processing:
    • new_config params:
    • snakemake_rules:
    • scripts:
  • kite:
    • new_config params:
    • snakemake_rules:
      • Remove run directive
    • scripts:
  • pheno_demux3:
    • new_config params:
    • snakemake_rules:
      • Remove run directive
      • Beautify the function get_filt_barcodes.
    • scripts:
  • picard_metrics:
    • new_config params:
    • snakemake_rules:
    • scripts:
  • produce_targets:
    • new_config params:
    • snakemake_rules:
      • Simplify target functions.
    • scripts:
  • STARsolo:
    • new_config params:
    • snakemake_rules:
      • Remove run directive
      • WASP mode
    • Issues with using wildcard vcf_type in the rule demux_samples.
    • Issues with output dir selection in the demux_samples i.e. automatically pick output dir.
    • Revamp wildcards so that varying output dirs are corrected accordingly:
      • Demux output i.e. if only one demux method needs to be used or simultaneously both.
      • splitting bams is for finalizing or genotype purposes.
  • cellranger:
    • Support for cellranger.
      • Support for cellranger-arc count.
        • Add support for alignment.
        • Add support for ATAC-based vireo demultiplexing.

This pipeline intends to not only make complex {term}preprocessing workflows easy (e.g. snRNA seq with pooled samples, double HTOs, etc.) but also to facilitate the use of common workflows used for preprocessing by providing readymade different combinations of softwares/tools (see {ref}selectable <selectable-modules> modules for more options).

It also supports various software/pipeline for scRNA seq pre-processing.

The highlights of the pipeline are:

  • Streamlined processes to modify parameters for each program through a single yaml file
  • Easily modifiable to accomodate more rules
  • Can be used for both individual samples as well as multiplexed pools
  • Preserve folder structures (mirroring fastqs' folder structures)
  • Organize outputs from each module
  • Select multiple pre-set modules that simplifies usage across multiple projects

Changelog

  • Changed param name in demultiplex info from Unique genes to gene_ids with an associated gene_name.
  • Added new param in demultiplex info file to add more stats when remove gene IDs without an associated gene name.
  • Added an option to run cellSNP without any ref vcfs (1000 Genomes Project vcf is min requirement)
  • Now create_wet_lab_info scripts can:
    • Run without a converter file
    • Save donor file along with the wet lab compilation file
    • argparse documented
  • Fixed an issue with create_wet_lab_info.py file
  • create_wet_lab_info.py file now mirrors actions for donor and multiplex compilations.
  • Changed name of the rule demux_samples_calico_solo_STARsolo to demux_samples.
  • Changed the demux_info parameter to optional (from positional) in demultiplex_no_argp.snkmk's rule that handles adding new demux to a final count matrix.
  • Added working argparse to demul_samples_no_argp.py script.
  • Changed the name of sub-workflow demultiplex_no_argp.snkmk to demultiplex.snkmk
  • Changed the name of sub-workflow demul_samples_no_argp.py to demul_samples.py
  • Fix demultiplex_no_argp.snkmk's rule that handles adding new demux to a final count matrix.
  • Add an option (in config file) to create h5ads when demultiplexing (demultiplex_no_argp.snkmk) or not (can be used as switch when doing gt checks and finalizing donor assignment).
  • Add an option for the rule cellSNP when ref SNPs vcf need not be subsetted further.
  • Make the functions similar for demultiplexing with any method.
  • Fix issue with reading old wet_lab_info file to update (extension issues).
  • Some issue with create_wet_lab_info.py file (it misses to add some lines from certain files - try AMP ones)
  • Single wildcard is called now pool (Earlier mixed use of num, id1 and id2).
  • Retained use of double wildcards.
  • Revised split_bams script:
    • Consolidated gt and non-gt versions.
    • Now, mito file is in params (earlier was an output)
    • Now, bed file is in params (earlier was an inputs)
    • Streamlined
  • Major revisions to create_per_donor_bams.bash script
    • Consolidated gt and non-gt versions.
    • Handles saving mito_file much elegantly.
    • Supports argument parsing (with support for older positional args)
    • Doesn't expect directories, provided as inputs, to follow logic - dirs should end with '/.
  • Changed name of workflow from pheno_demux3.snkmk to genotype_demux.snkmk
  • The rule create_inp_splitBams now:
    • Uses a consolidated script to create the barcode files using both, h5ad and raw files.
    • Removed the option to overwrite the outputs as before (but still present in python script).
    • Rule supports single demux (while the script can handle multiple).
    • h5ad input alone support for calico_solo while vireo output is supported as is.
  • Major revisions to run_update_logs.sh script
    • Builds command from input using asssociative arrays.
    • To emulate missingness (picard and/or demultiplexing) just use empty values.
    • Now support for STARsolo 2.7.10 with Final_out_MAP_2_7_10a_latest.tsv.
  • Major revisions to update_logs.py script
    • All optional parameters (except map_file, output_file, and bam_dir) expects one value or becomes None in its absence (with no argument value other values are used).
    • To emulate missingness (picard and/or demultiplexing) just use empty values.
    • Missingness of picard_dir implies not collecting GCBias and RNASeq Metrics.
    • Similarly, missingness of demul_dir implies not demultiplexing info.
  • New file - Final_out_MAP_2_7_10a_latest_info.xlsx - contains more info related to Final_out_MAP_2_7_10a_latest.tsv.
  • Changed the section name from demux_pipeline to hashsolo_demux_pipeline in new_config.yaml file.
  • calico_solo_demux.smk: At lines numbered 3 and 9.
  • create_logs.smk: At lines numbered 11 and 52.
  • demultiplex.smk: At lines numbered 10,12,32,34,72,74,78,80,124,125,132,135,138,328,331, and 333.
  • genotype_demux.smk: At line numbered 45.
  • kite.smk: At lines numbered 417 and 437.
  • produce_targets.smk: At lines numbered 159-162.
  • split_bams.smk: At lines numbered 10,16, and 18.
  • Consolidated the section name from split_bams_pipeline_gt_demux in split_bams_pipeline_gt (earlier used for calico_solo based split bams) in new_config.yaml file.
    • split_bams.smk: At lines numbered 44,45,47 and 52.
    • produce_targets.smk: At line numbered 111.
    • identify_swaps.smk: At line numbered 3.
  • Added support for Snakemake transition to version > 8.
    • Added workflow_profile/config.yaml which gets reflected in run_snakemake.sh.
    • To emulate previous behavior's for profile manually edited the lsf_executor_plugin and added ENV variable.
    • lsf.yaml still is present for snakemake < v8.
    • Changed the threads directive and replaced with resources: cpus_per_task.
  • Removed dependence on resources.smk. Instead all resource requirements are within each snakemake file.
  • Simplified the rule cellSNP in genotype_demux.smk.
    • Now a parameter cmd_str_csnp function to replace the use of indexed arrays in shell (was working Snakemake < 8).
    • Simplified the commandline for execution.
    • Now the rule picks the biggest of given n_proc in new_config.yaml and twice of the number of cpus provided.
  • Simplified the rule vireoSNP in genotype_demux.smk.
    • Now a parameter cmd_str_vireo function to replace the use of indexed arrays in shell (was working Snakemake < 8).
    • Now file specified in vcf_info (relates to the rule vireoSNP) in new_config.yaml is expected to have headers.
    • Simplified the commandline for execution.
  • Usage of pd.concat now in concordance with FutureWarning in update_logs.py.
  • In new_config.yaml, changed
    • gt_conv to file in donorName_conv in gt_demux_pipeline.
    • mito to mito_prefix. Reflected in demultiplex.smk, split_bams.smk and calico_solo_demux.smk
  • Removed mode='w+' when creating outputs in create_Feat_Barc.py.
  • Added multiome_alignment as a new module. Created cellranger.smk, which currently support cellranger arc count only.
  • Added multiome demultiplexing support for the following rules:
    • genotype_demux
      • Fixed UMItag selection in cellSNP for multiome-ATAC.
      • Fixed issues for multiome in create_inp_cellSNP (to add -1 in cell barcodes as bam by cellranger has "-1" suffix)
    • demultiplex
    • Change 'vcf_type' wildcard to support both multi-vcf and multiome setup.

Requirements

This pipeline depends on the following packages/programs:

About

Snakemake Pipeline with "selectable" modules

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published