Skip to content

Bin-based Analysis of Insertional Mutagenesis Screens

Notifications You must be signed in to change notification settings

bhavenp/BAIMS-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

BAIMS-Pipeline

Bin-based Analysis of Insertional Mutagenesis Screens. A pipeline to search for regions of the genome enriched for retroviral gene trap insertions in a population of mutagenized cells selected for a phenotype of interest.

Overview

BAIMS is a computational pipeline developed to analyze retroviral gene trap insertion patterns in haploid genetic screens. It can also be used to map retroviral insertions in any mutagenized cell population. The pipeline is currently designed to map insertions to the NCBI GRCh38 release of the human genome. Insertions are aligned to the human genome using the Bowtie aligner and mapped to contiguous, non-overlapping, sequence intervals of the genome, referred to as “bins.” The pipeline also tracks any genetic annotations (genes and their relevant features) that overlap with bins. Bins can be analyzed individually to look for enrichment of specific patterns of insertions, or they can be aggregated for a single gene to identify genes that have statistically significant insertion enrichment.

Following the user modifications described below, the current pipeline is designed to work on a personal machine, server, or server cluster that uses a SLURM scheduler. The code consists of a Bash shell script that links together a few Python scripts to perform the analysis.

Installation

Please refer to the INSTALL file for installation instructions.

Software Requirements

  • Bowtie 1.0.1 with index for human genome version GRCh38
  • Python (2.7.5)
    • Python packages getopt, re, argparse, xlwt, scipy, statsmodels

Script Parameters

  • -c <control_fastq_file> : “-c” indicates that the following is a fastq file (control_fastq_file) containing reads for the control (unselected) cell population. This parameter is required, even though it is specified by an option.
  • -s <selected_fastq_file> : “-s” indicates that the following is a fastq file (selected_fastq_file) containing reads for the selected cell population. This parameter is required, even though it is specified by an option.
  • -n <basename_for_output_files> : “-n” indicates that the following string (basename_for_output_files) should be used to create the files for output. This option is not required. The default basename is “output”. Please do not use the words “control” or “selected” when specifying the name for the output files.
  • --binSize : “--bin size” indicates that the following number should be used as the size of the bins that encompass the insertions. This option is not required. The default bin size is 1000.

Output

All output files should be found in the directory in which the “BAIMS_pipeline.sh” executable is called. No additional directories are added by the BAIMS pipeline.

  1. “basename”_control.sam: This file is generated by the Bowtie aligner and contains the reported alignments for the sequencing reads from the fastq file for the control population.
  2. “basename”_selected.sam: This file is generated by the Bowtie aligner and contains the reported alignments for the sequencing reads from the fastq file for the selected population.
  3. “basename”_control_alignment_stats: This is a text file that contains a brief summary of the alignments in the basename _control.sam file. It contains the number of sequencing reads processed from the fastq file for the control population, the number of reads that aligned to the genome, the number of reads that aligned to multiple places in the genome, the number of reads that aligned to only one place in the genome (called a “unique read/alignment”), a breakdown of the unique alignments in terms of base pair mismatches during alignment, and a breakdown of unique alignments per chromosome.
  4. “basename”_selected_alignment_stats: This is a text file (can be opened in Excel for better visualization) that contains a brief summary of the alignments in the basename_selected.sam file. It contains the number of sequencing reads processed from the fastq file for the selected population, the number of reads that aligned to the genome, the number of reads that aligned to multiple places in the genome, the number of reads that aligned to only one place in the genome (called a “unique read/alignment”), a breakdown of the unique alignments in terms of base pair mismatches during alignment, and a breakdown of unique alignments per chromosome.
  5. “basename”_control.bed: This is a file of only the uniquely aligned sequencing reads from basename_control.sam in the BED format. If two sequencing reads align to the same genomic position, only one is included in this file. This file can be uploaded to the UCSC Genome Browser to visualize the reads in different regions of the genome.
  6. “basename”_selected.bed: This is a file of only the uniquely aligned sequencing reads from basename_selected.sam in the BED format. If two sequencing reads align to the same genomic position, only one is included in this file. This file can be uploaded to the UCSC Genome Browser to visualize the reads in different regions of the genome.
  7. “basename”_control_binsAndAnnotations: This is a text file that contains the read counts for each bin for the control population. The information for each bin begins with the exact genomic location of the bin (chromosome and base pairs), followed by a breakdown of the number of total, unique, and non-unique (aligning to multiple genomic locations) reads in the bin. Each type of read is further broken down by orientation (“+” indicates the read aligned in the sense orientation with regard to the chromosome, “-” indicates the read aligned in the antisense orientation with regard to the chromosome) and the number of mismatches the read had during alignment. Genetic features that overlap with the bin are written below the table in the following format “Gene_name$Feature$Chromosome_Strand”.
  8. “basename”_selected_binsAndAnnotations: This is a text file that contains the read counts for each bin for the selected population. The information for each bin begins with the exact genomic location of the bin (chromosome and base pairs), followed by a breakdown of the number of total, unique, and non-unique (aligning to multiple genomic locations) reads in the bin. Each type of read is further broken down by orientation (“+” indicates the read aligned in the sense orientation with regard to the chromosome, “-” indicates the read aligned in the antisense orientation with regard to the chromosome) and the number of mismatches the read had during alignment. Genetic features that overlap with the bin are written below the table in the following format “Gene_name$Feature$Chromosome_Strand”.
  9. “basename”_Bin_Analysis.xls: This Excel file contains the results of the antisense intronic, upstream, and inactivating insertion enrichment analyses.
    1. Antisense intronic insertion enrichment analysis (sheet: “Antisense_intronic_bins”) columns:
      • “Bin” contains the chromosome and base pair location of the bin.
      • “Annotation” contains genetic feature annotations for the corresponding bin.
      • “p-value” contains the p-value for antisense intronic insertion enrichment for the corresponding bin.
      • “FDR-corrected p-value” contains the FDR-corrected p-value for antisense intronic insertion enrichment for the corresponding bin.
      • “Sample,” “Antisense insertions in the bin for the sample” and “Total insertions mapped in the sample” refer to the sample population the insertions were mapped for, the number of antisense insertions found in the corresponding bin for the sample and the total number of insertions mapped for the sample, respectively.
    2. Upstream insertion enrichment analysis (sheet: “Upstream_bins”) columns:
      • “Bin” contains the chromosome and base pair location of the bin.
      • “Annotation” contains genetic feature annotations for the corresponding bin.
      • “p-value” contains the p-value for promoter insertion enrichment for the corresponding bin.
      • “FDR-corrected p-value” contains the FDR-corrected p-value for promoter insertion enrichment for the corresponding bin.
      • “Sample,” “Insertions in the bin for the sample” and “Total insertions mapped in the sample” refer to the sample population the insertions were mapped for, the number of insertions found in the corresponding bin in that sample and the total number of insertions mapped for the sample, respectively.
    3. Inactivating insertion enrichment analysis (sheet: “Inactivating_bins”) columns:
      • “Bin” contains the chromosome and base pair location of the bin.
      • “Annotation” contains genetic feature annotations for the corresponding bin.
      • “p-value” contains the p-value for inactivating insertion enrichment for the corresponding bin.
      • “FDR-corrected p-value” contains the FDR-corrected p-value for inactivating insertion enrichment for the corresponding bin.
      • “Sample” refers to the sample population the insertions were mapped for.
      • “Sense insertions in the bin for the sample” and “Antisense insertions in the bin for the sample” refer to the number of sense or antisense insertions (relative to orientation of the gene in the Annotation column for the bin) found in the corresponding bin.
      • “Inactivating insertions in the bin for the sample” refers to the number of inactivating insertions (sense and antisense insertions in bins annotated with “5’UTR,” “CDS,” or “3’UTR,” and only sense insertions in bins annotated exclusively as “intron”) found in the corresponding bin.
      • “Total Insertions Mapped in Sample” refers to the total number of insertions mapped for the corresponding sample.
  10. “basename”_Gene_Comparison.xls: This Excel file contains the results of the gene-based insertion enrichment analysis.
    1. Gene-based insertion enrichment analysis (sheet: “Gene_Insert_Enrich”) columns:
      • “Gene” contains the RefSeq gene symbol.
      • “p-value” contains the p-value for insertion enrichment for the corresponding gene.
      • “FDR-corrected p-value” contains the FDR-corrected p-value for insertion enrichment for the corresponding gene.
      • “Sample,” “Insertions in the gene in the sample” and “Total insertions mapped in the sample” refer to the sample population the insertions were mapped for, the number of insertions found in the corresponding gene for the sample and the total number of insertions mapped for the sample, respectively.

Contact

This code was developed and is maintained by Bhaven Patel (bhavenp@stanford.edu), a member of the Rohatgi Lab in the Department of Biochemistry in the Stanford School of Medicine. If you have any questions/issues or find any bugs, please email me and will I will try to respond as soon as possible.

About

Bin-based Analysis of Insertional Mutagenesis Screens

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published