updated MacOS installation, user manual and UniProt converter

uds-helms · Jan 19, 2021 · b8eeea6 · b8eeea6
1 parent f4733b8
commit b8eeea6
Show file tree

Hide file tree

Showing 9 changed files with 115 additions and 151 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 # Development files
 *.fastq
 results/
+user_projects/
 .idea
 pyinstaller.bat
 .spec

diff --git a/README.md b/README.md
@@ -1,4 +1,5 @@
 # MutaNET
+## Introduction
 Mutations in genomic key elements can influence gene expression and function in various ways, and hence greatly contribute to the phenotype. MutaNET comes with a next generation sequencing (NGS) pipeline that calls mutations based on paired-end NGS reads, an automated analysis tool and various file converters and mergers. The mutation analysis feature considers the coding region, protein domains, regulation and transcription factor binding site information, and can be used to analyse the potential impact of mutations on genes of interest.
 
 MutaNET was developed and implemented in 2017 and published in 2018:
@@ -29,7 +30,7 @@ When starting MutaNET for the first time, the file paths for small example data
 **NGS Pipeline:** You need to extract `S11_R1.fastq` and `S11_R2.fastq` from `S11_R1.zip` and `S11_R2.zip` in `example_data/NGS/reads` before running the NGS pipeline with the example data.
 
 ## Change Log
-### Version 1.1.0
+### Version 2.0
 - added support for eukaryotes
 - fixed file encoding issues
 - fixed minor frame shift bug
diff --git a/example_data/UniProt/s_aureus_domains_shortened.txt b/example_data/UniProt/s_aureus_domains_shortened.txt
@@ -1,83 +1,78 @@
-ID   FMT_STAA8               Reviewed;         311 AA.
-AC   Q2FZ68;
-DT   15-JAN-2008, integrated into UniProtKB/Swiss-Prot.
+ID   Q2FXK2_STAA8            Unreviewed;       386 AA.
+AC   Q2FXK2;
+DT   21-MAR-2006, integrated into UniProtKB/TrEMBL.
 DT   21-MAR-2006, sequence version 1.
-DT   15-MAR-2017, entry version 72.
-DE   RecName: Full=Methionyl-tRNA formyltransferase {ECO:0000255|HAMAP-Rule:MF_00182};
-DE            EC=2.1.2.9 {ECO:0000255|HAMAP-Rule:MF_00182};
-GN   Name=fmt {ECO:0000255|HAMAP-Rule:MF_00182};
-GN   OrderedLocusNames=SAOUHSC_01183;
-OS   Staphylococcus aureus (strain NCTC 8325).
+DT   07-OCT-2020, entry version 87.
+DE   RecName: Full=Aminotran_5 domain-containing protein {ECO:0000259|Pfam:PF00266};
+GN   OrderedLocusNames=SAOUHSC_01832 {ECO:0000313|EMBL:ABD30900.1};
+OS   Staphylococcus aureus (strain NCTC 8325 / PS 47).
 OC   Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae;
 OC   Staphylococcus.
-OX   NCBI_TaxID=93061;
-RN   [1]
+OX   NCBI_TaxID=93061 {ECO:0000313|EMBL:ABD30900.1, ECO:0000313|Proteomes:UP000008816};
+RN   [1] {ECO:0000313|Proteomes:UP000008816}
 RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
-RC   STRAIN=NCTC 8325;
-RA   Gillaspy A.F., Worrell V., Orvis J., Roe B.A., Dyer D.W.,
-RA   Iandolo J.J.;
+RC   STRAIN=NCTC 8325 / PS 47 {ECO:0000313|Proteomes:UP000008816};
+RA   Gillaspy A.F., Worrell V., Orvis J., Roe B.A., Dyer D.W., Iandolo J.J.;
 RT   "The Staphylococcus aureus NCTC 8325 genome.";
 RL   (In) Fischetti V., Novick R., Ferretti J., Portnoy D., Rood J. (eds.);
-RL   Gram positive pathogens, 2nd edition, pp.381-412, ASM Press,
-RL   Washington D.C. (2006).
-CC   -!- FUNCTION: Modifies the free amino group of the aminoacyl moiety of
-CC       methionyl-tRNA(fMet). The formyl group appears to play a dual role
-CC       in the initiator identity of N-formylmethionyl-tRNA by: (I)
-CC       promoting its recognition by IF2 and (II) impairing its binding to
-CC       EFTu-GTP. {ECO:0000255|HAMAP-Rule:MF_00182}.
-CC   -!- CATALYTIC ACTIVITY: 10-formyltetrahydrofolate + L-methionyl-
-CC       tRNA(fMet) = tetrahydrofolate + N-formylmethionyl-tRNA(fMet).
-CC       {ECO:0000255|HAMAP-Rule:MF_00182}.
-CC   -!- SIMILARITY: Belongs to the Fmt family. {ECO:0000255|HAMAP-
-CC       Rule:MF_00182}.
-CC   -----------------------------------------------------------------------
-CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
-CC   Distributed under the Creative Commons Attribution-NoDerivs License
-CC   -----------------------------------------------------------------------
-DR   EMBL; CP000253; ABD30290.1; -; Genomic_DNA.
-DR   RefSeq; WP_000161299.1; NC_007795.1.
-DR   RefSeq; YP_499722.1; NC_007795.1.
-DR   ProteinModelPortal; Q2FZ68; -.
-DR   SMR; Q2FZ68; -.
-DR   STRING; 93061.SAOUHSC_01183; -.
-DR   EnsemblBacteria; ABD30290; ABD30290; SAOUHSC_01183.
-DR   GeneID; 28381227; -.
-DR   GeneID; 3919316; -.
-DR   KEGG; sao:SAOUHSC_01183; -.
-DR   PATRIC; 19579855; VBIStaAur99865_1086.
-DR   eggNOG; ENOG4105CAE; Bacteria.
-DR   eggNOG; COG0223; LUCA.
-DR   HOGENOM; HOG000261177; -.
-DR   KO; K00604; -.
-DR   OMA; GCINSHA; -.
+RL   Gram positive pathogens, 2nd edition, pp.381-412, ASM Press, Washington D.C
+RL   (2006).
+CC   -!- COFACTOR:
+CC       Name=pyridoxal 5'-phosphate; Xref=ChEBI:CHEBI:597326;
+CC         Evidence={ECO:0000256|ARBA:ARBA00001933,
+CC         ECO:0000256|PIRSR:PIRSR000524-50, ECO:0000256|RuleBase:RU004504};
+CC   -!- SIMILARITY: Belongs to the class-V pyridoxal-phosphate-dependent
+CC       aminotransferase family. {ECO:0000256|ARBA:ARBA00009236,
+CC       ECO:0000256|RuleBase:RU004075}.
+CC   ---------------------------------------------------------------------------
+CC   Copyrighted by the UniProt Consortium, see https://www.uniprot.org/terms
+CC   Distributed under the Creative Commons Attribution (CC BY 4.0) License
+CC   ---------------------------------------------------------------------------
+DR   EMBL; CP000253; ABD30900.1; -; Genomic_DNA.
+DR   RefSeq; WP_000291415.1; NZ_LS483365.1.
+DR   RefSeq; YP_500338.1; NC_007795.1.
+DR   SMR; Q2FXK2; -.
+DR   STRING; 1280.SAXN108_1751; -.
+DR   EnsemblBacteria; ABD30900; ABD30900; SAOUHSC_01832.
+DR   GeneID; 3921782; -.
+DR   GeneID; 45574931; -.
+DR   KEGG; sao:SAOUHSC_01832; -.
+DR   PATRIC; fig|93061.5.peg.1671; -.
+DR   eggNOG; COG0075; Bacteria.
+DR   HOGENOM; CLU_027686_1_1_9; -.
+DR   OMA; GQTHSTP; -.
 DR   Proteomes; UP000008816; Chromosome.
-DR   GO; GO:0004479; F:methionyl-tRNA formyltransferase activity; IEA:UniProtKB-EC.
-DR   Gene3D; 3.10.25.10; -; 1.
-DR   Gene3D; 3.40.50.170; -; 1.
-DR   HAMAP; MF_00182; Formyl_trans; 1.
-DR   InterPro; IPR005794; Fmt.
-DR   InterPro; IPR005793; Formyl_trans_C.
-DR   InterPro; IPR002376; Formyl_transf_N.
-DR   InterPro; IPR011034; Formyl_transferase_C-like.
-DR   InterPro; IPR001555; GART_AS.
-DR   Pfam; PF02911; Formyl_trans_C; 1.
-DR   Pfam; PF00551; Formyl_trans_N; 1.
-DR   SUPFAM; SSF50486; SSF50486; 1.
-DR   SUPFAM; SSF53328; SSF53328; 1.
-DR   TIGRFAMs; TIGR00460; fmt; 1.
-DR   PROSITE; PS00373; GART; 1.
+DR   GO; GO:0005777; C:peroxisome; IBA:GO_Central.
+DR   GO; GO:0008453; F:alanine-glyoxylate transaminase activity; IBA:GO_Central.
+DR   GO; GO:0004760; F:serine-pyruvate transaminase activity; IBA:GO_Central.
+DR   GO; GO:0019265; P:glycine biosynthetic process, by transamination of glyoxylate; IBA:GO_Central.
+DR   Gene3D; 3.40.640.10; -; 1.
+DR   Gene3D; 3.90.1150.10; -; 1.
+DR   InterPro; IPR000192; Aminotrans_V_dom.
+DR   InterPro; IPR020578; Aminotrans_V_PyrdxlP_BS.
+DR   InterPro; IPR015424; PyrdxlP-dep_Trfase.
+DR   InterPro; IPR015422; PyrdxlP-dep_Trfase_dom1.
+DR   InterPro; IPR015421; PyrdxlP-dep_Trfase_major.
+DR   InterPro; IPR024169; SP_NH2Trfase/AEP_transaminase.
+DR   Pfam; PF00266; Aminotran_5; 1.
+DR   PIRSF; PIRSF000524; SPT; 1.
+DR   SUPFAM; SSF53383; SSF53383; 1.
+DR   PROSITE; PS00595; AA_TRANSFER_CLASS_5; 1.
 PE   3: Inferred from homology;
-KW   Complete proteome; Protein biosynthesis; Reference proteome;
-KW   Transferase.
-FT   CHAIN         1    311       Methionyl-tRNA formyltransferase.
-FT                                /FTId=PRO_1000020173.
-FT   REGION      109    112       Tetrahydrofolate (THF) binding.
-FT                                {ECO:0000255|HAMAP-Rule:MF_00182}.
-SQ   SEQUENCE   311 AA;  34211 MW;  FC45A768EA61D5CA CRC64;
-     MTKIIFMGTP DFSTTVLEML IAEHDVIAVV TQPDRPVGRK RVMTPPPVKK VAMKYDLPVY
-     QPEKLSGSEE LEQLLQLDVD LIVTAAFGQL LPESLLALPN LGAINVHASL LPKYRGGAPI
-     HQAIIDGEQE TGITIMYMVK KLDAGNIISQ QAIKIEENDN VGTMHDKLSV LGADLLKETL
-     PSIIEGTNES VPQDDTQATF ASNIRREDER ISWNKPGRQV FNQIRGLSPW PVAYTTMDDT
-     NLKIYDAELV ETNKINEPGT IIETTKKAII VATNDNEAVA IKDMQLAGKK RMLAANYLSG
-     AQNTLVGKKL I
-//
+KW   Pyridoxal phosphate {ECO:0000256|PIRSR:PIRSR000524-50};
+KW   Reference proteome {ECO:0000313|Proteomes:UP000008816}.
+FT   DOMAIN          8..333
+FT                   /note="Aminotran_5"
+FT                   /evidence="ECO:0000259|Pfam:PF00266"
+FT   MOD_RES         195
+FT                   /note="N6-(pyridoxal phosphate)lysine"
+FT                   /evidence="ECO:0000256|PIRSR:PIRSR000524-50"
+SQ   SEQUENCE   386 AA;  42850 MW;  567080407ACD0927 CRC64;
+     MYYHQPLLLT PGPTPVPDAI MREIQAPMVG HRSKDFEDIA QQAFQGLKPI FGSQNDVLIL
+     TSSGTSVLEA SMLNIVNPED HFVVIVSGAF GNRFKQIAQT YYKNVHIYDV TWGEAVDVKD
+     FINFLSTLNV EVKAVFSQYC ETSTTVLHPI HELGNAINQF NSNIYFVVDG VSCIGAVDVD
+     INKDKIDVLV SGSQKAIMLP PGLAFVAYSH RAKEHFKEVT TPKFYLDLNK YISSQADNST
+     PFTPNVSLFR GVNAYVETVK AEGFNHVIAR HYAIRNALRS ALKALDLTLL VNDKDASPTV
+     TAFKPNTNDE VKIIKDELKN RFKITIAGGQ GHLKGQILRI GHMGKISPFD ILSVVSALEI
+     ILTEHRKVNY IGKGISKYME VIHEAI
+//
diff --git a/install_mac_os.sh b/install_mac_os.sh
@@ -19,11 +19,11 @@ then
 	fi
 	brew cask install java
 	echo
-	brew install homebrew/science/bwa
+	brew install bwa
 	echo
-	brew install homebrew/science/samtools
+	brew install samtools
 	echo
-	brew install homebrew/science/varscan
+	brew install brewsci/bio/varscan
 	echo
 fi
 

diff --git a/installation_guide.pdf b/installation_guide.pdf
diff --git a/mutaNET32.exe b/mutaNET32.exe
diff --git a/mutaNET64.exe b/mutaNET64.exe
diff --git a/source/configuration.py b/source/configuration.py
@@ -443,7 +443,7 @@ class GeneDB:
     md_syn = 'synonymous_mutations_per_kbp'             # type: str
     md_prom = 'promoter_mutations_per_kbp'              # type: str
     md_tfbs = 'tfbs_mutations_per_kbp'                  # type: str
-    pd_type = 'prot_dom_tyoe'                           # type: str
+    pd_type = 'prot_dom_type'                           # type: str
     pd_desc = 'prot_dom_desc'                           # type: str
     pd_start = 'prot_dom_start'                         # type: str
     pd_end = 'prot_dom_end'                             # type: str

diff --git a/source/converter.py b/source/converter.py
@@ -303,82 +303,54 @@ def parse_gn(self):
                 for name in names:
                     self.names.add(('', name.strip()))
 
-    def process_domain(self, t, s, e, ds):
+    def process_domain(self, t, loc, ds):
         """
         Processes the information of a single protein domain.
         :param t: domain type
-        :param s: start
-        :param e: end
+        :param loc: location
         :param ds: list of description lines
         """
-        # don't add the domain if start, end or type are not given
-        if not t or not s or not e:
+        # don't add the domain if location or type are not given
+        if not t or not loc:
             return
+
+        # extract start and end position from the location
+        loc = loc.split('..')
+        if len(loc) == 2:
+            s, e = loc
+        # sometimes there is only the start position
+        elif len(loc) == 1:
+            s = loc[0]
+            e = s
+        else:
+            return
+
         # don't add the domain if start or end are not numbers
         if not s.isdigit() or not e.isdigit():
             return
 
-        # a single description lines can contain several descriptions that end with '.'
+        # a single description lines can contain several descriptions as follows: /<qualifier>="<value>"
         # this creates an adjusted list where each element is a single description
         ds_adjusted = []
         for d in ds:
-            words = d.split('. ')
-
-            words2 = []
-            # add '.' back to the description that was removed when splitting the line
-            # (except for the last one, since a description can continue on the next line)
-            for w in words[:-1]:
-                if not w.endswith('.'):
-                    words2.append(w + '.')
-                else:
-                    words2.append(w)
-            words2.append(words[-1])
-
-            # fix some inconsistent formatting in the feature descriptions
-            words = [words2[0]]
-            for i in range(1, len(words2)):
-                try:
-                    if words[i - 1][-4:] == 'Ref.':
-                        words[i - 1] += ' ' + words2[i]
-                    else:
-                        words.append(words2[i])
-                except IndexError:
-                    words.append(words2[i])
-
-            ds_adjusted += words
+            if not d:
+                continue
+            if '="' in d:
+                ds_adjusted.append(d.split('="')[1].strip('"'))
+            elif d:
+                ds_adjusted[-1] += ' ' + d.strip('"')
 
         # final list of domain descriptions
-        desc = []           # type: list(str)
-        # true of the description continues on the next line
-        cont_desc = False
+        desc = []
 
         for d in ds_adjusted:
-            # ignore the feature ID
-            if d.startswith('/FTId'):
-                continue
-
-            # a description ends with '.'
-            cont = not d.endswith('.')
-            d = d.strip('.').replace(';', ',')
-
             # ignore empty descriptions
-            if not d:
-                continue
-
-            # sequence that continues on the next line
-            if cont_desc and desc[-1][-1].isupper() and d[0].isupper():
-                desc[-1] += d
-            # description that continues on the next line and needs a space in between
-            elif cont_desc:
-                desc[-1] += ' ' + d
-            # new description
-            else:
+            d = d.strip().replace('|', '-')
+            if d:
                 desc.append(d)
 
-            cont_desc = cont
-
         # add the protein domain
-        self.domains.add((s, e, t, cfg.misc.in_sep.join(desc)))
+        self.domains.add((s, e, t.lower(), cfg.misc.in_sep.join(desc)))
 
     def parse_ft(self):
         """
@@ -388,33 +360,28 @@ def parse_ft(self):
         if not self.ft:
             return
 
-        type = ''
-        start = ''
-        end = ''
+        domain_type = ''
+        location = ''
         desc = []
 
         for line in self.ft:
             # parse the columns
-            ttype = line[5:13].strip()
-            tstart = line[14:20].strip().strip('>').strip('<')
-            tend = line[21:27].strip().strip('>').strip('<')
-            tdesc = line[34:].strip()
+            t_type = line[5:21].strip()
 
             # new feature
-            if ttype:
+            if t_type:
                 # add the previous feature to the domain list
-                self.process_domain(type, start, end, desc)
+                self.process_domain(domain_type, location, desc)
                 # set the type, start and end of the new feature
-                type = ttype
-                start = tstart
-                end = tend
-                desc = [tdesc]
+                domain_type = t_type
+                location = line[21:].strip('?<>').strip()
+                desc = []
             # previous feature is continued
             else:
-                desc.append(tdesc)
+                desc.append(line[21:].strip())
 
         # add the last feature to the domain list
-        self.process_domain(type, start, end, desc)
+        self.process_domain(domain_type, location, desc)
 
     def validate(self):
         """
@@ -582,7 +549,7 @@ def tsv(self):
             file = open(self.out_path, 'w')
 
             # write the header
-            file.write('\t'.join([cfg.res.lt, cfg.res.name, cfg.res.desc]) + '\n')
+            file.write('\t'.join([cfg.int.lt, cfg.int.name, cfg.int.desc]) + '\n')
 
             # sort the genes by locus tag
             for key, val in sorted(self.entries.items(), key=lambda x: x[0]):