Skip to content

Commit

Permalink
updated MacOS installation, user manual and UniProt converter
Browse files Browse the repository at this point in the history
  • Loading branch information
Markus-Hollander committed Jan 19, 2021
1 parent f4733b8 commit b8eeea6
Show file tree
Hide file tree
Showing 9 changed files with 115 additions and 151 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Development files
*.fastq
results/
user_projects/
.idea
pyinstaller.bat
.spec
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# MutaNET
## Introduction
Mutations in genomic key elements can influence gene expression and function in various ways, and hence greatly contribute to the phenotype. MutaNET comes with a next generation sequencing (NGS) pipeline that calls mutations based on paired-end NGS reads, an automated analysis tool and various file converters and mergers. The mutation analysis feature considers the coding region, protein domains, regulation and transcription factor binding site information, and can be used to analyse the potential impact of mutations on genes of interest.

MutaNET was developed and implemented in 2017 and published in 2018:
Expand Down Expand Up @@ -29,7 +30,7 @@ When starting MutaNET for the first time, the file paths for small example data
**NGS Pipeline:** You need to extract `S11_R1.fastq` and `S11_R2.fastq` from `S11_R1.zip` and `S11_R2.zip` in `example_data/NGS/reads` before running the NGS pipeline with the example data.

## Change Log
### Version 1.1.0
### Version 2.0
- added support for eukaryotes
- fixed file encoding issues
- fixed minor frame shift bug
145 changes: 70 additions & 75 deletions example_data/UniProt/s_aureus_domains_shortened.txt
Original file line number Diff line number Diff line change
@@ -1,83 +1,78 @@
ID FMT_STAA8 Reviewed; 311 AA.
AC Q2FZ68;
DT 15-JAN-2008, integrated into UniProtKB/Swiss-Prot.
ID Q2FXK2_STAA8 Unreviewed; 386 AA.
AC Q2FXK2;
DT 21-MAR-2006, integrated into UniProtKB/TrEMBL.
DT 21-MAR-2006, sequence version 1.
DT 15-MAR-2017, entry version 72.
DE RecName: Full=Methionyl-tRNA formyltransferase {ECO:0000255|HAMAP-Rule:MF_00182};
DE EC=2.1.2.9 {ECO:0000255|HAMAP-Rule:MF_00182};
GN Name=fmt {ECO:0000255|HAMAP-Rule:MF_00182};
GN OrderedLocusNames=SAOUHSC_01183;
OS Staphylococcus aureus (strain NCTC 8325).
DT 07-OCT-2020, entry version 87.
DE RecName: Full=Aminotran_5 domain-containing protein {ECO:0000259|Pfam:PF00266};
GN OrderedLocusNames=SAOUHSC_01832 {ECO:0000313|EMBL:ABD30900.1};
OS Staphylococcus aureus (strain NCTC 8325 / PS 47).
OC Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae;
OC Staphylococcus.
OX NCBI_TaxID=93061;
RN [1]
OX NCBI_TaxID=93061 {ECO:0000313|EMBL:ABD30900.1, ECO:0000313|Proteomes:UP000008816};
RN [1] {ECO:0000313|Proteomes:UP000008816}
RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC STRAIN=NCTC 8325;
RA Gillaspy A.F., Worrell V., Orvis J., Roe B.A., Dyer D.W.,
RA Iandolo J.J.;
RC STRAIN=NCTC 8325 / PS 47 {ECO:0000313|Proteomes:UP000008816};
RA Gillaspy A.F., Worrell V., Orvis J., Roe B.A., Dyer D.W., Iandolo J.J.;
RT "The Staphylococcus aureus NCTC 8325 genome.";
RL (In) Fischetti V., Novick R., Ferretti J., Portnoy D., Rood J. (eds.);
RL Gram positive pathogens, 2nd edition, pp.381-412, ASM Press,
RL Washington D.C. (2006).
CC -!- FUNCTION: Modifies the free amino group of the aminoacyl moiety of
CC methionyl-tRNA(fMet). The formyl group appears to play a dual role
CC in the initiator identity of N-formylmethionyl-tRNA by: (I)
CC promoting its recognition by IF2 and (II) impairing its binding to
CC EFTu-GTP. {ECO:0000255|HAMAP-Rule:MF_00182}.
CC -!- CATALYTIC ACTIVITY: 10-formyltetrahydrofolate + L-methionyl-
CC tRNA(fMet) = tetrahydrofolate + N-formylmethionyl-tRNA(fMet).
CC {ECO:0000255|HAMAP-Rule:MF_00182}.
CC -!- SIMILARITY: Belongs to the Fmt family. {ECO:0000255|HAMAP-
CC Rule:MF_00182}.
CC -----------------------------------------------------------------------
CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC Distributed under the Creative Commons Attribution-NoDerivs License
CC -----------------------------------------------------------------------
DR EMBL; CP000253; ABD30290.1; -; Genomic_DNA.
DR RefSeq; WP_000161299.1; NC_007795.1.
DR RefSeq; YP_499722.1; NC_007795.1.
DR ProteinModelPortal; Q2FZ68; -.
DR SMR; Q2FZ68; -.
DR STRING; 93061.SAOUHSC_01183; -.
DR EnsemblBacteria; ABD30290; ABD30290; SAOUHSC_01183.
DR GeneID; 28381227; -.
DR GeneID; 3919316; -.
DR KEGG; sao:SAOUHSC_01183; -.
DR PATRIC; 19579855; VBIStaAur99865_1086.
DR eggNOG; ENOG4105CAE; Bacteria.
DR eggNOG; COG0223; LUCA.
DR HOGENOM; HOG000261177; -.
DR KO; K00604; -.
DR OMA; GCINSHA; -.
RL Gram positive pathogens, 2nd edition, pp.381-412, ASM Press, Washington D.C
RL (2006).
CC -!- COFACTOR:
CC Name=pyridoxal 5'-phosphate; Xref=ChEBI:CHEBI:597326;
CC Evidence={ECO:0000256|ARBA:ARBA00001933,
CC ECO:0000256|PIRSR:PIRSR000524-50, ECO:0000256|RuleBase:RU004504};
CC -!- SIMILARITY: Belongs to the class-V pyridoxal-phosphate-dependent
CC aminotransferase family. {ECO:0000256|ARBA:ARBA00009236,
CC ECO:0000256|RuleBase:RU004075}.
CC ---------------------------------------------------------------------------
CC Copyrighted by the UniProt Consortium, see https://www.uniprot.org/terms
CC Distributed under the Creative Commons Attribution (CC BY 4.0) License
CC ---------------------------------------------------------------------------
DR EMBL; CP000253; ABD30900.1; -; Genomic_DNA.
DR RefSeq; WP_000291415.1; NZ_LS483365.1.
DR RefSeq; YP_500338.1; NC_007795.1.
DR SMR; Q2FXK2; -.
DR STRING; 1280.SAXN108_1751; -.
DR EnsemblBacteria; ABD30900; ABD30900; SAOUHSC_01832.
DR GeneID; 3921782; -.
DR GeneID; 45574931; -.
DR KEGG; sao:SAOUHSC_01832; -.
DR PATRIC; fig|93061.5.peg.1671; -.
DR eggNOG; COG0075; Bacteria.
DR HOGENOM; CLU_027686_1_1_9; -.
DR OMA; GQTHSTP; -.
DR Proteomes; UP000008816; Chromosome.
DR GO; GO:0004479; F:methionyl-tRNA formyltransferase activity; IEA:UniProtKB-EC.
DR Gene3D; 3.10.25.10; -; 1.
DR Gene3D; 3.40.50.170; -; 1.
DR HAMAP; MF_00182; Formyl_trans; 1.
DR InterPro; IPR005794; Fmt.
DR InterPro; IPR005793; Formyl_trans_C.
DR InterPro; IPR002376; Formyl_transf_N.
DR InterPro; IPR011034; Formyl_transferase_C-like.
DR InterPro; IPR001555; GART_AS.
DR Pfam; PF02911; Formyl_trans_C; 1.
DR Pfam; PF00551; Formyl_trans_N; 1.
DR SUPFAM; SSF50486; SSF50486; 1.
DR SUPFAM; SSF53328; SSF53328; 1.
DR TIGRFAMs; TIGR00460; fmt; 1.
DR PROSITE; PS00373; GART; 1.
DR GO; GO:0005777; C:peroxisome; IBA:GO_Central.
DR GO; GO:0008453; F:alanine-glyoxylate transaminase activity; IBA:GO_Central.
DR GO; GO:0004760; F:serine-pyruvate transaminase activity; IBA:GO_Central.
DR GO; GO:0019265; P:glycine biosynthetic process, by transamination of glyoxylate; IBA:GO_Central.
DR Gene3D; 3.40.640.10; -; 1.
DR Gene3D; 3.90.1150.10; -; 1.
DR InterPro; IPR000192; Aminotrans_V_dom.
DR InterPro; IPR020578; Aminotrans_V_PyrdxlP_BS.
DR InterPro; IPR015424; PyrdxlP-dep_Trfase.
DR InterPro; IPR015422; PyrdxlP-dep_Trfase_dom1.
DR InterPro; IPR015421; PyrdxlP-dep_Trfase_major.
DR InterPro; IPR024169; SP_NH2Trfase/AEP_transaminase.
DR Pfam; PF00266; Aminotran_5; 1.
DR PIRSF; PIRSF000524; SPT; 1.
DR SUPFAM; SSF53383; SSF53383; 1.
DR PROSITE; PS00595; AA_TRANSFER_CLASS_5; 1.
PE 3: Inferred from homology;
KW Complete proteome; Protein biosynthesis; Reference proteome;
KW Transferase.
FT CHAIN 1 311 Methionyl-tRNA formyltransferase.
FT /FTId=PRO_1000020173.
FT REGION 109 112 Tetrahydrofolate (THF) binding.
FT {ECO:0000255|HAMAP-Rule:MF_00182}.
SQ SEQUENCE 311 AA; 34211 MW; FC45A768EA61D5CA CRC64;
MTKIIFMGTP DFSTTVLEML IAEHDVIAVV TQPDRPVGRK RVMTPPPVKK VAMKYDLPVY
QPEKLSGSEE LEQLLQLDVD LIVTAAFGQL LPESLLALPN LGAINVHASL LPKYRGGAPI
HQAIIDGEQE TGITIMYMVK KLDAGNIISQ QAIKIEENDN VGTMHDKLSV LGADLLKETL
PSIIEGTNES VPQDDTQATF ASNIRREDER ISWNKPGRQV FNQIRGLSPW PVAYTTMDDT
NLKIYDAELV ETNKINEPGT IIETTKKAII VATNDNEAVA IKDMQLAGKK RMLAANYLSG
AQNTLVGKKL I
//
KW Pyridoxal phosphate {ECO:0000256|PIRSR:PIRSR000524-50};
KW Reference proteome {ECO:0000313|Proteomes:UP000008816}.
FT DOMAIN 8..333
FT /note="Aminotran_5"
FT /evidence="ECO:0000259|Pfam:PF00266"
FT MOD_RES 195
FT /note="N6-(pyridoxal phosphate)lysine"
FT /evidence="ECO:0000256|PIRSR:PIRSR000524-50"
SQ SEQUENCE 386 AA; 42850 MW; 567080407ACD0927 CRC64;
MYYHQPLLLT PGPTPVPDAI MREIQAPMVG HRSKDFEDIA QQAFQGLKPI FGSQNDVLIL
TSSGTSVLEA SMLNIVNPED HFVVIVSGAF GNRFKQIAQT YYKNVHIYDV TWGEAVDVKD
FINFLSTLNV EVKAVFSQYC ETSTTVLHPI HELGNAINQF NSNIYFVVDG VSCIGAVDVD
INKDKIDVLV SGSQKAIMLP PGLAFVAYSH RAKEHFKEVT TPKFYLDLNK YISSQADNST
PFTPNVSLFR GVNAYVETVK AEGFNHVIAR HYAIRNALRS ALKALDLTLL VNDKDASPTV
TAFKPNTNDE VKIIKDELKN RFKITIAGGQ GHLKGQILRI GHMGKISPFD ILSVVSALEI
ILTEHRKVNY IGKGISKYME VIHEAI
//
6 changes: 3 additions & 3 deletions install_mac_os.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ then
fi
brew cask install java
echo
brew install homebrew/science/bwa
brew install bwa
echo
brew install homebrew/science/samtools
brew install samtools
echo
brew install homebrew/science/varscan
brew install brewsci/bio/varscan
echo
fi

Expand Down
Binary file modified installation_guide.pdf
Binary file not shown.
Binary file modified mutaNET32.exe
Binary file not shown.
Binary file modified mutaNET64.exe
Binary file not shown.
2 changes: 1 addition & 1 deletion source/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -443,7 +443,7 @@ class GeneDB:
md_syn = 'synonymous_mutations_per_kbp' # type: str
md_prom = 'promoter_mutations_per_kbp' # type: str
md_tfbs = 'tfbs_mutations_per_kbp' # type: str
pd_type = 'prot_dom_tyoe' # type: str
pd_type = 'prot_dom_type' # type: str
pd_desc = 'prot_dom_desc' # type: str
pd_start = 'prot_dom_start' # type: str
pd_end = 'prot_dom_end' # type: str
Expand Down
109 changes: 38 additions & 71 deletions source/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,82 +303,54 @@ def parse_gn(self):
for name in names:
self.names.add(('', name.strip()))

def process_domain(self, t, s, e, ds):
def process_domain(self, t, loc, ds):
"""
Processes the information of a single protein domain.
:param t: domain type
:param s: start
:param e: end
:param loc: location
:param ds: list of description lines
"""
# don't add the domain if start, end or type are not given
if not t or not s or not e:
# don't add the domain if location or type are not given
if not t or not loc:
return

# extract start and end position from the location
loc = loc.split('..')
if len(loc) == 2:
s, e = loc
# sometimes there is only the start position
elif len(loc) == 1:
s = loc[0]
e = s
else:
return

# don't add the domain if start or end are not numbers
if not s.isdigit() or not e.isdigit():
return

# a single description lines can contain several descriptions that end with '.'
# a single description lines can contain several descriptions as follows: /<qualifier>="<value>"
# this creates an adjusted list where each element is a single description
ds_adjusted = []
for d in ds:
words = d.split('. ')

words2 = []
# add '.' back to the description that was removed when splitting the line
# (except for the last one, since a description can continue on the next line)
for w in words[:-1]:
if not w.endswith('.'):
words2.append(w + '.')
else:
words2.append(w)
words2.append(words[-1])

# fix some inconsistent formatting in the feature descriptions
words = [words2[0]]
for i in range(1, len(words2)):
try:
if words[i - 1][-4:] == 'Ref.':
words[i - 1] += ' ' + words2[i]
else:
words.append(words2[i])
except IndexError:
words.append(words2[i])

ds_adjusted += words
if not d:
continue
if '="' in d:
ds_adjusted.append(d.split('="')[1].strip('"'))
elif d:
ds_adjusted[-1] += ' ' + d.strip('"')

# final list of domain descriptions
desc = [] # type: list(str)
# true of the description continues on the next line
cont_desc = False
desc = []

for d in ds_adjusted:
# ignore the feature ID
if d.startswith('/FTId'):
continue

# a description ends with '.'
cont = not d.endswith('.')
d = d.strip('.').replace(';', ',')

# ignore empty descriptions
if not d:
continue

# sequence that continues on the next line
if cont_desc and desc[-1][-1].isupper() and d[0].isupper():
desc[-1] += d
# description that continues on the next line and needs a space in between
elif cont_desc:
desc[-1] += ' ' + d
# new description
else:
d = d.strip().replace('|', '-')
if d:
desc.append(d)

cont_desc = cont

# add the protein domain
self.domains.add((s, e, t, cfg.misc.in_sep.join(desc)))
self.domains.add((s, e, t.lower(), cfg.misc.in_sep.join(desc)))

def parse_ft(self):
"""
Expand All @@ -388,33 +360,28 @@ def parse_ft(self):
if not self.ft:
return

type = ''
start = ''
end = ''
domain_type = ''
location = ''
desc = []

for line in self.ft:
# parse the columns
ttype = line[5:13].strip()
tstart = line[14:20].strip().strip('>').strip('<')
tend = line[21:27].strip().strip('>').strip('<')
tdesc = line[34:].strip()
t_type = line[5:21].strip()

# new feature
if ttype:
if t_type:
# add the previous feature to the domain list
self.process_domain(type, start, end, desc)
self.process_domain(domain_type, location, desc)
# set the type, start and end of the new feature
type = ttype
start = tstart
end = tend
desc = [tdesc]
domain_type = t_type
location = line[21:].strip('?<>').strip()
desc = []
# previous feature is continued
else:
desc.append(tdesc)
desc.append(line[21:].strip())

# add the last feature to the domain list
self.process_domain(type, start, end, desc)
self.process_domain(domain_type, location, desc)

def validate(self):
"""
Expand Down Expand Up @@ -582,7 +549,7 @@ def tsv(self):
file = open(self.out_path, 'w')

# write the header
file.write('\t'.join([cfg.res.lt, cfg.res.name, cfg.res.desc]) + '\n')
file.write('\t'.join([cfg.int.lt, cfg.int.name, cfg.int.desc]) + '\n')

# sort the genes by locus tag
for key, val in sorted(self.entries.items(), key=lambda x: x[0]):
Expand Down

0 comments on commit b8eeea6

Please sign in to comment.