-
Notifications
You must be signed in to change notification settings - Fork 63
core features
We are planning to explore using ragel to generate parsers for as many file types as possible (see this thread). Writers will still need to be written manually.
- FASTA
- FASTQ
- GenBank
- EMBL
Loads more at http://www.bioperl.org/wiki/HOWTO:SeqIO, but many of these are antiquated formats. I think we should prioritise by popularity. The sooner BioJulia is useful the better for the community.
- GFF & GTF (this is messy in most languages - it would be great if we could cleanly handle all the quirks)
- BED
- VCF
- BLAST (tabular/long form)
- MultiFASTA aligned
- CLUSTAL
- BAM/SAM
- Phylip
- PFAM
- Newick (can be ported from Phylogenetics.jl)
- Nexus
- PhyloXML
also database connectors, for e.g. BioSQL
We'll want to have representations of:
- DNA, RNA and amino acid sequences
- ranges and features of sequences (where the sequence may or may not be present)
- alignments - pairwise and multiple
- graph-derivative structures like phylogenetic trees, genetic networks and biochemical pathways
- probabilistic models of sequences (e.g. motifs - perhaps this isn't a high priority)
Having a solid interval tree implementation would enable a lot of common operations on genome annotations: counting, intersecting, extending, etc. We should also look at what parts of Diego's BioSeq.jl we can incorporate. We'll have to extend those sequence representations to attach metadata to sequences, but that shouldn't be too hard.
- BLAST
- Blat
- bowtie/2
- bwa
- HMMER
- Primer3
- Phylogenetic tools (clustal, mafft, PAML, phylip)
- samtools (unless we can do something faster in our own sam/bam implementation)
- signalP/targetP
- assemblers: velvet/oases, trinity, soapdenovo
- BioMart
- Ensembl
- EMBL
- NCBI
- SRA
- genome sequences
- genome annotations
- gene ontologies