core features

IO

We are planning to explore using ragel to generate parsers for as many file types as possible (see this thread). Writers will still need to be written manually.

Sequence formats

FASTA
FASTQ
GenBank
EMBL

Loads more at http://www.bioperl.org/wiki/HOWTO:SeqIO, but many of these are antiquated formats. I think we should prioritise by popularity. The sooner BioJulia is useful the better for the community.

Annotation formats

GFF & GTF (this is messy in most languages - it would be great if we could cleanly handle all the quirks)
BED
VCF

Alignment formats

BLAST (tabular/long form)
MultiFASTA aligned
CLUSTAL
BAM/SAM
Phylip
PFAM

Tree formats

Newick (can be ported from Phylogenetics.jl)
Nexus
PhyloXML

also database connectors, for e.g. BioSQL

Datastructures

We'll want to have representations of:

DNA, RNA and amino acid sequences
ranges and features of sequences (where the sequence may or may not be present)
alignments - pairwise and multiple
graph-derivative structures like phylogenetic trees, genetic networks and biochemical pathways
probabilistic models of sequences (e.g. motifs - perhaps this isn't a high priority)

Having a solid interval tree implementation would enable a lot of common operations on genome annotations: counting, intersecting, extending, etc. We should also look at what parts of Diego's BioSeq.jl we can incorporate. We'll have to extend those sequence representations to attach metadata to sequences, but that shouldn't be too hard.

Tool wrappers

BLAST
Blat
bowtie/2
bwa
HMMER
Primer3
Phylogenetic tools (clustal, mafft, PAML, phylip)
samtools (unless we can do something faster in our own sam/bam implementation)
signalP/targetP
assemblers: velvet/oases, trinity, soapdenovo

Service APIs

BioMart
Ensembl
EMBL
NCBI
SRA

Datasets

genome sequences
genome annotations
gene ontologies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly