tag: genome annotation analysis in Python!

tag is a free open-source software package for analyzing genome annotation data. It is developed as a reusable library with a focus on ease of use. tag is implemented in pure Python (no compiling required) with minimal dependencies!

What problem does tag solve?

Computational biology is 90% text formatting and ID cross-referencing!
– discouraged graduate students everywhere

Most GFF parsers will load data into memory for you–the trivial bit–but will not group related features for you–the useful bit. tag represents related features as a feature graph (a directed acyclic graph) which can be easily traversed and inspected.

# Compute number of exons per gene
import tag
reader = tag.GFF3Reader(infilename='/data/genomes/mybug.gff3.gz')
for gene in tag.select.features(reader, type='gene'):
    exons = [feat for feat in gene if feat.type == exon]
    print('num exons:', len(exons))

See the primer on annotation formats for more information.

Summary

The tag library is built around the following features:

  • parsers and writers for reading and printing annotation data in GFF3 format (with intelligent gzip support)
  • data structures for convenient handling of various types of GFF3 entries: annotated sequence features, directives and other metadata, embedded sequences, and comments
  • generator functions for a variety of common and useful annotation processing tasks, which can be easily composed to create streaming pipelines
  • a unified command-line interface for executing common processing workflows
  • a stable, documented Python API for interactive data analysis and building custom workflows

Development

Development of the tag library is currently a one-man show, but I would heartily welcome contributions. The development repository is at https://github.com/standage/tag. Please feel free to submit comments, questions, support requests to the GitHub issue tracker, or (even better) a pull request!

Indices and tables