(Pan)-Genome analysis and metagenomics


The plummeting costs of sequencing technologies have made it possible to investigate data sets stemming from a mixture of different genomes as a whole (a so called metagenome) and to sequence large populations of individuals from a species or clade (a so called pangenome). For those applications we worked on basic algorithms like mapping a set of meta- or pangenomic reads to protein databases, and on identifying graph-based data structures to represent similar sequences succinctly, while allowing fast computations on the representation.

Our tool Lambda  [1]  is an alternative for BLAST in the context of sequence classification. Lambda often outperforms the current best tools at reproducing BLAST’s results and is the fastest compared with the current state of the art at comparable levels of sensitivity. Apart from mapping NGS reads to protein or transriptomic references we also published SLIMM [1] a tool for metagenomic (sub) species determination and quantification.

For representing a set of similar strings (i.e. a pangenome), we provide a data type called Journaled String Tree (JST) [1]  that exploits data parallelism inherent in the set by analyzing shared regions only once. In real-world experiments, we can show that algorithms, that otherwise would scan each reference sequentially, can be sped up by a factor of 115.

Concerning the representation of multiple genomes we surveyed several graph-based data structures for genome-sized alignment and compared the structures of those graphs in terms of their abilities to represent alignment information [1] . We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs (see Figure). Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools.

We also investigated the structure of real and computed genome-sized alignments induced by gene gain, loss, duplication, chromosome fusion, fission, and rearrangement. When gene gain and loss occurs in addition to other types of rearrangement, breakpoints of rearrangement can exist that are only detectable by comparison of three or more genomes. A very large number of these “hidden” breakpoints can exist among genomes that exhibit no rearrangements in pairwise comparisons. Developing an extension of the multi-chromosomal breakpoint median problem to genomes that have undergone gene gain and loss, we demonstrate that the median distance among three genomes can be used to calculate a lower bound on the number of hidden breakpoints.

We applied our approach to measure the abundance of hidden breakpoints in simulated data sets under a wide range of evolutionary scenarios and could demonstrate that hidden breakpoint counts depend strongly on relative rates of inversion and gene gain/loss. Applying current multiple genome aligners to the simulated genomes we show that all aligners introduce a high degree of error in hidden breakpoint counts, and that this error grows with evolutionary distance in the simulation, which suggests that hidden breakpoint error may be pervasive in genome alignments [1].

The reconstruction of the history of evolutionary genome-wide events among a set of related organisms is of great biological interest since it can help to reveal the genomic basis of phenotypes. However, a high sequence similarity often does not allow one to distinguish between orthologs and paralogs. We show how to infer the order of genes of (a set of) families for ancestral genomes by considering the order of these genes on sequenced genomes in the evolutionary model of duplications and loss. Our branch- and-cut algorithm solves the two species small phylogeny problem about 200 times faster than the previously published method (see [1]).

People currently working mainly on this topic:

Hannes Hauswedell: The author the Lambda tool. Read here more about it.

Temesgen Dadi: The author of SLIMM and working on a new YARA version. Read here more about SLIMM.

Marie Hoffmann: Computing species signatures de novo.

Kerstin Neubert: Working on the BMBF Essbar project. Read here more about the Essbar project and our role.

Femke van Geffen: Working on deciphering ancient DNA metagenomics samples in collaboration with the Alfred-Wegener-Institute (AWI).

Relevant publications

[1] Unknown bibtex entry with key []