(Pan)-Genome analysis and metagenomics

Overview

The plummeting costs of sequencing technologies have made it possible to investigate data sets stemming from a mixture of different genomes as a whole (a so called metagenome) and to sequence large populations of individuals from a species or clade (a so called pangenome). For those applications we worked on basic algorithms like mapping a set of meta- or pangenomic reads to protein databases, and on identifying graph-based data structures to represent similar sequences succinctly, while allowing fast computations on the representation.

Our tool Lambda  [1]  is an alternative for BLAST in the context of sequence classification. Lambda often outperforms the current best tools at reproducing BLAST’s results and is the fastest compared with the current state of the art at comparable levels of sensitivity. Apart from mapping NGS reads to protein or transriptomic references we also published SLIMM [2] a tool for metagenomic (sub) species determination and quantification.

For representing a set of similar strings (i.e. a pangenome), we provide a data type called Journaled String Tree (JST) [3]  that exploits data parallelism inherent in the set by analyzing shared regions only once. In real-world experiments, we can show that algorithms, that otherwise would scan each reference sequentially, can be sped up by a factor of 115.

Concerning the representation of multiple genomes we surveyed several graph-based data structures for genome-sized alignment and compared the structures of those graphs in terms of their abilities to represent alignment information [4] . We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs (see Figure). Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools.

We also investigated the structure of real and computed genome-sized alignments induced by gene gain, loss, duplication, chromosome fusion, fission, and rearrangement. When gene gain and loss occurs in addition to other types of rearrangement, breakpoints of rearrangement can exist that are only detectable by comparison of three or more genomes. A very large number of these “hidden” breakpoints can exist among genomes that exhibit no rearrangements in pairwise comparisons. Developing an extension of the multi-chromosomal breakpoint median problem to genomes that have undergone gene gain and loss, we demonstrate that the median distance among three genomes can be used to calculate a lower bound on the number of hidden breakpoints.

We applied our approach to measure the abundance of hidden breakpoints in simulated data sets under a wide range of evolutionary scenarios and could demonstrate that hidden breakpoint counts depend strongly on relative rates of inversion and gene gain/loss. Applying current multiple genome aligners to the simulated genomes we show that all aligners introduce a high degree of error in hidden breakpoint counts, and that this error grows with evolutionary distance in the simulation, which suggests that hidden breakpoint error may be pervasive in genome alignments [5].

The reconstruction of the history of evolutionary genome-wide events among a set of related organisms is of great biological interest since it can help to reveal the genomic basis of phenotypes. However, a high sequence similarity often does not allow one to distinguish between orthologs and paralogs. We show how to infer the order of genes of (a set of) families for ancestral genomes by considering the order of these genes on sequenced genomes in the evolutionary model of duplications and loss. Our branch- and-cut algorithm solves the two species small phylogeny problem about 200 times faster than the previously published method (see [6]).

People currently working mainly on this topic:

Hannes Hauswedell: The author the Lambda tool. Read here more about it.

Temesgen Dadi: The author of SLIMM and working on a new YARA version. Read here more about SLIMM.

Marie Hoffmann: Computing species signatures de novo.

Kerstin Neubert: Working on the BMBF Essbar project. Read here more about the Essbar project and our role.

Femke van Geffen: Working on deciphering ancient DNA metagenomics samples in collaboration with the Alfred-Wegener-Institute (AWI).

Relevant publications

[1] H. Hauswedell, J. Singer, and K. Reinert, “Lambda: the local aligner for massive biological data,” Bioinformatics (Oxford, England), vol. 30, iss. 17, p. i349–i355, 2014.
[Bibtex]
@article{fu_mi_publications1453,
publisher = {Oxford University Press},
volume = {30},
author = {H. Hauswedell and J. Singer and K. Reinert},
journal = {Bioinformatics (Oxford, England)},
title = {Lambda: the local aligner for massive biological data},
number = {17},
month = {September},
pages = {i349--i355},
year = {2014},
abstract = {MOTIVATION:Next-generation sequencing technologies produce unprecedented amounts of data, leading to completely new research fields. One of these is metagenomics, the study of large-size DNA samples containing a multitude of diverse organisms. A key problem in metagenomics is to functionally and taxonomically classify the sequenced DNA, to which end the well-known BLAST program is usually used. But BLAST has dramatic resource requirements at metagenomic scales of data, imposing a high financial or technical burden on the researcher. Multiple attempts have been made to overcome these limitations and present a viable alternative to BLAST.RESULTS:In this work we present Lambda, our own alternative for BLAST in the context of sequence classification. In our tests, Lambda often outperforms the best tools at reproducing BLAST's results and is the fastest compared with the current state of the art at comparable levels of sensitivity.
AVAILABILITY AND IMPLEMENTATION:Lambda was implemented in the SeqAn open-source C++ library for sequence analysis and is publicly available for download at http://www.seqan.de/projects/lambda.
CONTACT:hannes.hauswedell@fu-berlin.de
SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.},
url = {http://publications.imp.fu-berlin.de/1453/}
}
[2] T. H. Dadi, B. Y. Renard, L. H. Wieler, T. Semmler, and K. Reinert, “SLIMM: species level identification of microorganisms from metagenomes,” PeerJ, vol. 5, p. e3138, 2017.
[Bibtex]
@article{fu_mi_publications2119,
month = {March},
author = {Temesgen Hailemariam Dadi and Bernhard Y. Renard and Lothar H. Wieler and Torsten Semmler and Knut Reinert},
pages = {e3138},
journal = {PeerJ},
year = {2017},
title = {SLIMM: species level identification of microorganisms from metagenomes},
volume = {5},
abstract = {Identification and quantification of microorganisms is a significant step in studying the alpha and beta diversities within and between microbial communities respectively. Both identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than when using 16S-rDNA sequences. However, shared regions of DNA among reference genomes and taxonomic units pose a significant challenge in assigning reads correctly to their true origins. The existing microbial community profiling tools commonly deal with this problem by either preparing signature-based unique references or assigning an ambiguous read to its least common ancestor in a taxonomic tree. The former method is limited to making use of the reads which can be mapped to the curated regions, while the latter suffer from the lack of uniquely mapped reads at lower (more specific) taxonomic ranks. Moreover, even if the tools exhibited good performance in calling the organisms present in a sample, there is still room for improvement in determining the correct relative abundance of the organisms. We present a new method Species Level Identification of Microorganisms from Metagenomes (SLIMM) which addresses the above issues by using coverage information of reference genomes to remove unlikely genomes from the analysis and subsequently gain more uniquely mapped reads to assign at lower ranks of a taxonomic tree. SLIMM is based on a few, seemingly easy steps which when combined create a tool that outperforms state-of-the-art tools in run-time and memory usage while being on par or better in computing quantitative and qualitative information at species-level.},
url = {http://publications.imp.fu-berlin.de/2119/}
}
[3] R. Rahn, D. Weese, and K. Reinert, “Journaled string tree–a scalable data structure for analyzing thousands of similar genomes on your laptop,” Bioinformatics, 2014.
[Bibtex]
@article{fu_mi_publications1448,
month = {July},
author = {R. Rahn and D. Weese and K. Reinert},
journal = {Bioinformatics},
year = {2014},
title = {Journaled string tree--a scalable data structure for analyzing thousands of similar genomes on your laptop},
abstract = {Motivation: Next-generation sequencing (NGS) has revolutionized biomedical research in the past decade and led to a continuous stream of developments in bioinformatics, addressing the need for fast and space-efficient solutions for analyzing NGS data. Often researchers need to analyze a set of genomic sequences that stem from closely related species or are indeed individuals of the same species. Hence, the analyzed sequences are similar. For analyses where local changes in the examined sequence induce only local changes in the results, it is obviously desirable to examine identical or similar regions not repeatedly.
Results: In this work, we provide a datatype that exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments, we show that algorithms that otherwise would scan each reference sequentially can be speeded up by a factor of 115.
Availability: The data structure and associated tools are publicly available at http://www.seqan.de/projects/jst and are part of SeqAn, the C++ template library for sequence analysis.
Contact: rene.rahn@fu-berlin.de},
url = {http://publications.imp.fu-berlin.de/1448/}
}
[4] B. Kehr, K. Trappe, M. Holtgrewe, and K. Reinert, “Genome alignment with graph data structures: a comparison,” BMC Bioinformatics, vol. 15, 2014.
[Bibtex]
@article{fu_mi_publications1437,
author = {B. Kehr and K. Trappe and M. Holtgrewe and K. Reinert},
month = {April},
journal = {BMC Bioinformatics},
year = {2014},
publisher = {BioMed Central},
title = {Genome alignment with graph data structures: a comparison},
volume = {15},
url = {http://publications.imp.fu-berlin.de/1437/},
abstract = {Background
Recent advances in rapid, low-cost sequencing have opened up the opportunity to study complete genome sequences. The computational approach of multiple genome alignment allows investigation of evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as rearrangement studies and phylogenetic inference.
Graphs have proven to be a powerful tool for coping with the complexity of genome-scale sequence alignments. The potential of graphs to intuitively represent all aspects of genome alignments led to the development of graph-based approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a genome alignment through identification and removal of graph substructures that indicate errors in the alignment.
Results
We compare the structures of commonly used graphs in terms of their abilities to represent alignment information. We describe how the graphs can be transformed into each other, and identify and classify graph substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that remove these substructures.
Conclusion
We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information content. Still, many ideas are shared among all graph-based approaches. Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools. }
}
[5] B. Kehr, K. Reinert, and A. E. Darling, “Hidden Breakpoints in Genome Alignments,” in Lecture Notes in Computer Science: Algorithms in Bioinformatics, B. Raphael and J. Tang, Eds., Berlin: Springer-Verlag, 2012, vol. 7534, p. 391–403.
[Bibtex]
@incollection{fu_mi_publications1168,
pages = {391--403},
year = {2012},
title = {Hidden Breakpoints in Genome Alignments },
author = {B. Kehr and K. Reinert and A. E. Darling},
address = {Berlin},
booktitle = {Lecture Notes in Computer Science: Algorithms in Bioinformatics},
publisher = {Springer-Verlag},
editor = {B. Raphael and J. Tang},
volume = {7534},
url = {http://publications.imp.fu-berlin.de/1168/},
abstract = {During the course of evolution, an organism?s genome can undergo changes that affect the large-scale structure of the genome. These changes include gene gain, loss, duplication, chromosome fusion, fission, and rearrangement. When gene gain and loss occurs in addition to other types of rearrangement, breakpoints of rearrangement can exist that are only detectable by comparison of three or more genomes. An arbitrarily large number of these ?hidden? breakpoints can exist among genomes that exhibit no rearrangements in pairwise comparisons.
We present an extension of the multichromosomal breakpoint median problem to genomes that have undergone gene gain and loss. We then demonstrate that the median distance among three genomes can be used to calculate a lower bound on the number of hidden breakpoints present. We provide an implementation of this calculation including the median distance, along with some practical improvements on the time complexity of the underlying algorithm.
We apply our approach to measure the abundance of hidden breakpoints in simulated data sets under a wide range of evolutionary scenarios. We demonstrate that in simulations the hidden breakpoint counts depend strongly on relative rates of inversion and gene gain/loss. Finally we apply current multiple genome aligners to the simulated genomes, and show that all aligners introduce a high degree of error in hidden breakpoint counts, and that this error grows with evolutionary distance in the simulation. Our results suggest that hidden breakpoint error may be pervasive in genome alignments. }
}
[6] S. Andreotti, K. Reinert, and S. Canzar, “The duplication-loss small phylogeny problem: from cherries to trees.,” Journal of Computational Biology, vol. 20, iss. 9, p. 643–59, 2013.
[Bibtex]
@article{fu_mi_publications1454,
author = {S. Andreotti and K. Reinert and S. Canzar},
journal = {Journal of Computational Biology},
volume = {20},
pages = {643--59},
month = {September},
year = {2013},
title = {The duplication-loss small phylogeny problem: from cherries to trees.},
number = {9},
url = {http://publications.imp.fu-berlin.de/1454/},
abstract = {Abstract The reconstruction of the history of evolutionary genome-wide events among a set of related organisms is of great biological interest since it can help to reveal the genomic basis of phenotypes. The sequencing of whole genomes faciliates the study of gene families that vary in size through duplication and loss events, like transfer RNA. However, a high sequence similarity often does not allow one to distinguish between orthologs and paralogs. Previous methods have addressed this difficulty by taking into account flanking regions of members of a family independently. We go one step further by inferring the order of genes of (a set of) families for ancestral genomes by considering the order of these genes on sequenced genomes. We present a novel branch-and-cut algorithm to solve the two species small phylogeny problem in the evolutionary model of duplications and losses. On average, our implementation, DupLoCut, improves the running time of a recently proposed method in the experiments on six Vibrionaceae lineages by a factor of {\~A}?\<88\>{\^A}?200. Besides the mere improvement in running time, the efficiency of our approach allows us to extend our model from cherries of a species tree, that is, subtrees with two leaves, to the median of three species setting. Being able to determine the median of three species is of key importance to one of the most common approaches to ancestral reconstruction, and our experiments show that its repeated computation considerably reduces the number of duplications and losses along the tree both on simulated instances comprising 128 leaves and a set of Bacillus genomes. Furthermore, in our simulations we show that a reduction in cost goes hand in hand with an improvement of the predicted ancestral genomes. Finally, we prove that the small phylogeny problem in the duplication-loss model is NP-complete already for two species.}
}