Read mapping and variant detection

Overview

Read mapping, a seemingly easy problem, was initially solved by heuristically accelerating standard q-gram-based methods due to the demands on speed for large NGS data sets. We addressed the problem of optimally solving read mapping in several publications. Our last two implementations, Masai  [1] and Yara, are 2–5 times faster and more accurate than Bowtie2 and BWA. The novelties of our read mappers are the use of filtration with approximate seeds and a method for multiple backtracking (Masai) as well a mapping by strata using an adaptive filtration scheme (Yara). In the future, we will address the problem of mapping reads to pan-genomes and work on even faster filtration and seeding schemes using the bidirectional FM-index with our recently published EPR dictionaries [2].

The landscape of structural variation (SV) including complex duplication and translocation patterns is far from resolved. SV detection tools usually exhibit low agreement, are often geared toward certain types or size ranges of variation and struggle to correctly classify the type and exact size of SVs. We were successful in implementing two algorithms in tools named Gustaf [3]  and Basil/Anise [4] . Gustaf (Generic mUlti-SpliT Alignment Finder) is a sound generic multi-split SV detection tool that detects and classifies deletions, inversions, dispersed duplications and translocations of >30 bp. Our approach is based on a generic multi-split alignment strategy that can identify SV breakpoints with base pair resolution. Recently we were able to improve significantly on the speed of Gustaf with our new tool Vaquita [5].

For large insertions, we developed approaches for detecting insertion breakpoints and targeted assembly of the insertions from HTS paired data (see Figure). After detecting breakpoints with the tool Basil, we conduct a targeted, local assembly with the tool Anise. One major point of Anise is that we employ a repeat resolution step on near identity repeats that are hard for assemblers. This results in far better reconstructions than obtained by the compared methods like MindTheGap.

In addition to our work in SV detection, we also considered the problem of correctly identifying different isoforms of expressed genes obtained from mixture data from RNA-seq experiments. The Cidane [6] framework for genome-based transcript reconstruction and quantification from RNA-seq reads assembles transcripts with significantly higher sensitivity and precision than existing tools, while competing in speed with the fastest methods. In addition to reconstructing transcripts ab initio, the algorithm also allows to make use of the growing annotation of known splice sites, transcription start and end sites, or full-length transcripts, which are available for most model organisms.

The tool Imseq [7] implements a method to derive clonotype repertoires from next generation sequencing data with sophisticated routines for handling errors stemming from PCR and sequencing artefacts. The application can handle different kinds of input data originating from single- or paired-end sequencing in different configurations and is generic regarding the species and gene of interest.

People currently working mainly on this topic

Jongkyu Kim: He is the author of the Vaquita tool.

David Heller: PacBio based analysis.

Chenxu Pan: Fast k-mer indices and applications for read mapping and variant detection

Christopher Pockrandt: He is the author of EPR dictionaries and its application. Read here more about it.

Relevant publications

[1] E. Siragusa, D. Weese, and K. Reinert, “Fast and accurate read mapping with approximate seeds and multiple backtracking,” Oxford Journals, vol. 41, iss. 7, p. e78, 2013.
[Bibtex]
@article{fu_mi_publications1161,
title = {Fast and accurate read mapping with approximate seeds and multiple backtracking},
number = {7},
month = {January},
pages = {e78},
year = {2013},
publisher = {Oxford University Press},
volume = {41},
author = {E. Siragusa and D. Weese and K. Reinert},
journal = {Oxford Journals},
abstract = {We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2?4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seeds, compared with exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds by taking advantage of the repetitiveness of next-generation sequencing data. Combined together, these two methods significantly speed up approximate search on genomic data sets. Masai is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and binaries for Linux, Mac OS X and Windows can be freely downloaded from http://www.seqan.de/projects/masai.},
url = {http://publications.imp.fu-berlin.de/1161/}
}
[2] C. Pockrandt, M. Ehrhardt, and K. Reinert, “EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices,” in Research in Computational Molecular Biology. RECOMB 2017, S. Sahinalp, Ed., Springer, Cham, 2017, vol. 10229, p. 190–206.
[Bibtex]
@incollection{fu_mi_publications2118,
title = {EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices},
year = {2017},
month = {April},
pages = {190--206},
volume = {10229},
editor = {S. Sahinalp},
publisher = {Springer, Cham},
booktitle = {Research in Computational Molecular Biology. RECOMB 2017},
author = {Christopher Pockrandt and Marcel Ehrhardt and Knut Reinert},
series = {Lecture Notes in Computer Science (LNCS)},
abstract = {The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If {\ensuremath{\sigma}} is the size of the alphabet then the method of Lam et al. can conduct one step in time O({\ensuremath{\sigma}}) while needing space O({\ensuremath{\sigma}}{$\cdot$}n) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to O(log{\ensuremath{\sigma}}) while using O(log{\ensuremath{\sigma}}{$\cdot$}n) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees.
In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in O(1) time per step while using O(log{\ensuremath{\sigma}}{$\cdot$}n)+o(log{\ensuremath{\sigma}}{$\cdot$}{\ensuremath{\sigma}}{$\cdot$}n)
bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary).
We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between {$\approx$}2.2?4.2 times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.},
url = {http://publications.imp.fu-berlin.de/2118/}
}
[3] K. Trappe, A. -K. Emde, H. -C. Ehrlich, and K. Reinert, “Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone,” Bioinformatics, vol. 30, iss. 24, p. 3484–3490, 2014.
[Bibtex]
@article{fu_mi_publications1455,
publisher = {Oxford University Press},
volume = {30},
author = {K. Trappe and A.-K. Emde and H.-C. Ehrlich and K. Reinert},
journal = {Bioinformatics},
title = {Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone},
number = {24},
pages = {3484--3490},
month = {July},
year = {2014},
abstract = {MOTIVATION:
The landscape of structural variation (SV) including complex duplication and translocation patterns is far from resolved. SV detection tools usually exhibit low agreement, are often geared toward certain types or size ranges of variation and struggle to correctly classify the type and exact size of SVs.
RESULTS:
We present Gustaf (Generic mUlti-SpliT Alignment Finder), a sound generic multi-split SV detection tool that detects and classifies deletions, inversions, dispersed duplications and translocations of {\ensuremath{<}}span class='mathrm'{\ensuremath{>}}ge{\ensuremath{<}}/span{\ensuremath{>}}30 bp. Our approach is based on a generic multi-split alignment strategy that can identify SV breakpoints with base pair resolution. We show that Gustaf correctly identifies SVs, especially in the range from 30 to 100 bp, which we call the next-generation sequencing (NGS) twilight zone of SVs, as well as larger SVs \>500 bp. Gustaf performs better than similar tools in our benchmark and is furthermore able to correctly identify size and location of dispersed duplications and translocations, which otherwise might be wrongly classified, for example, as large deletions. Availability and implementation: Project information, paper benchmark and source code are available via http://www.seqan.de/projects/gustaf/.
CONTACT:kathrin.trappe@fu-berlin.de.},
url = {http://publications.imp.fu-berlin.de/1455/}
}
[4] M. Holtgrewe, L. Kuchenbecker, and K. Reinert, “Methods for the Detection and Assembly of Novel Sequence in High-Throughput Sequencing Data,” Bioinformatics, vol. 31, iss. 12, p. 1904–1912, 2015.
[Bibtex]
@article{fu_mi_publications1506,
journal = {Bioinformatics},
year = {2015},
author = {M. Holtgrewe and L. Kuchenbecker and K. Reinert},
pages = {1904--1912},
volume = {31},
number = {12},
title = {Methods for the Detection and Assembly of Novel Sequence in High-Throughput Sequencing Data},
url = {http://publications.imp.fu-berlin.de/1506/},
abstract = {Motivation:
Large insertions of novel sequence are an important type of structural variants. Previous studies used traditional de novo assemblers for assembling non-mapping high-throughput sequencing (HTS) or capillary reads and then tried to anchor them in the reference using paired read information.
Results:
We present approaches for detecting insertion breakpoints and targeted assembly of large insertions from HTS paired data: BASIL and ANISE. On near identity repeats that are hard for assemblers, ANISE employs a repeat resolution step. This results in far better reconstructions than obtained by the compared methods. On simulated data, we found our insert assembler to be competitive with the de novo assemblers ABYSS and SGA while yielding already anchored inserted sequence as opposed to unanchored contigs as from ABYSS/SGA. On real-world data, we detected novel sequence in a human individual and thoroughly validated the assembled sequence. ANISE was found to be superior to the competing tool MindTheGap on both simulated and real-world data.
Availability and implementation: ANISE and BASIL are available for download at http://www.seqan.de/projects/herbarium under a permissive open source license.
Contact: manuel.holtgrewe@fu-berlin.de or knut.reinert@fu-berlin.de}
}
[5] J. Kim and K. Reinert, “Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence,” LIPIcs – Leibniz International Proceedings in Informatics, vol. 88, p. 14, 2017.
[Bibtex]
@article{Kim_2017cx,
author = {Kim, Jongkyu and Reinert, Knut},
title = {{Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence}},
journal = {LIPIcs - Leibniz International Proceedings in Informatics},
year = {2017},
volume = {88},
pages = {14}
}
[6] S. Canzar, S. Andreotti, D. Weese, K. Reinert, and G. W. Klau, “CIDANE: comprehensive isoform discovery and abundance estimation,” Genome Biology, vol. 17, iss. 1, 2016.
[Bibtex]
@article{fu_mi_publications1830,
month = {January},
year = {2016},
title = {CIDANE: comprehensive isoform discovery and abundance estimation},
number = {1},
author = {S. Canzar and S. Andreotti and D. Weese and K. Reinert and G. W. Klau},
journal = {Genome Biology},
publisher = {BioMed Central, Springer Science+Business Media},
volume = {17},
abstract = {We present CIDANE, a novel framework for genome-based transcript reconstruction and quantification from RNA-seq reads. CIDANE assembles transcripts efficiently with significantly higher sensitivity and precision than existing tools. Its algorithmic core not only reconstructs transcripts ab initio, but also allows the use of the growing annotation of known splice sites, transcription start and end sites, or full-length transcripts, which are available for most model organisms. CIDANE supports the integrated analysis of RNA-seq and additional gene-boundary data and recovers splice junctions that are invisible to other methods. CIDANE is available at http://?ccb.?jhu.?edu/?software/?cidane/?.},
url = {http://publications.imp.fu-berlin.de/1830/}
}
[7] L. Kuchenbecker, M. Nienen, J. Hecht, A. U. Neumann, N. Babel, K. Reinert, and P. N. Robinson, “IMSEQ – a fast and error aware approach to immunogenetic sequence analysis,” Bioinformatics, vol. 31, iss. 18, p. 2963–2971, 2015.
[Bibtex]
@article{fu_mi_publications1551,
number = {18},
title = {IMSEQ - a fast and error aware approach to immunogenetic sequence analysis},
year = {2015},
pages = {2963--2971},
volume = {31},
publisher = {Oxford University Press (online advanced access)},
journal = {Bioinformatics},
author = {L. Kuchenbecker and M. Nienen and J. Hecht and A. U. Neumann and N. Babel and K. Reinert and P. N. Robinson},
url = {http://publications.imp.fu-berlin.de/1551/},
abstract = {Motivation: Recombined T and B cell receptor repertoires are increasingly being studied using next generation sequencing (NGS) in order to interrogate the repertoire composition as well as changes in the distribution of receptor clones under different physiological and disease states. This type of analysis requires efficient and unambiguous clonotype assignment to a large number of NGS read sequences, including the identification of the incorporated V and J gene segments and the CDR3 sequence. Current tools have deficits with respect to performance, accuracy and documentation of their underlying algorithms and usage.
Results: We present IMSEQ, a method to derive clonotype repertoires from next generation sequencing data with sophisticated routines for handling errors stemming from PCR and sequencing artefacts. The application can handle different kinds of input data originating from single- or paired-end sequencing in different configurations and is generic regarding the species and gene of interest. We have carefully evaluated our method with simulated and real world data and show that IMSEQ is superior to other tools with respect to its clonotyping as well as standalone error correction and runtime performance.
Availability: IMSEQ was implemented in C++ using the SeqAn library for efficient sequence analysis. It is freely available under the GPLv2 open source license and can be downloaded at www.imtools.org. }
}