Read mapping and variant detection


Read mapping, a seemingly easy problem, was initially solved by heuristically accelerating standard q-gram-based methods due to the demands on speed for large NGS data sets. We addressed the problem of optimally solving read mapping in several publications. Our last two implementations, Masai  [1] and Yara, are 2–5 times faster and more accurate than Bowtie2 and BWA. The novelties of our read mappers are the use of filtration with approximate seeds and a method for multiple backtracking (Masai) as well a mapping by strata using an adaptive filtration scheme (Yara). In the future, we will address the problem of mapping reads to pan-genomes and work on even faster filtration and seeding schemes using the bidirectional FM-index with our recently published EPR dictionaries [1].

The landscape of structural variation (SV) including complex duplication and translocation patterns is far from resolved. SV detection tools usually exhibit low agreement, are often geared toward certain types or size ranges of variation and struggle to correctly classify the type and exact size of SVs. We were successful in implementing two algorithms in tools named Gustaf [1]  and Basil/Anise [1] . Gustaf (Generic mUlti-SpliT Alignment Finder) is a sound generic multi-split SV detection tool that detects and classifies deletions, inversions, dispersed duplications and translocations of >30 bp. Our approach is based on a generic multi-split alignment strategy that can identify SV breakpoints with base pair resolution. Recently we were able to improve significantly on the speed of Gustaf with our new tool Vaquita [1].

For large insertions, we developed approaches for detecting insertion breakpoints and targeted assembly of the insertions from HTS paired data (see Figure). After detecting breakpoints with the tool Basil, we conduct a targeted, local assembly with the tool Anise. One major point of Anise is that we employ a repeat resolution step on near identity repeats that are hard for assemblers. This results in far better reconstructions than obtained by the compared methods like MindTheGap.

In addition to our work in SV detection, we also considered the problem of correctly identifying different isoforms of expressed genes obtained from mixture data from RNA-seq experiments. The Cidane [1] framework for genome-based transcript reconstruction and quantification from RNA-seq reads assembles transcripts with significantly higher sensitivity and precision than existing tools, while competing in speed with the fastest methods. In addition to reconstructing transcripts ab initio, the algorithm also allows to make use of the growing annotation of known splice sites, transcription start and end sites, or full-length transcripts, which are available for most model organisms.

The tool Imseq [1] implements a method to derive clonotype repertoires from next generation sequencing data with sophisticated routines for handling errors stemming from PCR and sequencing artefacts. The application can handle different kinds of input data originating from single- or paired-end sequencing in different configurations and is generic regarding the species and gene of interest.

People currently working mainly on this topic

Jongkyu Kim: He is the author of the Vaquita tool.

David Heller: PacBio based analysis.

Chenxu Pan: Fast k-mer indices and applications for read mapping and variant detection

Christopher Pockrandt: He is the author of EPR dictionaries and its application. Read here more about it.

Relevant publications

[1] Unknown bibtex entry with key []