Data structures and parallelisation

Within the last ten years, modern sequencing technologies have brought a super- exponential growth of sequencing capacities. This has enabled the cheap sequencing of the genomic content of pangenomes, metagenomes, or many individuals of the same species (e.g. the 100,000 genome project) that differ only slightly from each other. Yet, the small individual differences are of interest (i.e. SNPs, or small structural polymorphisms) to elucidate the cause of diseases or reconstruct evolutionary events.

These datasets expose interesting characteristics. They are large, while some large fractions are highly redundant (e.g. 100,000 genome project, or storing different strains of bacteria) and hence amenable to compression techniques. On the other hand, compression usually makes it costly to implement the main operations on the data, namely finding approximate matches of (many) queries (approximate in the sense of edit distance).

In this research scheme we hence focus on the development of parallelised and vectorised algorithms for the analysis of highly redundant (i.e. genomic) data.

The DREAM framework

In the work of  Dadi (ECCB 2018 to be published) we introduced a distributed read mapper in the context of a larger framework, that we call the DREAM   (Dynamic seaRchablE pArallel coMpressed) framework.

While a lot of research has focused on indexing genomic datasets, the resulting solutions lack, in general, the ability to easily change the underlying datasets. That means it is costly (a) to change small parts of a sequence, or (b) to add or delete complete sequences while maintaining the ability to support fast approximate string searches. For example, in metagenomics, this problem becomes more and more recurrent. Many metagenomics search tools  and read mappers
use Burrows-Wheeler-Transform (BWT) based FM-indices (also often referred to as compressed suffix arrays (CSA)) which have to index about 50 to 200 gigabases. Due to constant database updates changes occur on a daily or weekly basis and thus require a newly constructed index. Recomputing a single index of this size is quite costly in terms of space and time, even if approaches of merging BWTs are used. For example, it takes about one day to compute the index for the dataset used in our experiments. On the other hand, the ability for fast approximate searches in such an index is crucial. It is used either directly to find all approximate occurrences of a (short) string or parts of it in seed-and-extend approaches. In the proposed DREAM framework we will  offer various solutions for the above areas depending on some key parameters of the input set (size of the input, amount of redundancy, the importance of rebuilding time vs. search time).

With regard to parallelisation we work on adding offloading and vectorisation  layers in SeqAn. SeqAn’s new pairwise alignment module is up to 1600 times faster than the serial code on a single CPU [1].

People currently working mainly on this topic:

Enrico Seiler and Temesgen Dadi: Temesgen Dadi and Enrico Seiler worked on the published distributed read mapper DREAM-YARA. Enrico Seiler will continue to work on the partitioning step of the DREAM framework and the approximate search distributor.

Svenja Mehringer: Svenja Mehringer will work on compressed indices for the DREAM framework.

Rene Rahn: He is working on compressed indices and is the main author of vectorized and parallelised alignment module.

Marcel Ehrhardt: He works on vectorized and parallelised alignment as well as SeqAn’s offloading layer.

Relevant publications:

[1] R. Rahn, S. Budach, P. Costanza, M. Ehrhardt, J. Hancox, and K. Reinert, “Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading,” Bioinformatics, 2018.
[Bibtex]
@article{fu_mi_publications2253,
year = {2018},
journal = {Bioinformatics},
author = {Ren{\'e} Rahn and Stefan Budach and Pascal Costanza and Marcel Ehrhardt and Jonny Hancox and Knut Reinert},
month = {May},
publisher = {Oxford Academic (OUP)},
title = {Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading},
url = {http://publications.imp.fu-berlin.de/2253/},
abstract = {Motivation
Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence lignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal.
Results
We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel? Xeon? (Skylake) and Intel? Xeon Phi? (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi? and 1400 times faster on the Xeon? than executing them with our previous sequential alignment module.
Availability
The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms.}
}