Within the last ten years, modern sequencing technologies have brought a super- exponential growth of sequencing capacities. This has enabled the cheap sequencing of the genomic content of pangenomes, metagenomes, or many individuals of the same species (e.g. the 100,000 genome project) that differ only slightly from each other. Yet, the small individual differences are of interest (i.e. SNPs, or small structural polymorphisms) to elucidate the cause of diseases or reconstruct evolutionary events.
These datasets expose interesting characteristics. They are large, while some large fractions are highly redundant (e.g. 100,000 genome project, or storing different strains of bacteria) and hence amenable to compression techniques. On the other hand, compression usually makes it costly to implement the main operations on the data, namely finding approximate matches of (many) queries (approximate in the sense of edit distance).
In this research scheme we hence focus on the development of parallelised and vectorised algorithms for the analysis of highly redundant (i.e. genomic) data.
In the work of Dadi (ECCB 2018 to be published) we introduced a distributed read mapper in the context of a larger framework, that we call the DREAM (Dynamic seaRchablE pArallel coMpressed) framework.
While a lot of research has focused on indexing genomic datasets, the resulting solutions lack, in general, the ability to easily change the underlying datasets. That means it is costly (a) to change small parts of a sequence, or (b) to add or delete complete sequences while maintaining the ability to support fast approximate string searches. For example, in metagenomics, this problem becomes more and more recurrent. Many metagenomics search tools and read mappers
use Burrows-Wheeler-Transform (BWT) based FM-indices (also often referred to as compressed suffix arrays (CSA)) which have to index about 50 to 200 gigabases. Due to constant database updates changes occur on a daily or weekly basis and thus require a newly constructed index. Recomputing a single index of this size is quite costly in terms of space and time, even if approaches of merging BWTs are used. For example, it takes about one day to compute the index for the dataset used in our experiments. On the other hand, the ability for fast approximate searches in such an index is crucial. It is used either directly to find all approximate occurrences of a (short) string or parts of it in seed-and-extend approaches. In the proposed DREAM framework we will offer various solutions for the above areas depending on some key parameters of the input set (size of the input, amount of redundancy, the importance of rebuilding time vs. search time).
With regard to parallelisation we work on adding offloading and vectorisation layers in SeqAn. SeqAn’s new pairwise alignment module is up to 1600 times faster than the serial code on a single CPU .
People currently working mainly on this topic:
Enrico Seiler and Temesgen Dadi: Temesgen Dadi and Enrico Seiler worked on the published distributed read mapper DREAM-YARA. Enrico Seiler will continue to work on the partitioning step of the DREAM framework and the approximate search distributor.
Svenja Mehringer: Svenja Mehringer will work on compressed indices for the DREAM framework.
Rene Rahn: He is working on compressed indices and is the main author of vectorized and parallelised alignment module.
Marcel Ehrhardt: He works on vectorized and parallelised alignment as well as SeqAn’s offloading layer.