Parallel and Memory-Efficient Preprocessing for Metagenome Assembly

Abstract

The analysis of high-throughput metagenomic sequencing data poses significant computational challenges. Most current de novo assembly tools use the de Bruijn graph-based methodology. In prior work, a connected components decomposition of the de Bruijn graph and subsequent partitioning of sequence read data was shown to be an effective memory-reducing preprocessing step for de novo assembly of large metagenomic datasets. In this paper, we present METAPREP, a new end-to-end parallel implementation of a similar preprocessing step. METAPREP has efficient implementations of several computational subroutines (e.g., k-mer enumeration and counting, parallel sorting, graph connectivity) that occur in other genomic data analysis problems, and we show that our implementations are comparable to the state-of-the-art. METAPREP is primarily designed to execute on large shared-memory multicore servers, but scales gracefully to use multiple compute nodes and clusters with parallel I/O capabilities. With METAPREP, we can process the Iowa Continuous Corn soil metagenomics dataset, comprising 1.13 billion reads totaling 223 billion base pairs, in around 14 minutes, using just 16 nodes of the NERSC Edison supercomputer. We also evaluate the performance impact of METAPREP on MEGAHIT, a parallel metagenome assembler.

Publication
16th IEEE International Workshop on High Performance Computational Biology
Date