CECS professor receives NSF grant to resolve bioinformatics sequencing issues

November 15, 2013

Qiang Zhu, 每日大赛 professor of computer and information science, recently received a $222,277 National Science Foundation (NSF) grant to support research that will provide new data storage, indexing and retrieval techniques to solve relevant issues in bioinformatics sequence analysis. Zhu is the principal investigator on the project. Michigan State University is a collaborative research partner and received a separate NSF award.

Qiang Zhu

Zhu鈥檚 research delves into the problems created by the vast increase in DNA sequencing that has turned the biology field into a data-intensive science.

鈥淥ver the past decade, DNA and RNA sequencing has become quick, easy and inexpensive,鈥 Zhu said. 鈥淪equencing has become indispensible for basic biological research and is increasingly serving biomedical diagnostics, trait association studies, gene expression analysis, drug resistance and other areas. All of these fields use sequences in different ways and are drowning in sequence data.鈥

Zhu said database overloads make it difficult to compute and analyze genome sequence data efficiently.

鈥淎s the sizes of the genome sequence databases grow, their computational demands are outpacing existing computing capacity,鈥 he said. 鈥淭his makes it even more difficult to complete an analysis. Primary data analysis now costs significantly more than generating the data in the first place.鈥

The research project focuses on a variety of approaches that use fixed-length strings/subsequences (called 鈥渒-mers鈥) from genome sequences. Although researchers have given considerable attention to efficient indexing, storage and retrieval for large-scale k-mer sets over the past decade, most existing techniques work in a computer with a huge main memory, which is not readily available to many biology labs. In addition, most techniques are optimized for exact matches, which limit the efficient sequence analysis applications.

鈥淢ost existing methods for storing k-mers do not support multiple word lengths ,鈥 Zhu said. 鈥淔or many sequence analysis problems, including assembly, variant detection and error correction, the use of multiple word lengths would allow better sensitivity and provide for more accurate sequence analysis.鈥

To overcome these issues, Zhu and his colleagues are investigating techniques for storing and querying large k-mer data sets. They will develop new data structures, building strategies, search algorithms and performance models.

鈥淲e expect to produce efficient on-disk approaches for storing and querying large-scale genome sequence databases,鈥 Zhu said. 鈥淭he research results will also impact other popular application areas such as biometrics, image processing, social network, E-commerce鈥攁ny field where non-ordered, discrete, multi-dimensional data is crucial.鈥

Zhu has many years of research experience in the database field, including developing centralized/distributed database systems. Visit his website for additional information about his research experience and interests.