Research

Our overarching goal is to develop novel algorithms to address important and fundamental problems in computational genomics. Below are some of the key research themes we are currently exploring:

High-performance genome assembly algorithm

De novo assembly, especially de novo haplotype-resolved assembly, has been a central problem and remains one of the most challenging tasks in bioinformatics for four decades. It involves multiple advanced algorithms such as sketching, alignment and many branches in graph theory, and demands programming skills of the highest level. We have developed a series of de novo assembly algorithms, including hifiasm, hifiasm (Hi-C) and hifiasm (UL), which are designed to produce optimal genome assemblies by combining different data types. These algorithms have been widely used and have already become the dominant long-read genome assemblers. Currently, we are particularly interested in developing de novo assembly algorithms for complex genomes with polyploid alterations such as cancer genomes and polyploid plant genomes.

Comprehensive variant calling and interpretation

For the human genome, variant calling is typically performed through read alignment, which aligns fragmented reads back to the human reference genome. However, the generic reference genome often lacks specific personal information, leading to potential inaccuracies and biases, especially within highly repetitive and structurally different regions. Consequently, there is a rapidly growing demand for de novo genome assembly—a methodology that reconstructs the genome without relying on a reference. Leveraging our computational expertise, we aim to develop innovative variant calling and interpretation methods that are based on de novo genome assembly.

Resolving challenging medically relevant genes

Many critical medically relevant genes, such as HLA, SMN1, SMN2, C3, C4, and NOTCH2NLC, are difficult to resolve due to high repetitiveness and structural variation. Most computational approaches rely on read-to-reference alignments, which are constrained by the inaccuracies in reference genomes. Our goal is to develop assembly-based, reference-free computational methods to accurately reconstruct these challenging genomic regions.

Software