The field of genomics is entering an exciting era with unprecedented opportunities for new medical insights, enabled by an enormous and ever-growing amount of genomic data. The data are characterized by highly distributed acquisition, huge storage requirements, and highly involved analyses that integrate heterogenous information. My research is dedicated to identifying and addressing the challenges arising in the context of such data. This undertaking includes the design and development of new algorithms for coping with the distribution and storage of the data, for facilitating its access, and for improving the analysis and inference performed on it.

I follow a multidisciplinary approach that combines tools from machine learning, information theory, and statistics to create a sound technical framework for tackling the challenges of modern genomic data.

Mayo Grand Challenge

If you are a cancer patient who has exhausted all standard treatment options, a one day turnaround of your DNA testing results could make a big difference in your search for new, individualized treatment options. Mayo Clinic envisions just that: using supercomputers that are so fast and efficient they can interpret your DNA sequence in hours so providers can quickly apply them to individualized patient care. Right now, it can take several days to weeks for computer systems to analyze DNA sequences looking for new treatment options. This is time that many patients with advanced cancer do not have. We are collaborating with researchers and physicians from the Mayo Clinic Center for Individualized Medicine on a research effort known as the Grand Challenge project, employing supercomputers that are many times faster than typical home or business computers so that analysis and interpretation of a complete human genome sequence can be completed in just one day.

Chan-Zuckerberg Initiative

Advances in biological data acquisition technologies have spurred the generation of heterogeneous omics data at a high speed and volume. Efforts in linking various forms of omics data are currently underway, with new platforms such as MIMOmics and the Omics Discovery Index emerging at a fast pace. Consequently, data storage, transmission, visualization and scalable processing have become major challenges in the advancement of biological and medical science research. This sentiment is reflected in the National Human Genome Research Institute (NHGRI) roadmap that asserts that “The major bottleneck in genome sequencing is no longer data generation - the computational challenges around data analysis, display and integration are now rate-limiting.” To address these problems, the proposal aims to develop an integrated information storage, visualization and shared machine learning pipeline that operates on various forms of lossless and lossy (quantized) compressive omics data while producing results matching those obtained from uncompressed data.

Bringing digital era formats to omics and health

We are developing a prototype to showcase the applicability of the developed formats for genomic information representation. The goal is to demonstrate the benefits of the developed technology towards reduced storage, faster access, and ease of visualization of the considered omics data, which will significantly advance research in the clinical setting. We are also creating the interface to make the new formats work seamlessly with existing tools and applications, such that immediate adoption is possible, and make all the resources available in the cloud. Some of the early adopters of the technology that we will target include the High-Throughput Sequencing and Genotyping Unit and the Mining Microbial Genomes theme at UIUC, and the OSF Saint Francis Medical Center in Peoria.