Research projects

This laboratory focuses on the use of bioinformatics and systems biology approaches to tackle longstanding problems in basic and translational medicine. All projects in this laboratory involve integrative analysis of diverse genome-wide datasets, especially next-generation sequencing (NGS) data such as RNA-seq, ChIP-seq, DNase-seq, and whole genome sequencing data. Here is a sampling of general research themes. Multiple projects are available under each threme.

  • Automated analysis of 3D biomedical imaging data. Three dimensional imaging technologies such as micro computed tomography (Micro-CT), optical projection tomography (OPT) and magnetic resonance imaging (MRI) are very important in studying the 3D structure of organs such as the heart and the vasculature in patients and animal models. These images require manual inspection and interpretation, which limit the scalability of this technology to phenotype a large number of images. There is a strong need for computer assisted assisted automated phenotyping, especially in terms of identifying patients with subtle anatomical abnormalities which may be easily missed by humans or rapid phenotyping of genetic animal models in research (e.g., in forward genetic screens). The main challenge is to be able to perform these classification tasks accurately (i.e., with high prediction accuracy) and quickly (i.e., to be able to train such a classifier with a large number of big image files). In this project, we will use advanced computational technologies such as cloud computing and deep learning to deal the automated phenotyping challenge.
  • Wearable device, heart rate dynamics and heart failure risk. Cardiac function is often impeded in patients who have experienced a myocardial infarction (heart attack), have an existing cardiac condition such as atrial fibrillation or cardiomyopathy, or strong risk factors such as high blood pressure. These patients are of high risk of developing heart failure (HF). Being able to track the change in cardiac function in real time under a patient's realistic activity profile is very important in clinical management of patients who are at high risk of developing HF. To address this clinical challenge, our team recently began to explore the use of a popular wrist-based wearable device as a non-invasive means to track the heart rate dynamics of patients under various levels of physical activity. By combining second-by-second heart rate measurements as well as Global Positioning System (GPS) signals, we have developed an algorithm to reconstruct a person's heart rate dynamics profile before, during and after various physical exertion events, which we can use to track this person's cardiac function. In this project, we will develop a new computer program (mobile app) that can process heart rate and GPS data recorded by a wearable device, and use them to describe a patient's cardiac function based on their heart rate dynamics profiles. This computer program can be deployed on the patient's mobile device with an option to share the data wirelessly into the cloud. In particular, we will develop new big data machine-learning algorithms to predict a patient's risk of HF based on their heart rate dynamic profiles.
  • Bioinformatics algorithms for single cell RNA-seq analysis. Single-cell RNA sequencing (scRNA-Seq) enables researchers to study heterogeneity among tens of thousands of individual cells and define cell types from a transcriptomic perspective. scRNA-Seq offers a means to precisely quantify the state of individual cells, enabling the high resolution mapping of cell cycle progression, cell differentiation and other trajectories. However, fast and reliable analysis of these large and noisy data requires new statistical and computational considerations. In this project we will develop cutting-edge bioinformatics methods to analyse a range of scRNA-seq data to answer important biological questions.
  • Integrative metabolomic data analysis. Analysis of high throughput mass spectrometry-based metabolomic data is challenging because of the difficulty in accurate and fast identification of metabolites. It has been found that integration of other omic data, such as genomic, transcriptomic and proteomic data, can help metabolite identification in a metabolomics analysis. In this project, we will develop a fast integrative bioinformatics pipeline for metabolomic data analysis.
  • Scalable 3D virtual reality visualisation of biological data. Visualisation of biological data is critical in the analysis and interpretation of large biological data, such as single-cell RNA-seq data that profiles the gene expression patterns of tens of thousands of cells. In this project, we will use state-of-the-art virtual reality (VR) technology to construct effective and scalable 3D visualisation of various biological data. We will make use of modern web-based VR javascript frameworks to develop a web-based VR visualisation engine. This project is ideally suited for students who have an interest in large-scale data visualisation and virtual reality.
  • Systems developmental biology of mammalian organ formation. Many organs form via intercellular exchange of signaling molecules and an intracellular network of transcription factors. These interactions can be summarised as a gene regulatory network (GRN). We recently reconstructed a GRN from more than 1,000 pieces of gene perturbation evidence and identified a feedback circuit associated with epithelial-mesenchymal signaling interactions during embryonic development of mouse molar tooth (O'Connell et al., 2012). In silico simulation suggests that the observed reciprocal tissue signaling interactions could be an intrinsic property of the circuit structure. This finding has significantly implication on our understanding of this important class of signaling interactions in organ formation/malformation. We are now extending this approach to study the development of other organs, such as salivary gland, pancreatic islet, ocular lens and heart valve.
  • Decoding the language of life. It is curious that stimulation of the same signalling pathway (e.g., Wnt pathway) can often lead to expression of different genes in different cell types (e.g., embryonic stem cells vs. differentiated intestinal cells). Recent findings based on genome-wide chromatin analysis (e.g., ChIP-chip/ChIP-seq) suggested that both the chromatin environment and DNA sequencing composition play an important role in genomic targeting of transcription factors, opening up the possibility we could learn a grammar to describe and predict cell-type specific signalling response. To test this hypothesis, this project will compile and perform meta-analysis on published genome-wide datasets (ChIP-chip, ChIP-seq, DNase-seq, and RNA-seq from ENCODE/modENCODE consortia for example) as well as in-house data generated by local and international collaborators. We will adapt advanced methods from computational linguistics and machine learning to build biologically meaningful models for signaling-responsive transcription factor binding in mammalian cells.
  • Genome-wide chromatin landscape analysis of fungal epigenomes. Fungi have wide medical, agricultural and biotechnological relevance because of their abilities to cause diseases in humans and plants. Moreover, many fungi have long been used in the biotechnology (e.g., industrial enzyme productions) and food (e.g., wine, cheese, soy sauce fermentations) industries; and some fungi (e.g., mushrooms) are a valuable food source with high nutritional and medicinal values. To better understand fungal potentials and diseases, many representative fungal genomes have already been sequenced, and an on-going joint effort initiative aims to have 1000 fungal genomes sequenced over the next few years. Despite these efforts at the genomic level, information about the epigenomes for most fungal species is still largely uncharted. In collaboration with an international collaborator, we will study genome-wide chromatin landscape of several closely related medically, agriculturally and industrially important species by ChIP-seq. We will identify and systematically analyse the chromatin states in these species using advanced data mining and machine learning techniques and use these information to gain insight into different key physiologies of the species.
  • Causal disease mutation identification in whole genome sequencing data. Whole genome sequencing is now highly cost-effective. It is possible to identify sequence or structural variants in the genome of an individual within weeks. This has open up enormous possibilities for personalized genomic medicine and the identification of causal genes of both rare and common diseases. Nonetheless, while a large number of sequence or structural variants can be identified in each individual, it is often difficult to pin-point the disease causing genetic mutation. In this project, we will develop a bioinformatic pipeline to integrate diverse functional genomic data to prioritise likely causal mutations that underlie a disease.
  • Bioinformatics software testing. Many bioinformatics programs have large input data (e.g., gigabyte-sized sequence data) and often implement sophisticated computational procedures (e.g., network simulation, string matching, machine learning, and combinatorial optimization). As a result, it is difficult to systematically test the correctness of these programs beyond the use of a few trivial test cases. Most of the faults in the programs are very difficult to detect, but once occur, may lead to incorrect biological conclusion or the design of a misguided follow-up experiment. This project will develop tools to help bioinformaticians to perform systematic software testing. We observed that many practicing bioinformaticians lack proper software testing training, and their programs are often not subjected to sufficient testing. One immediate goal of this project is to develop a software package that will help bioinformatics program developer to design, execute and report test cases. This project will fill an important need in bioinformatics that has not been fully addressed previously.
  • A cloud-based approach for incorporating scalability in genome informatics. The advent of Next Generation Sequencing (NGS) is transforming the landscape of biomedical research ranging from disease gene discovery to clinical application of genomic medicine. NGS enables low-cost, high-throughput sequencing for a wide variety of genome-wide scale analysis of the genome, epigenome and the transcriptome. However, with this vast quantity of data, we are faced with unprecedented technical challenges in terms of computational analysis and storage of these data. Our goal of this research project is to investigate the use of cloud based technology to deal with these challenges. In particular, we plan to utilise the unique strengths of cloud technology adaptive scalability and remote distributed data storage to overcome the technical challenges. We will develop new bioinformatics pipelines, and apply it to two cutting edge applications: (i) single-cell transcriptomic analysis, and (ii) disease gene discovery using whole genome sequencing data.

For postdoc/students/RA who wants to join this laboratory: All projects require proficiency in at least one programming/scripting language (R, Perl, Python, Java, C++, C Matlab) Familiarity with the Unix operating system is desirable but not required. Individual project can be tailored to fit each student's personal interest and skill set. Most projects involve close interactions with local and international collaborators. This is a highly interdisciplinary laboratory. We welcome perspective group members from diverse background, such as medicine, biology, physics, computer science, mathematics, statistics, and engineering. Expression of interest, along with your CV, can be sent to Dr. Ho.

More information about research and student opportunity at VCCRI can be found here.