Amazon research grant proposal
Scalable and collaborative single-cell genomic analysis on the cloud

Two innovative techniques are transforming the landscape of biomedical research – Next Generation Sequencing (NGS) and single cell analysis. By combining these techniques, we can now generate NGS-based genome-wide expression profiles of thousands of single cells simultaneously in a single experiment. Thus, we can now systematically study how cell-to-cell variations affect biological processes that impact human health. In fact, single-cell genomic technology was named the Method of the year in 2013 by Nature Methods.

Nonetheless, the pace of data generation far outstrips our ability to analyse and share them effectively. Not only do we need to handle the computational scalability issues in terms of processing huge datasets, we also need to design a more efficient means of sharing these datasets to enable better collaboration within the international scientific community. In this project, we will utilise the unique strengths (adaptive scalability and remote data storage) of cloud technology to:

  1. Perform parallel processing of thousands of single-cell datasets in a time- and cost-effective manner
  2. Develop a collaborative web-based genome browser that allows a large amount of single-cell genomic data to be shared, viewed and manipulated in a Google-Doc-like manner

Aim 1: A cloud-based pipeline for processing single-cell RNA-seq data

To illustrate the scalability issue, consider the alignment of 1000 single cell RNA-seq datasets, which will take 21 days if performed sequentially compared to 30 minutes when done simultaneously. The key to solving this embarrassingly parallel problem is to leverage the power of distributed cloud-based computing to perform multiple analyses simultaneously. We will develop tools, which will be made publicly available, to deploy and control a cloud-based system to ensure that computational resources can be adaptively requested or released to maximise their utility with minimal cost. We have already developed several integrated computational pipelines that mine Short Read Archive data in Amazon Elastic Compute Cloud (EC2). For this project specifically, we will develop a statistically rigorous yet computationally efficient pipeline that can take advantage of Amazon’s highly adaptive cloud computing infrastructure for single-cell RNA-seq data analysis.

Responsive image

Figure 1. Illustration showing how the cloud based pipeline works.

As shown in Figure 1, user first upload the output of RNA sequencing, which contains reads from multiple samples, from the sequencing machine to S3 bucket. In order to obtain reads from individual samples, Amazon Elastic Map Reduce (EMR) is used to separate the reads in the output using unique barcodes located in the reads. Each samples's reads are then submitted to a queue, which is implemented using Amazon Simple Queue Service together with Amazon SimpleDB, that will assign the reads to one of the EC2 instance for RNA-seq analysis (Figure 2). The results of the RNA-seq analysis then undergoes statistical analysis and normalisation.

Responsive image

Figure 2. RNA-seq analysis pipeline.

Aim 2: A collaborative web-based genome browser for large-scale genomic analysis

My laboratory is involved in several international collaborations that involve generation and analysis of hundreds of genome-wide datasets, including many single-cell genomic datasets. To facilitate collaboration among biologists and bioinformaticians in different geographical locations, we propose to build a web-based genome browser that can handle a large amount of genome-scale data. We will extend existing state-of-the-art genome browsers (e.g. biodalliance) in order to support comparative analysis and collaborative research.

Responsive image

Figure 3. Illustration showing pipeline for collaborative web-based genome browser.

Due to the large amount of data, and the need for preprocessing, we will be using the Amazon Simple Storage Service (S3) to store the data and EC2 for data processing and hosting the web-based interface (Figure 3).

Research Team

Dr Joshua W. K. Ho

Head of Bioinformatics and Systems Medicine Laboratory

Victor Chang Cardiac Research Insitute, Australia


A/Prof. Catherine Suter

Head of Epigenetics Laboratory

Victor Chang Cardiac Research Insitute, Australia


International Collaborators

Dr Koon Ho Wong

University of Macau, China


Dr Richard Sherwood

Harvard University, USA