RNA-seq Data Analysis Capability in KBase

The RNA-seq analysis suite is still in development. Please report any issues or suggestions to us.

All users, beginners or advanced, are highly encouraged to read these instructions before trying out the RNA-seq pipeline in KBase.

The KBase RNA-seq Service provides a number of data analysis tools (Apps) that are based on the original and new Tuxedo suites of RNA-seq tools. The original Tuxedo suite consists of Bowtie2 and TopHat2 to align the reads, Cufflinks to assemble the transcripts, Cuffdiff to identify the differentially expressed genes and CummeRbund to visualize the differentially expressed genes as 2D plots and heatmaps [1,2,3]. The new Tuxedo suite uses HISAT2 instead of (Bowtie2/TopHat2) to align reads and StringTie instead of Cufflinks to assemble the transcripts [4,5]. A significant improvement over their predecessors, both HISAT2 and StringTie are fast and efficient with low memory footprints [4,5]. Additionally, the modularity of the RNA-seq Apps in KBase provides several options that allow users to select their choice of reads aligner and assembler for the differential gene expression analysis.

We have provided two Narrative tutorials–one based on the original Tuxedo suite and the other based on the new Tuxedo suite–to demonstrate how to use the KBase RNA-seq pipeline end-to-end. It is also possible to  mix and match the analytic tools from both suites in a single Narrative and to compare/contrast their RNA-Seq data analysis.

You can copy these tutorials and re-run any of the steps (perhaps changing parameters or using your own data) in your KBase account. You will need a KBase account in order to view and copy the Narratives.

The figure below shows how the RNA-seq tools from the Original and New Tuxedo suites can be chained together in a workflow in KBase to study differential gene expression. Each step in this workflow is then described in detail.

The Differential Gene Expression Workflow using the RNA-seq Apps in KBase

Step 1: Import RNA-seq Data into the Narrative

The RNA-seq pipeline starts with importing high-throughput reads obtained from Illumina or SOLiD sequencing platforms and the corresponding reference genome in the Narrative interface.

1.1    Short reads: The reads must be a set of single-end or paired-end reads in FASTA or FASTQ format. Use the Bulk Uploader to import the short reads into your KBase account. If you don’t have your own reads, you can try out the RNA-seq tools by selecting a set of  example reads (trimmed down in size) from the Public tab in the Data Browser and adding them to your Narrative (see http://kbase.us/narrative-guide/add-data-to-your-narrative/).

1.2    Genomes: Import the appropriate reference genome from the Public tab in the Data Browser. You must run the Build Bowtie2 Index App  to index the genome if you intend to use Bowtie2 or TopHat2 aligners. However, the HISAT2 aligner only needs the relevant reference genome (no genome indexing needed).

Step 2:  Create RNA-seq Sample Set

This App allows you to associate the experiment metadata to the input sequence-reads and generate the RNA-seq Sample Set object that is required by the next step for a set of samples.

Step 3: Align Reads to the Reference Genome using Bowtie2/TopHat2/HISAT2

KBase provides three different Apps that can be used to align the RNA-seq Sample Set to a prokaryotic or eukaryotic genome. You can use one or more than one of these Apps (Bowtie2/TopHat2/HISAT2) to align the reference genome based on your research experiment and compare the alignment results. Bowtie2 or TopHat2 Apps need Bowtie2 indexed genome to generate the read alignments whereas HISAT2 uses only the reference genome for alignment. HISAT2 is faster and more sensitive than Bowtie2/TopHat2 and also uses less memory.

NOTE: Even though this App is one of the sequential steps in the KBase RNA-seq Pipeline, it can also be run as a standalone analysis tool for one or more RNA-seq samples.

Step 4: Assemble Transcripts with Cufflinks/StringTie

KBase provides two different Apps that can be used to assemble the alignments into a parsimonious set of transcripts. The RNASeqAlignmentSet obtained from any one of the Bowtie2/TopHat2/HISAT2 Apps can be used as an input to either Cufflinks or StringTie App to generate GTF and FPMK files that are subsequently wrapped as an RNAseqExpression object in KBase for each individual sample  and an RNASeqExpressionSet object for the whole SampleSet. These Apps also generate fully normalized FPKM/TPM ExpressionMatrix objects that can be downloaded or used as input to downstream analysis tools.

NOTE: Due to the modular nature of these Apps, KBase provides four different options to run this step. Based on your interest, you can choose any one of the following options:

  • TopHat2 -> Cufflinks
  • TopHat2 -> StringTie
  • HISAT2 -> Cufflinks
  • HISAT2 -> StringTie

Step 5: Identify Differential Expression using Cuffdiff

This App uses the RNASeqExpressionSet data object obtained from either the Cufflinks or StringTie Apps to calculate gene and transcript expression levels in more than one condition and identifies the significant changes in the expression levels. Cuffdiff calculates the FPKM value of each transcript, primary transcript and gene in each sample and produces a number of output files zipped into the Cuffdiff output as a RNASeqDifferentialExpression data object.

NOTE: Steps 6-8 below take Cuffdiff output as input and generate plots and/or expression matrices.

Step 6: View CummeRbund Plots

This App takes Cuffdiff output as input and generates a number of plots for the exploration, analysis and visualization of high-throughput RNA-seq data.

Step 7: View Interactive Volcano Plot

This App generates an interactive Volcano Plot  (2D scatter plot) to show the list of differentially expressed genes based on the fold change and p value.

Step 8: Create Expression Matrix from Cuffdiff

This App creates an expression matrix based on the data obtained from the Cuffdiff App. The advanced options can be used to select the different matrix transformation for normalization that can be filtered by alpha cutoff, fold change, and number of genes.

Step 9: View differentially expressed genes from Cuffdiff in HeatMap

This App compares a pair of conditions in RNA-seq expression data to identify differentially expressed genes and view them in an interactive heatmap. It uses the data produced from Cuffdiff RNA-seq differential expression analysis as input and creates a heatmap of differentially expressed genes that can be filtered by alpha cut off, fold change, and number of genes.

Next Steps

The expression matrix generated by the RNA-seq workflow can be used in downstream analysis by other Apps in KBase. For example, you can analyze patterns of gene expression by grouping expression data via different clustering algorithms such as Hierarchical, K-means and WGCNA. It can also be used in metabolic modeling Apps in KBase to compare reaction flux with gene expression to identify the pathways where expression and flux agree or conflict.


[1] Trapnell C, Pachter L, Salzberg SL. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. Vol 25, 9:1105-1111. http://bioinformatics.oxfordjournals.org/content/25/9/1105.abstract

[2] Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter, L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562 578. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334321/

[3] Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology. 14:R36 http://www.genomebiology.com/2013/14/4/R36/abstract

[4] Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology  http://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html

[5] Pertea M, Kim D, Pertea G, Leek JT and Salzberg SL (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie, and Ballgown. Nature Protocols 11, 1650–1667. http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html

The RNA-seq analysis suite is still in development. Please report any issues to us.