RNASeq Data Analysis

Introduction to RNA sequencing

High-throughput sequecing of mRNA (RNA-Seq) has become the standard methods for measuring and comparing the levels of gene expression in a wide variety of species and conditions. The RNA-seq method typically consists of identification of suitable biological samples (and replicates), isolation of total RNA, enrichment of nonribosomal RNAs, conversion of RNA to cDNA, construction of a fragment library, sequencing on a high-throughput sequencing platform, generation of single or paired-end reads of 30–300 base pairs in length, alignment or assembly of these reads, and downstream analysis.

RNA-seq experiments must be analyzed with accurate, efficient software that is carefully designed to handle the very large sequencing volumes generated by most experiments. The data analysis pipelines can be conceptually divided into six steps.

Step 1: Evaluate and manipulate Raw data

Step 2: Map to reference, assess results

Step 3: Assemble transcripts

Step 4: Quantify transcripts

Step 5: Test for differential expression

Step 6: Visualize and perform other downstream analysis

I provided several RNA-seq pipelines with examples scripts using the recent software tools. All software have been installed on HTC cluster. If you need personalized consultation for RNASeq data analysis workflow and selection of better pipelines, please contact me (fangping@pitt.edu).

Reference Genome

Reference genome - The nucleotide sequence of the chromosomes of a species. Genes are the functional units of a reference genome and gene annotations describe the structure of transcripts expressed from those gene loci.

Obtain a reference genome from Ensembl (http://www.ensembl.org/info/data/ftp/index.html), UCSC (http://hgdownload.cse.ucsc.edu/downloads.html), NCBI or iGenomes (https://support.illumina.com/sequencing/sequencing_software/igenome.html).

I will show you how to download the human GRCh38 version of the genome from Ensembl.

The complete data from which these files were obtained can be found at: ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/. You can find various homo sapiens genome assembly. Which one should I use? The options available can be a bit overwhelming. For NGS data pipelines, you can choose "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" for Ensembl. The files with "sm" in the name are "soft masked" which means instead of the repeat regions being given Ns, they are converted to lower-case. These are OK to use because most major aligners recognise lower-case letters as valid bases. Primary assembly contains all toplevel sequence regions excluding haplotypes and patches. This file is best used for performing sequence similarity searches where patch and haplotype sequences would confuse analysis.

You could use wget to download the Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz file, then unzip/untar.

mkdir ref
cd ref
wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens...
gunzip Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

Instead of the above, you might consider getting reference genomes and associated annotations from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/)


Gene annotations - Descriptions of gene/transcript models for a genome. A transcript model consists of the coordinates of the exons of a transcript on a reference genome. Additional information such as the strand the transcript is generated from, gene name, coding portion of the transcript, alternate transcript start sites, and other information may be provided. GTF (.gtf) file is a common file format referred to as Gene Transfer Format used to store gene and transcript annotation information.

To obtain known gene/transcript annotations, you can use Ensembl (http://useast.ensembl.org/info/data/ftp/index.html) or UCSC.


Use the following commands to download Ensembl release 89 homo sapiens GRCh38 gene annotation.


Based on UCSC annotations or several other possible annotation sources collected by UCSC. You might choose this option if you want to have a lot of flexibility in the annotations you obtain.

  • Open the following in your browser: http://genome.ucsc.edu/
  • Select 'Tools' and then 'Table Browser' at the top of the page.
  • Select 'Mammal', 'Human', and 'Dec. 2013 (GRCh38/hg38)' from the first row of drop down menus.
  • Select 'Genes and Gene Predictions' and 'GENCODE v24' from the second row of drop down menus.
  • Select ‘knownGene’ from the third row of drop down menus. Select ‘genome’ from the fourth row.
  • Select 'GTF - gene transfer format' for output format and enter 'hg38_UCSC_Genes.gtf' for output file.
  • Hit the 'get output' button and save the file. Make note of its location.

Important note on chromosome naming conventions

In order for your RNA-seq analysis to work, the chromosome names in your .gtf file must match those in your reference genome (i.e. your reference genome fasta file). Unfortunately, Ensembl, NCBI, and UCSC can not agree on how to name the chromosomes in many species, so this problem may come up often. You can avoid this by getting a complete reference genome and gene annotation package from the same source (e.g., Ensembl) to maintain consistency.

Important note on reference genome builds

Your annotations must correspond to the same reference genome build as your reference genome fasta file. e.g., both correspond to UCSC human build 'hg38', NCBI human build 'GRCh38', etc. Even if both your reference genome and annotations are from UCSC or Ensembl they could still correspond to different versions of that genome. This would cause problems in any RNA-seq pipeline.


Within your ‘ref’ directory, there are two files ‘Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa’ and ‘Homo_sapiens.GRCh38.86.gtf’.

All aligners require the reference sequence to be indexed -- this is a process that creates additional files that describe the reference file, such as where each chromosome is located in the file. Knowing this information speeds up alignment.

In this step, user supplied the reference genome sequences (FASTA file) and annotations (GTF file), from which aligner generate genome indexes that are utilized in the mapping step. The genome indexes are saved to disk and need only be generated once for each genome/annotation combination.

At each aligner’s website, a limited collection of genome indexes is available. For example, STAR (http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgeno...), HISAT2 (https://ccb.jhu.edu/software/hisat2/index.shtml), however, it is strongly recommended that users generate their own genome indexes with most up-to-date assemblies and annotations.

Consult aligner manual to generate genome indexes. I provides an example to generate HISAT2 indexes for Homo_sapiens.GRCh38.86 genome and annotation.

Extract splice sites and exons from GRCh38.86 annotation file. Run the following command on HTC login node.

module load HISAT2/2.0.5
module load compiler/python/2.7.10-Anaconda-2.3.0
extract_splice_sites.py Homo_sapiens.GRCh38.86.gtf >hg38.ss
extract_exons.py Homo_sapiens.GRCh38.86.gtf >hg38.exon

To build HISAT2 index, submit the following slurm job to HTC cluster.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH --mem=230g # Memory pool for all cores (see also --mem-per-cpu)
module load HISAT2/2.0.5
hisat2-build -p 16 --ss hg38.ss --exon hg38.exon Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa hg38_tran

WARNING: In order to index the entire human genome, HISAT2 requires at least 160GB of RAM, and several hours.

Now, you should find eight files named hg38_tran.1.ht2 to hg38_tran.8.ht2 with the ref directory. These are the HISAT2 index files.

STAR is a splice-aware aligner that will produce gapped alignments to handle reads that span exon-intron junctions. Thus it is only appropriate when an annotated reference genome is available. To build STAR index, submit the following slurm job to HTC cluster.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH --mem=230g # Memory pool for all cores (see also --mem-per-cpu)
module load STAR/2.5.2b
STAR --runMode genomeGenerate --runThreadN 16 --genomeDir ./STAR_Index/ --genomeFastaFiles Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa --sjdbGTFfile Homo_sapiens.GRCh38.86.gtf

RNA-seq Data

The test data consists of RNA-seq samples for primary human airway smooth muscle cell lines. In the experiment, four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated. The Illumina TruSeq assay was used to prepare the samples. The libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The sequencing experiment was paired-end. For more description of the experiment see the PubMed entry 24926665 and for raw data see the GEO entry GSE52778. The test data consists of 4 untreated samples (SRR1039508, SRR1039512, SRR1039516, SRR1039520), and 4 samples treated with dexamethasone (SRR1039509, SRR1039513, SRR1039517, SRR1039521).

You can download the FASTQ files from the Sequence Read Archive, the identifiers would be SRA run IDs, e.g. SRR1039508. You should have two files for a paired-end experiment for each ID, SRR1039508_1.fastq.gz and SRR1039508_2.fastq.gz, which give the first and second read for the paired-end fragments.

To extract fastq files from the GEO data, use NCBI's program toolkit. It's already installed on HTC cluster and can be loaded with module load SRAToolkit/2.8.2. The fastq-dump command produces two compressed fastq files for each dataset, for the forward and reverese reads. In this example, this is done with a SLURM job.

#SBATCH --job-name=fastq-dump
#SBATCH -n 1
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --output=fastq_dump.out
module load SRAToolkit/2.8.2
fastq-dump --split-3 --gzip SRR1039508
fastq-dump --split-3 --gzip SRR1039509
fastq-dump --split-3 --gzip SRR1039512
fastq-dump --split-3 --gzip SRR1039513
fastq-dump --split-3 --gzip SRR1039516
fastq-dump --split-3 --gzip SRR1039517
fastq-dump --split-3 --gzip SRR1039520
fastq-dump --split-3 --gzip SRR1039521

Check fastq-dump options at https://ncbi.github.io/sra-tools/fastq-dump.html. The recommended options to extract paired-end data is fastq-dump --split-files –I --gzip SRR1039513. If you run this command, you would receive two files SRR1039513_1.fastq.gz and SRR1039513_2.fastq.gz with significant different sizes. The paired-end files are not synchronized, which many processing programs are expecting. The above commands use fastq-dump with the legacy --split-3 command, which will output 1,2, or 3 files: 1 file means the data is not paired. 2 files mean paired data with no low quality reads or reads shorter than 20bp. 3 files means paired data, but asymmetric quality or trimming. In the case of 3 file output, most people ignore <file>.fastq (e.g., SRR1039513.fastq.gz).

Your own data

If you received your own data, contact the data provider to understand whether the data is paired-end or single-end sequencing reads, strand-specific or non-strand-specific RNAseq. Common preprocessing tasks are listed below.

Unzip fastq.gz to fastq files.

gunzip filename.fastq.gz

Merge split fastq files.

cat 1_S1_L001_R1_001.fastq 1_S1_L002_R1_001.fastq 1_S1_L003_R1_001.fastq 1_S1_L004_R1_001.fastq > 1_R1.fastq

Convert fastq file to fastq.gz file.

gzip 1_R1.fastq

PreAlignment QC

To get a sense of your data quality before alignment, FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. Video tutorial of FastQC is here: http://www.youtube.com/watch?v=bz93ReOv87Y

First, set up some directories for output:

mkdir fastqc_pretrim 

Submit the following batch job to run FastQC on your fastq files

#SBATCH --job-name=fastqc
#SBATCH -n 1
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --output=fastqc.out
module load FastQC/0.11.4
fastqc -o ./fastqc_pretrim/ *_1.fastq.gz
fastqc -o ./fastqc_pretrim/ *_2.fastq.gz

Check on the *_fastqc.html files to view the FastQC report.

Trimming Adaptor

FastQC gave you warning when overrepresented sequences were in first 200,000 sequences. The overrepresented sequences can provide you the information which adaptors are used. The “Overrepresented sequences” section of SRR1039508_1_fastqc.html shows that “TruSeq Adapter, Index 2” are overrepresented.

Here is the documentation of Illumina Adaptor Sequences: https://support.illumina.com/content/dam/illumina-support/documents/docu...

Here is a documentation to De-Mystify Illumina TruSeq DNA Adapters:


Cutadapt is tool specifically designed to remove adapters from NGS data.  To see the full list of help and options, read this document: http://cutadapt.readthedocs.io/en/stable/guide.html

For paired-end data containing Illumina TruSeq adapters, the command line looks as follows:

cutadapt \
            -o trimmed.1.fastq.gz -p trimmed.2.fastq.gz \
            reads.1.fastq.gz reads.2.fastq.gz

To run on HTC cluster, submit the following job.

#SBATCH --job-name=cutadapt
#SBATCH -n 1
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --output=cutadapt.out
module load cutadapt/1.12
cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o SRR1039508_1.trimmed.fastq.gz -p SRR1039508_2.trimmed.fastq.gz SRR1039508_1.fastq.gz SRR1039508_2.fastq.gz

The output of this step will be trimmed FASTQ files for each data set.



Adapter contamination will lead to NGS alignment errors and an increased number of unaligned reads, since the adapter sequences are synthetic and do not occur in the genomic sequence. There are applications (e.g. small RNA sequencing) where adapter trimming is highly necessary. With a fragment size of around 24 nucleotides, one will definitely sequence into the 3' adapter. But there are also applications (transcriptome sequencing, whole genome sequencing, etc.) where adapter contamination can be expected to be so small (due to an appropriate size selection) that one could consider to skip the adapter removal and thereby save time and efforts.


Alignment is the process of taking the short reads from the FASTQ and places them in the correct location in the genome. However, there exist a variety of aligners that will accomplish this goal. Two major types of aligners are non-splice aware aligner (e.g., BWA, Bowtie, etc) and splice aware aligner (e.g., STAR, HISAT2, etc). To align human RNA-seq data to human genome, splice aware aligner is required. All of the aligners are capable of taking advantage of multi-threading: splitting up the work of alignment across multiple cores. Your instance has 16 available cores, so you will instruct the aligner to split up the alignment into 16 pieces.

HISAT2 alignment

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against the general human population (as well as against a single reference genome). HISAT2 uses a graph-based alignment and has succeeded HISAT and TOPHAT2. The output of this step will be a SAM/BAM file for each data set.

You can perform alignments with HISAT2 to the genome and transcriptome. Refer to HISAT2 manual for a more detailed explanation: https://ccb.jhu.edu/software/hisat2/manual.shtml

First, begin by making the appropriate output directory for our alignment results.

mkdir alignment

Submit the following job to HTC cluster to align the RNA-seq data to the human genome and transcriptome.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J hisat2_human_samples
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH --mem=230g # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=hisat2.out
module load HISAT2/2.0.5
echo SRR1039508
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039508_1.fastq.gz -2 SRR1039508_2.fastq.gz -S ./alignment/SRR1039508.sam
echo SRR1039509
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039509_1.fastq.gz -2 SRR1039509_2.fastq.gz -S ./alignment/SRR1039509.sam
echo SRR1039512
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039512_1.fastq.gz -2 SRR1039512_2.fastq.gz -S ./alignment/SRR1039512.sam
echo SRR1039513
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039513_1.fastq.gz -2 SRR1039513_2.fastq.gz -S ./alignment/SRR1039513.sam
echo SRR1039516
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039516_1.fastq.gz -2 SRR1039516_2.fastq.gz -S ./alignment/SRR1039516.sam
echo SRR1039517
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039517_1.fastq.gz -2 SRR1039517_2.fastq.gz -S ./alignment/SRR1039517.sam
echo SRR1039520
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039520_1.fastq.gz -2 SRR1039520_2.fastq.gz -S ./alignment/SRR1039520.sam
echo SRR1039521
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039521_1.fastq.gz -2 SRR1039521_2.fastq.gz -S ./alignment/SRR1039521.sam

Options specified as follows:

  • '-p 16' tells HISAT2 to use 16 CPUs for bowtie alignments.
  • '--rna-strandedness RF' specifies strandedness of RNAseq library. You can specify RF if the TruSeq strand-specific library was used to make the sequences. See here for options. https://github.com/griffithlab/rnaseq_tutorial/blob/master/manuscript/su...
  • '--dta' Reports alignments tailored for transcript assemblers.
  • '-x /path/to/hisat2/index' The HISAT2 index filename prefix (minus the trailing .X.ht2) built earlier including splice sites and exons.
  • '-1 /path/to/read1.fastq.gz' The read 1 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
  • '-2 /path/to/read2.fastq.gz' The read 2 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
  • '-S /path/to/output.sam' The output SAM format text file of alignments.

HISAT2 generates a summary of the alignments, which is in file hisat2.out (specified as #SBATCH --output=hisat2.out). Notice the number of total reads, reads aligned and various metrics regarding how the reads aligned to the reference.

STAR alignment

To use STAR to map reads to the genome, create a directory mkdir STAR_alignment and submit the following jobs.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J STAR_human_samples
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH --mem=230g # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=STAR.out
module load STAR/2.5.2b
echo SRR1039508
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039508_1.fastq.gz SRR1039508_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039508
echo SRR1039509
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039509_1.fastq.gz SRR1039509_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039509
echo SRR1039512
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039512_1.fastq.gz SRR1039512_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039512
echo SRR1039513
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039513_1.fastq.gz SRR1039513_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039513
echo SRR1039516
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039516_1.fastq.gz SRR1039516_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039516
echo SRR1039517
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039517_1.fastq.gz SRR1039517_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039517
echo SRR1039520
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039520_1.fastq.gz SRR1039520_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039520
echo SRR1039521
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039521_1.fastq.gz SRR1039521_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039521

Note: If you are aligning several experiments in a row, add the option "--genomeLoad LoadAndKeep" and STAR will load the genome index into shared memory so that it can use it for the next time you run the program.

SAM to BAM Conversion

This step is to convert HISAT2 sam files to bam files and sort by aligned position. Submit the following batch job.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J samtools_sort_human_samples
#SBATCH --output=samtools.out
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
module load samtools/1.3.1-gcc5.2.0
samtools sort -@ 16 -o SRR1039508.bam SRR1039508.sam
samtools sort -@ 16 -o SRR1039509.bam SRR1039509.sam
samtools sort -@ 16 -o SRR1039512.bam SRR1039512.sam
samtools sort -@ 16 -o SRR1039513.bam SRR1039513.sam
samtools sort -@ 16 -o SRR1039516.bam SRR1039516.sam
samtools sort -@ 16 -o SRR1039517.bam SRR1039517.sam
samtools sort -@ 16 -o SRR1039520.bam SRR1039520.sam
samtools sort -@ 16 -o SRR1039521.bam SRR1039521.sam

You can now remove these .sam files to save storage space.

Post-Alignment Quality check

You can use FastQC to perform basic QC of your BAM file. This will give you output very similar to when you ran FastQC on your fastq files.

Some tools, such as IGV, require bam file to be indexed. We can use samtools index for this purpose. Submit the following job to index all bam files with the current folder.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J samtools_index_human_samples
#SBATCH --output=samtools_index.out
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
module load samtools/1.3.1-gcc5.2.0
find *.bam -exec echo samtools index {} \; | bash

The output is a .bam.bai file for each .bam file.


RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data.

infer_experiment.py can "guess" how RNA-seq sequencing were configured, particulary how reads were stranded for strand-specific RNA-seq data, through comparing the “strandness of reads” with the “standness of transcripts”.

module load RSeQC/2.6.4
[fangping@login0b ~]$ infer_experiment.py -r ./ref/Homo_sapiens.GRCh38.86.bed12 -i ./STAR_alignment/Aligned.sortedByCoord.out.bam -s 400000
Reading reference gene model ./ref/Homo_sapiens.GRCh38.86.bed12 ... Done
Loading SAM/BAM file ...  Total 400000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0282
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0039
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9679

This is an example of reverse strand-specific RNASeq paired end data. The strand-specific information is required for the next steps of RNASeq data analysis.

Assemble and quantify expressed genes and transcripts.

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. Refer to the Stringtie manual for a more detailed explanation: https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

Stringtie assemble and quantify expressed genes and transcripts in three steps. Refer to HISAT2 and Stringtie publication (pubmed: 27560171) for the explanation of the three steps. “The alignments are passed to StringTie for transcript assembly. StringTie assembles the genes for each data set separately, estimating the expression levels of each gene and each isoform as it assembles them. After assembling each sample, the full set of assemblies is passed to StringTie's merge function, which merges together all the gene structures found in any of the samples. This step is required because transcripts in some of the samples might be only partially covered by reads, and as a consequence only partial versions of them will be assembled in the initial StringTie run. The merge step creates a set of transcripts that is consistent across all samples, so that the transcripts can be compared in subsequent steps. The merged transcripts are then fed back to StringTie one more time so that it can re-estimate the transcript abundances using the merged structures. The re-estimation uses the same algorithm as the original assembly, but reads may need to be re-allocated for transcripts whose structures were altered by the merging step.”

Submit the following job to assemble transcripts for each sample.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J stringtie_human_samples
#SBATCH --output=stringtie.out
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
module load StringTie/1.3.2c
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039508.gtf -l SRR1039508 SRR1039508.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039509.gtf -l SRR1039509 SRR1039509.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039512.gtf -l SRR1039512 SRR1039512.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039513.gtf -l SRR1039513 SRR1039513.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039516.gtf -l SRR1039516 SRR1039516.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039517.gtf -l SRR1039517 SRR1039517.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039520.gtf -l SRR1039520 SRR1039520.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039521.gtf -l SRR1039521 SRR1039521.bam

Generate a text file named mergelist.txt with the following contents.


Submit the following job to merge transcripts from all samples.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J stringtie_merge_human_samples
#SBATCH --output=stringtie_merge.out
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
module load StringTie/1.3.2c
stringtie --merge -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o stringtie_merged.gtf mergelist.txt

In the third step, submit the following job to estimate transcript abundances and create table counts for post analysis.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J stringtie_abundance_human_samples
#SBATCH --output=stringtie_abundance.out
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
module load StringTie/1.3.2c
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039508/SRR1039508.gtf -A ballgown/SRR1039508/gene_abundances.tsv SRR1039508.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039509/SRR1039509.gtf -A ballgown/SRR1039509/gene_abundances.tsv SRR1039509.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039512/SRR1039512.gtf -A ballgown/SRR1039512/gene_abundances.tsv SRR1039512.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039513/SRR1039513.gtf -A ballgown/SRR1039513/gene_abundances.tsv SRR1039513.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039516/SRR1039516.gtf -A ballgown/SRR1039516/gene_abundances.tsv SRR1039516.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039517/SRR1039517.gtf -A ballgown/SRR1039517/gene_abundances.tsv SRR1039517.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039520/SRR1039520.gtf -A ballgown/SRR1039520/gene_abundances.tsv SRR1039520.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039521/SRR1039521.gtf -A ballgown/SRR1039521/gene_abundances.tsv SRR1039521.bam

Extra options with the above commands are specified below:

  • '-p 4' tells Stringtie to use 4 CPUs. Stringtie jobs are not computational intensive. Requesting less CPUs and walltime can allow the job to be run through backfill
  • '-G ' reference annotation to use for guiding the assembly process (GTF/GFF3)
  • '-e' only estimate the abundance of given reference transcripts (requires -G)
  • '-B' enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended)
  • '-o' output path/file name for the assembled transcripts GTF (default: stdout)
  • '-A' output path/file name for gene abundance estimates

The primary output of StringTie is a Gene Transfer Format (GTF) file that contains details of the transcripts that StringTie assembles from RNA-Seq data. Transcript abundance is reported in FPKM and TPM. FPKM: Fragments per kilobase of transcript per million read pairs. This is the number of pairs of reads aligning to this feature, normalized by the total number of fragments sequenced (in millions) and the length of the transcript (in kilobases). TPM: Transcripts per million. This is the number of transcripts from this particular gene normalized first by gene length, and then by sequencing depth (in millions) in the sample.

If StringTie is run with the -A <gene_abund.tab> option, it returns a file containing gene abundances. For details on the Stringtie output files refer to Stringtie manual

We can use these files to perform various comparisons of expression estimation tools (e.g. stringtie, kallisto, raw counts) and metrics (e.g. FPKM vs TPM).

Counting reads in features

Differential analysis tools, such as edgeR, work on a table of integer read counts, with rows corresponding to genes and columns to independent libraries. The counts represent the total number of reads aligning to each gene (or other genomic locus). htseq-count is a tool that work on alignments to produce raw counts instead of FPKM/TPM values for differential expression analysis. Refer to the HTSeq documentation for a more detailed explanation: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Run htseq-count and calculate gene-level counts. Submit the following job to HTC cluster:

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J htseq_human_samples
#SBATCH --output=htseq.out
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
module load HTSeq/0.6.1p1
htseq-count --format bam --order pos --mode intersection-strict --minaqual 1 --type exon --idattr gene_id SRR1039508.bam ../ref/Homo_sapiens.GRCh38.86.gtf > htseq/SRR1039508_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --minaqual 1 --type exon --idattr gene_id SRR1039509.bam ../ref/Homo_sapiens.GRCh38.86.gtf > htseq/SRR1039509_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --minaqual 1 --type exon --idattr gene_id SRR1039512.bam ../ref/Homo_sapiens.GRCh38.86.gtf > htseq/SRR1039512_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --minaqual 1 --type exon --idattr gene_id SRR1039513.bam ../ref/Homo_sapiens.GRCh38.86.gtf > htseq/SRR1039513_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --minaqual 1 --type exon --idattr gene_id SRR1039516.bam ../ref/Homo_sapiens.GRCh38.86.gtf > htseq/SRR1039516_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --minaqual 1 --type exon --idattr gene_id SRR1039517.bam ../ref/Homo_sapiens.GRCh38.86.gtf > htseq/SRR1039517_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --minaqual 1 --type exon --idattr gene_id SRR1039520.bam ../ref/Homo_sapiens.GRCh38.86.gtf > htseq/SRR1039520_gene.tsv
htseq-count --format bam --order pos --mode intersection-strict --minaqual 1 --type exon --idattr gene_id SRR1039521.bam ../ref/Homo_sapiens.GRCh38.86.gtf > htseq/SRR1039521_gene.tsv

Extra options specified below:

  • '--format' specify the input file format one of BAM or SAM. Since we have BAM format files, select 'bam' for this option.
  • '--order' provide the expected sort order of the input file. Previously we generated position sorted BAM files so use 'pos'.
  • '--mode' determines how to deal with reads that overlap more than one feature. We believe the 'intersection-strict' mode is best.
  • '--stranded' specifies whether data is stranded or not. The TruSeq strand-specific RNA libraries suggest the 'reverse' option for this parameter.
  • '--minaqual' will skip all reads with alignment quality lower than the given minimum value
  • '--type' specifies the feature type (3rd column in GFF file) to be used. (default, suitable for RNA-Seq and Ensembl GTF files: exon)
  • '--idattr' The feature ID used to identity the counts in the output table. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.

Merge results files into a single matrix for use in edgeR. The following joins the results for each replicate together, adds a header, reformats the result as a tab delimited file:

join SRR1039508_gene.tsv SRR1039509_gene.tsv | join - SRR1039512_gene.tsv | join - SRR1039513_gene.tsv | join - SRR1039516_gene.tsv | join - SRR1039517_gene.tsv | join - SRR1039520_gene.tsv | join - SRR1039521_gene.tsv > gene_read_counts_table_all.tsv
echo "GeneID SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521" > header.txt
cat header.txt gene_read_counts_table_all.tsv > gene_read_counts_table_all_final.tsv

htseq-count outputs provide gene IDs. Create a mapping file to go from gene IDs to gene name:

perl -ne 'if ($_ =~ /gene_id\s\"(ENSG\S+)\"\;/) { $id = $1; $name = undef; if ($_ =~ /gene_name\s\"(\S+)"\;/) { $name = $1; }; }; if ($id && $name) {print "$id\t$name\n";} if ($_=~/gene_id\s\"(ERCC\S+)\"/){print "$1\t$1\n";}' ../../ref/Homo_sapiens.GRCh38.86.gtf | sort | uniq > ENSG_ID2Name.txt

Counting reads from stringtie outputs

A Python script (prepDE.py) from StringTie is provided to extract the read count information directly from the files generated by StringTie (run with the -e parameter). https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq

Use the following scripts to extract the read count information from the ballgown directory.

module load StringTie/1.3.2c

Two files gene_count_matrix.csv and transcript_count_matrix.csv are generated from the StringTie output files.

Comments about normalization:

The most common application of RNA-seq is to estimate gene and transcript expression. This application is primarily based on the number of reads that map to each transcript sequence. The simplest approach to quantification is to aggregate raw counts of mapped reads using programs such as HTSeq-count. This gene-level (rather than transcript-level) quantification approach utilizes a gene transfer format (GTF) file containing the genome coordinates of exons and genes, and often discard multireads. Raw read counts alone are not sufficient to compare expression levels among samples, as these values are affected by factors such as transcript length, total number of reads, and sequencing biases. The measure RPKM (reads per kilobase of exon model per million reads) is a within-sample normalization method that will remove the feature-length and library-size effects. This measure and its subsequent derivatives FPKM (fragments per kilobase of exon model per million mapped reads), a within-sample normalized transcript expression measure analogous to RPKs, and TPM (transcripts per million) are the most frequently reported RNA-seq gene expression values. Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples, but it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads. The within-sample and between-sample comparisons are confusing.

Differential expression analysis requires that gene expression values should be compared among samples. RPKM, FPKM, and TPM normalize away the most important factor for comparing samples, which is sequencing depth, whether directly or by accounting for the number of transcripts, which can differ significantly between samples. These approaches rely on normalizing methods that are based on total or effective counts, and tend to perform poorly when samples have heterogeneous transcript distributions, that is, when highly and differentially expressed features can skew the count distribution. Normalization methods that take this into account are TMM, DESeq, PoissonSeq and UpperQuartile, which ignore highly variable and/or highly expressed features.

Differential Expression

DESeq2 and edgeR are two popular Bioconductor packages for analyzing differential expression, which take as input a matrix of read counts mapped to particular genomic features (e.g., genes). Make use of the raw counts you generate above using htseq-count or using prepDE.py on StringTie outputs.

Differential expression using DESeq2

We will use gene_count_matrix.csv generated from prepDE.py to perform differential expression analysis.

Create a plain text file named pheno_data with the following context:

name    type
SRR1039508      UN
SRR1039509      DE
SRR1039512      UN
SRR1039513      DE
SRR1039516      UN
SRR1039517      DE
SRR1039520      UN
SRR1039521      DE

Launch R 3.2.2 on HTC cluster:

Start an interactive job (1 core, 2 hours) and R:

srun -n1 -t02:00:00 --pty bash
module load R/3.2.2-gcc5.2.0

Run the following R commands.

#Import DESeq2 library in R
#Load gene(/transcript) count matrix and labels
countData <- as.matrix(read.csv("gene_count_matrix.csv", row.names="gene_id"))
colData <- read.csv("pheno_data", sep="\t", row.names=1)
all(rownames(colData) %in% colnames(countData))
countData <- countData[, rownames(colData)]
all(rownames(colData) == colnames(countData))
#Create a DESeqDataSet from count matrix and labels
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ type)
#Run the default analysis for DESeq2 and generate results table
dds <- DESeq(dds)
res <- results(dds)
#Sort by adjusted p-value and display
(resOrdered <- res[order(res$padj), ]) 
write.table(resOrdered, file="genes.txt", sep="\t")

Differential expression using edgeR

We will use htseq-count outputs to perform differential expression analysis using edgeR.

To submit a batch R jobs to HTC cluster, create a R script, such as script.R, and submit the following job.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J edgeR_human_samples
#SBATCH --output=edgeR.out
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
module load R/3.2.2-gcc5.2.0
R CMD BATCH script.R
# or
#Rscript script.R > $SLURM_JOBID.out

Run the following R commands to perform differential analysis using edgeR.

#Read in gene mapping
mapping=read.table("ENSG_ID2Name.txt", header=FALSE, stringsAsFactors=FALSE, row.names=1)
# Read in count matrix
rawdata=read.table("gene_read_counts_table_all_final.tsv", header=TRUE, stringsAsFactors=FALSE, row.names=1)
# Check dimensions
# Require at least 25% of samples to have count > 25
quant <- apply(rawdata,1,quantile,0.75)
keep <- which((quant >= 25) == 1)
rawdata <- rawdata[keep,]
# load edgeR
# make class labels, UN for untreated, DE for treated with dexamethasone
class <- factor( c("UN","DE","UN","DE","UN","DE","UN","DE"))
# Get common gene names
# Make DGEList object
y <- DGEList(counts=rawdata, genes=genes, group=class)
# TMM Normalization
y <- calcNormFactors(y)
# Estimate dispersion
y <- estimateCommonDisp(y, verbose=TRUE)
y <- estimateTagwiseDisp(y)
# Differential expression test
et <- exactTest(y)
# Print top genes
# Print number of up/down significant genes at FDR = 0.05  significance level
summary(de <- decideTestsDGE(et, p=.05))
detags <- rownames(y)[as.logical(de)]
# Output DE genes
# Matrix of significantly DE genes
mat <- cbind(
colnames(mat) <- c("Gene", "Gene_Name", "Log10_Pvalue", "Log_fold_change")
# Order by log fold change
o <- order(et$table$logFC[as.logical(de)],decreasing=TRUE)
mat <- mat[o,]
# Save table
write.table(mat, file="DE_genes.txt", quote=FALSE, row.names=FALSE, sep="\t")
#To exit R type the following

Differential Analysis using Ballgown

Ballgown is a software package designed to facilitate flexible differential expression analysis of RNA-Seq data. The standard unit used by ballgown is FPKM. Use the folder ballgown generated by StringTie for differential expression analysis. Refer to the Ballgown manual for a more detailed explanation: https://www.bioconductor.org/packages/release/bioc/html/ballgown.html

Use this R module on HTC cluster to use ballgown.

module load R/3.2.2-gcc5.2.0

Run the R commands detailed in the following R script.

# Load phenotype data from a file we saved in the current working directory
pheno_data = read.csv("pheno_data",sep = "\t")
# Load ballgown data structure and save it to a variable "bg"
bg = ballgown(dataDir = "ballgown",samplePattern = "", pData=pheno_data)
# Display a description of this object
# Filter low-abundance genes. Here we remove all transcripts with a variance across the samples of less than one
bg_filt = subset (bg,"rowVars(texpr(bg)) > 1", genomesubset=TRUE)
# Load all attributes including gene name
bg_filt_table = texpr(bg_filt , 'all')
bg_filt_gene_names = unique(bg_filt_table[, 9:10])
# Perform DE analysis using the filtered data for transcripts
results_transcripts = stattest(bg_filt, feature="transcript", covariate="type", getFC=TRUE, meas="FPKM")
# Perform DE analysis using the filtered data for genes
results_genes = stattest(bg_filt, feature="gene", covariate="type", getFC=TRUE, meas="FPKM")
# Add gene names
results_genes = merge(results_genes, bg_filt_gene_names, by.x=c("id"), by.y=c("gene_id"))
# Output the filtered list of genes and transcripts and save to tab delimited files
write.table(results_transcripts, "transcript_results_filtered.tsv", sep="\t", quote=FALSE)
write.table(results_genes, "gene_results_filtered.tsv", sep="\t", quote=FALSE)
# Identify the significant genes with p-value < 0.05
sig_transcripts = subset(results_transcripts, results_transcripts$pval<0.05)
sig_genes = subset(results_genes, results_genes$pval<0.05)
# Output the signifant gene results to a pair of tab delimited files
write.table(sig_transcripts, "transcript_results_sig.tsv", sep="\t", quote=FALSE)
write.table(sig_genes, "gene_results_sig.tsv", sep="\t", quote=FALSE)
# Exit the R session

Pseudoalignment approach

Two alternative approaches are available for reference-based RNASeq analysis. Traditionally, an annotated genome is available and reads are mapped to the genome with a gapped mapper. Next (novel) transcript discovery and quantification can proceed with or without an annotation file. Novel transcripts are then functionally annotated.

In recent years, pseudoalignment approaches, such as Sailfish and kallisto, are proposed. If no novel transcript discovery is needed, reads can be mapped to the reference transcriptome using an ungapped aligner. Transcript identification and quantification can occur simultaneously. Transcript identification and quantification become important challenges for alternatively expressed genes.

Using Kallisto for transcript expression abundance estimation from RNA-seq data

For more information on Kallisto, refer to the Kallisto project page and Kallisto manual page.

Obtain transcript sequences in fasta format

Note that we already have fasta sequences for the reference genome sequence. However, Kallisto works directly on target cDNA/transcript sequences. Remember also that we have transcript models for genes. These transcript models were downloaded from Ensembl in GTF format. This GTF contains a description of the coordinates of exons that make up each transcript but it does not contain the transcript sequences themselves. So currently we do not have transcript sequences needed by Kallisto. There are many places we could obtain these transcript sequences. For example, we could download them directly in Fasta format from the Ensembl FTP site.

wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/cdna/Homo_sapien...
gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz

Build a Kallisto transcriptome index

Remember that Kallisto does not perform alignment or use a reference genome sequence. Instead it performs pseudoalignment to determine the compatibility of reads with targets (transcript sequences in this case). However, similar to alignment algorithms like HISAT2 or STAR, Kallisto requires an index to assess this compatibility efficiently and quickly. First process the transcriptome fasta file downloaded to build a transcriptome index that kallisto needs with the kallisto "index" function; we name the index file "human_transcripts.idx":

#SBATCH --job-name=kallisto_index
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
module load kallisto/0.43.0
kallisto index -i human_trans_GRCh38.idx Homo_sapiens.GRCh38.cdna.all.fa

Generate transcript abundance estimates for all samples using Kallisto.

#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH -J kallisto_human_samples
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
#SBATCH --output=kallisto.out
module load kallisto/0.43.0
kallisto quant --index=../ref/human_trans_GRCh38.idx --output-dir=SRR1039508 --threads=4 --plaintext ../SRR1039508_1.fastq.gz ../SRR1039508_2.fastq.gz
kallisto quant --index=../ref/human_trans_GRCh38.idx --output-dir=SRR1039509 --threads=4 --plaintext ../SRR1039509_1.fastq.gz ../SRR1039509_2.fastq.gz
kallisto quant --index=../ref/human_trans_GRCh38.idx --output-dir=SRR1039512 --threads=4 --plaintext ../SRR1039512_1.fastq.gz ../SRR1039512_2.fastq.gz
kallisto quant --index=../ref/human_trans_GRCh38.idx --output-dir=SRR1039513 --threads=4 --plaintext ../SRR1039513_1.fastq.gz ../SRR1039513_2.fastq.gz
kallisto quant --index=../ref/human_trans_GRCh38.idx --output-dir=SRR1039516 --threads=4 --plaintext ../SRR1039516_1.fastq.gz ../SRR1039516_2.fastq.gz
kallisto quant --index=../ref/human_trans_GRCh38.idx --output-dir=SRR1039517 --threads=4 --plaintext ../SRR1039517_1.fastq.gz ../SRR1039517_2.fastq.gz
kallisto quant --index=../ref/human_trans_GRCh38.idx --output-dir=SRR1039520 --threads=4 --plaintext ../SRR1039520_1.fastq.gz ../SRR1039520_2.fastq.gz
kallisto quant --index=../ref/human_trans_GRCh38.idx --output-dir=SRR1039521 --threads=4 --plaintext ../SRR1039521_1.fastq.gz ../SRR1039521_2.fastq.gz

The main output of the kallisto quantification are the abundance estimates in the "abundance.tsv" file. For each transcript ("target_id"), abundance is reported in “estimated counts” (est_counts) and in Transcripts Per Million (tpm).

Differential Analysis

sleuth is a tool for the analysis and comparison of multiple related RNA-Seq experiments. To use sleuth, RNA-Seq data must first be quantified with kallisto. Refer to sleuth documentation to use this tool.