RNASeq Data Analysis

Introduction to RNA sequencing

Reference Genome

Reference genome - The nucleotide sequence of the chromosomes of a species. Genes are the functional units of a reference genome and gene annotations describe the structure of transcripts expressed from those gene loci.

Obtain a reference genome from Ensembl (http://www.ensembl.org/info/data/ftp/index.html), UCSC (http://hgdownload.cse.ucsc.edu/downloads.html), NCBI or iGenomes (https://support.illumina.com/sequencing/sequencing_software/igenome.html).

I will show you how to download the human GRCh38 version of the genome from Ensembl.

The complete data from which these files were obtained can be found at: ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/. You can find various homo sapiens genome assembly. Which one should I use? The options available can be a bit overwhelming. For NGS data pipelines, you can choose "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz" for Ensembl. The files with "sm" in the name are "soft masked" which means instead of the repeat regions being given Ns, they are converted to lower-case. These are OK to use because most major aligners recognise lower-case letters as valid bases. Primary assembly contains all toplevel sequence regions excluding haplotypes and patches. This file is best used for performing sequence similarity searches where patch and haplotype sequences would confuse analysis.

You could use wget to download the Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz file, then unzip/untar.

mkdir ref
cd ref
wget ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/Homo_sapiens...
gunzip Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

Instead of the above, you might consider getting reference genomes and associated annotations from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/)

Annotation

Gene annotations - Descriptions of gene/transcript models for a genome. A transcript model consists of the coordinates of the exons of a transcript on a reference genome. Additional information such as the strand the transcript is generated from, gene name, coding portion of the transcript, alternate transcript start sites, and other information may be provided. GTF (.gtf) file is a common file format referred to as Gene Transfer Format used to store gene and transcript annotation information.

To obtain known gene/transcript annotations, you can use Ensembl (http://useast.ensembl.org/info/data/ftp/index.html) or UCSC.

Use ENSEMBL FTP SITE

Use the following commands to download Ensembl release 89 homo sapiens GRCh38 gene annotation.

Use UCSC TABLE BROWSER

Based on UCSC annotations or several other possible annotation sources collected by UCSC. You might choose this option if you want to have a lot of flexibility in the annotations you obtain.

  • Open the following in your browser: http://genome.ucsc.edu/
  • Select 'Tools' and then 'Table Browser' at the top of the page.
  • Select 'Mammal', 'Human', and 'Dec. 2013 (GRCh38/hg38)' from the first row of drop down menus.
  • Select 'Genes and Gene Predictions' and 'GENCODE v24' from the second row of drop down menus.
  • Select ‘knownGene’ from the third row of drop down menus. Select ‘genome’ from the fourth row.
  • Select 'GTF - gene transfer format' for output format and enter 'hg38_UCSC_Genes.gtf' for output file.
  • Hit the 'get output' button and save the file. Make note of its location.

Important note on chromosome naming conventions

In order for your RNA-seq analysis to work, the chromosome names in your .gtf file must match those in your reference genome (i.e. your reference genome fasta file). Unfortunately, Ensembl, NCBI, and UCSC can not agree on how to name the chromosomes in many species, so this problem may come up often. You can avoid this by getting a complete reference genome and gene annotation package from the same source (e.g., Ensembl) to maintain consistency.

Important note on reference genome builds

Your annotations must correspond to the same reference genome build as your reference genome fasta file. e.g., both correspond to UCSC human build 'hg38', NCBI human build 'GRCh38', etc. Even if both your reference genome and annotations are from UCSC or Ensembl they could still correspond to different versions of that genome. This would cause problems in any RNA-seq pipeline.

Indexing

Within your ‘ref’ directory, there are two files ‘Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa’ and ‘Homo_sapiens.GRCh38.86.gtf’.

All aligners require the reference sequence to be indexed -- this is a process that creates additional files that describe the reference file, such as where each chromosome is located in the file. Knowing this information speeds up alignment.

In this step, user supplied the reference genome sequences (FASTA file) and annotations (GTF file), from which aligner generate genome indexes that are utilized in the mapping step. The genome indexes are saved to disk and need only be generated once for each genome/annotation combination.

At each aligner’s website, a limited collection of genome indexes is available. For example, STAR (http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgeno...), HISAT2 (https://ccb.jhu.edu/software/hisat2/index.shtml), however, it is strongly recommended that users generate their own genome indexes with most up-to-date assemblies and annotations.

Consult aligner manual to generate genome indexes. I provides an example to generate HISAT2 indexes for Homo_sapiens.GRCh38.86 genome and annotation.

Extract splice sites and exons from GRCh38.86 annotation file. Run the following command on HTC login node.

module load HISAT2/2.0.5
module load compiler/python/2.7.10-Anaconda-2.3.0
 
extract_splice_sites.py Homo_sapiens.GRCh38.86.gtf >hg38.ss
extract_exons.py Homo_sapiens.GRCh38.86.gtf >hg38.exon

To build HISAT2 index, submit the following slurm job to HTC cluster.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
 
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH --mem=230g # Memory pool for all cores (see also --mem-per-cpu)
 
module load HISAT2/2.0.5
 
hisat2-build -p 16 --ss hg38.ss --exon hg38.exon Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa hg38_tran

WARNING: In order to index the entire human genome, HISAT2 requires at least 160GB of RAM, and several hours.

Now, you should find eight files named hg38_tran.1.ht2 to hg38_tran.8.ht2 with the ref directory. These are the HISAT2 index files.

STAR is a splice-aware aligner that will produce gapped alignments to handle reads that span exon-intron junctions. Thus it is only appropriate when an annotated reference genome is available. To build STAR index, submit the following slurm job to HTC cluster.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
 
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH --mem=230g # Memory pool for all cores (see also --mem-per-cpu)
 
module load STAR/2.5.2b
 
STAR --runMode genomeGenerate --runThreadN 16 --genomeDir ./STAR_Index/ --genomeFastaFiles Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa --sjdbGTFfile Homo_sapiens.GRCh38.86.gtf

RNA-seq Data

The test data consists of RNA-seq samples for primary human airway smooth muscle cell lines. In the experiment, four primary human airway smooth muscle cell lines that were treated with dexamethasone, albuterol, dexamethasone+albuterol or were left untreated. The Illumina TruSeq assay was used to prepare the samples. The libraries were sequenced with an Illumina Hi-Seq 2000 instrument. The sequencing experiment was paired-end. For more description of the experiment see the PubMed entry 24926665 and for raw data see the GEO entry GSE52778. The test data consists of 4 untreated samples (SRR1039508, SRR1039512, SRR1039516, SRR1039520), and 4 samples treated with dexamethasone (SRR1039509, SRR1039513, SRR1039517, SRR1039521).

You can download the FASTQ files from the Sequence Read Archive, the identifiers would be SRA run IDs, e.g. SRR1039508. You should have two files for a paired-end experiment for each ID, SRR1039508_1.fastq.gz and SRR1039508_2.fastq.gz, which give the first and second read for the paired-end fragments.

To extract fastq files from the GEO data, use NCBI's program toolkit. It's already installed on HTC cluster and can be loaded with module load SRAToolkit/2.8.2. The fastq-dump command produces two compressed fastq files for each dataset, for the forward and reverese reads. In this example, this is done with a SLURM job.

#!/bin/bash
#
#SBATCH --job-name=fastq-dump
#SBATCH -n 1
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --output=fastq_dump.out
 
module load SRAToolkit/2.8.2
 
fastq-dump --split-3 --gzip SRR1039508
fastq-dump --split-3 --gzip SRR1039509
fastq-dump --split-3 --gzip SRR1039512
fastq-dump --split-3 --gzip SRR1039513
fastq-dump --split-3 --gzip SRR1039516
fastq-dump --split-3 --gzip SRR1039517
fastq-dump --split-3 --gzip SRR1039520
fastq-dump --split-3 --gzip SRR1039521

Check fastq-dump options at https://ncbi.github.io/sra-tools/fastq-dump.html. The recommended options to extract paired-end data is fastq-dump --split-files –I --gzip SRR1039513. If you run this command, you would receive two files SRR1039513_1.fastq.gz and SRR1039513_2.fastq.gz with significant different sizes. The paired-end files are not synchronized, which many processing programs are expecting. The above commands use fastq-dump with the legacy --split-3 command, which will output 1,2, or 3 files: 1 file means the data is not paired. 2 files mean paired data with no low quality reads or reads shorter than 20bp. 3 files means paired data, but asymmetric quality or trimming. In the case of 3 file output, most people ignore <file>.fastq (e.g., SRR1039513.fastq.gz).

PreAlignment QC

To get a sense of your data quality before alignment, FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. Video tutorial of FastQC is here: http://www.youtube.com/watch?v=bz93ReOv87Y

First, set up some directories for output:

mkdir fastqc_pretrim 

Submit the following batch job to run FastQC on your fastq files

#!/bin/bash
#
#SBATCH --job-name=fastqc
#SBATCH -n 1
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --output=fastqc.out
 
module load FastQC/0.11.4
 
fastqc -o ./fastqc_pretrim/ *_1.fastq.gz
fastqc -o ./fastqc_pretrim/ *_2.fastq.gz

Check on the *_fastqc.html files to view the FastQC report.

Trimming Adaptor

FastQC gave you warning when overrepresented sequences were in first 200,000 sequences. The overrepresented sequences can provide you the information which adaptors are used. The “Overrepresented sequences” section of SRR1039508_1_fastqc.html shows that “TruSeq Adapter, Index 2” are overrepresented.

Here is the documentation of Illumina Adaptor Sequences: https://support.illumina.com/content/dam/illumina-support/documents/docu...

Here is a documentation to De-Mystify Illumina TruSeq DNA Adapters:

http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_Il...

Cutadapt is tool specifically designed to remove adapters from NGS data.  To see the full list of help and options, read this document: http://cutadapt.readthedocs.io/en/stable/guide.html

For paired-end data containing Illumina TruSeq adapters, the command line looks as follows:

cutadapt \
            -a AGATCGGAAGAGC -A AGATCGGAAGAGC \
            -o trimmed.1.fastq.gz -p trimmed.2.fastq.gz \
            reads.1.fastq.gz reads.2.fastq.gz

To run on HTC cluster, submit the following job.

#!/bin/bash
#
#SBATCH --job-name=cutadapt
#SBATCH -n 1
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --output=cutadapt.out
 
module load cutadapt/1.12
 
cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o SRR1039508_1.trimmed.fastq.gz -p SRR1039508_2.trimmed.fastq.gz SRR1039508_1.fastq.gz SRR1039508_2.fastq.gz

The output of this step will be trimmed FASTQ files for each data set.

Comments

http://www.ecseq.com/support/ngs/trimming-adapter-sequences-is-it-necessary

Adapter contamination will lead to NGS alignment errors and an increased number of unaligned reads, since the adapter sequences are synthetic and do not occur in the genomic sequence. There are applications (e.g. small RNA sequencing) where adapter trimming is highly necessary. With a fragment size of around 24 nucleotides, one will definitely sequence into the 3' adapter. But there are also applications (transcriptome sequencing, whole genome sequencing, etc.) where adapter contamination can be expected to be so small (due to an appropriate size selection) that one could consider to skip the adapter removal and thereby save time and efforts.

Alignment

Alignment is the process of taking the short reads from the FASTQ and places them in the correct location in the genome. However, there exist a variety of aligners that will accomplish this goal. Two major types of aligners are non-splice aware aligner (e.g., BWA, Bowtie, etc) and splice aware aligner (e.g., STAR, HISAT2, etc). To align human RNA-seq data to human genome, splice aware aligner is required. All of the aligners are capable of taking advantage of multi-threading: splitting up the work of alignment across multiple cores. Your instance has 16 available cores, so you will instruct the aligner to split up the alignment into 16 pieces.

HISAT2 alignment

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against the general human population (as well as against a single reference genome). HISAT2 uses a graph-based alignment and has succeeded HISAT and TOPHAT2. The output of this step will be a SAM/BAM file for each data set.

You can perform alignments with HISAT2 to the genome and transcriptome. Refer to HISAT2 manual for a more detailed explanation: https://ccb.jhu.edu/software/hisat2/manual.shtml

First, begin by making the appropriate output directory for our alignment results.

mkdir alignment

Submit the following job to HTC cluster to align the RNA-seq data to the human genome and transcriptome.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J hisat2_human_samples
 
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH --mem=230g # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=hisat2.out
 
module load HISAT2/2.0.5
 
echo SRR1039508
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039508_1.fastq.gz -2 SRR1039508_2.fastq.gz -S ./alignment/SRR1039508.sam
 
echo SRR1039509
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039509_1.fastq.gz -2 SRR1039509_2.fastq.gz -S ./alignment/SRR1039509.sam
 
echo SRR1039512
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039512_1.fastq.gz -2 SRR1039512_2.fastq.gz -S ./alignment/SRR1039512.sam
 
echo SRR1039513
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039513_1.fastq.gz -2 SRR1039513_2.fastq.gz -S ./alignment/SRR1039513.sam
 
echo SRR1039516
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039516_1.fastq.gz -2 SRR1039516_2.fastq.gz -S ./alignment/SRR1039516.sam
 
echo SRR1039517
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039517_1.fastq.gz -2 SRR1039517_2.fastq.gz -S ./alignment/SRR1039517.sam
 
echo SRR1039520
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039520_1.fastq.gz -2 SRR1039520_2.fastq.gz -S ./alignment/SRR1039520.sam
 
echo SRR1039521
hisat2 -p 16 --dta -x ref/hg38_tran -1 SRR1039521_1.fastq.gz -2 SRR1039521_2.fastq.gz -S ./alignment/SRR1039521.sam

Options specified as follows:

  • '-p 16' tells HISAT2 to use 16 CPUs for bowtie alignments.
  • '--rna-strandedness RF' specifies strandedness of RNAseq library. You can specify RF if the TruSeq strand-specific library was used to make the sequences. See here for options. https://github.com/griffithlab/rnaseq_tutorial/blob/master/manuscript/su...
  • '--dta' Reports alignments tailored for transcript assemblers.
  • '-x /path/to/hisat2/index' The HISAT2 index filename prefix (minus the trailing .X.ht2) built earlier including splice sites and exons.
  • '-1 /path/to/read1.fastq.gz' The read 1 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
  • '-2 /path/to/read2.fastq.gz' The read 2 FASTQ file, optionally gzip(.gz) or bzip2(.bz2) compressed.
  • '-S /path/to/output.sam' The output SAM format text file of alignments.

HISAT2 generates a summary of the alignments, which is in file hisat2.out (specified as #SBATCH --output=hisat2.out). Notice the number of total reads, reads aligned and various metrics regarding how the reads aligned to the reference.

STAR alignment

To use STAR to map reads to the genome, create a directory mkdir STAR_alignment and submit the following jobs.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J STAR_human_samples
 
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH --mem=230g # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH --output=STAR.out
 
module load STAR/2.5.2b
 
echo SRR1039508
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039508_1.fastq.gz SRR1039508_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039508
 
echo SRR1039509
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039509_1.fastq.gz SRR1039509_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039509
 
echo SRR1039512
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039512_1.fastq.gz SRR1039512_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039512
 
echo SRR1039513
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039513_1.fastq.gz SRR1039513_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039513
 
echo SRR1039516
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039516_1.fastq.gz SRR1039516_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039516
 
echo SRR1039517
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039517_1.fastq.gz SRR1039517_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039517
 
echo SRR1039520
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039520_1.fastq.gz SRR1039520_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039520
 
echo SRR1039521
STAR --genomeDir ref/STAR_Index/  --runThreadN 16 --readFilesIn SRR1039521_1.fastq.gz SRR1039521_2.fastq.gz --readFilesCommand zcat --outFileNamePrefix ./STAR_alignment/SRR1039521

Note: If you are aligning several experiments in a row, add the option "--genomeLoad LoadAndKeep" and STAR will load the genome index into shared memory so that it can use it for the next time you run the program.

SAM to BAM Conversion

This step is to convert HISAT2 sam files to bam files and sort by aligned position. Submit the following batch job.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J samtools_sort_human_samples
#SBATCH --output=samtools.out
 
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
 
module load samtools/1.3.1-gcc5.2.0
 
samtools sort -@ 16 -o SRR1039508.bam SRR1039508.sam
samtools sort -@ 16 -o SRR1039509.bam SRR1039509.sam
samtools sort -@ 16 -o SRR1039512.bam SRR1039512.sam
samtools sort -@ 16 -o SRR1039513.bam SRR1039513.sam
samtools sort -@ 16 -o SRR1039516.bam SRR1039516.sam
samtools sort -@ 16 -o SRR1039517.bam SRR1039517.sam
samtools sort -@ 16 -o SRR1039520.bam SRR1039520.sam
samtools sort -@ 16 -o SRR1039521.bam SRR1039521.sam

You can now remove these .sam files to save storage space.

Post-Alignment Quality check

You can use FastQC to perform basic QC of your BAM file. This will give you output very similar to when you ran FastQC on your fastq files.

Some tools, such as IGV, require bam file to be indexed. We can use samtools index for this purpose. Submit the following job to index all bam files with the current folder.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J samtools_index_human_samples
#SBATCH --output=samtools_index.out
 
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
 
module load samtools/1.3.1-gcc5.2.0
 
find *.bam -exec echo samtools index {} \; | bash

The output is a .bam.bai file for each .bam file.

Assemble and quantify expressed genes and transcripts.

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. Refer to the Stringtie manual for a more detailed explanation: https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

Stringtie assemble and quantify expressed genes and transcripts in three steps. Refer to HISAT2 and Stringtie publication (pubmed: 27560171) for the explanation of the three steps. “The alignments are passed to StringTie for transcript assembly. StringTie assembles the genes for each data set separately, estimating the expression levels of each gene and each isoform as it assembles them. After assembling each sample, the full set of assemblies is passed to StringTie's merge function, which merges together all the gene structures found in any of the samples. This step is required because transcripts in some of the samples might be only partially covered by reads, and as a consequence only partial versions of them will be assembled in the initial StringTie run. The merge step creates a set of transcripts that is consistent across all samples, so that the transcripts can be compared in subsequent steps. The merged transcripts are then fed back to StringTie one more time so that it can re-estimate the transcript abundances using the merged structures. The re-estimation uses the same algorithm as the original assembly, but reads may need to be re-allocated for transcripts whose structures were altered by the merging step.”

Submit the following job to assemble transcripts for each sample.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J stringtie_human_samples
#SBATCH --output=stringtie.out
 
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
 
module load StringTie/1.3.2c
 
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039508.gtf -l SRR1039508 SRR1039508.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039509.gtf -l SRR1039509 SRR1039509.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039512.gtf -l SRR1039512 SRR1039512.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039513.gtf -l SRR1039513 SRR1039513.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039516.gtf -l SRR1039516 SRR1039516.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039517.gtf -l SRR1039517 SRR1039517.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039520.gtf -l SRR1039520 SRR1039520.bam
stringtie -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o SRR1039521.gtf -l SRR1039521 SRR1039521.bam

Generate a text file named mergelist.txt with the following contents.

SRR1039508.gtf
SRR1039509.gtf
SRR1039512.gtf
SRR1039513.gtf
SRR1039516.gtf
SRR1039517.gtf
SRR1039520.gtf
SRR1039521.gtf

Submit the following job to merge transcripts from all samples.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J stringtie_merge_human_samples
#SBATCH --output=stringtie_merge.out
 
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
 
module load StringTie/1.3.2c
 
stringtie --merge -p 4 -G ../ref/Homo_sapiens.GRCh38.86.gtf -o stringtie_merged.gtf mergelist.txt

In the third step, submit the following job to estimate transcript abundances and create table counts for post analysis.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J stringtie_abundance_human_samples
#SBATCH --output=stringtie_abundance.out
 
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
 
module load StringTie/1.3.2c
 
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039508/SRR1039508.gtf -A ballgown/SRR1039508/gene_abundances.tsv SRR1039508.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039509/SRR1039509.gtf -A ballgown/SRR1039509/gene_abundances.tsv SRR1039509.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039512/SRR1039512.gtf -A ballgown/SRR1039512/gene_abundances.tsv SRR1039512.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039513/SRR1039513.gtf -A ballgown/SRR1039513/gene_abundances.tsv SRR1039513.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039516/SRR1039516.gtf -A ballgown/SRR1039516/gene_abundances.tsv SRR1039516.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039517/SRR1039517.gtf -A ballgown/SRR1039517/gene_abundances.tsv SRR1039517.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039520/SRR1039520.gtf -A ballgown/SRR1039520/gene_abundances.tsv SRR1039520.bam
stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/SRR1039521/SRR1039521.gtf -A ballgown/SRR1039521/gene_abundances.tsv SRR1039521.bam

Extra options with the above commands are specified below:

  • '-p 4' tells Stringtie to use 4 CPUs. Stringtie jobs are not computational intensive. Requesting less CPUs and walltime can allow the job to be run through backfill
  • '-G ' reference annotation to use for guiding the assembly process (GTF/GFF3)
  • '-e' only estimate the abundance of given reference transcripts (requires -G)
  • '-B' enable output of Ballgown table files which will be created in the same directory as the output GTF (requires -G, -o recommended)
  • '-o' output path/file name for the assembled transcripts GTF (default: stdout)
  • '-A' output path/file name for gene abundance estimates

The primary output of StringTie is a Gene Transfer Format (GTF) file that contains details of the transcripts that StringTie assembles from RNA-Seq data. Transcript abundance is reported in FPKM and TPM. FPKM: Fragments per kilobase of transcript per million read pairs. This is the number of pairs of reads aligning to this feature, normalized by the total number of fragments sequenced (in millions) and the length of the transcript (in kilobases). TPM: Transcripts per million. This is the number of transcripts from this particular gene normalized first by gene length, and then by sequencing depth (in millions) in the sample.

If StringTie is run with the -A <gene_abund.tab> option, it returns a file containing gene abundances. For details on the Stringtie output files refer to Stringtie manual

We can use these files to perform various comparisons of expression estimation tools (e.g. stringtie, kallisto, raw counts) and metrics (e.g. FPKM vs TPM).