RNA-SeqTranscriptomics

RNA-Seq Analysis: From Raw Reads to Differential Expression

A practical walkthrough of the RNA-Seq analysis pipeline, from quality control and alignment to differential gene expression using DESeq2.

Somenath DuttaFebruary 10, 202512 min read

RNA-Seq Analysis: From Raw Reads to Differential Expression

RNA sequencing (RNA-Seq) has become the gold standard for measuring gene expression. Unlike older microarray technologies, RNA-Seq provides an unbiased, genome-wide view of the transcriptome with single-nucleotide resolution. This guide walks through the complete analysis pipeline.

Quality Control with FastQC

Every RNA-Seq analysis begins with quality control. FastQC generates a comprehensive report on your raw FASTQ files, checking for adapter contamination, per-base quality scores, GC content distribution, and sequence duplication levels. If quality issues are detected, tools like Trimmomatic or fastp can trim low-quality bases and remove adapter sequences.

Never skip this step. Poor-quality reads introduce noise into downstream analyses and can lead to false conclusions about differential expression.

Read Alignment

Once reads are cleaned, they need to be aligned to a reference genome. HISAT2 and STAR are the two most popular splice-aware aligners for RNA-Seq data. They handle the challenge of reads spanning exon-exon junctions, which is unique to transcriptomic data.

The output is a BAM file containing the coordinates where each read maps on the genome. Tools like SAMtools help sort, index, and perform quality checks on these alignment files.

Quantification

After alignment, we need to count how many reads map to each gene. featureCounts from the Subread package is the most widely used tool for this purpose. It takes the BAM file and a gene annotation file (GTF) and produces a count matrix — a table with genes as rows and samples as columns.

Alternatively, pseudo-alignment tools like Salmon and Kallisto skip the alignment step entirely and estimate transcript abundance directly from FASTQ files. They are significantly faster and increasingly popular for large-scale studies.

Differential Expression with DESeq2

DESeq2 is an R/Bioconductor package that identifies genes with statistically significant changes in expression between conditions. It uses a negative binomial model to account for the overdispersion typical of count data and applies shrinkage estimation for fold changes.

The standard workflow involves creating a DESeqDataSet from the count matrix, running the DESeq() function, and extracting results with the results() function. Genes with an adjusted p-value below 0.05 and a log2 fold change above 1 are typically considered differentially expressed.

Visualization and Interpretation

Volcano plots, MA plots, and heatmaps are the standard visualizations for differential expression results. Gene Ontology (GO) enrichment analysis and pathway analysis using tools like clusterProfiler help interpret the biological meaning behind the list of differentially expressed genes.

All articles

Written by Somenath Dutta