# Load the required packages
library(GEOquery)
library(dplyr)
library(ggplot2)
Class 2
RNA-Seq mini project
1. Background
The data for this hands-on session comes from a recently published RNA-seq experiment profiling multiple immune cell types from healthy (control) and lupus (disease) patient blood (Panwar et al. 2021).
Lupus (more formally known as systemic lupus erythematosus, or SLE for short) is a chronic autoimmune disease in which the immune system attacks the body’s own tissues and organs (Figure 1).
The exact cause of lupus is unknown, but is believed to be a combination of genetic, environmental, and hormonal factors leading to a heterogeneous aberrant autoimmune response (Kaul et al. 2016).
As the authors of the current study state “A major obstacle in finding targeted therapies for SLE is its remarkable heterogeneity in clinical manifestations as well as the involvement of distinct cell types” (Panwar et al. 2021). Their work aims to discover molecular signatures of SLE to help inform future therapies.
Outline
In this class session we will:
- Open a new RStudio Project and Quarto document for today’s class;
- Explore the gene expression data available for this study on GEO;
- Use base R, GEOquery, dplyr and ggplot2 packages to answer specific questions about the experiments;
- Perform a detailed differential gene expression analysis with the DESeq2 package.
- Render a reproducible PDF report of your work with answers to all questions below.
2. Working with GEO
The Gene Expression Omnibus (GEO) from NCBI serves as the main public repository for a wide range of high-throughput gene expression data.
The Gene Expression Omnibus (or GEO for short) is NCBIs main repository for high-throughput gene expression data, such as micro array and next-generation sequencing (NGS) data. The data in GEO is organized into three types of records: Sample, Series, and Platform records.
- Sample records contain information about a single biological sample, including the experimental conditions, the type of platform used to generate the data, and the raw data files. Note their GSM accession codes in GEO.
- Series records group together related samples that were generated using similar experimental conditions or belong to the same study. Series records also contain additional metadata about the study, such as the experimental design, the hypothesis being tested, and the publications related to the study. Note their GSE accession codes in GEO.
- Platform records describe the platform or technology used to generate the data, including information about the probes, arrays, or sequencing methods used. Note their GPL accession codes in GEO.
All records in GEO are assigned a unique accession number that can be used to access and download the data. The data in GEO can be searched and accessed through the GEO website, or programmatically using the NCBI Entrez system or R packages such as GEOquery that we will use further below.
From publication to count data
Examining the PubMed entry (or full text) of the Panwar et al. (2021) paper leads us to the GEO DataSets with SeriesAccession: GSE149050 under the Related information section.
Click through from PubMed to visit this GEO entry: GSE149050 and scroll down to find the RawCounts data.
Download the GSE149050_Bulk_Human_RawCounts.txt.gz file and move it to your RStudio project directory.
While this is downloading determine how many samples (GSM entries) are associates with this GEO series entry (GSE). Click on one of these entries to see how it describe the actual experimental data we will examine below.
While our count data downloads we can use the GEOquery package form BioConductor to help organize and make sense of all these samples.
Using the GEOquery package
If necessary use the BiocManager::install("GEOquery")
function call to install the package. We will also use dplyr and ggplot2 packages to help explore this dataset further.
The main getGEO()
function from GEOquery will return available metadata from this study and parse it into an R list object.
# Complete the code with the GEO ID
<- getGEO(___)
gse gse
$GSE149050_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 0 features, 288 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM4489145 GSM4489146 ... GSM4489432 (288 total)
varLabels: title geo_accession ... visit number:ch1 (64 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
pubMedIds: 33674349
Annotation: GPL16791
If you are on a particularly slow WiFi connection and the getGEO()
function gives a connection timeout error then we can download the matrix file from NCBI and read it directly:
# Download "Series Matrix File" from the web browser or in the Terminal
#wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE149nnn/GSE149050/matrix/GSE149050_series_matrix.txt.gz
# then in R
<- getGEO(filename="GSE149050_series_matrix.txt.gz") gse
As essentially all information in the GEO record is reflected in the resulting list data structure we will need to do a little work to focus on just the experimental phenotypes data we want.
We will extract information on experimental phenotypes with the phenoData()
function and use pData()
function to return this as a more conventional data.frame
with samples as rows and variables as columns.
# Our gse list has only one entry
<- pData(phenoData(gse[[1]]))
metadata dim(metadata)
[1] 288 64
See how many “characteristics” are in this dataset
grep("characteristics", colnames(metadata), value=TRUE)
[1] "characteristics_ch1" "characteristics_ch1.1" "characteristics_ch1.2"
[4] "characteristics_ch1.3" "characteristics_ch1.4" "characteristics_ch1.5"
[7] "characteristics_ch1.6" "characteristics_ch1.7" "characteristics_ch1.8"
[10] "characteristics_ch1.9" "characteristics_ch1.10"
Let’s look at a few of these. First note that the underscore “ch1” and “ch1.2” tell us the disease state (healthy or lupus) and cell type analyzed. For example, to find out how many “healthy control” samples are in the dataset we can use:
table(metadata$characteristics_ch1)[1]
disease state: healthy control
85
Or to get an “ethnicity” by “gender” breakdown:
table(metadata$characteristics_ch1.10, metadata$characteristics_ch1.7)
gender: Female gender: Male
ethnicity: Hispanic 24 0
ethnicity: Hispanic/Latino 53 6
ethnicity: Non Hispanic 37 0
ethnicity: Non-Hispanic/Latino 137 1
ethnicity: Not available 23 0
ethnicity: not listed 1 0
ethnicity: Pacific Islander 6 0
Point for discussion: Why do you think this study features more Female than Male samples? What do you know about the prevalence of lupus? ChatGPT can help you here ;-)
3. Select your favorate immune cell type
You have by now found that this study targeted multiple major immune cell types. You should now select YOUR favorite immune cell type and use it for future sections of this lab.
For demonstration purposes I will select my favorite classical monocytes (cMo). They are the most abundant sub-type of monocytes in the blood and changes in their numbers and/or function has been observed in various inflammatory and autoimmune diseases, including lupus, rheumatoid arthritis, and atherosclerosis. Therefore, the study of classical monocytes and their role in immune regulation is an important area of research.
Here I will select all the "cell type: cMo"
samples for ""visit number: 1"
with the help of dplyr filter()
function:
<- filter(metadata, characteristics_ch1.2 == "cell type: cMo" &
metadata.cmo .6 == "visit number: 1")
characteristics_ch1head(metadata.cmo[,1:3])
title geo_accession status
GSM4489314 170_S3004a_IFNpos_cMo GSM4489314 Public on Feb 01 2021
GSM4489318 174_S1055a_IFNpos_cMo GSM4489318 Public on Feb 01 2021
GSM4489321 177_S1082a_IFNpos_cMo GSM4489321 Public on Feb 01 2021
GSM4489322 178_S3008a_IFNpos_cMo GSM4489322 Public on Feb 01 2021
GSM4489323 179_S3011a_IFNpos_cMo GSM4489323 Public on Feb 01 2021
GSM4489325 181_S1065a_IFNpos_cMo GSM4489325 Public on Feb 01 2021
Book-keeping: Focus on columns of interest
Now that we have made some initial explorations of the dataset we have some book-keeping and renaming to get things better organized for further analysis.
We will use the dplyr select()
and mutate()
functions extensively here:
Select the columns of interest.
# TO DO: Build up this example by piping to head() after each line
<- metadata.cmo %>%
metadata.subset select(title,
disease_state = characteristics_ch1,
ifn_status = characteristics_ch1.3,
patient_id = characteristics_ch1.4) %>%
mutate(disease_state = gsub("disease state: ","", disease_state)) %>%
mutate(ifn_status = gsub("ifn status: ","", ifn_status)) %>%
mutate(patient_id = gsub("patientuid: ","", patient_id)) %>%
mutate(state = recode(disease_state,
"healthy control" = "control",
"systemic lupus erythematosus (SLE)" = "lupus") )
table(metadata.subset$state)
control lupus
24 64
4. Read count data
Finally, we are ready to read in the count data for all samples that we downloaded from GEO in Section 2 above. Note that the file is compressed with gzip
and you may need to “unzip” it if your browser did not already do this. You can use the bash/UNIX command gunzip
or gzip -d
if you have a mac or PC with git bash setup. Ask Barry if you are not sure about this.
<- read.delim("GSE149050_Bulk_Human_RawCounts.txt",
counts.all check.names=FALSE, row.names = 1)
head(counts.all[,1:3])
001_L0038_HC_T 002_L0088fresh_HC_T 003_L0140_HC_T
5S_rRNA 2 0 2
5_8S_rRNA 0 0 0
7SK 1 1 2
A1BG 53 56 105
A1BG-AS1 7 24 46
A1CF 0 2 2
Note that we are setting the row.names
as Gene IDs here and making sure the column names are not prepended with an X
via check.names
. Try running without these arguments to see the difference in output format.
Select only the samples we want
Select only the columns (samples) that correspond to the cell type we want to examine. We have these in metadata.subset$title
<- counts.all %>% select(metadata.subset$title) counts.subset
5. Setting up for DESeq
For DESeq we need to do some book-keeping and get the data in the correct format. First double check on column correspondences between counts and metadata
all(colnames(counts.subset) == metadata.subset$title)
[1] TRUE
Finally, DESeq needs our sample names as rownames of the metadata.
# Remove the title column then use it as rownames
<- metadata.subset[,-1]
colData rownames(colData) <- metadata.subset[,1]
head(colData)
disease_state ifn_status patient_id
170_S3004a_IFNpos_cMo systemic lupus erythematosus (SLE) IFNpos S3004
174_S1055a_IFNpos_cMo systemic lupus erythematosus (SLE) IFNpos S1055
177_S1082a_IFNpos_cMo systemic lupus erythematosus (SLE) IFNpos S1082
178_S3008a_IFNpos_cMo systemic lupus erythematosus (SLE) IFNpos S3008
179_S3011a_IFNpos_cMo systemic lupus erythematosus (SLE) IFNpos S3011
181_S1065a_IFNpos_cMo systemic lupus erythematosus (SLE) IFNpos S1065
state
170_S3004a_IFNpos_cMo lupus
174_S1055a_IFNpos_cMo lupus
177_S1082a_IFNpos_cMo lupus
178_S3008a_IFNpos_cMo lupus
179_S3011a_IFNpos_cMo lupus
181_S1065a_IFNpos_cMo lupus
Now we can setup our DESeq object with the long but appropriately named DESeqDataSetFromMatrix()
function:
library(DESeq2)
<- DESeqDataSetFromMatrix(countData = counts.subset,
dds colData = colData,
design = ~state)
Before we go any further we should remove any genes that have very low read counts. Here we will follow general practice of keeping rows that have at least 10 reads total. This will speed up our future calculations and avoid diluting our statistical power by doing needless tests on genes that we can not say much about anyway.
<- rowSums(counts(dds)) >= 10
keep.inds <- dds[keep.inds,] dds
6. Principal Component Analysis (PCA)
Before running DESeq analysis we can look how the count data samples are related to one another via our old friend Principal Component Analysis (PCA). We will follow the DESeq recommended procedure and associated functions for PCA. First calling vst()
to apply a variance stabilizing transformation (read more about this in the expandable section below) and then plotPCA()
to calculate our PCs and plot the results.
The purpose of the variance stabilizing transformation in DESeq2 is to normalize count data and stabilize the variance across the mean expression level. This is important because count data often have different variances at different levels of expression, and this can lead to false positive or false negative results in downstream analysis. The variance stabilizing transformation is a type of power transformation that aims to stabilize the variance across the mean expression level of genes.
The vst()
function in DESeq2 transforms raw count data into a log2-counts per million (logCPM) space, where the variance is approximately independent of the mean. This transformation improves the performance of statistical tests and reduces the impact of outliers on the analysis. The variance-stabilized data can then be used for exploratory data analysis, visualization, and downstream statistical analysis, such as differential expression analysis, clustering, and dimension reduction. For motivated readers this vignette has additional details.
<- vst(dds, blind = FALSE)
vsd plotPCA(vsd, intgroup = c("state", "ifn_status"))
The plotPCA()
function comes with DESeq2 and the two terms specified by intgroup
are the interesting groups for labeling the samples; they tell the function to use them to choose colors.
We can also build the PCA plot from scratch using the ggplot2 package. This is done by asking the plotPCA function to return the data used for plotting rather than building the plot.
<- plotPCA(vsd, intgroup=c("state", "ifn_status"), returnData=TRUE)
pcaData head(pcaData)
PC1 PC2 group state ifn_status
170_S3004a_IFNpos_cMo -7.799237 0.6012251 lupus:IFNpos lupus IFNpos
174_S1055a_IFNpos_cMo -15.788003 -6.1432683 lupus:IFNpos lupus IFNpos
177_S1082a_IFNpos_cMo -23.889066 -5.6467291 lupus:IFNpos lupus IFNpos
178_S3008a_IFNpos_cMo -15.407654 1.5003440 lupus:IFNpos lupus IFNpos
179_S3011a_IFNpos_cMo -9.263411 -4.6312344 lupus:IFNpos lupus IFNpos
181_S1065a_IFNpos_cMo -21.794772 -5.1795433 lupus:IFNpos lupus IFNpos
name
170_S3004a_IFNpos_cMo 170_S3004a_IFNpos_cMo
174_S1055a_IFNpos_cMo 174_S1055a_IFNpos_cMo
177_S1082a_IFNpos_cMo 177_S1082a_IFNpos_cMo
178_S3008a_IFNpos_cMo 178_S3008a_IFNpos_cMo
179_S3011a_IFNpos_cMo 179_S3011a_IFNpos_cMo
181_S1065a_IFNpos_cMo 181_S1065a_IFNpos_cMo
# Calculate percent variance per PC for the plot axis labels
<- round(100 * attr(pcaData, "percentVar")) percentVar
ggplot(pcaData) +
#aes(x = PC1, y = PC2, color = state, shape = ifn_status) +
aes(x = PC1, y = PC2, color = ifn_status, shape=state) +
geom_point(size =3) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed() +
ggtitle("PCA with VST data") +
theme_bw()
TO DO: - Q10: Question about PCA result interpenetration. Comment on the low variance captured. Lack of lupus vs control split. But there is this interesting IFN split to examine. More on this next day! We should run DESeq of lupus:IFNpos vs healthy etc…
7. Running DESeq analysis
Now let’s run the DESeq analysis pipeline on the dataset, and reassign the resulting object back to the same dds
object:
<- DESeq(dds) dds
Here, we’re running the DESeq pipeline on the dds
object and it can take some time. The DESeq()
function calls a number of other functions within the package to essentially run the entire pipeline (normalizing by library size by estimating the “size factors,” estimating dispersion for the negative binomial model, and fitting models and getting statistics for each gene for the design specified when we imported the data).
Getting results
Since we’ve got a fairly simple design (single factor, two groups, lupus vs control), we can get results out of the object simply by calling the results()
function on the DESeqDataSet that has been run through the pipeline.
The help page for ?results
and the vignette both have extensive documentation about how to pull out the results for more complicated models (multi-factor experiments, specific contrasts, interaction terms, time courses, etc.).
<- results(dds)
res head(res)
log2 fold change (MLE): state lupus vs control
Wald test p-value: state lupus vs control
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
5S_rRNA 9.614794 0.276762 0.167223 1.655050 0.0979143 0.693798
7SK 0.776761 -0.568677 0.525878 -1.081386 0.2795253 NA
A1BG 94.473196 -0.241340 0.137931 -1.749716 0.0801673 0.659390
A1BG-AS1 26.219981 0.237438 0.203450 1.167060 0.2431860 0.839246
A1CF 1.071617 -0.302191 0.371924 -0.812507 0.4165009 NA
A2M 8.057817 -0.117888 0.621198 -0.189775 0.8494851 0.984135
We can summarize some basic tallies using the summary()
function.
summary(res)
out of 33496 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 154, 0.46%
LFC < 0 (down) : 33, 0.099%
outliers [1] : 0, 0%
low counts [2] : 9863, 29%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
And with a different alpha
(adjusted p-value) threshold of 0.05:
<- results(dds, alpha=0.05)
res_p05 summary(res_p05)
out of 33496 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 93, 0.28%
LFC < 0 (down) : 19, 0.057%
outliers [1] : 0, 0%
low counts [2] : 10513, 31%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
The low numbers here are consistent with the (Panwar et al. 2021) paper. This set of “differentially expressed genes” will be interrogated further in Class 3 and Class 4. We will finish up this class session by visualizing and saving our results.
8. Visualizating and saving results
First volcano plot with no color or annotation.
plot(res$log2FoldChange, -log(res$padj))
Let’s add some color with a color vector, mycols
, based on p-value and fold change:
<- rep("gray", nrow(res))
mycols abs(res$log2FoldChange) > 2] = "blue"
mycols[ $padj > 0.05 ] = "gray"
mycols[ res
plot(res$log2FoldChange, -log(res$padj), ylim=c(0,15), col=mycols)
abline(v=c(-2,2), lty=2)
abline(h=-log(0.05), lty=2)
and make a ggplot version
<- as.data.frame(res)
results
library(ggplot2)
ggplot(results) +
aes(log2FoldChange, -log(padj)) +
geom_point(col=mycols) +
geom_vline(xintercept = c(-2,+2), linetype=2) +
geom_hline(yintercept = -log(0.05), linetype=2)
Warning: Removed 9863 rows containing missing values (`geom_point()`).
Let’s finally extract our top genes of interest for the NEXT CLASS!
<- results %>% filter(padj <= 0.05 & abs(log2FoldChange) >= 2)
top.genes head(top.genes)
baseMean log2FoldChange lfcSE stat pvalue
ADAMTS2 88.614987 3.809186 0.5933012 6.420325 1.359841e-10
CCL2 182.234755 2.295830 0.4362095 5.263136 1.416188e-07
CDS1 4.712707 4.157445 1.1161940 3.724661 1.955775e-04
IFI27 1588.569384 6.923943 0.6893704 10.043865 9.776701e-24
IFI44L 3095.596958 2.200965 0.3878805 5.674338 1.392262e-08
IFIT1B 20.503298 2.017055 0.3928991 5.133773 2.839906e-07
padj
ADAMTS2 1.076858e-06
CCL2 2.803699e-04
CDS1 4.410551e-02
IFI27 2.322651e-19
IFI44L 8.268989e-05
IFIT1B 5.189819e-04
We can ask ChatGPT and Google Bard about these genes and or do some pathway analysis. We will also save full DESeq results as a CSV file:
save(top.genes, file = "top_genes.RData")
write.csv(res, file="deseq_results.csv")
Optional
TO DO: We can compare IFNneg vs. IFNpos (lupus:IFNpos vs healthy control (or control and lupus:IFNneg)).
Setup a new metadata/colData object for IFNneg vs. IFNpos
table(colData$ifn_status)
HC IFNneg IFNpos
24 34 30
<- filter(colData, ifn_status %in% c("IFNneg", "IFNpos"))
ifn_colData #ifn_colData <- filter(colData, ifn_status %in% c("HC", "IFNpos"))
table(ifn_colData$ifn_status)
IFNneg IFNpos
34 30
# Select only IFNneg and IFNpos count columns
<- select(counts.subset, rownames(ifn_colData))
ifn_countData dim(ifn_countData)
[1] 56269 64
dim(ifn_colData)
[1] 64 4
Now we can setup and run DESeq as before:
<- DESeqDataSetFromMatrix(countData = ifn_countData,
ifn_dds colData = ifn_colData,
design = ~ifn_status)
<- DESeq(ifn_dds)
ifn_dds <- results(ifn_dds) ifn_res
and a plot
<- as.data.frame(ifn_res)
ifn_results
<- rep("gray", nrow(ifn_res))
ifn_cols abs(ifn_res$log2FoldChange) > 2] = "blue"
ifn_cols[ $padj > 0.05 ] = "gray"
ifn_cols[ ifn_res
ggplot(ifn_results) +
aes(log2FoldChange, -log(padj)) +
geom_point(col=ifn_cols) +
geom_vline(xintercept = c(-2,+2), linetype=2) +
geom_hline(yintercept = -log(0.05), linetype=2)
Warning: Removed 34189 rows containing missing values (`geom_point()`).
Let’s finally extract our top genes of interest for the NEXT CLASS!
<- ifn_results %>% filter(padj <= 0.05 & abs(log2FoldChange) >= 2)
ifn_genes head(ifn_genes)
baseMean log2FoldChange lfcSE stat pvalue
ABCA6 11.359801 -2.508709 0.7199788 -3.484420 4.932048e-04
ADAMTS5 63.814100 -2.459209 0.4663791 -5.272982 1.342244e-07
AGRN 14.421869 2.475145 0.4115442 6.014288 1.806792e-09
ANKRD22 203.431431 2.301576 0.3202911 7.185887 6.677221e-13
AP001610.5 3.764462 2.288457 0.5708389 4.008936 6.099291e-05
APOBEC3A 5954.424665 2.068756 0.1975059 10.474398 1.132563e-25
padj
ABCA6 1.196396e-02
ADAMTS5 9.008314e-06
AGRN 1.727011e-07
ANKRD22 9.100805e-11
AP001610.5 2.148619e-03
APOBEC3A 3.473194e-23
dim(ifn_genes)
[1] 88 6
Look up TNFSF13B and IL1RN for the original lupus vs control run:
<- grep("IL1RN", rownames(rowData(dds)))
i1 <- grep("TNFSF13B", rownames(rowData(dds)))
i2 c(i1,i2),] results[
baseMean log2FoldChange lfcSE stat pvalue padj
IL1RN 1092.730 0.5164452 0.13117721 3.937004 8.250512e-05 0.02450093
TNFSF13B 4717.083 0.2082546 0.08390382 2.482064 1.306240e-02 0.38936428
And for IFN-pos vs IFN-neg:
<- grep("IL1RN", rownames(rowData(ifn_dds)))
i1 <- grep("TNFSF13B", rownames(rowData(ifn_dds)))
i2 c(i1,i2),] ifn_results[
baseMean log2FoldChange lfcSE stat pvalue padj
IL1RN 1189.982 0.9300144 0.12696031 7.325237 2.384762e-13 3.397132e-11
TNFSF13B 4938.118 0.6524722 0.07520505 8.675910 4.102581e-18 8.465886e-16
Ferhat can comment and take it further in Class 3 from here.