This is a modified version of the analysis performed for “Genetic and microstructural differences in the cortical plate of gyri and sulci during gyrification in fetal sheep”, (PMID: 32609332 DOI: 10.1093/cercor/bhaa171). This analysis is for display purposes only.
Raw RNA Seq FASTQ files can be downloaded from NCBI’s GEO repository, accession GSE141233. Sheep genome (v3.1) and annotation files were obtained from the International Sheep Genomics Consortium. Sheep annotation files can also be found at Ensembl
Due to the present inability to reproduce some computationally
expensive steps and files (Steps 1 to 3), the analysis will be fed a
csv
file that was produced as output from
featureCounts
as part of the original analysis. See Step 4
for more details. Notwithstanding this limitation, the code for Steps 1
to 3 is still presented.
Additionally, the original study is here extended (for the sake of scientific curiosity) via the inclusion of simulated samples. See step 7 for more details. These samples, understandably, are not part of the original study. If you know a better way to perform these simulations (and this pipeline in general), please feel free to contact me so we can discuss this!
The pipeline uses STAR
, featureCounts
,
edgeR
and edgeRun
After loading the raw files, genome file and annotation file to the
HPC (this can be done using rsync
), the first step is to
build the genome index using STAR
.
module load STAR
STAR --runThreadN 16 \
--runMode genomeGenerate \
--genomeDir ~/RNASeqSQ/GenomeDir/ \
--genomeFastaFiles ~/RNASeqSQ/ref_genome/Sheep_genome.fna \
--sjdbGTFfile ~/RNASeqSQ/ref_genome/Sheep_genome_annotations.gff \
--sjdbOverhang 49 \
--sjdbGTFtagExonParentTranscript Parent
The second step is mapping of the FASTQ files, which can be done
using a loop command again using STAR
.
#!/bin/bash
module load STAR
fqFiles=`find $1 -name '*.gz' -type f`
for fqFile in $fqFiles;do
STAR --runThreadN 16 \
--genomeDir ~/RNASeqSQ/GenomeDir/ \
--readFilesIn $fqFile \
--readFilesCommand zcat \
--outFileNamePrefix ~/RNASeqSQ/mapped_data/$(basename $fqFile .fastq.gz)_ \
--outSAMstrandField intronMotif \
--outSAMtype BAM Unsorted \
--outSAMunmapped Within \
--outFilterIntronMotifs RemoveNoncanonical
done
After mapping, read summarisation can be performed locally in a
decent-enough laptop in R environment. This step can be done using
RSubread’s featureCounts
. All the next steps can be
performed locally in R, which saves resources on the HPC.
library(Rsubread)
aligned_files <- c("mapped_data_ensembl/1251.1G_ensembl_Aligned.out.bam",
"mapped_data_ensembl/1251.1S_ensembl_Aligned.out.bam",
"mapped_data_ensembl/1251.2G_ensembl_Aligned.out.bam",
"mapped_data_ensembl/1251.2S_ensembl_Aligned.out.bam",
"mapped_data_ensembl/1452.1G_ensembl_Aligned.out.bam",
"mapped_data_ensembl/1452.1S_ensembl_Aligned.out.bam",
"mapped_data_ensembl/1452.2G_ensembl_Aligned.out.bam",
"mapped_data_ensembl/1452.2S_ensembl_Aligned.out.bam")
data_new<-featureCounts(aligned_files, annot.ext="ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf", isGTFAnnotationFile=TRUE, nthreads=16)
Normally, the data is extracted from the object output by featureCounts as shown below:
data <- as.data.frame(data_new$counts)
However and due to the present inability to reproduce these
.bam
files, the raw data will be input from a
csv
file produced as output in the original analysis.
data_final <- read.csv("counts_new_raw.csv",
header = TRUE)
colnames(data_final) <- c("Ensembl_ID", "1251.1G", "1251.1S", "1251.2G", "1251.2S", "658.1G", "658.1S", "1452.1G", "1452.1S", "1452.2G", "1452.2S")
rownames(data_final) <- data_final[,1]
data_final <- data_final[,c(2:11)]
The data should give a table with the counts for each sample ID as columns and the Ensembl IDs as rownames (if not, it can be easily transformed anyway). These Ensembl IDs will be used later to pair the IDs with the gene symbols, which are more informative than the IDs.
Next, we will import the experimental design matrix prepared and create the factors required.
experiment_design <- read.table("experiment_design.txt",header = T, sep = "\t")
experiment_design
## SampleID Source_animal Sample_type
## 1 1251.1G Sheep_1 Gyrus
## 2 1251.1S Sheep_1 Sulcus
## 3 1251.2G Sheep_2 Gyrus
## 4 1251.2S Sheep_2 Sulcus
## 5 658.1G Sheep_3 Gyrus
## 6 658.1S Sheep_3 Sulcus
## 7 1452.1G Sheep_4 Gyrus
## 8 1452.1S Sheep_4 Sulcus
## 9 1452.2G Sheep_5 Gyrus
## 10 1452.2S Sheep_5 Sulcus
rownames(experiment_design) <- experiment_design$SampleID
samples_names <- as.character(experiment_design$SampleID)
sample_type <- factor(experiment_design$Sample_type)
sheep_number <- factor(experiment_design$Source_animal)
Then, we will create the DGEList object to calculate dispersion and create a MDS (PCA) plot to have an overlook of the data points.
library(edgeR)
DGEList.edgeR <- DGEList(counts=data_final, group = sample_type)
DGEList.edgeR <- calcNormFactors(DGEList.edgeR)
DGEList.edgeR <- estimateCommonDisp(DGEList.edgeR,verbose = T)
## Disp = 0.25183 , BCV = 0.5018
This outputs a Common Dispersion of ~0.25 and a Biological Coefficient of Variation (BCV) of ~0.5, which is unexpectedly high. Let’s observe the MDS for the samples.
data_norm <- calcNormFactors(DGEList.edgeR)
colours <- sheep_number
points <- rep(16,5)
plotMDS(data_norm, col=colours, labels=NULL, pch=points)
legend("bottomleft", legend=levels(sheep_number), pch=points, col=c(1,2,3,4,5), ncol=2)
After normalisation, it appears as if one of the samples from Sheep_3 (Sheep number 658.1 is an outlier. This means we will have to drop both samples from the same sheep, since we cannot have a sheep with only one sample).
We will thus repeat the above steps but trimming our data and experimental design to include only Sheep 1, 2, 4 and 5.
#Trimming data and matrix
data_final2 <- data_final[,c(1:4,7:10)]
experiment_design2 <- experiment_design[c(1:4,7:10),]
#Defining factors
samples_names2 <- as.character(experiment_design2$SampleID)
sample_type2 <- factor(experiment_design2$Sample_type)
sheep_number2 <- factor(experiment_design2$Source_animal)
DGEList.edgeR2 <- DGEList(counts=data_final2, group = sample_type2)
DGEList.edgeR2 <- calcNormFactors(DGEList.edgeR2)
DGEList.edgeR2 <- estimateCommonDisp(DGEList.edgeR2,verbose = T)
## Disp = 0.17989 , BCV = 0.4241
data_norm2 <- calcNormFactors(DGEList.edgeR2)
colours2 <- c(1,1,2,2,4,4,5,5)
points2 <- rep(16,4)
plotMDS(data_norm2, col=colours2, labels=NULL, pch=points2)
legend("topleft", legend=levels(sheep_number2), pch=points2, col=c(1,2,4,5), ncol=2)
The samples now show a better clustering pattern with samples coming
from the same sheep clustering closer to each other, as expected.
Removing the outlier improved the Disp
and BCV
parameters, but there is still a lot of variability, which probably
comes from the fact that we still haven’t filtered out the noise from
unexpressed and low-expressed genes. We will do this in the next
step.
Next, low count genes need to be cleaned out to reduce the noise and their impact on the stats. Genes that have less than 5 counts will be removed.
DGEList.edgeR3 <- DGEList(counts=data_final2, group = sample_type2)
keep <- filterByExpr(DGEList.edgeR3)
summary(keep)
## Mode FALSE TRUE
## logical 13833 13221
DGEList.edgeR3 <- DGEList.edgeR3[keep, , keep.lib.sizes=FALSE]
DGEList.edgeR3 <- calcNormFactors(DGEList.edgeR3)
Which reduces the original data frame with 27054 entries to 13221.
Over half of the table was purged! Which is understandable since the
table produced by featureCounts
represents the gene
expression counts in the samples mapped against all the
genes in the genome, whereas the samples are derived from (mainly)
neurons in the cortical plate (see details in study), which will only
express a subset of genes. Additionally, some low level counts can be
the product of sequencing error, rather than being true counts.
We’ll repeat the calculations for Dispersion and BCV, as well as looking at how the samples cluster after removing the noise.
DGEList.edgeR3 <- estimateCommonDisp(DGEList.edgeR3,verbose = T)
## Disp = 0.12955 , BCV = 0.3599
data_norm3 <- calcNormFactors(DGEList.edgeR3)
colours2 <- c(1,1,2,2,4,4,5,5)
points2 <- rep(c(16,15),4)
plotMDS(data_norm3, col=colours2, labels=NULL, pch=points2)
legend("topright", legend=levels(sheep_number2), pch=16, col=c(1,2,4,5), ncol=2)
##Plot BCV trends
DGEList.edgeR3 <- estimateDisp(DGEList.edgeR3)
## Using classic mode.
plotBCV(DGEList.edgeR3)
for (column in c(1:8)){
plotMD(cpm(DGEList.edgeR3, log=TRUE), column = column)
abline(h=0,col="red",lty=2,lwd=2)
}
The clustering is now much more like we expected it in which samples coming from the same animal are closer genetically to each other, and both the Common Dispersion and the Biological Coefficient of Variability are within expected values for the sample types we are using here (genetically unrelated animals). However, the MDS here also spells some bad news for our analysis, since there appears to be more variability between animals than between samples types from the same animal (Gyrus vs Sulcus), which is what we’re interested in. Finally, the MD plots show the samples clustering around log-fold change of 0, meaning the composition bias between libraries has been successfully removed.
First, the design matrix is input
design_matrix <- model.matrix(~ 0 + sample_type2)
colnames(design_matrix) <- levels(sample_type2)
design_matrix
## Gyrus Sulcus
## 1 1 0
## 2 0 1
## 3 1 0
## 4 0 1
## 5 1 0
## 6 0 1
## 7 1 0
## 8 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$sample_type2
## [1] "contr.treatment"
Then, we fit the model using the quasi-likelihood pipeline using
gkmQLFit
.
fit <- glmQLFit(DGEList.edgeR3, design_matrix, robust = TRUE)
Then we make the contrasts we’re interested in.
contrast = makeContrasts(Gyrus - Sulcus, levels = design_matrix)
And we apply the QL Test to determine differential expression.
GvS <- glmQLFTest(fit, contrast = contrast)
topTags(GvS)
## Coefficient: 1*Gyrus -1*Sulcus
## logFC logCPM F PValue FDR
## ENSOARG00000017745 2.2932048 0.5202402 13.262868 0.002516191 0.9999222
## ENSOARG00000020222 1.3600373 1.5786336 13.025800 0.002688116 0.9999222
## ENSOARG00000015199 0.6519343 5.4252684 12.677510 0.002965656 0.9999222
## ENSOARG00000009322 0.5833102 7.2972359 12.039736 0.003563410 0.9999222
## ENSOARG00000010290 -1.8558338 0.9006594 11.858640 0.003757520 0.9999222
## ENSOARG00000012748 -1.8345322 1.5587280 11.523343 0.004149779 0.9999222
## ENSOARG00000009537 -2.3714088 0.4407101 11.091470 0.004726114 0.9999222
## ENSOARG00000016386 2.8504759 -0.8524899 10.297994 0.006041694 0.9999222
## ENSOARG00000017663 1.0385993 2.4124548 9.503951 0.007796455 0.9999222
## ENSOARG00000011570 1.6926856 0.7041272 9.484659 0.007845858 0.9999222
summary(decideTests(GvS))
## 1*Gyrus -1*Sulcus
## Down 0
## NotSig 13221
## Up 0
However, as expected and predicted from the MDS, edgeR
was unable to find differentially expressed genes between
Gyrus
and Sulcus
with the data provided.
An alternative to edgeR
is edgeRun
, that
offers an unconditional exact test that is better suited for conditions
when we have low number of biological replicates (which is the case for
this study). The test will be run with 50,000 iterations which is the
minimum recommended by the documentation, although this is
computationally expensive. Please note that this package is no
longer available at Bioconductor and needs to be installed from CRAN’s
archive.
library(edgeRun)
GvS_edgeRun <-UCexactTest(DGEList.edgeR3, upper=50000)
topTags(GvS_edgeRun)
## Comparison of groups: Sulcus-Gyrus
## logFC logCPM PValue FDR
## ENSOARG00000017745 -2.291866 0.5202402 0.0001431588 0.2444737
## ENSOARG00000016386 -2.847723 -0.8524899 0.0001741128 0.2444737
## ENSOARG00000004431 -2.905343 -0.4757439 0.0003431994 0.2444737
## ENSOARG00000020222 -1.358654 1.5786336 0.0003670460 0.2444737
## ENSOARG00000009537 2.369241 0.4407101 0.0003938497 0.2444737
## ENSOARG00000010290 1.854501 0.9006594 0.0003969009 0.2444737
## ENSOARG00000012670 -2.261659 -0.2142238 0.0005592318 0.2444737
## ENSOARG00000012748 1.834845 1.5587280 0.0005656822 0.2444737
## ENSOARG00000000669 2.556560 -0.3686911 0.0012533417 0.2444737
## ENSOARG00000011570 -1.691119 0.7041272 0.0012711513 0.2444737
summary(decideTests(GvS_edgeRun))
## Sulcus-Gyrus
## Down 0
## NotSig 13221
## Up 0
Although the FDR has decreased importantly, this test was still
unsuccessful in extracting differentially expressed genes from our
dataset, most likely due to the low number of biological replicates and
the small amount of variation for the factor we wanted to detect. The
MDS shown earlier shows that the main variability factors do
not separate the samples based on the factor we want, and
samples from Gyrus
and Sulcus
cluster
together. Plotting the different dimensions can help further make sense
of the issue:
plotMDS(data_norm3, dim.plot = c(3,4), col=rep(c(1,2),4), pch=rep(c(15,16),4))#, col=colours2, labels=NULL, pch=points2)
legend("topright", legend=c("Gyrus samples", "Sulcus samples"), pch=c(15,16), col=c(1,2), ncol=1)
As can be seen in the MDS, dimension 3 (which accounts for only 14%
of the variability in the dataset) is the factor that best separates
Gyrus
from Sulcus
, and this is done only
partially since one of the Sulcus
samples behaves very
similarly to the Gyrus
samples. This is something that the
original study also confirmed via immunohistochemical approaches for
some genes such as BDNF and NeuroD6 – it was only after extensive
sampling of Gyrus
and Sulcus
of the cortical
plate throughout the cortex at different rostrocaudal levels that
differential genetic expression became evident enough to be detected via
two-way ANOVA. This supports the idea that more biological replicates
would have been a desirable approach.
The little genetic difference between the Gyrus
and
Sulcus
samples is understandable since the samples come
from the the same cortical structure (cortical plate), the
same cortical region (around the central (Ansate) sulcus), and
from normal, healthy sheep without any treatments. Since these specimens
are not genetically related, the genetic variability between them is
maximal as well.
The previous step is where the RNA Seq component of the original study ends (see DOI: 10.1093/cercor/bhaa171 for more details). However, I’ve always wondered: What if instead of 4 sheep I had, say, 9-10 sheep that behaved similarly enough in the gyral and sulcal expression to the ones in this dataset (as variable as they were)? Would I have had the statistical power to uncover some differential genetic expression between gyri and sulci?
Since the study is concluded and published, and that there is no
budget to do this (or Animal Ethics Committee that would approve of it),
I decided to simulate some data based using my 4 sheep
(1251.1
, 1251.2
, 1451.1
and
1451.2
) as guidance.
The data is modeled by using a Negative Binomial distribution centred around the mean counts for each gene for the 4 original sheep, with a random probability that with a range of 0.4 and centred at 0.5 (thus between between 0.3 and 0.7 – narrower ranges deliver too much clustering of the simulated samples therefore not a good “real world” simulation, and wider ranges deliver data that is so variable that makes the original samples cluster and is thus too variable):
#Separate the gyral and the sulcal data.
gyri_data <- data_final2[, c(1,3,5,7)]
sulci_data <- data_final2[, c(2,4,6,8)]
#Simulate the data
set.seed(10)
sheep_sim_data <- data_final2
n <- 1
min <- 0.3
max <- 0.7
suppressWarnings(sheep_sim_data['sim_sheep1.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep1.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep2.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep2.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep3.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep3.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep4.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep4.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep5.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep5.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
sheep_sim_data[is.na(sheep_sim_data)] <- 0
head(sheep_sim_data, 20)
## 1251.1G 1251.1S 1251.2G 1251.2S 1452.1G 1452.1S 1452.2G
## ENSOARG00000017577 3 9 14 3 23 3 23
## ENSOARG00000020430 0 0 0 0 0 0 0
## ENSOARG00000017587 0 0 0 0 0 0 0
## ENSOARG00000025485 1 2 2 10 0 0 5
## ENSOARG00000017595 0 0 0 0 0 0 0
## ENSOARG00000017608 71 72 154 164 41 38 100
## ENSOARG00000017620 36 7 65 57 9 47 28
## ENSOARG00000017628 5 12 19 5 3 20 34
## ENSOARG00000017642 70 89 107 37 34 22 101
## ENSOARG00000017652 1 4 3 3 2 9 4
## ENSOARG00000017668 130 132 247 270 141 100 159
## ENSOARG00000017737 329 185 241 159 94 76 149
## ENSOARG00000017810 1992 2094 2911 2199 1754 1572 2474
## ENSOARG00000017870 1544 1537 2210 1913 1112 725 1885
## ENSOARG00000025486 1 1 0 0 1 0 1
## ENSOARG00000017936 0 0 3 0 0 1 1
## ENSOARG00000025487 0 0 0 0 9 0 2
## ENSOARG00000017998 1738 1309 1943 1654 1339 1498 2058
## ENSOARG00000018016 21 15 5 16 7 3 35
## ENSOARG00000018079 20 49 119 61 56 32 116
## 1452.2S sim_sheep1.G sim_sheep1.S sim_sheep2.G sim_sheep2.S
## ENSOARG00000017577 24 11 7 9 5
## ENSOARG00000020430 0 0 0 0 0
## ENSOARG00000017587 0 0 0 0 0
## ENSOARG00000025485 1 1 4 1 4
## ENSOARG00000017595 0 0 0 0 0
## ENSOARG00000017608 103 52 70 61 148
## ENSOARG00000017620 36 22 21 17 30
## ENSOARG00000017628 23 32 10 8 25
## ENSOARG00000017642 85 65 50 26 22
## ENSOARG00000017652 2 6 5 2 2
## ENSOARG00000017668 244 236 160 80 163
## ENSOARG00000017737 142 197 304 235 115
## ENSOARG00000017810 2204 1974 947 2087 1490
## ENSOARG00000017870 1386 1694 640 1094 3279
## ENSOARG00000025486 0 0 0 0 0
## ENSOARG00000017936 2 0 3 2 2
## ENSOARG00000025487 3 2 5 2 0
## ENSOARG00000017998 2148 1172 2274 1240 732
## ENSOARG00000018016 20 10 33 29 9
## ENSOARG00000018079 89 53 85 63 90
## sim_sheep3.G sim_sheep3.S sim_sheep4.G sim_sheep4.S
## ENSOARG00000017577 22 15 25 8
## ENSOARG00000020430 0 0 0 0
## ENSOARG00000017587 0 0 0 0
## ENSOARG00000025485 1 6 1 1
## ENSOARG00000017595 0 0 0 0
## ENSOARG00000017608 64 116 100 185
## ENSOARG00000017620 44 51 18 17
## ENSOARG00000017628 16 16 12 30
## ENSOARG00000017642 141 20 73 75
## ENSOARG00000017652 2 3 0 3
## ENSOARG00000017668 121 139 321 322
## ENSOARG00000017737 93 138 104 362
## ENSOARG00000017810 1478 1382 2809 2410
## ENSOARG00000017870 1641 1960 1134 1024
## ENSOARG00000025486 1 1 1 0
## ENSOARG00000017936 1 1 0 0
## ENSOARG00000025487 1 0 1 0
## ENSOARG00000017998 793 2134 3659 1726
## ENSOARG00000018016 36 5 15 7
## ENSOARG00000018079 115 32 51 54
## sim_sheep5.G sim_sheep5.S
## ENSOARG00000017577 35 19
## ENSOARG00000020430 0 0
## ENSOARG00000017587 0 0
## ENSOARG00000025485 8 3
## ENSOARG00000017595 0 0
## ENSOARG00000017608 71 75
## ENSOARG00000017620 27 16
## ENSOARG00000017628 31 14
## ENSOARG00000017642 74 86
## ENSOARG00000017652 3 1
## ENSOARG00000017668 89 236
## ENSOARG00000017737 156 47
## ENSOARG00000017810 1786 1639
## ENSOARG00000017870 1364 1301
## ENSOARG00000025486 0 0
## ENSOARG00000017936 0 0
## ENSOARG00000025487 4 1
## ENSOARG00000017998 2395 847
## ENSOARG00000018016 25 1
## ENSOARG00000018079 65 21
And now we can repeat the edgeR
and edgeRun
approaches using this data.
#Experimental design setup
experiment_design_sim <- read.table("experiment_design_sim.txt",header = T, sep = "\t")
experiment_design_sim
## SampleID Source_animal Sample_type
## 1 1251.1G Sheep_1 Gyrus
## 2 1251.1S Sheep_1 Sulcus
## 3 1251.2G Sheep_2 Gyrus
## 4 1251.2S Sheep_2 Sulcus
## 5 1452.1G Sheep_4 Gyrus
## 6 1452.1S Sheep_4 Sulcus
## 7 1452.2G Sheep_5 Gyrus
## 8 1452.2S Sheep_5 Sulcus
## 9 sim_sheep1.G sim_sheep1 Gyrus
## 10 sim_sheep1.S sim_sheep1 Sulcus
## 11 sim_sheep2.G sim_sheep2 Gyrus
## 12 sim_sheep2.S sim_sheep2 Sulcus
## 13 sim_sheep3.G sim_sheep3 Gyrus
## 14 sim_sheep3.S sim_sheep3 Sulcus
## 15 sim_sheep4.G sim_sheep4 Gyrus
## 16 sim_sheep4.S sim_sheep4 Sulcus
## 17 sim_sheep5.G sim_sheep5 Gyrus
## 18 sim_sheep5.S sim_sheep5 Sulcus
rownames(experiment_design_sim) <- experiment_design_sim$SampleID
samples_names_sim <- as.character(experiment_design_sim$SampleID)
sample_type_sim <- factor(experiment_design_sim$Sample_type)
sheep_number_sim <- factor(experiment_design_sim$Source_animal)
DGEList.edgeR_sim <- DGEList(counts=sheep_sim_data, group = sample_type_sim)
keep_sim <- filterByExpr(DGEList.edgeR_sim)
summary(keep_sim)
## Mode FALSE TRUE
## logical 13870 13184
DGEList.edgeR_sim <- DGEList.edgeR_sim[keep_sim, , keep.lib.sizes=FALSE]
DGEList.edgeR_sim <- calcNormFactors(DGEList.edgeR_sim)
DGEList.edgeR_sim <- estimateCommonDisp(DGEList.edgeR_sim,verbose = T)
## Disp = 0.17812 , BCV = 0.422
data_norm_sim <- calcNormFactors(DGEList.edgeR_sim)
colours_sim <- rep(c(1,2),9)
points_sim <- rep(c(15,16),9)
plotMDS(data_norm_sim, col=colours_sim, labels=NULL, pch=points_sim)
legend("topright", legend=levels(sample_type_sim), pch=c(15,16), col=c(1,2), ncol=1)
design_matrix_sim <- model.matrix(~ 0 + sample_type_sim)
colnames(design_matrix_sim) <- levels(sample_type_sim)
design_matrix_sim
## Gyrus Sulcus
## 1 1 0
## 2 0 1
## 3 1 0
## 4 0 1
## 5 1 0
## 6 0 1
## 7 1 0
## 8 0 1
## 9 1 0
## 10 0 1
## 11 1 0
## 12 0 1
## 13 1 0
## 14 0 1
## 15 1 0
## 16 0 1
## 17 1 0
## 18 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$sample_type_sim
## [1] "contr.treatment"
fit_sim <- glmQLFit(DGEList.edgeR_sim, design_matrix_sim, robust = TRUE)
contrast_sim = makeContrasts(Gyrus - Sulcus, levels = design_matrix_sim)
GvS_sim <- glmQLFTest(fit_sim, contrast = contrast_sim)
topTags(GvS_sim)
## Coefficient: 1*Gyrus -1*Sulcus
## logFC logCPM F PValue FDR
## ENSOARG00000012748 -2.265645 1.71982086 44.42897 2.644003e-11 3.365868e-07
## ENSOARG00000000776 3.035031 0.08656688 43.14067 5.105989e-11 3.365868e-07
## ENSOARG00000009537 -2.744161 0.34469859 41.68114 1.076723e-10 4.731838e-07
## ENSOARG00000017745 2.504387 0.54151354 37.68212 8.341183e-10 2.749254e-06
## ENSOARG00000011263 1.992096 1.60629255 33.97380 5.594003e-09 1.475027e-05
## ENSOARG00000019063 2.531804 0.04134232 32.16711 1.416493e-08 3.112506e-05
## ENSOARG00000008842 2.302205 0.35312880 30.26215 3.790534e-08 6.201197e-05
## ENSOARG00000015875 2.446860 0.02572159 30.11616 4.073915e-08 6.201197e-05
## ENSOARG00000000669 -2.634856 -0.37974314 29.81500 4.758333e-08 6.201197e-05
## ENSOARG00000014800 2.528492 -0.13994649 29.69803 5.054240e-08 6.201197e-05
summary(decideTests(GvS_sim))
## 1*Gyrus -1*Sulcus
## Down 116
## NotSig 12930
## Up 138
plotMD(GvS_sim, main = "Differentially Expressed between Gyrus and Sulcus")
So, using this approach we find that we have 116 genes downregulated
in Gyrus
compared to Sulcus
, and 138 genes
that are upregulated in Gyrus
compared to
Sulcus
.
But the information we currently have as row names (the Ensembl IDs)
do not tell us anything that we humans can read, and querying the IDs
one by one with the Ensembl database is impractical. Luckily we have
some tools like rtracklayer
that we can use to import and
handle the annotation file and use it to match the IDs with the gene
symbols, which are more informative.
library(rtracklayer)
annots <- import("Ovis_aries_annot.gtf")
annots <- subset(annots, type == "gene")
annots_df <- as.data.frame(annots)[,c("gene_id", "gene_name")]
So now we can use the annots_df
object to merge IDs and
symbols (names) after subsetting the results to include only DE
genes.
#Select only the genes that are up- or down-regulated based on decideTests results
significant_GvS_sim <- as.data.frame(decideTests(GvS_sim))
significant_GvS_sim <- subset(significant_GvS_sim, significant_GvS_sim$`1*Gyrus -1*Sulcus` == -1 | significant_GvS_sim$`1*Gyrus -1*Sulcus` == 1 )
significant_GvS_sim["gene_id"] <- rownames(significant_GvS_sim)
#Subset the DGELRT table to include only the genes that are significant based on decideTests results
significant_table <- GvS_sim$table
significant_table["gene_id"] <- rownames(significant_table)
significant_table <- merge(significant_table, significant_GvS_sim, by.x = "gene_id", by.y = "gene_id")
#Merge with symbols
significant_final <- merge(significant_table, annots_df, by.x = "gene_id", by.y = "gene_id")
There are too many NAs! Which is what happens often with genomes that are not as extensively annotated and validated such as model organisms.
When we filter by genes with names, we see that
significant_gene_names <- significant_final["gene_name"][is.na(significant_final["gene_name"]) == FALSE]
significant_gene_names
## [1] "TLR10" "CDK5" "EDNRA" "ZDHHC7" "AGTR1" "PMP22" "POLR2G" "7SK"
We have 8 genes differentially expressed, amongst which is CDK5, which is one of the genes that the original study assumed as differentially expressed (based on edgeRun results combined with other statistical approaches) and was validated as such using immunohistological methods. However, the other validated genes (BDNF, NeuroD6, HDAC5, MeCP2 and MAP2) were not picked up by this method, highlighting the complexities of the pipeline at hand when dealing with small samples of genetically diverse individuals.
The work presented here (for display and curiosity’s sake) is a
replication and an extension of the RNA Seq component of an already
published study that sought to uncover the differential genetic
expression of the gyral and sulcal cortical plate during gyrification in
fetal sheep (see PMID: 32609332 DOI: 10.1093/cercor/bhaa171 for more
details). The extension component was drawn using simulated data that
was created using negative binomial modelling that intended to “mimic”
the Gyrus
and Sulcus
samples observed in the
real-world RNA Seq data. However, this is understandably just a
simulation, and in order to uncover small differences from genetically
diverse individuals we would need many more samples (see for example the
“Profiles of Yoruba HapMap individuals” section in the edgeR user’s
manual).
Interestingly, the simulation did identify a gene that was hypothesised to be differentially expressed: CDK5, thus strongly suggesting that the modelled approach has some merits, however it nonetheless missed genes such as BDNF, NeuroD6, HDAC5, MeCP2 and MAP2. We might need to sample many, many more fetal sheep cortices to obtain enough replicates to uncover these genes, which is something that is beyond the constraints of the animal ethics and the budget we had.