Overview

This is a modified version of the analysis performed for “Genetic and microstructural differences in the cortical plate of gyri and sulci during gyrification in fetal sheep”, (PMID: 32609332 DOI: 10.1093/cercor/bhaa171). This analysis is for display purposes only.

Raw RNA Seq FASTQ files can be downloaded from NCBI’s GEO repository, accession GSE141233. Sheep genome (v3.1) and annotation files were obtained from the International Sheep Genomics Consortium. Sheep annotation files can also be found at Ensembl

Due to the present inability to reproduce some computationally expensive steps and files (Steps 1 to 3), the analysis will be fed a csv file that was produced as output from featureCounts as part of the original analysis. See Step 4 for more details. Notwithstanding this limitation, the code for Steps 1 to 3 is still presented.

Additionally, the original study is here extended (for the sake of scientific curiosity) via the inclusion of simulated samples. See step 7 for more details. These samples, understandably, are not part of the original study. If you know a better way to perform these simulations (and this pipeline in general), please feel free to contact me so we can discuss this!

The pipeline uses STAR, featureCounts, edgeR and edgeRun

Step 1 - Building the Genome Index

After loading the raw files, genome file and annotation file to the HPC (this can be done using rsync), the first step is to build the genome index using STAR.

    module load STAR
    STAR --runThreadN 16 \
         --runMode genomeGenerate \
         --genomeDir ~/RNASeqSQ/GenomeDir/ \
         --genomeFastaFiles ~/RNASeqSQ/ref_genome/Sheep_genome.fna \
         --sjdbGTFfile ~/RNASeqSQ/ref_genome/Sheep_genome_annotations.gff \
         --sjdbOverhang 49 \
         --sjdbGTFtagExonParentTranscript Parent

Step 2 - Mapping FASTQ files

The second step is mapping of the FASTQ files, which can be done using a loop command again using STAR.

    #!/bin/bash

    module load STAR

    fqFiles=`find $1 -name '*.gz' -type f`

    for fqFile in $fqFiles;do
      STAR --runThreadN 16 \
           --genomeDir ~/RNASeqSQ/GenomeDir/ \
           --readFilesIn $fqFile \
           --readFilesCommand zcat \
           --outFileNamePrefix ~/RNASeqSQ/mapped_data/$(basename $fqFile .fastq.gz)_ \
           --outSAMstrandField intronMotif \
           --outSAMtype BAM Unsorted \
           --outSAMunmapped Within \
           --outFilterIntronMotifs RemoveNoncanonical
    done

Step 3 - Read Summarisation

After mapping, read summarisation can be performed locally in a decent-enough laptop in R environment. This step can be done using RSubread’s featureCounts. All the next steps can be performed locally in R, which saves resources on the HPC.

    library(Rsubread)
    aligned_files <- c("mapped_data_ensembl/1251.1G_ensembl_Aligned.out.bam",
                           "mapped_data_ensembl/1251.1S_ensembl_Aligned.out.bam",
                           "mapped_data_ensembl/1251.2G_ensembl_Aligned.out.bam",
                           "mapped_data_ensembl/1251.2S_ensembl_Aligned.out.bam",
                           "mapped_data_ensembl/1452.1G_ensembl_Aligned.out.bam",
                           "mapped_data_ensembl/1452.1S_ensembl_Aligned.out.bam",
                           "mapped_data_ensembl/1452.2G_ensembl_Aligned.out.bam",
                           "mapped_data_ensembl/1452.2S_ensembl_Aligned.out.bam")
    data_new<-featureCounts(aligned_files, annot.ext="ref_genome_ensembl/Sheep_genome_annotations_ensembl.gtf", isGTFAnnotationFile=TRUE, nthreads=16)

Step 4 - Exploring and Cleaning the Data

Normally, the data is extracted from the object output by featureCounts as shown below:

data <- as.data.frame(data_new$counts)

However and due to the present inability to reproduce these .bam files, the raw data will be input from a csv file produced as output in the original analysis.

data_final <- read.csv("counts_new_raw.csv",
                       header = TRUE)
colnames(data_final) <- c("Ensembl_ID", "1251.1G", "1251.1S", "1251.2G", "1251.2S", "658.1G", "658.1S", "1452.1G", "1452.1S", "1452.2G", "1452.2S")
rownames(data_final) <- data_final[,1]
data_final <- data_final[,c(2:11)]

The data should give a table with the counts for each sample ID as columns and the Ensembl IDs as rownames (if not, it can be easily transformed anyway). These Ensembl IDs will be used later to pair the IDs with the gene symbols, which are more informative than the IDs.

Next, we will import the experimental design matrix prepared and create the factors required.

experiment_design <- read.table("experiment_design.txt",header = T, sep = "\t")
experiment_design
##    SampleID Source_animal Sample_type
## 1   1251.1G       Sheep_1       Gyrus
## 2   1251.1S       Sheep_1      Sulcus
## 3   1251.2G       Sheep_2       Gyrus
## 4   1251.2S       Sheep_2      Sulcus
## 5    658.1G       Sheep_3       Gyrus
## 6    658.1S       Sheep_3      Sulcus
## 7   1452.1G       Sheep_4       Gyrus
## 8   1452.1S       Sheep_4      Sulcus
## 9   1452.2G       Sheep_5       Gyrus
## 10  1452.2S       Sheep_5      Sulcus
rownames(experiment_design) <- experiment_design$SampleID
samples_names <- as.character(experiment_design$SampleID)
sample_type <- factor(experiment_design$Sample_type)
sheep_number <- factor(experiment_design$Source_animal)

Then, we will create the DGEList object to calculate dispersion and create a MDS (PCA) plot to have an overlook of the data points.

library(edgeR)
DGEList.edgeR <- DGEList(counts=data_final, group = sample_type)
DGEList.edgeR <- calcNormFactors(DGEList.edgeR)
DGEList.edgeR <- estimateCommonDisp(DGEList.edgeR,verbose = T)
## Disp = 0.25183 , BCV = 0.5018

This outputs a Common Dispersion of ~0.25 and a Biological Coefficient of Variation (BCV) of ~0.5, which is unexpectedly high. Let’s observe the MDS for the samples.

data_norm <- calcNormFactors(DGEList.edgeR)
colours <- sheep_number
points <- rep(16,5)
plotMDS(data_norm, col=colours, labels=NULL, pch=points)
legend("bottomleft", legend=levels(sheep_number), pch=points, col=c(1,2,3,4,5), ncol=2)

After normalisation, it appears as if one of the samples from Sheep_3 (Sheep number 658.1 is an outlier. This means we will have to drop both samples from the same sheep, since we cannot have a sheep with only one sample).

We will thus repeat the above steps but trimming our data and experimental design to include only Sheep 1, 2, 4 and 5.

#Trimming data and matrix
data_final2 <- data_final[,c(1:4,7:10)]
experiment_design2 <- experiment_design[c(1:4,7:10),]

#Defining factors
samples_names2 <- as.character(experiment_design2$SampleID)
sample_type2 <- factor(experiment_design2$Sample_type)
sheep_number2 <- factor(experiment_design2$Source_animal)
DGEList.edgeR2 <- DGEList(counts=data_final2, group = sample_type2)
DGEList.edgeR2 <- calcNormFactors(DGEList.edgeR2)
DGEList.edgeR2 <- estimateCommonDisp(DGEList.edgeR2,verbose = T)
## Disp = 0.17989 , BCV = 0.4241
data_norm2 <- calcNormFactors(DGEList.edgeR2)
colours2 <- c(1,1,2,2,4,4,5,5)
points2 <- rep(16,4)
plotMDS(data_norm2, col=colours2, labels=NULL, pch=points2)
legend("topleft", legend=levels(sheep_number2), pch=points2, col=c(1,2,4,5), ncol=2)

The samples now show a better clustering pattern with samples coming from the same sheep clustering closer to each other, as expected. Removing the outlier improved the Disp and BCV parameters, but there is still a lot of variability, which probably comes from the fact that we still haven’t filtered out the noise from unexpressed and low-expressed genes. We will do this in the next step.

Step 5 - Filtering Out The Noise and QC’ing Before Stats.

Next, low count genes need to be cleaned out to reduce the noise and their impact on the stats. Genes that have less than 5 counts will be removed.

DGEList.edgeR3 <- DGEList(counts=data_final2, group = sample_type2)
keep <- filterByExpr(DGEList.edgeR3)
summary(keep)
##    Mode   FALSE    TRUE 
## logical   13833   13221
DGEList.edgeR3 <- DGEList.edgeR3[keep, , keep.lib.sizes=FALSE]
DGEList.edgeR3 <- calcNormFactors(DGEList.edgeR3)

Which reduces the original data frame with 27054 entries to 13221. Over half of the table was purged! Which is understandable since the table produced by featureCounts represents the gene expression counts in the samples mapped against all the genes in the genome, whereas the samples are derived from (mainly) neurons in the cortical plate (see details in study), which will only express a subset of genes. Additionally, some low level counts can be the product of sequencing error, rather than being true counts.

We’ll repeat the calculations for Dispersion and BCV, as well as looking at how the samples cluster after removing the noise.

DGEList.edgeR3 <- estimateCommonDisp(DGEList.edgeR3,verbose = T)
## Disp = 0.12955 , BCV = 0.3599
data_norm3 <- calcNormFactors(DGEList.edgeR3)
colours2 <- c(1,1,2,2,4,4,5,5)
points2 <- rep(c(16,15),4)
plotMDS(data_norm3, col=colours2, labels=NULL, pch=points2)
legend("topright", legend=levels(sheep_number2), pch=16, col=c(1,2,4,5), ncol=2)

##Plot BCV trends
DGEList.edgeR3 <- estimateDisp(DGEList.edgeR3)
## Using classic mode.
plotBCV(DGEList.edgeR3)

for (column in c(1:8)){
  plotMD(cpm(DGEList.edgeR3, log=TRUE), column = column)
  abline(h=0,col="red",lty=2,lwd=2)
}

The clustering is now much more like we expected it in which samples coming from the same animal are closer genetically to each other, and both the Common Dispersion and the Biological Coefficient of Variability are within expected values for the sample types we are using here (genetically unrelated animals). However, the MDS here also spells some bad news for our analysis, since there appears to be more variability between animals than between samples types from the same animal (Gyrus vs Sulcus), which is what we’re interested in. Finally, the MD plots show the samples clustering around log-fold change of 0, meaning the composition bias between libraries has been successfully removed.

Step 6 - Differential Expression

First, the design matrix is input

design_matrix <- model.matrix(~ 0 + sample_type2)
colnames(design_matrix) <- levels(sample_type2)
design_matrix
##   Gyrus Sulcus
## 1     1      0
## 2     0      1
## 3     1      0
## 4     0      1
## 5     1      0
## 6     0      1
## 7     1      0
## 8     0      1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$sample_type2
## [1] "contr.treatment"

Then, we fit the model using the quasi-likelihood pipeline using gkmQLFit.

fit <- glmQLFit(DGEList.edgeR3, design_matrix, robust = TRUE)

Then we make the contrasts we’re interested in.

contrast = makeContrasts(Gyrus - Sulcus, levels = design_matrix)

And we apply the QL Test to determine differential expression.

GvS <- glmQLFTest(fit, contrast = contrast)
topTags(GvS)
## Coefficient:  1*Gyrus -1*Sulcus 
##                         logFC     logCPM         F      PValue       FDR
## ENSOARG00000017745  2.2932048  0.5202402 13.262868 0.002516191 0.9999222
## ENSOARG00000020222  1.3600373  1.5786336 13.025800 0.002688116 0.9999222
## ENSOARG00000015199  0.6519343  5.4252684 12.677510 0.002965656 0.9999222
## ENSOARG00000009322  0.5833102  7.2972359 12.039736 0.003563410 0.9999222
## ENSOARG00000010290 -1.8558338  0.9006594 11.858640 0.003757520 0.9999222
## ENSOARG00000012748 -1.8345322  1.5587280 11.523343 0.004149779 0.9999222
## ENSOARG00000009537 -2.3714088  0.4407101 11.091470 0.004726114 0.9999222
## ENSOARG00000016386  2.8504759 -0.8524899 10.297994 0.006041694 0.9999222
## ENSOARG00000017663  1.0385993  2.4124548  9.503951 0.007796455 0.9999222
## ENSOARG00000011570  1.6926856  0.7041272  9.484659 0.007845858 0.9999222
summary(decideTests(GvS))
##        1*Gyrus -1*Sulcus
## Down                   0
## NotSig             13221
## Up                     0

However, as expected and predicted from the MDS, edgeR was unable to find differentially expressed genes between Gyrus and Sulcus with the data provided.

An alternative to edgeR is edgeRun, that offers an unconditional exact test that is better suited for conditions when we have low number of biological replicates (which is the case for this study). The test will be run with 50,000 iterations which is the minimum recommended by the documentation, although this is computationally expensive. Please note that this package is no longer available at Bioconductor and needs to be installed from CRAN’s archive.

library(edgeRun)
GvS_edgeRun <-UCexactTest(DGEList.edgeR3, upper=50000)
topTags(GvS_edgeRun)
## Comparison of groups:  Sulcus-Gyrus 
##                        logFC     logCPM       PValue       FDR
## ENSOARG00000017745 -2.291866  0.5202402 0.0001431588 0.2444737
## ENSOARG00000016386 -2.847723 -0.8524899 0.0001741128 0.2444737
## ENSOARG00000004431 -2.905343 -0.4757439 0.0003431994 0.2444737
## ENSOARG00000020222 -1.358654  1.5786336 0.0003670460 0.2444737
## ENSOARG00000009537  2.369241  0.4407101 0.0003938497 0.2444737
## ENSOARG00000010290  1.854501  0.9006594 0.0003969009 0.2444737
## ENSOARG00000012670 -2.261659 -0.2142238 0.0005592318 0.2444737
## ENSOARG00000012748  1.834845  1.5587280 0.0005656822 0.2444737
## ENSOARG00000000669  2.556560 -0.3686911 0.0012533417 0.2444737
## ENSOARG00000011570 -1.691119  0.7041272 0.0012711513 0.2444737
summary(decideTests(GvS_edgeRun))
##        Sulcus-Gyrus
## Down              0
## NotSig        13221
## Up                0

Although the FDR has decreased importantly, this test was still unsuccessful in extracting differentially expressed genes from our dataset, most likely due to the low number of biological replicates and the small amount of variation for the factor we wanted to detect. The MDS shown earlier shows that the main variability factors do not separate the samples based on the factor we want, and samples from Gyrus and Sulcus cluster together. Plotting the different dimensions can help further make sense of the issue:

plotMDS(data_norm3, dim.plot = c(3,4), col=rep(c(1,2),4), pch=rep(c(15,16),4))#, col=colours2, labels=NULL, pch=points2)
legend("topright", legend=c("Gyrus samples", "Sulcus samples"), pch=c(15,16), col=c(1,2), ncol=1)

As can be seen in the MDS, dimension 3 (which accounts for only 14% of the variability in the dataset) is the factor that best separates Gyrus from Sulcus, and this is done only partially since one of the Sulcus samples behaves very similarly to the Gyrus samples. This is something that the original study also confirmed via immunohistochemical approaches for some genes such as BDNF and NeuroD6 – it was only after extensive sampling of Gyrus and Sulcus of the cortical plate throughout the cortex at different rostrocaudal levels that differential genetic expression became evident enough to be detected via two-way ANOVA. This supports the idea that more biological replicates would have been a desirable approach.

The little genetic difference between the Gyrus and Sulcus samples is understandable since the samples come from the the same cortical structure (cortical plate), the same cortical region (around the central (Ansate) sulcus), and from normal, healthy sheep without any treatments. Since these specimens are not genetically related, the genetic variability between them is maximal as well.

Step 7 - Extending the Original Study via Simulated Biological Replicates.

The previous step is where the RNA Seq component of the original study ends (see DOI: 10.1093/cercor/bhaa171 for more details). However, I’ve always wondered: What if instead of 4 sheep I had, say, 9-10 sheep that behaved similarly enough in the gyral and sulcal expression to the ones in this dataset (as variable as they were)? Would I have had the statistical power to uncover some differential genetic expression between gyri and sulci?

Since the study is concluded and published, and that there is no budget to do this (or Animal Ethics Committee that would approve of it), I decided to simulate some data based using my 4 sheep (1251.1, 1251.2, 1451.1 and 1451.2) as guidance.

The data is modeled by using a Negative Binomial distribution centred around the mean counts for each gene for the 4 original sheep, with a random probability that with a range of 0.4 and centred at 0.5 (thus between between 0.3 and 0.7 – narrower ranges deliver too much clustering of the simulated samples therefore not a good “real world” simulation, and wider ranges deliver data that is so variable that makes the original samples cluster and is thus too variable):

#Separate the gyral and the sulcal data.
gyri_data <- data_final2[, c(1,3,5,7)]
sulci_data <- data_final2[, c(2,4,6,8)]

#Simulate the data
set.seed(10)
sheep_sim_data <- data_final2
n <- 1
min <- 0.3
max <- 0.7
suppressWarnings(sheep_sim_data['sim_sheep1.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep1.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep2.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep2.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep3.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep3.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep4.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep4.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep5.G'] <- sapply(rowMeans(gyri_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))
suppressWarnings(sheep_sim_data['sim_sheep5.S'] <- sapply(rowMeans(sulci_data), function(x) rnbinom(1, x, prob = sample(runif(n=n, min=min, max=max),1))))

sheep_sim_data[is.na(sheep_sim_data)] <- 0
head(sheep_sim_data, 20)
##                    1251.1G 1251.1S 1251.2G 1251.2S 1452.1G 1452.1S 1452.2G
## ENSOARG00000017577       3       9      14       3      23       3      23
## ENSOARG00000020430       0       0       0       0       0       0       0
## ENSOARG00000017587       0       0       0       0       0       0       0
## ENSOARG00000025485       1       2       2      10       0       0       5
## ENSOARG00000017595       0       0       0       0       0       0       0
## ENSOARG00000017608      71      72     154     164      41      38     100
## ENSOARG00000017620      36       7      65      57       9      47      28
## ENSOARG00000017628       5      12      19       5       3      20      34
## ENSOARG00000017642      70      89     107      37      34      22     101
## ENSOARG00000017652       1       4       3       3       2       9       4
## ENSOARG00000017668     130     132     247     270     141     100     159
## ENSOARG00000017737     329     185     241     159      94      76     149
## ENSOARG00000017810    1992    2094    2911    2199    1754    1572    2474
## ENSOARG00000017870    1544    1537    2210    1913    1112     725    1885
## ENSOARG00000025486       1       1       0       0       1       0       1
## ENSOARG00000017936       0       0       3       0       0       1       1
## ENSOARG00000025487       0       0       0       0       9       0       2
## ENSOARG00000017998    1738    1309    1943    1654    1339    1498    2058
## ENSOARG00000018016      21      15       5      16       7       3      35
## ENSOARG00000018079      20      49     119      61      56      32     116
##                    1452.2S sim_sheep1.G sim_sheep1.S sim_sheep2.G sim_sheep2.S
## ENSOARG00000017577      24           11            7            9            5
## ENSOARG00000020430       0            0            0            0            0
## ENSOARG00000017587       0            0            0            0            0
## ENSOARG00000025485       1            1            4            1            4
## ENSOARG00000017595       0            0            0            0            0
## ENSOARG00000017608     103           52           70           61          148
## ENSOARG00000017620      36           22           21           17           30
## ENSOARG00000017628      23           32           10            8           25
## ENSOARG00000017642      85           65           50           26           22
## ENSOARG00000017652       2            6            5            2            2
## ENSOARG00000017668     244          236          160           80          163
## ENSOARG00000017737     142          197          304          235          115
## ENSOARG00000017810    2204         1974          947         2087         1490
## ENSOARG00000017870    1386         1694          640         1094         3279
## ENSOARG00000025486       0            0            0            0            0
## ENSOARG00000017936       2            0            3            2            2
## ENSOARG00000025487       3            2            5            2            0
## ENSOARG00000017998    2148         1172         2274         1240          732
## ENSOARG00000018016      20           10           33           29            9
## ENSOARG00000018079      89           53           85           63           90
##                    sim_sheep3.G sim_sheep3.S sim_sheep4.G sim_sheep4.S
## ENSOARG00000017577           22           15           25            8
## ENSOARG00000020430            0            0            0            0
## ENSOARG00000017587            0            0            0            0
## ENSOARG00000025485            1            6            1            1
## ENSOARG00000017595            0            0            0            0
## ENSOARG00000017608           64          116          100          185
## ENSOARG00000017620           44           51           18           17
## ENSOARG00000017628           16           16           12           30
## ENSOARG00000017642          141           20           73           75
## ENSOARG00000017652            2            3            0            3
## ENSOARG00000017668          121          139          321          322
## ENSOARG00000017737           93          138          104          362
## ENSOARG00000017810         1478         1382         2809         2410
## ENSOARG00000017870         1641         1960         1134         1024
## ENSOARG00000025486            1            1            1            0
## ENSOARG00000017936            1            1            0            0
## ENSOARG00000025487            1            0            1            0
## ENSOARG00000017998          793         2134         3659         1726
## ENSOARG00000018016           36            5           15            7
## ENSOARG00000018079          115           32           51           54
##                    sim_sheep5.G sim_sheep5.S
## ENSOARG00000017577           35           19
## ENSOARG00000020430            0            0
## ENSOARG00000017587            0            0
## ENSOARG00000025485            8            3
## ENSOARG00000017595            0            0
## ENSOARG00000017608           71           75
## ENSOARG00000017620           27           16
## ENSOARG00000017628           31           14
## ENSOARG00000017642           74           86
## ENSOARG00000017652            3            1
## ENSOARG00000017668           89          236
## ENSOARG00000017737          156           47
## ENSOARG00000017810         1786         1639
## ENSOARG00000017870         1364         1301
## ENSOARG00000025486            0            0
## ENSOARG00000017936            0            0
## ENSOARG00000025487            4            1
## ENSOARG00000017998         2395          847
## ENSOARG00000018016           25            1
## ENSOARG00000018079           65           21

And now we can repeat the edgeR and edgeRun approaches using this data.

#Experimental design setup
experiment_design_sim <- read.table("experiment_design_sim.txt",header = T, sep = "\t")
experiment_design_sim
##        SampleID Source_animal Sample_type
## 1       1251.1G       Sheep_1       Gyrus
## 2       1251.1S       Sheep_1      Sulcus
## 3       1251.2G       Sheep_2       Gyrus
## 4       1251.2S       Sheep_2      Sulcus
## 5       1452.1G       Sheep_4       Gyrus
## 6       1452.1S       Sheep_4      Sulcus
## 7       1452.2G       Sheep_5       Gyrus
## 8       1452.2S       Sheep_5      Sulcus
## 9  sim_sheep1.G    sim_sheep1       Gyrus
## 10 sim_sheep1.S    sim_sheep1      Sulcus
## 11 sim_sheep2.G    sim_sheep2       Gyrus
## 12 sim_sheep2.S    sim_sheep2      Sulcus
## 13 sim_sheep3.G    sim_sheep3       Gyrus
## 14 sim_sheep3.S    sim_sheep3      Sulcus
## 15 sim_sheep4.G    sim_sheep4       Gyrus
## 16 sim_sheep4.S    sim_sheep4      Sulcus
## 17 sim_sheep5.G    sim_sheep5       Gyrus
## 18 sim_sheep5.S    sim_sheep5      Sulcus
rownames(experiment_design_sim) <- experiment_design_sim$SampleID
samples_names_sim <- as.character(experiment_design_sim$SampleID)
sample_type_sim <- factor(experiment_design_sim$Sample_type)
sheep_number_sim <- factor(experiment_design_sim$Source_animal)
DGEList.edgeR_sim <- DGEList(counts=sheep_sim_data, group = sample_type_sim)
keep_sim <- filterByExpr(DGEList.edgeR_sim)
summary(keep_sim)
##    Mode   FALSE    TRUE 
## logical   13870   13184
DGEList.edgeR_sim <- DGEList.edgeR_sim[keep_sim, , keep.lib.sizes=FALSE]
DGEList.edgeR_sim <- calcNormFactors(DGEList.edgeR_sim)
DGEList.edgeR_sim <- estimateCommonDisp(DGEList.edgeR_sim,verbose = T)
## Disp = 0.17812 , BCV = 0.422
data_norm_sim <- calcNormFactors(DGEList.edgeR_sim)
colours_sim <- rep(c(1,2),9)
points_sim <- rep(c(15,16),9)
plotMDS(data_norm_sim, col=colours_sim, labels=NULL, pch=points_sim)
legend("topright", legend=levels(sample_type_sim), pch=c(15,16), col=c(1,2), ncol=1)

design_matrix_sim <- model.matrix(~ 0 + sample_type_sim)
colnames(design_matrix_sim) <- levels(sample_type_sim)
design_matrix_sim
##    Gyrus Sulcus
## 1      1      0
## 2      0      1
## 3      1      0
## 4      0      1
## 5      1      0
## 6      0      1
## 7      1      0
## 8      0      1
## 9      1      0
## 10     0      1
## 11     1      0
## 12     0      1
## 13     1      0
## 14     0      1
## 15     1      0
## 16     0      1
## 17     1      0
## 18     0      1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$sample_type_sim
## [1] "contr.treatment"
fit_sim <- glmQLFit(DGEList.edgeR_sim, design_matrix_sim, robust = TRUE)
contrast_sim = makeContrasts(Gyrus - Sulcus, levels = design_matrix_sim)
GvS_sim <- glmQLFTest(fit_sim, contrast = contrast_sim)
topTags(GvS_sim)
## Coefficient:  1*Gyrus -1*Sulcus 
##                        logFC      logCPM        F       PValue          FDR
## ENSOARG00000012748 -2.265645  1.71982086 44.42897 2.644003e-11 3.365868e-07
## ENSOARG00000000776  3.035031  0.08656688 43.14067 5.105989e-11 3.365868e-07
## ENSOARG00000009537 -2.744161  0.34469859 41.68114 1.076723e-10 4.731838e-07
## ENSOARG00000017745  2.504387  0.54151354 37.68212 8.341183e-10 2.749254e-06
## ENSOARG00000011263  1.992096  1.60629255 33.97380 5.594003e-09 1.475027e-05
## ENSOARG00000019063  2.531804  0.04134232 32.16711 1.416493e-08 3.112506e-05
## ENSOARG00000008842  2.302205  0.35312880 30.26215 3.790534e-08 6.201197e-05
## ENSOARG00000015875  2.446860  0.02572159 30.11616 4.073915e-08 6.201197e-05
## ENSOARG00000000669 -2.634856 -0.37974314 29.81500 4.758333e-08 6.201197e-05
## ENSOARG00000014800  2.528492 -0.13994649 29.69803 5.054240e-08 6.201197e-05
summary(decideTests(GvS_sim))
##        1*Gyrus -1*Sulcus
## Down                 116
## NotSig             12930
## Up                   138
plotMD(GvS_sim, main = "Differentially Expressed between Gyrus and Sulcus")

So, using this approach we find that we have 116 genes downregulated in Gyrus compared to Sulcus, and 138 genes that are upregulated in Gyrus compared to Sulcus.

But the information we currently have as row names (the Ensembl IDs) do not tell us anything that we humans can read, and querying the IDs one by one with the Ensembl database is impractical. Luckily we have some tools like rtracklayer that we can use to import and handle the annotation file and use it to match the IDs with the gene symbols, which are more informative.

library(rtracklayer)
annots <- import("Ovis_aries_annot.gtf")
annots <- subset(annots, type == "gene")
annots_df <- as.data.frame(annots)[,c("gene_id", "gene_name")]

So now we can use the annots_df object to merge IDs and symbols (names) after subsetting the results to include only DE genes.

#Select only the genes that are up- or down-regulated based on decideTests results
significant_GvS_sim <- as.data.frame(decideTests(GvS_sim))
significant_GvS_sim <- subset(significant_GvS_sim, significant_GvS_sim$`1*Gyrus -1*Sulcus` == -1 | significant_GvS_sim$`1*Gyrus -1*Sulcus` == 1 )
significant_GvS_sim["gene_id"] <- rownames(significant_GvS_sim)

#Subset the DGELRT table to include only the genes that are significant based on decideTests results
significant_table <- GvS_sim$table
significant_table["gene_id"] <- rownames(significant_table)
significant_table <- merge(significant_table, significant_GvS_sim, by.x = "gene_id", by.y = "gene_id")

#Merge with symbols
significant_final <- merge(significant_table, annots_df, by.x = "gene_id", by.y = "gene_id")

There are too many NAs! Which is what happens often with genomes that are not as extensively annotated and validated such as model organisms.

When we filter by genes with names, we see that

significant_gene_names <- significant_final["gene_name"][is.na(significant_final["gene_name"]) == FALSE]
significant_gene_names
## [1] "TLR10"  "CDK5"   "EDNRA"  "ZDHHC7" "AGTR1"  "PMP22"  "POLR2G" "7SK"

We have 8 genes differentially expressed, amongst which is CDK5, which is one of the genes that the original study assumed as differentially expressed (based on edgeRun results combined with other statistical approaches) and was validated as such using immunohistological methods. However, the other validated genes (BDNF, NeuroD6, HDAC5, MeCP2 and MAP2) were not picked up by this method, highlighting the complexities of the pipeline at hand when dealing with small samples of genetically diverse individuals.

Conclusions

The work presented here (for display and curiosity’s sake) is a replication and an extension of the RNA Seq component of an already published study that sought to uncover the differential genetic expression of the gyral and sulcal cortical plate during gyrification in fetal sheep (see PMID: 32609332 DOI: 10.1093/cercor/bhaa171 for more details). The extension component was drawn using simulated data that was created using negative binomial modelling that intended to “mimic” the Gyrus and Sulcus samples observed in the real-world RNA Seq data. However, this is understandably just a simulation, and in order to uncover small differences from genetically diverse individuals we would need many more samples (see for example the “Profiles of Yoruba HapMap individuals” section in the edgeR user’s manual).

Interestingly, the simulation did identify a gene that was hypothesised to be differentially expressed: CDK5, thus strongly suggesting that the modelled approach has some merits, however it nonetheless missed genes such as BDNF, NeuroD6, HDAC5, MeCP2 and MAP2. We might need to sample many, many more fetal sheep cortices to obtain enough replicates to uncover these genes, which is something that is beyond the constraints of the animal ethics and the budget we had.