How to perform single cell RNA sequencing: exploratory analysis
Last updated
Last updated
Is anyone out there still performing bulk RNA-Seq? I am sure that it is far from dying out but judging by what scientists talk about and report on, single cell RNA sequencing (scRNA-Seq) has become the new norm.
Although, when you think about it, it can hardly be called a novel approach. The first manuscript on scRNA-Seq was published in 2009 (Tang F, et al. 2009), but you may be surprised to hear that the basic principles were described in the 1990s (seminal works by Paul Coleman and Norman Iscove). That decade (or three) is quite negligible for a geologist, but in terms of molecular biology, scRNA-Seq is getting gray hair. In comparison, bioinformatics analysis of scRNA-Seq data has yet to mature and there is no consensus in the community.
Single Cell analysis is a larger topic than what is possible in a single blog post. To break it down into bite-size reading, we’ll talk about each step in a separate blog post.
Like with any other analysis, the golden rule of garbage in, garbage out still applies, so once you have the quantified matrix, you should do some exploratory analysis, on both the cell and gene levels.
Let’s start with cells. You may want to look at the fraction of mitochondrial counts per cell (i.e., the ratio of reads mapping to mitochondrial genes the total number of reads; Figure 1). Cells with an increase in the proportion of mitochondrial genes may be apoptotic and, hence, should be excluded from the analysis. The decision on the cut-off should be made considering the experiment and the samples, but values of 5% or 10% are quite common for single cell analysis studies.
Next, some cells may show unusually high count numbers. As an example, I plotted the results of a Drop-Seq experiment (Figure 2). The cell in the top right corner has ~1.3 million reads, which is some 1000⨉ more than most of the cells. That event is most likely not a single cell, but a doublet (or a triplet) and, again, should be excluded from any downstream steps.
Tip! Use a violin plot to show the data distribution of your single cell analysis experiment. It makes it easier to set the cutoff.
There are also several gene-level matrices that you should take into consideration. Going over a distribution plot like the one in Figure 3, based on a 10x Genomics data set, will reveal the genes with the highest number of reads.
One way to interpret a distribution plot is to see where you are investing your reads. For example, out of the ten top genes in Figure 3, six are ribosomal genes. Unless you are specifically interested in ribosomal biology or translation machinery, you may want to remove them from the downstream steps, since their presence only introduces noise in the single cell analysis.
Tip! A list of ribosomal genes can be downloaded from genenames.org.
There is an additional filter strategy to consider: what genes to focus on? One approach is to prune the genes that are not detected (for example, have 0 counts across the cells) or that are detected in a few cells only (“background”; for example, have 0 counts in at least 99% of the cells).
We advise you to carefully interpret the non-detectable genes: are those genes not expressed in your cells (=biology) or are you not picking them up due to the experimental setup, e.g., insufficient sequencing depth (=technology)?
Another approach is to filter only the most variable genes, with the rationale that those genes are also the most informative ones. This is quite appealing and can help to identify the main cell groups. On the flip side, key biological information may be lost.
Irrespective of your decision, filter strategy needs to be considered when interpreting the results. To illustrate the impact of filtering, let us compare two t-SNE charts, based on the same 10x Genomics data (Figure 4): all the detected genes were used for the t-SNE on the left (in this case: 6,178), while only the 100 most variable genes were used for the one on the right. As shown by overlaying the output of graph-based clustering, the cells in the left panel form five, while the cells in the right panel form six clusters.
Having performed cell-based and gene-based filtering you are one (big) step closer to the analytical data set, but there is still work to be done. For example, you may want to eliminate technical nuisance factors by using batch removal or scaling.