# Unscaling scaled counts in scRNA-seq data

It is always good to compare your measurements with measurements others have done before you. This way you get a feeling for expected outcomes, expected variation in measurement, etc.

In single cell RNA-seq, I tend to apply this principle by looking at a lot of published data, to see how the properties of data in general correspond to data that I’m analyzing for the sake of scientific investigation.

Single cell RNA-seq data consists of observed counts of molecules generated from different genes in different cells. Sometimes however, authors only publish counts that have been scaled. One very popular unit is CPM, counts per million, where the counts in each cell is divided by the sum of counts, then multiplied by a million. Recently I have also noticed data with similar scaling to 10,000 or 100,000, or to a median across the data set.

To compare data from different sources, it is easier to interpret differences if you know the per-cell total counts. It allows you to think of the counts in the data in the context of exposure. I recently wanted to look at some 10X data, but it turned out the expression matrix did not consist of counts.

Looking at the mean-variance relation for all the genes in the data it was clear that they had been log-transformed.

Even after exponentiating the expression levels, they were not integers. However, looking at the sum of expression levels in each cell it could easily be seen that the counts had been scaled to 10,000 counts per cell.

One would think that this is a lossy transformation: we wouldn’t know how large the magnitudes in a cell were before division. However, the very problem with scaling of counts means we can recover the value used to scale the cell.

It’s not clear if this is a biological property or a property of the measurement process, but out of the 20,000 or so genes, in scRNA-seq data of any given cell, the vast majority of genes will have 0 counts. Follows by a substantial number of genes with 1 counts, followed by a slightly smaller number of genes with 2 counts, etc.

Since the scaling by division just changes the magnitude of the counts, but not the discrete nature, the most common value will correspond to 0, the second most common value to 1, etc. This plot of a few random cells make this relation clear.

Just looking at the second most common value in the histogram of a cell will give us the factor used to scale the cell! (For the sake of some robustness, I used several ranks in the histogram and the corresponding scaled counts and fitted the scaling factor with least squares, but the difference from the second most common value was all in all negligible).

By simply multiplying the cells by the identified scaling factors, the true counts are recovered. After un-scaling, the typical relation between total count and detected genes is clear, and the negative binomial mean-variance relation for genes is apparent.

I wanted to write about this, because the fact that it is possible to recover the original counts after scaling indicates that just scaling the counts is not sufficient for normalization. The properties if discrete counts are still present in the data, and will be a factor in every analysis where these scaled counts are used. It will make interpretation hard, since the relation between magnitude and the probability of generating values magnitude will not be tied to each other as clearly as with counts.

Notebook with this analysis is available here.

# Resolving immune cell fates in malaria

Malaria is one of the world's deadliest diseases with 200 million cases in a year, leading to half a million deaths. It is caused by mosquito-borne Plasmodium parasites. During parasitic infections our adaptive immune system responds through a complex series of cellular interactions, precipitated by events at the molecular level. The dynamic ability to suppress and activate genes allow immune cells to maintain surprising plasticity. Switching the states of the cells in response to infection require a network of gene regulatory events to occur.

At the heart of the adaptive immune system are T helper cells. Under healthy conditions, naive T helper cells are quiescent and roam through the body awaiting signals. Once T helper cells receive activation signals they differentiate into different roles depending on the type of infection. In malaria, naive cells differentiate into two distinct populations with different roles: Th1 cells, which signal macrophages and killer T cells to attack infected cells. And Tfh cells, which activate survival programs and promote antibody affinity in B cells.

Differentiation into a role is orchestrated by a small number of transcription factors regulating expression of large sets of genes. Identifying transcription factors that direct the differentiation fate is of key importance to understand the immune system, and to what degree in vivo cell-extrinsic factors bias the cell fate has not been studied in detail.

Immunology tightly links cellular biology and molecular genomics. Recently single cell RNA sequencing has gained massive popularity to investigate cells with this joint perspective (1). This technology allows researchers to measure expression of all genes in thousands of single cells. Immune cells however have particularly low amounts of RNA and many important genes, such as transcription factors, are present at low levels. To ensure scRNA-seq would be useful for us we evaluated many different technologies with ground truth molecular input levels (2). To our surprise we found that expression could be quantified even from tens of copies of RNA molecules!

To chart the immune response to malaria we designed a large scRNA-seq time course experiment using a mouse model. We collected T helper cells, monocytes, and dendritic cells from mice infected by the malaria parasite Plasmodium chabaudi at five points over seven days (3).

We used Gaussian process methods to identify and investigate genes with expression dynamics over time. These are nonparametric machine learning models that can represent and identify any smooth dynamic function. (Later we extended this method to identify spatially variable genes in spatial tissue gene expression systems (4).)

Unfortunately five time points are not sufficient to make conclusions about the dynamics of gene expression. However, we could exploit the fact that the immune cells are not synchronized in their response. An extension of the Gaussian process model let us estimate latent per-cell “time points” which could be earlier or later than cell collection times. This Gaussian process latent variable model simultaneously model many genes as different smooth functions, and correct time values so genes are consistent between each other. Now we could investigate the temporally dynamic genes and looked at when they started, peaked, and stopped expression. This way we learn the order transcriptional programs like activation or proliferation.

The most crucial aspect of our experiment was that the T helper cells had two different fates. Meaning any gene’s dynamic expression was the result of either of two functions: a Th1 fate function, or a Tfh fate function. We needed a way to analyse gradual commitment to either fate, without knowing which cells were on which path.

We solved this problem using a mixture of overlapping Gaussian processes. A mixture model is a statistical way to formalize the cell fate assignment problem described above. Machine learning algorithms can be used to infer the most likely groupings of cells into the components of the mixture. In our case cells would be assigned to either a “Tfh component”, or a “Th1 component”, with ambiguous assignments for uncommitted cells.

After fitting the mixture model to assign cells to fates, we could investigate how each individual gene related to the fates. This was the final key to uncovering transcriptional programs for the fate choice. This recapitulated the known Tfh markers Bcl6 and Cxcr5, but in particular let us associate the maintenance of Tcf7 with the Tfh fate. The mixture model identified the Tcf7 antagonist Id2 as the main transcription factor associated with the Th1 fate, along with many known Th1 genes.

The peak times of dynamic genes could be linked to when cells started committing to fates. This identified receptors which were compatible with signaling molecules expressed by the monocytes we had collected in parallel. We confirmed that mice with reduced numbers of monocytes produced smaller fractions of Th1 cells. (Mice with reduced number of B cells on the other hand produce smaller fractions of Tfh cells).

It was possible that T helper cells driven to either fate by the communicating cells had already been primed to respond to only these cells. Using the T cell receptor sequence in the T helper cells as a natural clonal barcode, we found that both Th1 and Tfh cells shared sequences, and must have originated from common naive progenitors.

In summary our data indicate that in malaria activated T helper cells start expressing Tcf7. After proliferation rate acceleration the cells express receptors which receive signals from monocytes or B cells. Cells receiving B cell signals maintain Tcf7 expression and become Tfh cells, while cells receiving monocyte signals replace Tcf7 expression with Id2 expression and become Th1 cells.

Biology is entering a new era with the rich measurements from single cell RNA sequencing. It allows us to use machine learning to ask explicit questions about the data. The power is particularly clear from the joint cellular and molecular perspective. Going forward I hope for more developments that allow us to ask and answer questions from this perspective, and more opportunities for methods like the overlapping mixture of Gaussian processes that can link the two.

### References and notes

1. V. Svensson, R. Vento-Tormo, S. A. Teichmann, Exponential scaling of single-cell RNA-seq in the past decade. Nat. Protoc. 13, 599–604 (2018).
2. V. Svensson et al., Power analysis of single-cell RNA-sequencing experiments. Nat. Methods (2017), doi:10.1038/nmeth.4220.
3. T. Lönnberg et al., Single-cell RNA-seq and computational analysis using temporal mixture modeling resolves TH1/TFH fate bifurcation in malaria. Science Immunology. 2, eaal2192 (2017).
4. V. Svensson, S. A. Teichmann, O. Stegle, SpatialDE: identification of spatially variable genes. Nat. Methods. 15, 343–346 (2018).
5. I. C. Macaulay et al., Single-Cell RNA-Sequencing Reveals a Continuous Spectrum of Differentiation in Hematopoietic Cells. Cell Rep. 14, 966–977 (2016).

This post was originally an essay I submitted for a competition on work done in the field of genomics during PhD. Thanks to Sarah Teichmann, Lynn Yi, and Eduardo Beltrame for feedback on the text.

# Integrating scRNA-seq and spatial STARmap data from mouse frontal cortex with scVI

In the last few months a number of interesting brain single cell datasets have been published. The new promising STARmap method was pusblished by Wang et al in Science using mouse brain data as example applications. Even more recently, two back to back mouse brain atlases were published in Cell. Zeisel et al characeterized the nervous system of adolescent mice through sequencing over 500,000 cells from 19 different regions using the 10X Genomics Chromium technology. Saunders et al studied adult mouse brain using almost 700,000 cells from 9 regions of the brain using the Drop-seq method.

All three papers contain some data from the frontal cortex region. An interesting exercise is to see how these datasets relate to one another, and if they can be analysed as a single entity.

The Wang et al STARmap data measures a panel of 166 genes in the medial prefrontal cortex (mPFC), across 3,700 cells from three separate biological samples. The STARmap data does not have any clusters annotated, but they do have spatial locations in the tissue preserved! The data from Saunders et al and Zeisel et al are transcriptome wide, so much larger. But it would be interesting to see how well the small panel of STARmap genes would work to integrate and analyse these datasets.

The anterior cortex data from Zeisel et al consists of 14,000 cells with annotations for 53 clusters (though some of these seem to be enteric neurons which is a bit odd?). The frontal cortex dataset from Saunders et al has over 70,000 cells annotated with 62 clusters. For the sake of simplicity, here I randomly sampled 15,000 cells from the Saunders data.

When the datasets are combined we get an expression matrix with 32,000 cells with counts in 158 genes. To analyze the data together I use scVI an autoencoder based method that have been working well in my current research work. Technology specific effects can be accounted for by providing them as batches.

Because we are looking at very few genes here, running inference on the scVI model takes about 10 minutes for 100 epochs. The scVI model then gives you access to a low-dimensional (10 dimensional in this case) latent representation space which can be used to generate counts for all genes consistent with the structure in the original data. The main point with this is that many machine learning and statistical methods that have problems with count data

Once the scVI model is fitted, it is a good idea to create a tSNE of the latent space, and color the cells by various known factors to try to figure out what might contribute to variation in the data.

We see that similar patterns of densities overlap for all the datasets. Of course we need to remind ourselves not to read too much into the tSNE representation; the important part is the 10-dimensional latent representation of the data.

We can note that on the larger annotated scale annotation from the Zeisel and Saunders we can see overlap between related terms in the tSNE! Immune cells close to immune cells, neurons close to neurons, etc.

We can also color the cells by the complete cluster annoation. But with so many clusters it becomes hard to tell the colors apart! (And here I even collapsed some of the Saunders et al cluster names so they would fit in the plot.)

To learn about how the different annotations in the different datasets relate to each other we can use Ward linkage on the cell group centers to create a common dendrogram for all the groups. Because of the design of the latent space, we can simply use Euclidean distances. First we look at the broad cell classes from both the annotated datasets.

From the dendrogram it is clear that the class annotations are consistent between the datasets, as we observed in tSNE above.

We turn then to looking at all 110 annotated clusters in the frontal cortex data from the two publications. This dendrogram is on the larger side, bu the text is stil legible. While it seems pretty consistant within a given dataset, the mixing does not seem completely right. For example neurons in the Saunders et al data are mixed with oligodendrocytes from the Zeisel et l data. This is not necessarily terrible, we are only working with 158 genes from the STARmap panel, the original studies used thousands of genes to define the cell types.

The STARmap cells do not come with cell type annoations. They do however come with spatial locations for every cell! We can try to link the spatial locations in the STARmap data with the cell type annotion in the othe data sets.

scVI has specialized functionality for these tasks, both for semi-supervised learning to annotate cells, and for working with combined spatial and scRNA-seq datasets. Here however we will just do something simple to see how it works out.

Since Euclidean distances are meaningful in the latent 10-d space, we can use a kNN classifier using clusters or classes as targets.

Now we can plot the spatial locations of the STARmap cells, but annotate the cells with predicted clusters from the scRNA-seq datasets. This way we can qualitatively evaluate whether the clusters are consistent with spatial structure in the brain.

In these plots above, each "stripe" consists of different biological samples, and it wouldn't make sense to plot the cells from different biological samples in a single plane. I think they are even from different mice.

From this it seems that prediction to the annotation from Saunders et al seem more spatially organized in the STARmap data.

The cell types in the Zeisel et al data are predicted more spread out in the STARmap data. I suspect there is an issue with the cell type annotation here, since excitatory neurons (TEGLU*) are mapped to the region at the right which is supposed to be non-neuronal. I might have misunderstood how the cell clusters are indexed in the Loom files for cortex. (There is also an odd thing where the website says there are 7,000 anterior cortex cells, but the Loom file contains 15,000).

A notebook with all analysis, including fitting the scVI model, is available here.

# Identify tissue structure with Automatic Expression Histology

A while after the invention of the microscope, scientists started applying various chemical dyes when looking at biological samples in the form of tissues from diseased organs. This revealed anatomical structures on the microscopic scale in many plant and animal tissues, birthing the field of histology. Microscopy of stained tissue is still an enormously important field, and histopathologists have devised metrics for diagnosing many types of conditions and diseases based on patterns of cells in stained tissues.

The original classification of the six neuronal layers in the cerebral cortex for example was made on the basis of morphology observed using “Nissl staining” which highlights the endoplasmic reticulum.

Since the original studies of brain microanatomy in the late 19th century we have figured out that the functions and phenotypes of cells (including neurons) are determined by regulation of gene expression programs. Many histological assays are now defined by examining particular gene products examined using immunohistochemical assays.

Researchers are now making extremely rich molecular measurements with preserved spatial coordinates in tissues. At the same time as genome wide measurement are being miniaturized into technologies that preserve spatial location in tissues, technologies for counting individual mRNA molecules in cells using imaging are being increasingly parallelized. We are reaching a point where these two approaches to spatial gene expression analysis are converging to similarly rich assays, with the ability to look at thousands of gene products in thousands of individual cells in tissues.

Similar to how pathologists have defined visual morphological “markers” for cell types in tissues, it is possible to look directly at gene expression or protein abundance throughout the tissue to learn about its structure.

The function of regions of tissues will in many cases be determined by sets of co-expressed genes. This means there is a coupling between spatial locality of function and particular sets of genes.

The automatic expression histology (AEH) model which we published in Nature Methods earlier this year attempts to explicitly model this coupling through a probabilistic model:

\begin{align*} P(Y, \mu, Z, \sigma^2, \Sigma) &= P(Y | \mu, Z, \sigma^2) \cdot P(\mu | \Sigma) \cdot P(Z), \\ P(Y | \mu, Z, \sigma^2) &= \prod_{k=1}^K \prod_{g=1}^G \mathcal{N}(y_g | \mu_k, \sigma^2)^{(z_{g, k})}, \\ P(\mu | \Sigma) &= \prod_{k=1}^K \mathcal{N}(\mu_k | 0, \Sigma), \\ P(Z) &= \prod_{k=1}^K \prod_{g=1}^G \left( \frac{1}{K} \right)^{(z_{g, k})}. \end{align*}

Here $$Y$$ is a set of vectors of $$G$$ genes observed in $$N$$ locations in the tissue. The first couple of equations just describe a probabilistic model of clusters: a Gaussian mixture model with $$K$$ components. Here the matrix $$Z$$ is binary, and only has one non-zero entry per row which assigns an observation to a component. The last two lines though specify that means in the GMM should be generated from smooth spatial functions, thus giving the form of a mixture of Gaussian processes. The covariance matrix $$\Sigma$$ encodes spatial locality.

Modeling the spatial functions with Gaussian processes means you do not need to know what they look like, we just assume they are somewhat smooth.

In this model the expression levels of genes are governed by underlying shared spatial functions. The information we know from the experiment is $$Y$$ and $$\Sigma$$. We don’t know what the functions look like ($$\mu$$ and $$\sigma$$) and we don’t know which genes share which function ($$Z$$).

This figure gives a visual representation of the idea of how the binary matrix $$Z$$ assigns genes $$g$$ to underlying patterns $$k$$:

Implemented in the SpatialDE package there is a Bayesian inference algorithm for learning posterior probabilities of the assignments $$z$$ between genes and hidden spatial patterns that, together, give rise to spatial co-expression. In our paper we applied it to breast cancer tissue and found a set of immune genes spatially co-expressed in a region of tumor infiltration. We also applied it to mouse olfactory bulb data and got a clear picture of the layer structure of the organ. As well as spatial organization of cell types in a small region of mouse hippocampus.

As a new example here, we can apply the method to an extremely interesting recent dataset that was produced using the novel STARmap technology. This protocol uses in situ DNA sequencing to decode barcodes bound to mRNA molecules within cells of a given tissue, integrating the concepts of CLARITY tissue clearing, in situ FISSEQ, and seqFISH/MERFISH combinatorial gene tagging.

The dataset we use here as an example probed 1,020 genes in in 1,549 single cells over a region of size of about 700 µm by 400 µm in the primary visual cortex of a mouse (corresponding to Figure 4 in the STARmap paper).

First we run the SpatialDE significance test, and find that 101 of the genes significantly depend on spatial locations of the cells in the tissue. To perform AEH, we use these genes and set the characteristic length scale in the tissue. Setting these parameters is not trivial, and here we are a bit helped by domain knowledge. We happen to know that layers are on the order of a couple of 100 µm, and we expect about eight of them. We then set the length scale to be 100 µm and select eight underlying spatial patterns to assign the genes to.

After running AEH we are given assignments of the genes, as well as the underlying spatial function for a pattern, which is evaluated for all the cells in the dataset/tissue.

HPC: hippocampus, cc: corpus callosum, L1-L6: Layers 1 to 6.

Approximately annotating the image by matching it with the figure in the STARmap paper, we see that the patterns learned by AEH match up with the classical layers of the cortex. While we used the number of layers when running AEH, no information about the structure of them were provided to the model. The layer structure that was originally observed by the morphology of stained cells was now automatically identified by the molecular measurements. One way of viewing this is that we now have access to 101 “colors” from the genes, and the “hues” we get by mixing combinations of them in the different layers form patterns that are easy to see computationally.

Compared to other spatial data this dataset is on the larger side, measuring 1500 locations, but even so AEH converges in a reasonable time. Of course the data sets will get larger, but seeing these technologies spread to more research labs is going to be very interesting. The same way many cell types have been discovered by single cell RNA sequencing, many new microanatomical structures will be discovered in many organs using these spatial gene expression methods. I hope tools such as AEH in SpatialDE will be helpful to the fields of tissue and cell biology.

The analysis of the STARmap data can be seen in a Jupyter notebook here.

Thanks to Carlos Talavera-Lópes for feedback on the post.

# Count based autoencoders and the future for scRNA-seq analysis

Two recent computational methods, scVI by Lopez et al and DCA by Eraslan & Simon et al, are the most promising methods for large scale scRNA-seq analysis in a while. This post will describe what they do and why it is exciting, and demonstrate the results of running them on three recent datasets.

In almost all cases, to do anything useful in scRNA-seq the tens of thousands of genes measured need to be summarised and simplified. An extremely effective way to summarise many variables through their covariance structure is principal component analysis (PCA). However, scRNA-seq data consists of counts, which have particular behaviours that cause issues with the interpretation and of PCA.

Potential ways of dealing with this is either figuring out how to transform count data to emulate the characteristics of continuous Gaussian data, or to reformulate PCA for the count setting. While data transformations have historically had some success, they don’t perform so well for low count numbers, which is the case when massive scRNA-seq experiments are economically feasible.

A couple of years ago Risso et al successfully created a generalized linear factor analysis model for scRNA-seq counts called ZINB-WaVE, based on a zero-inflated negative binomial (ZINB) count distribution. Omitting some further features, in this model underlying rates of observations of mRNA molecules from specific genes are modelled by a low-dimensional collection of continuous factors. Every cell has a particular hidden value for these factors, and to fit the model all cells are investigated and assigned the most likely factor values. (In these equations, red color indicates parameters that need to be inferred.)

\begin{align*} Y &\sim \text{ZINB}(\mu, \phi, \pi) \\ \log \mu &= {\color{red}{W}}^{T} {\color{red}{X}} \\ \cdots \end{align*}

With these ZINB based methods, the data does not need to be scaled, nor normalised, and in principle the common step of selecting highly variable genes is not necessary. The data just plugs in.

Inherently, learning the $$X$$ means that each cell is compared to all other cells. The factors ($$W$$) can be investigated in attempts to deduce meaning, and in that way we gain knowledge. But if you show the model a new cell $$y$$, it doesn’t know what to do with it. The inference will need to be rerun with the entire dataset including this new cell.

Two new methods, called scVI (single cell variational inference) and DCA (deep count autoencoder) rethinks this model, by moving from factor analysis to an autoencoder framework using the same ZINB count distribution. (Their titles and abstracts phrase them as imputation methods, which is a bit odd and substantially undersell them!) The two methods have slightly different parameterizations, but conceptually (abusing notation a bit), this represents with they both do:

\begin{align*} Y &\sim \text{ZINB}(\mu, \phi, \pi) \\ \log \mu &= f_{\mu}(X; {\color{red}{\theta_f}}) \\ \cdots \\ \hat{X} &= g(Y; {\color{red}{\theta_g}}) \end{align*}

A parametric function from the observed space (gene counts) to a low-dimensional space is fitted ($$g$$), at the same time as a function that maps the low dimensional space to ZINB parameters ($$f$$). For any given cell, you can apply $$g$$ to get its $$X$$-representation, then apply $$f$$ on that to get parameters, and evaluate the likelihood of the cell. The functions $$g$$ and $$f$$ are parameterised as neural networks because these are flexible and efficient.

This unlocks a lot of benefits. For inference, this setup makes it easier to use stochastic optimization, and is directly compatible with inference on mini-batches: you only need to look at a few cells at a time, so no matter how many cells you have, you will not run out of memory. Scientifically, this allows you to generalize. You can apply $$g$$ to any new cell and see where it ends up in the $$X$$-space.

By analysing how different regions of the $$X$$-space map to gene expression, markers and differential expression can be investigated. And on the converse, if you perform clustering in the $$X$$-space, you can take new cells and evaluate which clusters the $$g$$ function maps them to.

To illustrate what the methods do, we will apply them to three recent datasets. I picked smaller datasets on the order of ~2,500 cells because I wanted to quickly run them on desktop just for testing. One from Rosenberg et al 2018, where I randomly sampled 3,000 out of 150,000 developing mouse brain cells. The second one is a taxon of Peripheral sensory neurons from mousebrain.org, which is part of Zeisel et al 2018. And finally a dataset of male mouse germ cells that I found on GEO, but I couldn’t find an associated paper for (Lukassen 2018). I ran both DCA and scVI with default parameters: DCA produces a 32-dimensional representation, and scVI a 10-dimensional. To qualitatively inspect the results I ran tSNE on the representations, and colored the cells based on labels provided from the data sources.

The DCA method is implemented around the anndata Python package, and is very easy to run on any data you have. The scVI implementation requires you to manually wrangle your data into TensorFlow tensors with correct data types, which can be frustrating if you are not used to it. This does however imply that if you want to scale the inference using out-of-core strategies, scVI directly supports that.

In terms of run time, DCA was much faster than scVI, finishing in a few minutes. scVI took about an hour for each dataset. A large component is probably that DCA implements automatic early stopping, while scVI will run for as many epochs as you tell it, even if the fit doesn’t improve.

The scVI method has the option to account for discrete nuisance variables (batch effects), but I did not try this. And even without it, it seems to align the two mice quite well in the Lukassen 2018 data!

I am curious if there is a way to also account for continuous nuisance parameters (e.g. amplification cycles). In ZINB-WaVE this is straightforward, because it is a GLM, but it is not so clear here. I mentioned clustering in the X-space, it might be possible to formulate these models as structured autoencoders (SVAE), and encourage the $$g$$ function to learn a representation that favours cell type segmentation. Notebooks where I ran the methods on the different datasets are available here.

# Illustrating spatially variable genes with a 1-dimensional zebrafish

The other day SpatialDE, our statistical test for spatially variable genes in spatial transcriptomics data was published in Nature Methods. The goal of the method is to find genes which in some way depend on the spatial organization of tissues, and for our paper we analysed generated with Spatial Transcriptomics, SeqFISH, and MERFISH. As a demonstration of how to use SpatialDE I also analysed some recent heart tissue. The underlying principle in SpatialDE can be hard to get to grips with, and the aim of this post is to provide an example that hopefully explains it.

One very practical reason it can be hard to see what is going on is that it is generally difficult to make plots to visualize data in more than 2 dimensions so that noise is easily interpreted. In terms of spatial gene expression, it is then easiest to make an example with a 1-dimensional tissue, and illustrate things that way.

There is actually a spatial transcriptomics dataset which is in fact 1-dimensional! In a study by Junker et al, the authors studied zebrafish embryos at the 15 somites stage by straightening them, then segmenting them into about 100 slices, each roughly 18 µm thick. For each slice digital RNA sequencing was performed providing expression levels for the entire transcriptome, annotated with the head-to-tail coordinates of the slice: a protocol they call “tomo-seq”. This provides an excellent example to illustrate what SpatialDE mean by “spatially variable genes”.

## Defining spatial gene expression

Without pre-specifying the shapes of gene expression we need a way to express properties of how genes could behave across the coordinates. An effective way to look at this is to consider spatial coordinates as locally informative. If you know expression in a coordinate, then nearby coordinates should be pretty similar, while coordinates further away are still somewhat similar. In statistical terms, we say that coordinates $$x_i$$ and $$x_j$$ covary in a way that decreases with distance between $$x_i$$ and $$x_j$$. A common way to express this is through the Gaussian covariance function: $cov(x_i, x_j) = \exp \left(-\frac{|x_i − x_j|^2}{2 \cdot l^2}\right)$

The parameter $$l$$ describes the scale at which expression levels are locally similar. We illustrate the meaning of this by considering a length scale of 126 µm in the zebrafish embryo.

In this example, we don’t know what the expression values are in any coordinate, but we expect that expression levels 5 slices apart from each other will have a correlation of about 0.8, while expression levels 10 slices apart will have a correlation of about 0.4. In other words, genes may be expressed in or close to the head while not expressed at all in the tail, or vice versa. The right side of the plot shows a couple of expression curves drawn from a multivariate normal distribution with this covariance matrix.

\begin{align*} y_g &\sim \mathcal{N}(0, \Sigma), \\ \Sigma_{i, j} &= cov(x_i, x_j) \end{align*}

Models like this where the covariance between points are parameterized are known as Gaussian processes.

## Spatial and non-spatial variance

The curves above are random draws, but have you ever seen data that looks so clean? More likely there is additional noise which we cannot explain by only considering the spatial covariance between zebrafish slices. This can be expressed as $y_g \sim \mathcal{N}(0, \sigma^2 \cdot (\Sigma + \delta \cdot I))$

The $$\sigma^2$$ parameter is the overall magnitude of the variance of $$y_g$$, while the $$\delta$$ parameter scales how much of the variance not determined by spatial covariance. With these quantities (together with a correction for the structure of $$\Sigma$$ called the Gower factor $$G(\Sigma)$$) we can move to a measure we refer to as “fraction spatial variance” (FSV), FSV $$= G(\Sigma) / (G(\Sigma) + \delta)$$. This is a number between 0 and 1 which indicate how much of the variance in data is due to spatial covariance. To illustrate, we can randomly simulate data with different levels of FSV.

The simulated data in the panels have the same variance. But we can see that the spatial contribution to the variance decreases as the FSV decreases.

The SpatialDE test fits the $$\sigma^2$$ and $$\delta$$ parameters to each genes observed expression levels, and also compares the likelihood with a model that has no spatial variance (FSV = 0) to obtain a significance level (p-value). This way it is possible to distinguish between genes with high variance and genes with high spatial variance. The $$l$$ hyper-parameter for the $$\Sigma$$ matrix is also fitted through grid search, so for each gene we also get the optimal length scale. (This dataset had 95 samples of 20,317 genes, and the SpatialDE test took about 5 minutes to run).

As an illustration, let us look at the expression patterns of a number of genes identified with different FSVs in the zebrafish embryo.

For the higher FSVs it is clear the expression depends on spatial location, while this is less clear for small FSVs. We can look at the relation between total variance and spatial variance, and how the FSVs are distributed compared to these.

On the right side the standard plot from SpatialDE illustrates the relation between the FSV and the statistical significance for all genes. I put in references to the genes I used as examples, as well as some genes that were highlighted by Junker et al in the original paper. In total SpatialDE identifies 1,010 genes as significantly spatially variable, with most of those genes having length scales between 65 µm and 244 µm.

Hopefully this one-dimensional example helps clear up what we mean by “spatially variable genes”, and also illustrate that there are many forms of data that can use spatial analysis. If you have spatial gene expression data, I recommend trying SpatialDE to identify genes of interest. The analysis associated with this post is available in a notebook here. If you want to use SpatialDE for your spatial data, you can install it by pip install spatialde. Thanks to Lynn Yi for editorial feedback on this post.

# Actionable scRNA-seq clusters

Recently Jean Fan published a blog post about "real" scRNA-seq clusters. The general idea is that if there is not differential expression between clusters, they should be merged. This is a good idea, and putting criteria like this highlights expectations of what we mean by clusters, and may in the future direct explicit clustering models that incorporate these.

The workflow Jean presents is similar to how I have been looking at these things recently, and the post inspired me to 1) write down my typical workflow for cell types or clusters, and 2) put my code in Python modules rather than copy-pasting it between Notebooks whenever there's new data to look at. Of course in this post I'll be making commands and images much more stylized than they would typically be, but the concept is representative.

## Simulated data

To assess whether an approach is reasonable, it is good to make some simulations of the ideal case. To simulate cells from different cell types, I make use of 1) a theory that cell types are defined by "pathways" of genes, and 2) observations from interpreting principal component analysis from many datasets. These indicate that to produce a transcriptional cell type, a small number of genes are upregulated and covary among each other.

With this in mind, simulation is done by, for each cell type, creating a multivarate normal distribution for each cell type, which have increased mean and covariance for a defining "module" of genes. From this distribution a number of cells are sampled, and expression values are pushed through a softplus function to immitate the nonnegative scale and properties of log(counts + 1). The process is repeated for each cell type, leaving a number of genes as inactive background.

In this particular case I simulate 10 cluster spread over 1,000 cells, with 20 active marker gene per cluster, and finally add 300 "unused" genes.

from NaiveDE import simulate data = simulate.simulate_cell_types(num_clusters=10, num_cells=1000, num_markers_per_cluster=20, num_bg_genes=300, marker_expr=1.8, marker_covar=0.6)

Here a tSNE plot is used as a handy way to look at all the cells at once. The most unrealistic part of this simulation is probably the uniform distribution of cell numbers per cell type. In real data this is very uncommon.

To cluster cells it is pretty effective to work in the space of a number or principal components. I like to use Bayesian Gaussian mixture models to group cells into clusters in this space. First I will cluster with an overly large number of clusters.

from sklearn.decomposition import PCA pca = PCA(15) Y = pca.fit_transform(data) from sklearn.mixture import BayesianGaussianMixture def get_clusters(K): gmm = BayesianGaussianMixture(K, max_iter=1000) gmm.fit(Y) c = gmm.predict(Y) return c c = get_clusters(25)

In this tSNE plot each color correspond to a cluster. As Jean points out, for a cluster to be useful in followup experiments we must be able to define it with a small number of genes. That is, there should be some genes which will allow us to predict whether a particular cell belongs to the cell type or not. For many this is a bit reductionist, and t is not impossible functional cell types are defined by hugely complex nonlinear interactions of hundreds of genes. But in practice, we wouldn't know what to do with such cell types.

The definition of an actionable cell type being one we can predict leads to predictive models. In particular regularised logistic regression is good for this. By controlling the regularisation) so that each cluster has a "marker budget" of only a handful genes, we can ensure that a few markers can predict the types. It is very rare to look at more than ~5 top genes when assessing clusters in practice.

ROC curves from the predicted assignment probabilites in logistic regression is a practical way to assess whether we are able to predict the cell types correctly. Here I split the data into 50% training and testing, train logistic regression, then create the ROC curve for the test data, for each cell type.

The printed numbers in the training command are the number of positive markers for each cell type. I interactively change the sparsity parameter to keep these numbers generally low. In this regard this is all quite supervised and subjective, and far from automated.

from NaiveDE import cell_types test_prob, test_truth, lr_res, lr = cell_types.logistic_model(data, c, sparsity=0.5) [26 24 34 31 8 31 10 10 42 24 17 8 10 39 17 39 5 7 12 7 21 19 10 9 14] cell_types.plot_roc(test_prob, test_truth, lr)

The colors of the curves here correspond to clusters. A number of the clusters are relatively close to the unit line, indicating that we have a hard time predicting these, and they will not end up being actionable. So we decrease the K and try again. We iterate this procedure until we are happy with the predictability of the clusters.

Here I have also plotted cases of "under clustering" the data. The larger clusters are still pretty predictice, but we would want to maximize the number of clusters which would be experimentally actionable.

In the case of K=10 (which is what we simulated), we can also look at which genes the predictive model is using for each cell type. We can use the get_top_markers command to extract the N largest weights for each class in a handy table.

top_markers = cell_types.get_top_markers(lr_res, 5)

To visualise how these weight relate the expression of cells with different cluster annotations, we can plot a "marker map", which sorts cells by cluster, and plots the top marker genes in corresponding order on the Y axis. This is a very common plot in scRNA-seq cluster studies.

cell_types.plot_marker_map(data, c, top_markers)

We see that structure we simulated is largely recovered! The diagonal blocks indicate genes which predict the cell types.

## Application to bone marrow data

Now we try this strategy on real data. In particular, we are using one of the batches of bone marrow data from the recently published Mouse Cell Atlas. This batch has 5,189 cells and expression values for 16,827 genes. For the sake of speeding up the analysis a little I randomly sample 3,000 of the cells.

Now we perform the same procedure of training GMM's and attempting to predict held out data with logistic regression.

At 7 clusters I stop, here the clusters are very easy predict. Again we can create the marker map.

Obviosuly this is a lot noisier than the simulated data. Another thing we notice is that the number of cells per cluster is much less even than for the simulated data. This will cause some issues with interpreting the ROC curves, but in practice we want to try to have some minimal size for clusters in order to keep them reliable.

In order to quickly interpret the clusters, and read out what the predictive genes are, the plot_marker_table command will lay out predictive weights and names of top marker genes for each cluster, the colors relate to the colors in the plots.

cell_types.plot_marker_table(top_markers, lr)

I find this workflow fairly straightforward and quick to work around. There are some clear drawbacks of course: tt is quite manual, and we are not quantifying the uncertainty of these predictive weights, so we can't do proper statistics.

Notebooks for this post are available here.

# The effect of Poisson zeros on OLS regression results

In a previous post I wrote about the Poisson distribution seeming like a good error model for scRNA-seq counts. This suggests using GLM with Poisson likelihood to analyse your data, as long as the offset due to count depth variation is taken into consideration.

An alternative strategy could be to transform the counts to roughly normal, and perform analysis in that setting. This is effectively what the vast majority of studies do for unsupervised analysis: counts are transformed, then PCA is used to find a low-dimensional representation for further analysis such as clustering.

What if we try to adjust for the count depth variation in a supervised setting assuming Gaussian noise?

A huge benefit of assuming Gaussian noise is that linear regression has an extremely efficient solution, usually referred to as OLS regression. A couple of years ago I made a simple Python package NaiveDE to perform OLS regression on gene expression matrices. I don't recommend anyone use it for final analysis, indeed I called it "Naive DE" because it is a baseline. Literally every other DE test will be better than it by design, in particular with regards to false positive P-values. (Well maybe not according to a recent study, the test in NaiveDE should be equivalent to the t-test.) It is nevertheless convenient during exploratory analysis to iterate through models

Alternative and null models are specified by Patsy formulas, and significance is calculated with a likelihood ratio test. A Bonferroni corrected version of the P-value is also reported.

For every gene $$g$$ where we have a design matrix $$X$$ and observed counts $$y_g$$ we look at

$$y_g = \mathcal{N}\left( \alpha^T_g X, \sigma^2_g \right).$$

The weights $$\alpha$$ are calculated by OLS, and $$\sigma^2$$ is reflected in the residual errors. For flexibility, intercept is optionally part of the design matrix.

### Negative control data

In the negative control 10X dataset from Svensson et al 2017, the only variation in observed expression should (in theory) be due technical effects, in particular the count depth variation. Here we are using 2,000 cells with 24,000 genes. The most common variance stabilizing transformation of scRNA-seq data is $$\log(Y + 1)$$, so we will investigate how this affects regression.

If the gene counts are scaled per cells, we would want

$$\log\left( \frac{y_g}{\text{counts}} \right) = \log(y_g) - 1.0 \cdot \log(\text{counts}) = \mathcal{N}(0, \sigma^2)$$

We set up a model where the design matrix $$X$$ have the log total counts, and an intercept. Ideally the weights for the log counts should be found to be 1, and the intercept 0. Note that in practice we are always using $$\log(y_g + 1)$$.

%%time lr_results = NaiveDE.lr_tests(sample_info, np.log1p(counts.T), alt_model='~ np.log(total_count) + 1', null_model='~ 1')

CPU times: user 2.98 s, sys: 702 ms, total: 3.68 s Wall time: 3.35 s 

The test produces a table with weights from the alternative model, hypothesis test restults.

print(lr_results.sort_values('np.log(total_count)', ascending=False).head(25))

 Intercept np.log(total_count) pval qval gene ENSG00000198938 -5.501025 1.080920 3.073563e-294 6.147125e-291 ENSG00000198727 -5.600405 1.041435 3.073563e-294 6.147125e-291 ERCC-00002 -2.999032 1.034389 3.073563e-294 6.147125e-291 ERCC-00136 -4.155633 1.017020 3.073563e-294 6.147125e-291 ERCC-00113 -4.297474 1.010625 3.073563e-294 6.147125e-291 ENSG00000198886 -5.615521 1.010178 2.134557e-266 4.269114e-263 ENSG00000198712 -5.144020 1.005586 3.341643e-168 6.683285e-165 ERCC-00096 -2.740023 0.989442 3.073563e-294 6.147125e-291 ENSG00000210082 -4.357098 0.988333 3.073563e-294 6.147125e-291 ERCC-00046 -3.727992 0.979269 3.073563e-294 6.147125e-291 

In this plot np.log(total_count) does not refer to the value, but weight for this variable. Each dot is a gene rather than droplet. The P-value comes from comparing the model with one that does not consider the depth.

The marjority of genes are found to have gene count weights much smaller than 1. It turns out that lowly abundant genes will have delfated total count slopes.

We can look at a few examples of genes with different count depth weights.

From this, it is clear that increased observations on the low count values, in particular 0, are responsible for decrease in the total count weight.

### Differential expression

Now let us investigate how this count depth effect plays in to a differential expression analysis. With all published large scale experiments cataloging cell types, it is getting increasingly easy to simply fetch some data and do quick comparisons. We will use data from the recent single cell Mouse Cell Atlas30116-8). To get something easy to compare, we use the samples called "Brain" and focus on the cells annotated as "Microglia" and "Astrocyte". Out of the ~400,000 cells in the study, these two cell types have 338 and 199 representative cells. On average they have about 700 total UMI counts each, so while the entire study is a pretty large scale, the individual cell types and cells are on a relatively small scale. The final table has 537 cells and 21,979 genes.

print(sub_samples.head())

 ClusterID Tissue Batch Cell Barcode \ Cell name Brain_1.AAAACGCGAGTAGAATTA Brain_3 Brain Brain_1 AAAACGCGAGTAGAATTA Brain_1.AAAACGGAGGAGATTTGC Brain_3 Brain Brain_1 AAAACGGAGGAGATTTGC Brain_1.AAAACGGGCTGCGACACT Brain_2 Brain Brain_1 AAAACGGGCTGCGACACT Brain_1.AAAACGGTGGTAGCTCAA Brain_3 Brain Brain_1 AAAACGGTGGTAGCTCAA Brain_1.AAAACGGTTGCCATACAG Brain_3 Brain Brain_1 AAAACGGTTGCCATACAG cell_type super_cell_type is_astrocyte \ Cell name Brain_1.AAAACGCGAGTAGAATTA Astrocyte_Mfe8 high Astrocyte True Brain_1.AAAACGGAGGAGATTTGC Astrocyte_Mfe8 high Astrocyte True Brain_1.AAAACGGGCTGCGACACT Microglia Microglia False Brain_1.AAAACGGTGGTAGCTCAA Astrocyte_Mfe8 high Astrocyte True Brain_1.AAAACGGTTGCCATACAG Astrocyte_Mfe8 high Astrocyte True total_count gene Cell name Brain_1.AAAACGCGAGTAGAATTA 1088.0 0 Brain_1.AAAACGGAGGAGATTTGC 967.0 0 Brain_1.AAAACGGGCTGCGACACT 543.0 0 Brain_1.AAAACGGTGGTAGCTCAA 679.0 0 Brain_1.AAAACGGTTGCCATACAG 957.0 0 

In a differential expression test you simply include a covariate in the design matrix that informs the linear model about the different conditions you want to compare. Here we are comparing microglia and astrocytes.

%%time lr_results = NaiveDE.lr_tests(sub_samples, np.log1p(sub_counts.T), alt_model='C(is_astrocyte) + np.log(total_count) + 1', null_model='np.log(total_count) + 1')

CPU times: user 705 ms, sys: 136 ms, total: 841 ms Wall time: 707 ms 

print(lr_results.sort_values('pval').head())

 Intercept C(is_astrocyte)[T.True] np.log(total_count) \ Atp1a2 -1.925596 1.840452 0.318532 Sparcl1 -1.008002 1.742278 0.179123 Tmsb4x -3.680027 -2.044908 0.948016 Hexb -2.165802 -2.032087 0.646263 Ctss -1.665139 -1.937761 0.553429 pval qval Atp1a2 3.058918e-162 1.642639e-159 Sparcl1 3.548817e-158 1.905715e-155 Tmsb4x 2.742131e-153 1.472524e-150 Hexb 3.671724e-145 1.971716e-142 Ctss 8.167943e-144 4.386185e-141 

Also in this case we can see that the count depth weights are deflated for lowly abundant genes.

Similar to above, we can look at the relation between count depth and observed counts for a few genes, but we can also make sure to plot the stratifiction into the two cell types and how the regression models are predicting the counts.

Again we can see the overall abundance is related to the slope of the lines. Another thing which seem to pop out in these plots is an interaction between cell type and slope. For example looking at C1qa the slope for the microglia seem underestimated. This makes sense, if this is an effect of count noise at low abundances.

My takeaway from this is that OLS regression might be OK if counts are large, but at lower levels model parameters are not estimated correctly due to the count nature of the data.

Notebooks of the analysis in this post are available here.

# Count depth variation makes Poisson scRNA-seq data Negative Binomial

In the scRNA-seq community the observation of more zero values than expected (called the "dropout problem") is still a concern. The source seems to be an intuition that at such small scales of biological material as RNA from individual cells, molecular reactions lose efficiency due to conceptual stochastic events. The trendiest computational research directions in the field at the moment are probably tied between "how do we do this for a million cells?" and "how do we deal with the dropouts?". In particular droplet based scRNA-seq methods are considered to have more dropouts, often leading investigators that opt for more expensive plate based methods even for exploratory pilot experiments.

In negative control data there is no evidence for zero inflation on top of negative binomial noise, counter to what is commonly suggested (in particular for droplet based methods). A notion that has inspired significant research efforts. A recent interesting report by Wagner, Yan, & Yanai goes even further and illustrates that the Poisson distribution is sufficient to represent technical noise in scRNA-seq data. The authors write that additional variation in gene counts is due to efficiency noise (an observatin from Grün, Kester, & van Oudenaarden that different tubes of reagents appear to have different success rates), and can be accounted for by an averaging approach.

This can be explored by simulating data! Say droplets contain transcripts from 300 genes, whose relative abundance levels are fixed because they come from the same RNA solution. Then a droplet with $$d$$ transcripts can be seen as a draw from a multinomial distribution,

$c_i \sim \text{Multinom} (d, (p_1, \ldots, p_{300})).$

Now each gene will independently conform to a Poisson distribution.

The constant mean-variance relation for Poisson holds (as expected) for this simulation. In actual data, genes with higher abundance are over dispersed, which can be modeled using a negative binomial distribution.

The negative binomial distribution is constructed as a mixture of Poisson distributions, where the rate parameter follows a Gamma distribution. Other Poisson mixtures have also been suggested for scRNA-seq data.

An aspect of real data which our mutlinomial simulation does not account for is that the total counts observed in each droplet is variable. Indeed, usually a cutoff at some low number of total counts per droplet is used to decide which droplets captured cells and which only contain background material that is not of interest.

Thinking about the abundance levels of the different genes as rates in Poisson distributions require each observation to come from a constant count depth. If the count depth varies in each observation but the model is not informed of this, it will appear as if the rate for each gene is variable, and this will be more consistent with a negative binomial distribution.

As an illustration, in the simulation, variation in count depths can be included. For simplicity, a uniform distribution is used,

\begin{align} d_i & \sim \text{Uniform}(5.000, 100.000), \\ c_i & \sim \text{Multinom} (d_i, (p_1, \ldots, p_{300})). \end{align}

These new values clearly have the quadratic polynomial mean-variance relation that is typical for scRNA-seq counts.

This indicates we need to handle the differences in count depth. The easiest solution is to simply divide the expression counts in each cell with the total depth, turning each expression value into a fraction.

In the RNA-seq field it is also common to also multiply these fractions by 1 million to form the "CPM" unit.

It is clear that after creating either fractions or CPM will follow a linear relation between mean and variance. However, in both cases there is an offset from the unit relation, and in particular for the CPM unit the variance gets inflated compared to the mean.

The thrid panel shows the result after manually scaling the fractions (through multiplication by 3.5e4) to achieve the Poisson mean = variance relation. (There is probably a closed form expression for the scaling factor that achieves this, and the 1e6 is above this, explaining the variance inflation.)

It is entirely possible that the this type of scaling to create CPM from fractions is one reason people have noticed higher than expected numbers of zeros. For Poisson data, the expected number of zeros at a given mean expression level is given by the function $$e^{-\mu}$$.

The counts themselves follow the theoretical curve quite close, but with an increase of zeros at high expression levels, consistent with negative binomial zeros. 'Fractions' see a large offset of much fewer zeros than is expected given the mean, while CPM see an offset for more zeros than expected. The manually scaled values follow the theoretical curve decently, though far from exactly.

For interpretable analysis, counts should be scaled for total count depth, but this also need to be taken under consideration when looking at the results (e.g. dropout rate). The best solution might be to take inspiration from the field of generalised linear models. In that field offsets are included in models when there is a clear explanation for variation in counts, to convert counts to rates. Clustering or pseudotime methods could be reformulated to the Poisson setting with offsets.

There are some additional aspects to keep in mind. For negative control data where each droplet contains RNA from the same solution, the count depth variability must be technical, but in real samples this could also be due to cells having variable amounts of RNA. For droplet based data one simple reason for the heterogeneity could be due to variation in coverage for DNA oligos on barcoded beads. It is not clear what an explanation for plate based methods would be, and no proper negative control data exist for plate based methods to investigate these properties. On a similar note, the latest single cell sequencing methods based on stochastic schemes for in situ barcoding of cells are impossible to assess with negative control samples.

An R notebook for this analysis is available here. Thanks to Lior Pachter for editorial feedback on this post.

# Low mapping rate 6 - Converting sorted BAM to FASTQ

Some sequencing centers have moved to only work on BAM or CRAM files rather than "raw" FASTQ files. The motivation for this is that CRAM files can be heavily compressed and require less memory for the sequencing service provider. The CRAM files in particular make use of the alignment to a reference genome to achieve better compression performance, especially when the files are sorted by alignment coordinate.

In many forms of sequencing analysis, in particular genetics, coordinate sorted alignments to a standard reference is so standardized it is considered a "raw data" format. Unfortunately, this is not the case for RNA sequencing, in single cells or other. As demonstrated by the previous posts in the series, we are often not even sure what material we have been sequencing. The choice of what cDNA's are in the reference is also an informed choice depending on the experiment. For these reasons, software for working with RNA-seq data require FASTQ's as input.

To convert BAM's/CRAM's to FASTQ, a handy command from the samtools suite is samtools fastq. It is however not mentioned in the documentation or in the help for the command itself that is assumes name sorted BAM/CRAM input. It will not even stop or warn you if your file is coordinate sorted. Instead, it will silently create paired FASTQ files with incorrect read orders.

This issue was recently raised by Davis on twitter.

In particular, the incorrectly generated FASTQ files will have worse performance in terms of mapping statistics, caused by read pairs not originating from the same cDNA fragment. We can compare the result on the same data and reference as we used before, after converting files with name sorted and coordinated sorted CRAM files.

We see that cells which would have a mapping rate of > 75% only have ~40% mapping rate with the incorrect coordinate sorting.

Thomas Keane recommends using the samtools collate command before converting to FASTQ to quickly ensure reads are correctly ordered.

If you have data with lower mapping rates and you were provided BAM or CRAM files by your facility, it might be worth to check the sort order. This can be seen in the first line of a BAM/CRAM file.

$samtools view -H 20003_4#70.cram | head | cut -c1-50 @HD VN:1.4 SO:coordinate @RG ID:20003_4#70 DT:2016-06-07T00:00:00+0100 PU:1 @PG ID:SCS VN:2.2.68 PN:HiSeq Control Software DS: @PG ID:basecalling PP:SCS VN:1.18.66.3 PN:RTA DS:B @PG ID:Illumina2bam PP:basecalling VN:V1.19 CL:uk. @PG ID:bamadapterfind PP:Illumina2bam VN:2.0.44 CL @PG ID:BamIndexDecoder PP:bamadapterfind VN:V1.19 @PG ID:spf PP:BamIndexDecoder VN:v10.26-dirty CL:/ @PG ID:bwa PP:spf VN:0.5.10-tpx PN:bwa @PG ID:BamMerger PP:bwa VN:V1.19 CL:uk.ac.sanger.n Similar to the case of unsorted FASTQ files, you can also notice this by read headers not matching up in the FASTQ pairs. $ head 20003_4#70_*.fastq | cut -c1-50
==> 20003_4#70_1.fastq <==
@HS36_20003:4:2214:5946:43267
CTATTAAGACTCGCTTTCGCAACGCCTCCCCTATTCGGATAAGCTCCCCA
+
BBBBBBBFFFFFF/FFFFFF/<F</FFF/B/<<F/FFF/<<F<B/F/FFF
@HS36_20003:4:2111:19490:55868
GTATCAACGCAGAGTACTTATTTTTTTTTTTTTTTTTTTTTTTGGGTGGT
+
BBBBBFFFBFFFFFFFFFF/<<<FBBBFFFFFFFFFFFFFFFB///////
@HS36_20003:4:2109:19213:11514
TATCAACGCAGAGTACATGGGTGGCGCCAGCTTCAGCAAAAGCCTTTGCT

==> 20003_4#70_2.fastq <==
@HS36_20003:4:2214:5946:43267
GCCCTTAAGTTGATTTGAGAGTAGCGGAACGCTCTGGAAAGTGCGGCCAT
+
BBBBBBFFBFFFF/FFBFFFB/<FFFFF<FFFBFFFFFBFFFFFFFFF<F
@HS36_20003:4:1111:7773:69672
GAGACATAATAAACAAAAAAAGAATAAGATAAAGGAGAGAGAAAAAAACG
+
/<//////<//BF/F/FFFFF/<F/B//</<FF//F/F/F/FFFFFFF</
@HS36_20003:4:2109:19213:11514
CATTACAGGCGCATCTTACGGAATTGGATATGAAATAGCAAAGGCTTTTG

This post was suggested by Raghd Rostom and Davis McCarthy.

# Droplet scRNA-seq is not zero inflated

As scRNA-seq (singel cell RNA sequencing) started to gain popularity users expressed concern about an unexpected number of zero values among gene expressions. That is, for any given gene many cells had not detected the expression, even if it was relatively high in other cells.

It is unclear when this was originally stated, but it has been named the "dropout" problem. A search on Google Scholar will give hundreds of publications discussing the problem of "dropouts" in scRNA-seq, and there are several methods papers explicitly aimed at investigating and dealing with the "dropouts". Typically by imputing zeros to positive values or by stating models which includes a zero-inflation component. These observed zeros ("dropouts") in the data have typically been explained by inefficiencies of molecular reactions, due to the very small volumes of mRNA in individual cells.

In high throughput variants of scRNA-seq assays cells are isolated in (reverse) droplets, within which several molecular reactions occur to eventually give rise to labeled cDNA from expressed genes from each cell. Part of what makes this possible is limiting the sequenced fragments to just single tags from the 3' or 5' end of each transcript. It has recently been observed in statistical analysis that RNA tag counting versions of scRNA-seq data is better explained without additional zero inflation.

Nevertheless, it is common to hear weariness from potential usesrs of droplet based scRNA-seq assays because they are percieved to have a higher amount of "dropouts" than alternative more expensive and manual methods with less throughput.

These observed zeros are consistent with count statistics, and droplet scRNA-seq protocols are not producing higher numbers of "dropouts" than expected because of technical artifacts.

To see this, consider four experiments were solutions of RNA were evenly distributed into droplets, guaranteeing a complete lack of biological variation. One was performed with inDrop (Klein et al), one with 10X Genomics GemCode (Zheng et al), and two with 10X Genomics Chromium (Svensson et al. All datasets have on the order of ~1,000 droplets with RNA, facilitating accurate estimation of e.g. mean or variance for each gene.

It has been observed that expression counts from these technologies follow the negative binomial distribution, in which there is a quadratic mean-variance relationship.

Compared with experiments involving single cells the mean-variance relation is extremely clear in these homogenous datasets.

In negative binomial data, the probability of observing a count of $$k$$ given the mean $$\mu$$ and dispersion $$\phi$$ is

$P( k \,|\, \mu, \phi) = {k + \phi^{-1} - 1 \choose k} \left( \frac{\mu}{\mu + \phi^{-1}} \right)^k \left( \frac{\phi^{-1}}{\mu + \phi^{-1}} \right)^{\phi^{-1}}.$

So the probability of $$k = 0$$ is simply

$P( k=0 \,|\, \mu, \phi) = \left( \frac{\phi^{-1}}{\mu + \phi^{-1}} \right)^{\phi^{-1}}.$

With this function we can visualize theoretical "dropout" rates for various means and dispersion values.

These values can be compared with the empirical "dropout" rate, simply calculated as

$\frac{\text{# cells with gene = 0}}{\text{# cells total}}$

for each gene.

The "dropout" rates for the data without biological variation follows the theoretical prediction. In all datasets the Pearson correlation between theoretical and empirical dropout rates is 99.9%.

Here the $$\phi$$ parameter is different for each dataset, and it is possible that this overdispersion is affected by technical factors. There does however not seem to be any technical contribution to zero-inflation, if it is observed it is instead more likely caused by biological heterogeneity.

# Variance stabilizing scRNA-seq counts

Quantitative sequencing assays in general yield counts. The generative models for different levels of counts are in many ways fundamentally different from continuous distributions such as the more common Gaussian (normal) distribution. The problem is not that the data consists of integers; rounded normal data such as e.g. user ratings of products wouldn't have any particular problems being analysed with normal methods. Counts however are generated by several, cumulative, singular events. With each of the events having some probability of occurring in a given "time interval" or other relevant unit.

For example, if we think about the RNA sequencing process in the abstract we have a large collection of cDNA molecules which are randomly sampled and identified as originating from genes. How often moleculas are identifed from any particular gene tells us something about the abundance or expression level of the gene. The process of counting however implies that variation will propagate as the number of events increase. The effect of this is that there will be an inherent relation between mean (expected value) and variance of counts.

As an example, let us look at two recent datasets. One from Karaiskos et al where the authors mapped out fruit fly development on the single cell level, and another from Stoeckius et al where the authors developed a new method to study RNA and protein expression from cells in tandem.

The mean-variance relation typical for negative binomial distributed count data is quite clear. Negative binomial, or other similar distributions, have been used to study RNA-seq data for a long time. Almost all statistical tests for comparing control vs condition style experiments (differential expression) use generalized linear models assuming count data with these kinds of distributions.

Single cell RNA-seq data is different. Not necessarily because the data wouldn't be suitible for these tools, but rather because differential expression is a minor question of limited interest in single cell studies. By far the most popular use of scRNA-sequencing is to identify groups of cells which are similar to eachother and might correspond to functionally distinct cell types. In addition to such clustering analysis, inference of developmental trajectories are popular, as well as quantifying the degree of variation between conditions.

So unlike bulk RNA-sequencing, the key analysis modality is in terms of multivariate analysis such as clustering or "dimensionality reduction" like PCA. In the coming years I believe figuring out effective ways to think about these issues for count data will be important, especially for sparse counts from low depth!

In the meantime, it is useful to be able to use available methods. Existing methods for clustering or dimensionality reduction are almost always either explicitly or implicitly designed with normal data in mind. (A notable exception being ZINB-WaVE by Risso et al). Any method using Euclidean distances implies normally distributed data.

One way to deal with these problems is to transform the data) in some way which makes it more similar to normal data. The logarithm is a very practical transformation for positive data. In particular ratios are very useful to log transform. With counts though, it is common to observe values of zero, for which the logarithm is not defined. Instead, it is common to perform $$\log(x + 1)$$ for counts $$x$$.

If we apply this transformation to each gene $$i$$, we can investigate a couple of things. First of all, we can see what happens with the mean-variance relation. Secondly, we can display the first few principal components for both of the data sets using the transformed unit. This will represent a form of multivariate analysis.

From the plots we can observe a few things. For higher mean counts the relation between mean and variance (or standard deviation) is gone. However, for lower counts (mean $$< 1$$) there is still some correlation. From the PCA below we can see some subpopulations for each data set.

The 1 here was added due to the observed zero counts. But why 1? What if we used something else? Is this the best we can do? In the end, adding the one was quite ad hoc wasn't it? Or in the words of Arjun Raj the other week:

There is actually some theory we can use here. Our goal was to transform the data in way that removes the mean-variance relation as effectively as possible. This is known as a variance-stabilizing transformation. For example, in bulk RNA-sequencing the DESeq2 package has a function vst() for this (based on the underlying parametric Poisson-Gamma model).

If there is a functional form for the relation between the mean and the variance, e.g.

$Var(x) = g(\mu),$

then the variance can be stabilized by applying the function

$f(x) = \int^x \frac{1}{\sqrt{g(v)}}dv.$

As illustrated in the first plot of the post, for negative binomial data, which generally suits scRNA-seq counts well, we have that

$Var(x) = \mu + \phi \cdot \mu^2,$

where $$\phi$$ is the dispersion for the data. If we plug this into the integral above, and use Wolfram Alpha to solve the integral because I'm not in school anymore, we get

$VST(x, \phi) = 2 \cdot \frac{\sinh^{-1} \left( \sqrt{ \phi \cdot x } \right)}{\sqrt{\phi}}$

It is very easy to find a $$\phi$$ for the data by fitting a polynomial to the observed mean-variance relation. Let's transform our data in this way, and redo the plots for the two data sets.

We can note that this transformation behaves very similarly to $$\log(x + 1)$$. One difference is that the standard deviation is scaled around 1 rather than around 0.5. Any effect on the PCA seems minimal though.

In a 1948 paper Anscombe explored these sorts of tranformations for Poisson and negative binomial data. In addition to the $$\sinh^{-1}$$ form of the solution to the integral, Anscombe also considers an approximation which works for certain ranges of mean and $$\phi$$. The approximation has the form $\log\left( x_i + \frac{1}{2 \cdot \phi} \right).$

This has the same form as the heuristic $$\log$$ transform, but instead of just picking 1 the "pseudocount" is motivated by the data distribution and statistical theory. We also create the same plots as above for data tansformed in this way.

Again the data looks similar to before, and we're back to the situation of having standard deviation around 0.5 for highly expressed genes.

In both of these cases we have assumed a global $$\phi$$ parameter, and found it by polynomial curve fitting of mean vs variance. This is handy because we can use information from all genes and learn something global about the data. Then each gene is transformed assuming a fixed dispersion level.

For these data sets, there's actually no real need to assume a global $$\phi$$. They have thousands of cells providing observations for each gene, we can easily learn individual $$\phi_i$$ for each gene by maximum likelihood. That means we can perform VST of each gene independent of the other genes.

(It should be mentioned that finding all the $$\phi_i$$'s took about 40 minutes for the larger of the data sets, so it's not extremely practical).

When plotting these independent VST values we first see that the relation between mean and standard deviation is much "tighter" for both data sets. We still get an interesting bump for lower expression values, but after a mean of 2.0 the standard deviations are stable at 1.0. (A problem of course is that it's hard to know what "2.0 expression" means here, but it seems somewhat comparable between the two data sets).

Here we notice that the low-dimensional representation in the PCA is different from the previous data transformations. For the left data, we don't see clear clusters anymore, while for the right data some within-cluster covariance seem to follow PC1 better.

Finally, we can try to perform the approximate Anscombe transformation for gene specific $$\phi_i$$ values.

It's hard to say what we are seeing here. There's definitaly no apparant correlation between mean and variance after the transformation. Though the standard deviations are not particularly stable around a value. For very high mean expression, values are transformed to have 0 standard deviation, meaning they are probably transformed to a constant value.

The multivariate PC analysis shows similar results as the first few transformations.

In the end I don't have any particularly good conclusions. The results are in the end somewhat different, especially when considering per gene $$\phi_i$$ values. I have no idea which would "correct" in any meaningful way.

One note though, is that in all cases (except the last) there still is a dependency between mean and variance for genes with very low means. Considering that scRNA-seq as a field is moving towards more cells rather than more counts per cell, this might mean that variance stabilizing transforms are the wrong way to go in modern studies. Instead working directly with count distributions might be a more stable strategy for low counts. There of course is very limited prior work on this, and that is good to keep in mind when working with and planning to make shallow scRNA-seq data.

# Low mapping rate 5 - Human DNA contamination

This is (most likely) the final post in the series investigating the low mapping rate of our Smart-seq2 data from our study on the malaria immune response. If you have read the previous posts, you might have notied a population of cells which have been stuck at extremely low mapping rate, no matter how much things improved for the other cells. It turns out this population of cells are contaminated with human material.

In our study we invsetigated CD4+ T cells sorted from spleens of infected mice. There are a number of potential entry points for human material to contaminate the samples.

1. Human cells can be sorted with mouse cells.
2. Human material enter the plate of cell lysate during cDNA generation.
3. Human material can end up in the plate when creating the DNA sequencing library.

These different potential vectors of human contamination will lead to data with different characteristics. If human cells are sorted together mouse cells, the data will have heterogenous expression patterns as single cell data does, only with human genes rather than mouse genes. On the other hand, if human material enter the plate of sorted cells while cDNA is being generated, human mRNA will convert to cDNA but with consistent bulk-like expression patterns in the different wells. Finally, if human material enter the plate during library preperation no reverse transcription will be performed, and instead DNA from the human material will end up being sequenced.

To analyse this, I added all human Gencode transcripts to the Salmon reference from the previous post, along with a human 18S rRNA sequence. This will account for the first two possibilities. For the case of human DNA contamination, I extracted the unmapped reads from Salmon and aligned them to the human genome with HISAT2. With the remaining unmapped/unaligned reads I calculated the final mapping rate.

It is clear that we hace solved the mystery of the extremely low-mapping population! By breaking up the mapped reads into the sources of contribution like we have done before for each sample, we can see which have the contamination cases have happened.

In the entirety of the plate 20003_6, as well as stretches of the plate 20003_8 we see that by for the greatest contribution of material is from human intergenic DNA, suggesting that the contamination happened during library preperation.

At this point I want to illustrate how much of the reads we have no found an explenation for compared to the original reference.

The mapping rate have moved from a heterogenous stretch to a clearer distribution common for the plates. In the end the mapping rate is only 75% on average, but this is a great improvement from before, and I haven't managed to see something systematic about the remaining reads.

To summerise, the Salmon reference now contains Mouse Gencode genes, ERCC spike-in sequences, mouse ribosomal RNA, a TSO concatemer "bait" sequence, Pseudomonas 16S and 23S sequences, human Gencode genes, and human 18S. Additionally, the unmapped reads are aligned to the human genome.

I hope the series of posts have been helpful, and in particular illustrative of the many failure modes of scRNA-seq experiments. This was all within a single experiment!

# Low mapping rate 4 - Bacterial contamination in reagents

The large amount of amplification needed to get usable material from single cells cause us to get detectable signal from minor effects such as ribosomal RNA or TSO concatemers. This also means reagents need to be particularly pure. Several studies have inviestigated the bacterial contents in kits used for sequencing studies, investigating the 'kitome'.

Reagents might have a small amount of contamination which will not be detectable in the "bulk" assays with more input material it was designed for, or even for relatively large single cells.

In particular, some commercial reverse transcriptases have been reported to contain contaminating bacteria. This thread on Seqansweres is generally recommended for anyone using scRNA-seq. A few years ago multiple people reported electropherograms which showed different sample purity depending on the reagent lot. This is an illustrateive plot reported by user bplevi:

The thread contains more examples good vs bad lots of reagents. They also reported which lots they had had success or failure with. For the data analyzed in the previous posts of the series, SuperScript II has also been used, though I don't know the lot number.

The company producing SuperScript II later acknowledged a couple of lots where contamination of E. coli had beed detected.

Analysing the reads from my data, I didn't see contribution from E. coli, however, I did find contamination from Pseudomonas. This bacterial family have also been reported in papers describing reagent contamination.

I added sequences for Pseudomonas 16S and 23S to the Salmon reference, reran the samples, and compared the results to the previous iteration.

With this addition the mapping rate for the majority of the cells jump up to ~75%!

From the scatter plot above we can see that before adding the Pseudomonas sequences, the mapping rate was more heterogenous. As we did before, we can see if this relates to the individual plates.

It is quite interesting that some plates have a larger contribution of TSO concatemers while others have larger contributions of bacterial sequences. All these plates are suppowed to have the same lots of reagents. However, the plates are sorted on different days then stored in a freezer until the experiment is done. It might be the case that TSO products and bacterial products degrade differently due to this.

(I should also acknowledge Luisa for pointing me to various bacterial resources.)

# Low mapping rate 3 - TSO concatemers

Compared to the two previous posts in the series, this post deals with something more technology specific.

Many biochemical reactions require a criticial amount of material before they work at all. This is the main challange with single cell RNA-sequecing: to create sufficient material for the next step in a protocol. The Smart-seq2 protocol makes use of Nextera, a kit for fragmenting and adding adapters for amplification, and finally Illumina sequencing adapters. But in order for Nextera to work a minimal input of DNA must be provided.

Once cDNA have been reverse transcribed from mRNA from a cell, it can be pre-amplified if it has PCR adaptors at both ends of the cDNA. A particularly convenient way to add these adapterors is though template switching PCR.

## Template switching oligos

Here when reverse transcription reaches the 5' end of the RNA, a CCC sequence is added. This allows a DNA oligo with GGG at the to bind to the end if the cDNA. This oligo allows the second strand of the cDNA to be generated, and at the same time provides as an adapter for PCR primers.

In the standard implementation of the Smart-seq2 protocol the template switching oligo (TSO) is AAGCAGTGGTATCAACGCAGAGTACATGGG.

Sometimes these TSOs concatenate to longer DNA sequences, and get amplified along with the cDNA. If you investigate reads not mapping to the transcriptome or rRNA you will find a number of reads whic have the TSO repeated after itself multiple times.

The TSO concatemers can be accounted for during quantification by including a FASTA record of a TSO concatemer in the reference, like this one:

>TSOconcatamer
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAGAGTA
CACGGGAAGCAGTGGTATCAACGCAGAGTACATGG

Rerunning Salmon with the new reference, we can compare the mapping rates to those in the previous post:

As we can see, the majority of cells get increased mapping rate when including the TSO concatemer. And many cells go from single digit percentages to over 50%! These samples are likely wells with almost no cellular mRNA in.

As before we can visualize the relative contribution of fragments from the different sources (here I merged the rRNA genes expression to one unit).

We see that several of the plates have large amounts of TSO contamination, and compared to rRNA it seems more variable between samples. It also seems to generally have a larger contribution than rRNA except for in one of the plates.

To quickly investigate different concatemers in data I created a little tool to our readquant collection which counts the number of occurances in reads from FASTQ files.

$concatamer_filter.py fastq/20003_3#57_1.fastq fastq/20003_3#57_2.fastq AAGCAGTGGTATCAACGCAGAGTACATGGG Copies,Fragments 0,667147 1,32750 2,33590 3,84431 4,8723 5,8 A typical strategy when investigating low mapping rates is to BLAST unmapped reads. Many times this will give results from scaffolds of the common carp genome (Cyprinus carpio). Actually, if you simply BLAST a 3x concatemer of TSO, it will map all over the carp genome with 100% similarity Finally, I should mention that Smart-seq2 isn't the only protocol making use of template switching. It is also used in STRT-seq, the different flavours of Drop-seq (e.g. SeqWell, DroNc-seq etc) as well as in the very popular 10X Genomics Chromium single cell solution. # Low mapping rate 2 - Ribosomal RNA In the first post of the low mapping rate series I started off by describing a problem at the data processing level in a dataset. In the coming few posts I will focus on a particular dataset and iteratively increase the mapping rate due to different factors. The data is from our study of CD4+ immune response to Malaria infection. In the study we first performed the experiment using the Fluidigm C1 system with the SMARTer kit, then we replicated the results using Smart-seq2 in microwell plates. Here I will use the Smart-seq2 data. This data have a particularly large range of mapping rates for the individual cells, evenly distributed between 1% and 70%. ## Ribosomal RNA Ribosomal RNAs are highly abundant in cells, though unlike mRNA these are not polyadenylated. Since (almost) all scRNA-seq protols make use of oligo-dT sequences to reverse transcribe RNA to cDNA this is not a big issue. The RNA component consists of a number of rRNA genes, repeated in chunks in various locations of the genome. These genes are 5S, 5.8S, 28S (all parts of the large subunit), and 18S (small subunit). In particular, 18S have a couple of (relatively short) stretches of poly-A in its sequence. My theory is that when the amount of mRNA is very limited in a sample the olig-dT binds these small stretches and the 18S gets reverse transcribed. To investigate this, I added the sequences of Rn5S, Rn5.8S, Rn18S, and Rn28S from mouse together with the GENCODE transcripts and ERCC spike-in sequences in a new reference, and reran all the samples through Salmon. On average this had the effect of increasing the mapping rate, with a number of samples having almost twice the mapping rate as before. The data here consists of cells from many individual mice, from different time points in the infection with a couple of replicates. By necessity of the technology cells from each mouse and time point need to be sorted into individual microwell plates. From the quantified gene expression/abundance values we can compare the controbuting sources in each individual cell, stratified by plate to see if there are any trends. Here ENSMUS corresponds to contribution from the mouse transcriptome. The different rRNA genes are indicated. Here we see that Rn18S is contributing far more than the other genes. It is also clear that different samples (plates) have different contributions of rRNA. If you haven't included ribosomal RNA in you mapping reference and are working on mouse, a red flag for rRNA contamination is particularly high expression of genes called CT010467.1, AY036118, Erf1, or Gm42418. These genes overlap a region on Chromosome 17 which have particularly similar sequnce to 18S. I have seen many datasets where any of these genes are the top 3-4 most highly expressed genes in a cell. # Low mapping rate 1 - Unsorted FASTQ pairs Occasionally when working with scRNA-seq data, you notice that there is a large degree of heterogeneity in terms of the percent of mapped reads per cell. Typically this is one of the criterions for exculding cells from analysis. Usually we are pushed to get things done, and as long as we have enough cells with enough mapped reads to perform proper analysis we let it be and move on. In our review of the history of scRNA-seq experiments we point out that the sequencing itself is one of the main current bottlenecks for large scale experiments. In light of this, I thought it would be useful to actually note what is causing us to sequence reads which we are not using in the actual analysis. I will write a series of posts with a number of contributing factors for low mapping rates I have noticed recently. # Unsorted FASTQ pairs Typically we quantify gene expression with Salmon, and have some simple tools to extract QC data from the result files. One good combination of variables to look at in your cells is the number of mapped reads compared to the % of mapped reads. In a recent case, we had a plot like this What sticks out here is the large gap between high-mapping and low-mapping cells. It usually shows more of a continuous trend, with a more clear cluster of "proper" cells. It turned out that the at some point in the data processing steps prior to quantificiation, the order in the reverse and forward FASTQ files had not been kept consistent. Example: $ head -n 8 JE1704_C27_R{1,2}.fastq
==> JE1704_C27_R1.fastq <==
@NS500239:235:HL7JLAFXX:1:11101:21276:1083 1:N:0:ATGCGCAG+NTTAATAG
CCTCTACCCAGAGGCCCAGTGGCAGAGGCCTGGACAAGTATTGAACACAAGAACTGTAGTGGTCAGAGGGACTTAA
+
AAAAAAEEEEEEEEEEEEEEEEEEEEEEAEEEE/EEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEAEEA
@NS500239:235:HL7JLAFXX:1:11101:9041:1085 1:N:0:ATGCGCAG+NTTAATAG
TGTTTTTATTGATTTAGTCTGTTTCAGAGTCAAGGTGTCAACGAGGAAGGATGGATATCCATGGAGGAAGAAGAGA
+
AAAAAEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEAEEEEEEEAEEE/EEA/EA<E

==> JE1704_C27_R2.fastq <==
@NS500239:235:HL7JLAFXX:2:11101:17928:1074 2:N:0:ATGCGCAG+NTTAATAG
ACCTAATATAGCAGGTGGCCAGGACTGGGATCCAGCTGCCTGGATCAGGTCAGGCTTGAGGAAGACTGCTTAAGAG
+
AAAAAEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAEEEEAEEEEEEAEEAEEEAE6/A
@NS500239:235:HL7JLAFXX:2:11101:5780:1075 2:N:0:ATGCGCAG+NTTAATAG
TATCAACGCAGAGTACGGGATGAGGTGTGGGACGCACTGCCAGCGTTGCCCTTTAGTATAGCTCTTGGTAAACTAA
+
AAAAAEAEEEEA6AEE//EEEA///EE//E/EEAEEA6A6EEEEE//6E/<EAE<EEE6E/EEEAE/EEEEEEEE6

If everything is correct, the headers of the FASTQ records should be identical up to the first space. The number after the first space indicated if the FASTQ is a forward read or reverse read. Here we specifically see that the reads come from different sequencing lanes (the number after HL7JLAFXX:).

When Salmon is mapping with umatched forward and reverse reads, the majority of these will map to different transcripts from eachother. This will cause the read to be considered unmapped as it is an event which is not consistent with typical RNA-seq libraries.

The solution to this problem is pretty simple: just sort all your FASTQ files by the header header. The quickest solution I stumbled upon for this is from the EdwardsLab blog which suggests a Bash oneline to do this.

After sorting, the beginning of the FASTQ pair above looks like this

\$ head -n 8 JE1704_C27_R{1,2}_sorted.fastq
==> JE1704_C27_R1_sorted.fastq <==
@NS500239:235:HL7JLAFXX:1:11101:10008:20335 1:N:0:ATGCGCAG+CTTAATAG
CCCACAGCCTCTGCCGCGGGTACCATGAAGATCTCTGCAGCTGCCCTCACCATCATCCTCACTGCAGCCGCCCT
+
AAAAAEEAEEEEEEEEEEEEEE//EEEEEEEEEEEEEEAEEEAEEEEAA66EEE/E/EEEEEEEEEEEEAAEEE
@NS500239:235:HL7JLAFXX:1:11101:10009:10878 1:N:0:ATGCGCAG+CTTAATAG
TACCAACACATGATCTAGGAGGCTGCTGACCTCCAACAGGAATTTCACCACTTAACCCTCTAGAAGTCCCACTA
+
AAAAAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEAEEEEAEEEEEEEAEEEEEE/EEEEEEEEAEEE

==> JE1704_C27_R2_sorted.fastq <==
@NS500239:235:HL7JLAFXX:1:11101:10008:20335 2:N:0:ATGCGCAG+CTTAATAG
TGGTATCAACGCAGAGTACGGGGGTGCAGAGGGCGGCTGCAGTGAGGATGATGGTGAGGGCAGCTGCAGAGATCTT
+
A/AAAEEEAEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEAEEEAAAEEEEEAEEEEEE
@NS500239:235:HL7JLAFXX:1:11101:10009:10878 2:N:0:ATGCGCAG+CTTAATAG
ATTACATGGAGTCCATGGAATCCAGTAGCCATGAAGAATGTAGAACCATAGATACCATCTGAAATGGAGAATGATG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEE/EEEAAEEEEEEAAEEEEEEEEEEE

What caused the files to not be correctly sorted to begin with? it could have been something about how the files from different sequencing lanes were merged. I have also had issues with this happening befoer when using the samtools fastq command on CRAM files which have been sorted by alignment coordinate.

It seems a small number of the cells had succesfully kept the FASTQ order, explaining the "outlier" population at 80%.

After sorting all files and rerunning Salmon, we get much more reasonable mapping rates.

There are still some cells with lower mapping rates, but not nearly as many as before. And we see a more consistant cluster of highly mapping cells at ~80%.

We can also visualize how much the mapping improved for each cell when properly ordering the FASTQ files.

This was probably the cause of low mappting rate which is easiest to deal with, and it was introduced at the data processing side. But a good thing to keep in mind, if you have consistently very low mapping rate, have a quick look to see that reads are sorted correctly.

# Negative Binomial Factor Analysis by SGD

Principal component analysis works on the assumption that residual error from the linear model is Gaussian. To satisfy this in e.g. the case of scRNA-seq gene expression counts, it is common to log transform the counts with a "pseudocount" added to keep expression positive and deal with 0's.

Count models are fundamentally different from normally distributed models in that there is a relation between empirical mean and variance.

It is well known that a negative binomial noise model is appropriate for RNA-Seq sequencing counts. Previously I wrote about ZINB-WaVE by Risso et al, a factor analysis method which has a zero-inflated negative binomial noise model. The negative binomial distribution has two parameters, $$\mu$$ - the mean of the disitribution - and $$\phi$$, the overdispersion. If $$y \sim NB(\mu, \phi)$$ then $$\mathbb{E}(Y) = \mu$$ and $$\text{Var}(Y) = \mu + \frac{1}{\phi} \cdot \mu^2$$. The likelihood of this model is $$\mathcal{L}_{NB}(y | \mu, \phi) = {{y + \phi - 1} \choose {y}} \cdot \left( \frac{\mu}{\mu + \phi} \right)^y \cdot \left( \frac{\phi}{\mu + \phi} \right)^\phi.$$

If we make the simplifying assumption $$\phi = 1$$ then the log likelihood simplifies to $$\log \mathcal{L}_{NB}(y | \mu, 1) = y \cdot \log(\mu) - (y + 1) \cdot \log(\mu + 1)$$

From available datasets, it looks like this assumption might be a sensible thing. Let's look at the empirical mean variance relation for four representative datasets.

The mean variance relation seem to hold for a large number of genes, but not all. I think one way to deal with this is to consider a factor model similar to PCA for the $$\mu$$ parameter, which should explain additional variance on top of the expected technical variance. Say that each gene $$g$$ and cell $$c$$ has its own mean $$\mu_{g, c}$$. In matrix form, $$\mu = \exp ( W x + E + \log(T) + S),$$ where $$W$$ is a $$G \times N$$ matrix of gene weights, $$x$$ is an $$N \times C$$ matrix of latent factors, $$E$$ is a $$1 \times C$$ vector of cell specific scaling "efficencies", $$T$$ is a $$1 \times C$$ vector of known cell specific scale factors, in this case the total number of counts in a cell, and $$S$$ is a global scaling factor. (Here we pretend matrix-vector addition "broadcasts" like in NumPy / TensorFlow code.)

This can be fitted with stochastic gradient descent using TensorFlow as I wrote about in the case of PCA before. The full implementation is available here, but besides the data reading and mini-batching code, the key snippet of the TensorFlow model is the following:

...

## Model ##

W = tf.Variable(np.random.randn(G, N), name='weights')
x = tf.Variable(np.random.randn(N, S), name='PCs')
E = tf.Variable(np.random.randn(S), name='Efficiency')
S = tf.Variable(np.array([0.]), name='Scaling')

sample_idx = tf.placeholder(tf.int32, shape=[None])
variable_idx = tf.placeholder(tf.int32, shape=[None])
T_ = tf.placeholder(tf.float64, shape=[None])
y_ = tf.placeholder(tf.float64, shape=[None])

W_ = tf.gather(W, variable_idx)
x_ = tf.gather(tf.matrix_transpose(x), sample_idx)
eta_ = tf.reduce_sum(W_ * x_, 1)
E_ = tf.gather(E, sample_idx)

mu_ = tf.exp(eta_ + tf.log(T_) + E_ + S)

LL = tf.reduce_sum(y_ * tf.log(mu_) - (y_ + 1) * tf.log(mu_ + 1))

...

Performing the SGD model fitting takes about 20 seconds for datasets with several thousands of cells, using the top 3,000 expressed genes. Applying it to the data presented in the plot above using 2 hidden factors per cell, we get these results:

I like that in this model you can just provide UMI counts without any need to log transform or in other way Gaussianize the data. Though in practice, the results from performing regular PCA on log transformed counts give pretty similar results in a fraction of the time.

Different runs of the model also give slightly different results, though large scale patterns are pretty conserved between runs.

Here we are not enforcing any independence between the hidden factors, which should be a next step. Additionally, some way of selecting the number of factors like variance explained in PCA would be useful.

# Simple and interpretable supervised machine learning of scRNA-seq cell types

The scRNA-seq field has reached a second wave, were the first initial systems under investigation are getting repeated. Either to ask more specific questions, or to get better data with the newer technologies available. This is highlighted in particular in a recent paper by Kiselev & Hemberg. They point out that we need to start thinking about cell type references similar to how there are genome references, and we need a way to map data to this reference.

I was wondering how a stereotypical machine learning multi-class classification model would perform for this task. Since the online scmap tool from the K&H paper comes with a couple of well annotated example data sets of pancreatic cells, this ended up being quite straightforward.

What we will do is train a machine learning model to predict cell types using one of the data sets, and predict cell types of cells from the other dataset with it.

The most basic multi-class classification model is Logistic regression, and we will use the implementation in scikit-learn. The entire analysis is in a notebook on Github, but let's walk through the key parts here.

To train the model, we will use the data from Segerstolpe et al, consisting of 3,500 cells annotated with 15 cell types. We want to predict the cell types of the samples using the gene expression values. First we split up the data so we can evaluate the model afterwards.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
train_test_split(s_exprs, s_sample_info['cell_type1'], test_size=.2)

Next we initiate the model.

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=0.05, penalty='l1', n_jobs=-1)

First of all, we use L1 penalty in the model. This means we are favoring sparsity. That is, we believe only a small number of the genes determine the cell types, and we favor many genes having 0 weights. The C parameter determines how strongly we enforce sparsity. I picked 0.05 after trying a couple of different values.

Next we train and investigate the model, this takes about 5 seconds.

lr.fit(X_train, y_train)

LogisticRegression(C=0.05, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

lr.classes_

array(['MHC class II', 'PSC', 'acinar', 'alpha', 'beta', 'co-expression',
'delta', 'ductal', 'endothelial', 'epsilon', 'gamma', 'mast',
'not applicable', 'unclassified', 'unclassified endocrine'], dtype=object)

lr.coef_.shape

(15, 23171)

The cell types we want to be able to predict gets stored in the lr.classes_ field. Logistic regression works by predicting a probability of a sample coming from a given class. In the standard version in sklearn, this is done by making oen binary logistic regression for each class, where logistic regression depends on a linear combination of weights times gene expression values. The class with highest probability gets assigned as the predicted class when evaluating the model on a new observation. The weights for each gene for each cell type is stored in lr.coeff_.

First let's have a look at the performance of the model

lr.score(X_train, y_train)

0.98256848096762717

lr.score(X_test, y_test)

0.92887624466571839

I think this is pretty good. For the data used for training, the model is 98% accurate, while it is 92% accurate for the held out testing data. It should be noted that this might not be the best metric here, because the cell types are very different in number of representatives.

To predict using our model, we just use the lr.predict method.

y_hat = lr.predict(X_train)
y_hat

array(['ductal', 'alpha', 'not applicable', ..., 'not applicable', 'beta',
'not applicable'], dtype=object)

The most straightforward way to investigate how the model is doing is by making a matrix of how different cell types get predicted.

from sklearn import metrics
pd.DataFrame.from_records(metrics.confusion_matrix(y_train, y_hat),
index=lr.classes_)

In particular we notice that some of the not applicable and unclassified cells get predicted as other cell types.

A particularly nice thing with linear model such as logistic regression is how interpretable they are. The weights of the genes directly relate to how the the cell types are predicted. Let's assign each gene as a marker for the cell type it's the strongest predictor of.

marker_genes = pd.DataFrame({
'cell_type': lr.classes_[lr.coef_.argmax(0)],
'gene': X_train.columns,
'weight': lr.coef_.max(0)
})

marker_genes.query('weight > 0.').shape
(628, 3)

The final row tells us that of the ~23,000 genes we used as input, only 628 are used in predicting the cell types. Let's print out the top predictive genes for each cell type.

top_markers = \
marker_genes \
.query('weight > 0.') \
.sort_values('weight', ascending=False) \
.groupby('cell_type') \
.sort_values(['cell_type', 'weight'], ascending=[True, False])

figsize(10, 20)
for i, m in enumerate(top_markers.cell_type.unique()):
plt.subplot(10, 3, i + 1)
g = top_markers.query('cell_type == @m')
plt.title(m, size=12, weight='bold')
for j, gn in enumerate(g.gene):
plt.annotate(gn, (0, 0.2 * j), )

plt.axis('off')
plt.ylim(6 * 0.2, -0.2)

plt.tight_layout()

We wrote before that logistic regression predicts the probability of each cell type. This can also be used as a visualization. After sorting the cells according to the known cell type, we can predict the probability, then plot the probability of each cell type for each cell.

shift_idx = y_train.argsort()
sorted_idx = y_train.sort_values().index
y_prob = lr.predict_log_proba(X_train.loc[sorted_idx])

Now let's finally get to the task at hand: treat this model as a reference, and predict cells from another dataset. The second dataset is from Muraro et al. This is 2,100 cells annotated with 10 cell types, the interesting point is to see if these cell types gets predicted in a reasonable way by our model.

Something we need to make sure of is that the genes in the new dataset are in the same order as in the previous. If a gene is not present in the new dataset, we set those values to 0.

X_new = m_exprs.T.loc[X_train.columns].T.fillna(0)
m_sample_info['predicted_cell_type'] = lr.predict(X_new)

m_sample_info \
.groupby(['cell_type1', 'pred_cell_type']) \
.count().iloc[:, [0]] \
.unstack().T \
.fillna(0)

This is pretty nice I think! We didn't do any normalisation or batch correction etcetera, but the results still seems consistant. Based on this I think it's pretty easy to envision servers with models for cell types based on huge amounts of data that can be used by researchers to query new samples against.

I think clustering and cell type annotation will be considered similarly to transcriptome assembly and annotation in the future. An application which is certainly feasible, but a level more advanced than most users will need.

Again, this sort of analysis is pretty straight forward, and the notebook is available here

# Approximate PCA by mini-batch SGD using TensorFlow

In machine learning you usually define a model which has a cost function which you minimize to learn parameters from the data. A very powerful way to do this with large amounts of data is mini-batch stochastic gradient descent (SGD). This means iteratively looking at small random subset of your data, then update parameters using that subset (mini-batch).

I think it's pretty intuitive why this works well; you both need less memory to evaluate the cost function on the mini-batch; and by constantly changing the data we should reach less overfitted results.

This strategy is very well used in supervised classification and regression. Unfortunately in our field of single cell gene expression analysis, these are not the sorts of problems we have. A problem we do have is to learn low-dimensional representations of the data, for example through principal component analysis (PCA).

There are a couple of reasons why mini-batch SGD doesn't make sense for this. Firstly, just making batches over the observations will not help much, because we usually have rather few (hundreds) observations (cells) of many (tens of thousands) variables (genes). Secondly, we need to learn parameters for every observation, so no information would be shared between batches! We would just end up solving many independent problems.

Usually data is represented as a table with observations vs variables. Another way to represent the data is by "long" or "database-style" encoding. (Also known as "tidy" in the R world). Here we store records of values, and indexes for each record indicating which observation and variable the value belongs to. In this formatting it actually makes some sense batching the data!

Recall that in PCA, we want to represent our data $$Y$$ by $Y = W \cdot X,$ where W contains a weight for each variable, and $$X$$ has a representative value for each observation. Say that we learn the $$W$$ and $$X$$ by batching the long form of the data $$Y$$.

From the animation, we notice that the weights for each variable will be learned after each other. So in the beginning of optimization the model will fit the first variable alone. A solution to this is to shuffle the long form of the data.

Now we see there isn't any bias which variables are trained.

I made an implementation of this strategy in TensorFlow. It's not strictly PCA, because the cost function is simply $|| y_b - w_b x_b || \cdot \frac{1}{B},$ where the $$b$$ subscript indicates that it's from within a batch, and $$B$$ is the size of the batch. The complete implementation is available here, but the main functional TensorFlow part is the following

N = 2  # Latent space dimensionality

W = tf.Variable(np.random.randn(G, N), name='weights')
x = tf.Variable(np.random.randn(N, S), name='PCs')

sample_idx = tf.placeholder(tf.int32, shape=[None])
variable_idx = tf.placeholder(tf.int32, shape=[None])
y_ = tf.placeholder(tf.float64, shape=[None])

W_ = tf.gather(W, variable_idx)
x_ = tf.gather(tf.matrix_transpose(x), sample_idx)
y_hat = tf.reduce_sum(W_ * x_, 1)

cost = tf.nn.l2_loss(y_ - y_hat) / batch_size

The main point is to use the tf.gather functions to get the sub-tensors for the current batch.

For startars, we apply this to the Iris data:

We see that the cost is going down, and we get a 2-dimensional representation. If we compare to the normal solution to PCA, we see that the our solution finds roughly the same features.

Can we use this for real and interesting data? We evaluate this by considering a dataset by Zeisel et al, consisting of 3,005 single cells from mouse brain. We look at the 3,000 top variable genes, so the long form representation has about nine million rows. Using a batch size of 10,000, we get fairly good results in about 10 seconds.

Again, comparing to the typical solution, here using scikit-learn, we see the same general features.

It should be noted that the scikit-learn PCA is instant for this dataset, it really doesn't make sense to use this mini-batch SGD version in practice. But I think it is interesting because it does show we can use the mini-batch SGD concept for tasks like these. The model we use here could be extended to include known covariates, or it could be used for clustering.