# Mapping a malaria infection response by GPLVM

## Introduction

If you have ever looked at the definition of cell types in flow cytometry images, you might be used to seeing relatively faint signals under a large portion of noise. In flow cytometry, abundance of a small number of proteins is measured in hundreds of thousands of cells. A representative example can be seen for example here.

Even so, it is known that if a population of cells is sorted out from a global population, they have different functions and potentials.

Each cell type or 'cluster' will however have a lot of observed variability. This could be either due to technical measurement factors, or because of intrinsic biological properties. The takeaway though is that not all variability is interesting. Cells do however need to end up in the state which defines it as a distinct cell type from another cells. There is a starting state, and something happens, and cells in an end state are produced. It is reasonable to argue, that if you measure gene expression of cells representing the entire process of going from one state to another, we should see a continuum of cells.

Imagine we do experiment were we sample and measure two marker genes a population of cells at a number of time points.

While there is a lot of noise, there is little bit of structure in each time point. We would attribute this to some cells being "ahead" of others in differentiation. If we had a magical flow cytometer that could track the levels in the cells in real time, we might see something like this

What do we mean by this? We are essentially believing that for both gene A and B, there is a pattern of expression change which is going on over time as the population of cells are differentiating.

## Learning from snapshots

In single cell RNA-sequencing experiments, usually we sampels ~100 cells from each time, and then we want to figure out this underlying trajectory the cells are going through.

Here, we are arguing that there is an underlying process, representing differentiation, and genes are changes expression levels over the course of this process. If we only make the physical assumption that the changes in expression level is smooth, and we knew the fine grained differentiation state, but no further assumptions, we can model the expression patterns using Gaussian Processes.

$y_g = f_g(t) + \varepsilon$

The function $$f_g$$ is distributed by a Gaussian Process, an infinte dimensional version of a multivariate normal distribution. And $$\varepsilon$$ corresponds to observational noise.

If we have multiple genes $$G$$ that we say that we want to model in this way, we can actually learn the differentiation trajectory values! This is done by using the Gaussian Process Latent Variable Model. I wrote a bit about this before.

$\begin{pmatrix} y_0 \\ \vdots \\ y_G \end{pmatrix} = \begin{pmatrix} f_0(t) \\ \vdots \\ f_G(t) \end{pmatrix} + \varepsilon$

We used this method on our Thrombocyte development paper, Macauley, Svensson, Labalette, et al Cell Reports 2016. This way we could order the cells according to the most likely transcriptional trajectory, and then analyze for example how genes behave over the course of development. We also used it to study transition of mouse embryonic stem cells to a specific cell state of interest in Eckersley-Maslin et al Cell Reports 2016.

Normally, we used the implementation in GPy to fit the latent time values, but there are also a number of GPLVM implementations, some of which are explicitly aimed at scRNA-seq data.

## Malaria immune response

In our recent paper, Lönnberg, Svensson, James, et al Science Immunology 2017, we applied Bayesian GPLVM to a time course of immune cells from mice reactingto malaria infection.

When animals have an immune response, the natural course is to go back to the healthy state after finishing combatting the infection. The expression profiles of the cells therefore exhibit a cyclic behavior. This causes a problem when inferring a single pseudotime, not practically, but in terms of visual interpretation. To deal with this we consider informed priors on the $$t$$ values, $$p(t_i) = \mathcal{N}(\text{day}_i, \sigma_p^2 )$$, inspired by the DeLorean implementation. This allows us to make full use of the time course, and the seven mice we sacrificed for this purpose.

The inference of the pseudotime can be visualized like in the example above, but for real data.

This way we could obtain a high-resolution time course of the immune response to Malaria infection, which we could use in downstream analysis to create a timeline of the events that happen after infection. See the paper for our findings!