Explaining variance by technical factors in scRNA-seq data using ARD-MLR in Stan

I was recently rereading the ADVI paper by Kucukelbir et al and noted a couple of things I didn't know. First of all, their Stan implementation of Probabilistc PCA (PPCA) in the paper is far better than the implementation I made. Secondly, they implement a version of PPCA with Automatic Relevence Determination (ARD). This gives the ability to extract "fraction variance explained" of the principal components similar to the Singular Value Decomposition based implementatoins.

In PPCA we seek matrices \( W \) and \( Z \) so that

\[ \begin{align} X_n & \sim \mathcal{N}(W \cdot Z_n + \mu, \sigma^2) \\ Z_{i,j} & \sim \mathcal{N}(0, 1) \\ W_{i,j} & \sim \mathcal{N}(0, \sigma^2) \end{align} \]

The modification that allows the ARD is to introduce a hyper-prior \( \alpha \) for the prior on the weights \( W \).

\[ W_{i,j} \sim \mathcal{N}(0, \alpha_j \cdot \sigma^2) \]

Now the posterior of \( \alpha \) will indicate the proportion of variation of a given column of \( Z \) explains the variance of \( X \).

This seem to work really nicely, and applies directly to the Residual Component Analysis model I described in an earlier post.

This idea of putting the hyper-prior on the variance solve another thing I've been trying to do though, which I'll describe below.

When I get a new single-cell RNA-seq dataset, I usually try to figure out what known factors are contributing to variation in the data. We usually have a table with technical and experimental information, as well as a gene expression matrix for each gene in each sample.

For now the RCA is really too slow to be applicable to scRNA-seq data. My general workflow goes like this:

Perform PCA
Correlate PCs with technical factors
Regress out correlating technical factors
Perform PCA on the residuals
Repeat step 2-4 until you understand the data for proper analysis

This gives me a handle on which factors are responsible to alot of variation, and various average effect sizes and groupings. It does not however give me quantitative information about how much of the variation in the data is explained by the different factors! I've been a bit frustrated with this since the PC's do come with this information, so I've felt it should be possible to get this information in a supervised way. I know it something which can be done, but I haven't found the correct Google terms, which I guess should be something like "variance explained in multivariate multiple linear regression".

In the ARD-PPCA model above though, I saw a clear strategy to get the values I want to know. Perform Multiple Linear Regression with an ARD hyper-prior!

I made a Stan implementation which takes multivariate data and a design matrix assumed to include an intercept.

data {
  int<lower=0> N; // number of data points in dataset
  int<lower=0> D; // dimension

  int<lower=0> P; // number of known covariates
  vector[D] x[N]; // data
  matrix[P, N] y; // Knwon covariates

}
parameters {
  real<lower=0> sigma;
  matrix[D, P] w_y;
  vector<lower=0>[P] beta;
}
model {
  // priors
  for (d in 1:D){
    w_y[d] ~ normal(0, sigma * beta);
  }
  sigma ~ lognormal(0, 1);
  beta ~ inv_gamma(1, 1);

  // likelihood
  for (n in 1:N){
    x[n] ~ normal (w_y * col(y, n), sigma);
  }
}

Then I grabbed a data set I had lying around (with 96 samples). Below is a snipped of the kind of sample information available.

	21681_1#18	21681_1#32	21681_1#48	21681_1#58	21681_1#12
detection_limit	inf	inf	inf	inf	inf
accuracy	-inf	-inf	-inf	-inf	-inf
ERCC_content	0	0	0	0	0
num_genes	7092	6990	469	6056	1025
MT_content	72193.3	82527.8	77319.3	97045.6	99507.8
rRNA_content	68.1274	41.7641	1905.97	41.2784	0
num_processed	680970	7287104	975356	3726116	27173
num_mapped	501237	6106642	670644	3081850	2018
percent_mapped	73.6063	83.8007	68.7589	82.7094	7.42649
global_fl_mode	321	1000	309	276	283
robust_fl_mode	321	280	309	276	283
Supplier Sample Name	SCGC--0894_B06	SCGC--0894_C08	SCGC--0894_D12	SCGC--0894_E10	SCGC--0894_A12
sample_type	sc	sc	sc	sc	sc
LB_type	A	B	B	B	B
merge	sc_A	sc_B	sc_B	sc_B	sc_B
well	B06	C08	D12	E10	A12
row	2	3	4	5	1
column	6	8	12	10	12

Using the patsy Python package, we can generate a design matrix which we use to model the values in the expression matrix.

Y = patsy.dmatrix('np.log(num_genes) + np.log(num_mapped) + LB_type + sample_type + percent_mapped', sample_info, return_type='dataframe')

While the ADVI in Stan is fast, I didn't have the patiance to run the model on the full expression matrix. In stead I sampled 1,000 genes and ran the model on that, just as a prookf of concept.

partial_logexp = logexp.sample(1000, axis=0)

N, D = partial_logexp.T.shape
N, P = y.shape
data = {
    'N': N,
    'D': D,
    'x': partial_logexp.T,
    'P': P,
    'y': y.T
}

v = model.vb(data=data)

As you might see from the Stan code, the ARD parameter is \( \beta \), so we extract these for the different columns of the design matrix \( Y \).

Note that the one-hot encoding for the categorical variables is spreading variance in to multiple columns. To get a final fraction we can sum over all the variance for a given categorical variable.

We see that the majority of variance in the data is due to sample_type, which indicate whether a sample is proper, or positive control or negative control. After this the LB_type parameter explains the second most amount of variance. (Which is a sample condition for this data, but it's not very important exactly what it is in the proof of concept).

It seems pretty stable for sub-samples of genes during my experimentation. I think this might be a neat way to quickly assess your data. I'm not sure how fast it will run for more samples though, even when sampling only 1,000 genes.

A notebook of this can be found here.

I really like how quickly problems can be solved, at least to the prototype stage, using probabilistic programming like with Stan. This short model solves a problem in an intuitive way. And the conciseness of the language meant I could write the entire model on the bus home from work.