PCA with batch effects in Stan

In Principal Component Analysis (PCA), we wish to find a simple linear model that explain multidimensional data. If we have \( G \) variables:

\[ \require{color} \begin{align} y^g &= {\color{red} w_1^g} \cdot {\color{red} x_1} + {\color{red} w_2^g} \cdot {\color{red} x_2} + {\color{red} \mu^g} + \varepsilon \\ \varepsilon &\sim \mathcal{N}(0, {\color{red} \sigma^2}) \\ g &\in \{1, \ldots, G\} \\ (x_1, x_2) &\sim \mathcal{N}(0, I) \end{align} \]

The red parts of the formula indicate quantities we need to infer from the data.

(In a previous version of this post I hadn't specified the multivariate normal prior on \( X \). Mike Love pointed out that without it, the components will not be orthogonal.)

Let us look at an example and a simple implementation.

As an illustratory data set, let us use the classical Iris data set. This consists of 150 observations of four measurements (sepal length, sepal width, petal length, and petal width) of three species of Iris flowers (I. setosa, I. versicolor, and I. virginica).

To implement PCA, we use Stan, a probabilistic programming language, where you just write out the model, and the inference is handled automatically. In the C++ based notation of Stan, the PCA model described above is written in the following way:

data {
    int<lower = 1> N;  // Number of samples
    int<lower = 1> G;  // Number of measured features

    vector[G] Y[N];    // Data
}
transformed data{
    vector[2] O;
    matrix[2, 2] I;

    O[1] = 0.;
    O[2] = 0.;

    I[1, 1] = 1.;
    I[1, 2] = 0.;
    I[2, 1] = 0.;
    I[2, 2] = 1.;
}
parameters {
    vector[2] X[N];
    vector[G] mu;
    matrix[G, 2] W;

    real<lower = 0> s2_model;
}
model {
    // "For every sample ..."
    for (n in 1:N){
        X[n] ~ multi_normal(O, I);
    }

    for (n in 1:N){
        Y[n] ~ normal(W * X[n] + mu, s2_model);
    }
}

The typical way to use Stan is Bayesian analysis, where you define your model in Stan along with your priors (which by default, like here, will be uniform) and use Stan to draw samples from the posterior. We will do this, then plot the mean of the posterior \( X \) values.

From this we can see that I. setosa is quite different from the other two species, which are harder to separate from each other.

Now imagine that the iris data was collected by two different researchers. One of of them has a ruler which is off by a bit compared to the other. This would cause a so called batch effect. This means a global bias due to some technical variation which we are not interested in.

Let us simulate this by randomly adding a 2 cm bias to some samples:

batch = np.random.binomial(1, 0.5, (Y.shape[0], 1))
effect = np.random.normal(2.0, 0.5, size=Y.shape)
Y_b = Y + batch * effect

Now we apply PCA to this data set Y_b the same way we did for the original data Y.

We see now that our PCA model identifies the differences between the batches. But this is something we don't care about. Since we know which researcher measured which plants, we can include this information in model. Formally, we can write this out in the following way:

$$ \begin{align} y^g &= {\color{red} v^g} \cdot {z} + {\color{red} w_1^g} \cdot {\color{red} x_1} + {\color{red} w_2^g} \cdot {\color{red} x_2} + {\color{red} \mu^g} + \varepsilon \\ \varepsilon &\sim \mathcal{N}(0, {\color{red} \sigma^2}) \\ g &\in \{1, \ldots, G\} \\ (x_1, x_2) &\sim \mathcal{N}(0, I) \end{align} $$

In our case, we let \( z \) be either 0 or 1 depending on which batch a sample belongs to.

We can call the new model Residual Component Analysis (RCA), because in essence the residuals of the linear model of the batch is being further explained by the principal components. These concepts were explored much more in depth than here by Kalaitzis & Lawrence, 2011.

Writing this out in Stan is straightforward from the PCA implementation.

data {
    int<lower = 1> N;
    int<lower = 1> G;
    int<lower = 0> P;  // Number of known covariates

    vector[G] Y[N];
    vector[P] Z[N];    // Known covariates
}
transformed data{
    vector[2] O;
    matrix[2, 2] I;

    O[1] = 0.;
    O[2] = 0.;

    I[1, 1] = 1.;
    I[1, 2] = 0.;
    I[2, 1] = 0.;
    I[2, 2] = 1.;
}
parameters {
    vector[2] X[N];
    vector[G] mu;
    matrix[G, 2] W;
    matrix[G, P] V;

    real<lower = 0> s2_model;
}
model {
    for (n in 1:N){
        X[n] ~ multi_normal(O, I);
    }

    for (n in 1:N){
        Y[n] ~ normal(W * X[n] + V * Z[n] + mu, s2_model);
    }

}

We apply this to our data with batch effects, and plot the posterior \( X \) values again.

Now we reconstitute what we found in the data that lacked batch effect, I. setosa separates more from the other two species. The residual components \( X_1 \) and \( X_2 \) ignores the differences due to batch.

Discussion

Note that the batch effect size \( v^g \) here is different for each feature (variable). So this would equally well apply if e.g. the second researcher had misunderstood how to measure petal widths, causing a bias in only this feature. There is also nothing keeping us from including continuous values as known covariates.

Typically when batch effects are observed, at least in my field, a regression model is first applied to the data to "remove" this effect, then further analysis is done on the residuals from that model.

I think this kind of strategy where the known information is added to a single model is a better way to do these things. It makes sure that your model assumptions are accounted for together.

A weird thing I see a lot is people trying different methods to "regress out" batch effects, and then perform a PCA of the result to confirm that their regression worked. But if your assumption is that PCA, i.e. linear models, should be able to represent the data you can include all your knowledge of the data in the model. The same goes for clustering.

In a previous version of this post, I estimated the parameters with the penalized likelihood maximization available in Stan. But estimation of the principal components in this way is not very good for finding the optimial fit. There are lots of parameters (2 * 150 + 4 * 3) and it's very easy to end up in a local optimum. Principal component analysis is very powerful because it has a very well known optimal solution (eigendecomposition of covariance matrix).

However, writing the models in Stan like this allows you to experiment with different variations of a model, and the next step would then be to try to find a good fast and stable way of inferring the values you want to infer.

The code for producing these restults is available at https://github.com/Teichlab/RCA.

The first steps in RNA-seq expression analysis (single-cell and other)

Recently a colleague asked me if I know of any good online tutorials on analysing single-cell RNA-seq data. There are a number of great resources for this. However, they all start from having obtained your expression matrix already (See this https://github.com/davismcc/workshop-scRNAseq-oxford-sep2016 or this https://f1000research.com/articles/5-2122/v1 ). If you get your data delivered from a facility, you still need to know what to do. Charlotte Sonesson recently published a set of slides with an overview of modern RNA-Seq workflows. But I think it skims through the practical parts a bit briefly. So here I will focus on those practical bits, and hopefully this will be informative to anyone who received some sequence data and want to analyse it.

Reference - What did you measure?

This a step you can do before you get your data. When you are performing an RNA-seq experiment, you are measuring cDNA of RNA in your sample (sometimes poly-A RNA, sometimes all the RNA). A sequence reference in this case will be a list of cDNA sequences you expect to measure. For your biological sample, I think the simplest resource for this is Ensembl Biomart. The reason to use Biomart rather than transcriptome reference files is that it is updated more often, and offers filtering of genes (which we will ignore here). It also seems more things are considered “genes” there. In your index you want everything which make generate cDNA in your samples.

When you go to Biomart, you want to use the Ensembl Genes X database, where X is the current version (at writing 85) (1-A). The Ensembl gene annotation version updates about four times per year, so you want to do this procedure whenever you get new data.

Next you need to pick your organism. For this example we’ll use Mus musculus (1-B).

As discussed above, since we are measuring cDNA in the samples, we want to get this data from the Ensembl annotation. Click Attributes (2-A), then change from ‘Features’ to ‘Sequences’ (2-B). The sequences you want are ‘cDNA sequences’, so select these (2-C).

By default the headers of these sequences have the gene ID in them, but we just want the transcript ID. Scroll down to “Header Information” and unclick “Ensembl Gene ID” (3-A).

To download the reference file, go to Results (4-A), export results as FASTA (4-B) and click ‘Go’ (4-C).

This will download a file called ‘mart_export.txt’, which is about 200 MB large.

We rename this file to something informative so we don’t mix it up with other mart_export.txt files. For me, the informative bits are the genome assembly version and the Ensembl version. As well as an indicator that this is cDNA.

$ cd reference
$ mv mart_export.txt Mus_musculus_GRCm38.p4_E85_cdna.fasta

In many implementation of single cell rna-sequencing spike-ins are added. In this case I am making this example from, we are expecting reads from the ERCC spike ins. These must be added to the reference. Beyond a way to assess success of the experiment, it will also inflate some % reads mapped which is usually used as a quality control metric.

These sequences can be downloaded from https://tools.thermofisher.com/content/sfs/manuals/ERCC92.zip. This is a zip containing ERCC92.fa and ERCC92.gtf, and for this workflow we will only need ERCC92.fa.

The two sequence references files needs to be combined, and this is really simple for FASTA files, you simply catenate them.

$ cat Mus_musculus_GRCm38.p4_E85_cdna.fasta ERCC92.fa > Mus_musculus_GRCm38.p4_E85_cdna_ERCC.fasta

At this point I just want to mention that this might not be enough to cover the cDNA you should expect in a sample. In many cases repeat regions are expressed and gets converted to cDNA. We looked at RNA of repeats in our recent paper in Cell Reports.

There might be other sources of cDNA you might want to add to the reference, like potential contaminants. This can help explain low mapping rates.

In this example I am using Salmon. I would also recommend Kallisto, but at the moment Salmon has the ability to model more biases.

Make sure you have Salmon installed. You can download the binary from github, or install it with Homebrew/Linuxbrew.

To keep track of indices, I like make a directory for the kind of index I am making, then create the index in there.

$ mkdir salmon

Keeping the name the same makes it easier to relate back to what you ran the samples against.

$ salmon index -t Mus_musculus_GRCm38.p4_E85_cdna_ERCC.fasta -i salmon/Mus_musculus_GRCm38.p4_E85_cdna_ERCC.fasta

This takes about 5 minutes.

Salmon can take a ‘genemap’ to summarise expression to the gene level (by summing the expression of transcripts arising from a given gene). The ‘genemap’ is a simple TSV table with gene and transcript name. We get this from biomart as well, to match the cDNA index.

The simplest way to generate this is to go to ‘Features’ (5-A) in the Attributes (5-B) of Biomart. The table need to be ordered (transcript, gene). So unclick ‘Ensembl Gene ID’ (5-C), then click it again, so the order is correct (5-D).

Go to Results (6-A) and download the file as TSV (6-B) by clicking ‘Go’ (6-C).

Rename this text file to something memorable. Again, I like to match the name of this file with the FASTA file to keep track of these belonging together.

$ mv mart_export.txt Mus_musculus_GRCm38.p4_E85_cdna.genemap.tsv

We need to add the ERCC names to this list though. To make a “genemap” from the ERCC FASTA file you can run the following command:

$ grep '>' ERCC92.fa | tr -d '>' | sed 'p' | paste -d '\t' - - > ERCC92_genemap.tsv

Now we can simply merge the mouse cDNA genemap and the ERCC genemap

$ cat Mus_musculus_GRCm38.p4_E85_cdna.genemap.tsv ERCC92_genemap.tsv > Mus_musculus_GRCm38.p4_E85_cdna_ERCC.genemap.tsv

Expression quantification - processing the data

Now let us move to the actual expression quantification. I am going to assume you have a directory with one pair of FASTQ files per sample (one forward and one reverse file per sample). You probably don’t, but the way data is delivered from sequencing facilities is so heterogeneous, you will just have to figure out how to reach this stage. For example, at the Wellcome Trust Sanger Institute, sequencing data is delivered in CRAM files, which need to be converted.

I find it easiest to organise things so that in a data directory of a project, I have one directory with FASTQ files, and make another directory with Salmon outputs, called ‘salmon’. (Or similar for any other program that solves a problem.)

$ cd ..
$ ls fastq/ | head
sc1_cell10_1.merged.fastq
sc1_cell10_2.merged.fastq
sc1_cell11_1.merged.fastq
sc1_cell11_2.merged.fastq
sc1_cell1_1.merged.fastq
sc1_cell12_1.merged.fastq
sc1_cell12_2.merged.fastq
sc1_cell1_2.merged.fastq
sc1_cell13_1.merged.fastq
$ mkdir salmon
$ cd salmon

Now we have everything you need to run Salmon on any given sample. However, this will be tedious, so you should write some script which will do this for you. A very portable way to do this is by using GNU Make. A more intuitive alternative for this sort of processing is to use Snakemake. Below I’m pasting in an example Snakemake file.

$ cat Snakefile
import glob

FASTQS = glob.glob('../fastq/*_1.merged.fastq')

rule all:
   input: [os.path.basename(fq).replace('_1.merged.fastq', '_salmon_out') for fq in FASTQS]

rule salmon:
   input:
       index='../reference/salmon/Mus_musculus_GRCm38.p4_E85_cdna_ERCC.fasta',
       genemap='../reference/salmon/Mus_musculus_GRCm38.p4_E85_cdna_ERCC.genemap.tsv',
       fwd='../fastq/{sample}_1.merged.fastq',
       rev='../fastq/{sample}_2.merged.fastq'
   output:
       '{sample}_salmon_out'
   shell: '''salmon quant -i {input.index} \
                          -l IU \
                          -g {input.genemap} \
                          -1 {input.fwd} \
                          -2 {input.rev} \
                          -o {output} \
                          --posBias \
                          --gcBias
          '''

This Snakefile just runs one command on each sample, which is ‘salmon quant’. Here we provide the reference and the genemape with the -i and -g flags. For Salmon you also need to specify the library type, regarding strand specificity, which is one factor used to judge the mapping locations of read pairs. For ‘normal’ samples this ‘-l’ flag will be IU, but check the documentation to be sure this corresponds to your samples. We also provide some flags to calculate bias parameters. By running ‘snakemake’ in the ‘salmon’ directory you can execute ‘salmon quant’ for all samples. If you have access to a cluster you can run hundreds of these at once by doing e.g. (if you have LSF)

$ snakemake --cluster "bsub -M 10000 -R 'rusage[mem=10000]'" --jobs 200

Once this finishes, you will have expression values for all your samples.

Bringing the data together

What we described above will get you one resulting Salmon directory per sample. To compare these samples, you need to combine the results to a table. As I work mostly in Python, I made a little helper package to combine these to useful tables.

%pylab inline

import readquant

sample_info = readquant.read_qcs('salmon/*_salmon_out', version='0.7.2')

This command parses technical information from the Salmon results which are useful for quality control of the samples. Next read in the expression values per gene in the samples.

tpm = readquant.read_quants('salmon/*_salmon_out')
tpm.iloc[:5, :3]

salmon/sc1_cell35_salmon_out

salmon/sc1_cell24_salmon_out

salmon/sc2_cell55_salmon_out

Name

ERCC-00158

90.6392

0.00000

0.0

ERCC-00154

35.6046

21.19100

0.0

ERCC-00150

43.4815

0.00000

0.0

ERCC-00143

12.7487

6.95937

0.0

ERCC-00142

0.0000

0.00000

0.0

Now, when we are working on gene level, it is good to generally have an annotation of the genes we are interested in. We have been going back to Biomart a lot but this is the last one! Go Attributes (7-A), then Features (7-B), unclick Ensembl Transcript ID (7-C), then click everything you might be interested in on gene level.

A must is the Associated Gene Name which will allow you to relate the gene id to something recognizable. Chromosome Name is useful for e.g. filtering MT genes. I like the ability to check the rough location of a gene using the Gene Start (bp), for example, if a number of genes are highly correlated, I can quickly check if they are at they share a locus. If there is anything you think you might be interested in relating to the genes you might find, just add it. When you’re done, go to Results (8-A), and download the table as a CSV (8-B) by clicking ‘Go’ (8-C).

Again rename this annotation file so you can relate it to the rest of the files.

$ cd ..
$ mv /Users/vale/Downloads/mart_export\ \(1\).txt reference/Mus_musculus_GRCm38.p4_E85_gene_annotation.csv

Read in this annotation file in the Python you are running

import pandas as pd
gene_annotation = pd.read_csv('reference/mus-musculus/Mus_musculus_GRCm38.p4_E85_gene_annotation.csv', index_col=0)

Beyond the purely technical information, some QC information can be generated based on the abundance estimates. Firstly, the ERCC spike-ins gives some relative information about the mRNA amount captured from a cell. A note is to also remove ERCCs from expression abundances, since this will mask differences in relative mRNA abundance in each cell. Finally, two common metrics for QC are the number of detected genes, as well as the relative abundance of mitochondrially encoded genes.

qc_info = pd.DataFrame(tpm.loc[tpm.index.str.startswith('ERCC-')].sum(), columns=['ERCC_content'])
etpm = tpm.loc[~tpm.index.str.startswith('ERCC-')]
etpm = etpm / etpm.sum(0) * 1e6
qc_info['num_genes'] = (etpm > 1).sum()
qc_info['MT_content'] = etpm.loc[gene_annotation['Chromosome Name'] == 'MT'].sum()
sample_info = sample_info.join(qc_info)
sample_info.head().T

salmon/sc1_cell35_salmon_out

salmon/sc1_cell24_salmon_out

salmon/sc2_cell55_salmon_out

num_processed

6454761

3848929

2753898

num_mapped

4915485

2605536

1380493

percent_mapped

76.1529

67.6951

50.1287

global_fl_mode

68

57

80

robust_fl_mode

100

100

175

ERCC_content

612303

339921

514.417

num_genes

9575

12073

9005

MT_content

116475

67066.2

3395.65

Now everything is processed, and we can save the files and move on to downstream analysis.

etpm.to_csv('salmon_etpm.csv')
sample_info.to_csv('salmon_sample_info.csv')

From this point, you have expression values and sample information which can be used in the tutorials mentioned above.

As an example, and to end with an actual plot, we could investigate things like mapping rate and number of detected genes:

plt.xscale('log')
plt.scatter(sample_info.num_processed, sample_info.num_genes, c='k');

Swedish School Fires by Gaussian Process Regression

My former colleague Mikael Huss recently published an interesting data set on Kaggle which indicates the school fires reported in Swedish municipalities over last few years. This is coupled with some predictors for the towns over the same years, with aim to see if some predictors associate with an increase in school fires.

The data set is published along with some initial analysis. I was surprised at the poor performance when normalising the fire counts by population, to obtain a frequency.

Partially I got curious about why that was, but also was wondering if I could formulate a nice non-linar model for the data using Gaussian Process Regression.

I registered and downloaded the data from Kaggle. First I read in the fire cases.

%pylab inline
import pandas as pd
import seaborn as sns
Populating the interactive namespace from numpy and matplotlib
fires = pd.read_csv('school_fire_cases_1998_2014.csv')
sns.set_style('ticks')
sns.set_color_codes()
sns.set_context('talk')

Now let's see how these counts relates to population in municipalities

figsize(6, 4)
plt.loglog()
plt.scatter(fires.Population, fires.Cases, c='k', s=30);
sns.despine()
plt.xlabel('Population'), plt.ylabel('Cases');

There seem to be a relation, though it looks a bit binary, more clear at the high end where there are around 10 cases. These seem like a bit too few cases per town to beat out stochastic noise. To picture the underlying generative process a bit better, we ignore the time-dimension we have, and just consider the total number of cases in each town in this region.

sum_fires = pd.DataFrame(fires.groupby('Municipality').sum()['Cases']) \
.join(fires.groupby('Municipality').max()['Population'])
plt.loglog()
plt.scatter(sum_fires.Population, sum_fires.Cases, c='k', s=30);
sns.despine()
plt.xlabel('Population'), plt.ylabel('Cases');

Now the relation between the cases of school fires and the population of a town is much clearer!

This indicates that we should be able to normalise the cases succesfully in to a frequency. However, I would prefer putting the population as part of the model, which we will get into later.

First, let us look at our predictors per town. These are referred to as Key Performance Indicators in the data set.

indicators = pd.read_csv('simplified_municipality_indicators.csv')
indicators.columns
Index(['code', 'name', 'medianIncome', 'youthUnemployment2010',
       'youthUnemployment2013', 'unemployment2010', 'unemployment2013',
       'unemploymentChange', 'reportedCrime', 'populationChange',
       'hasEducation', 'asylumCosts', 'urbanDegree', 'satisfactionInfluence',
       'satisfactionGeneral', 'satisfactionElderlyCare', 'foreignBorn',
       'reportedCrimeVandalism', 'youngUnskilled', 'latitude', 'longitude',
       'population', 'populationShare65plus', 'municipalityType',
       'municipalityTypeBroad', 'refugees', 'rentalApartments', 'governing',
       'fokusRanking', 'foretagsklimatRanking', 'cars', 'motorcycles',
       'tractors', 'snowmobiles'],
      dtype='object')

Now, I have been a bit lazy, and I haven't actually read the documentation for these. I don't know the units for these, and which ones are normalised for population. Or which ones it would make sense to do that with.

To quickly get a hang of this, I like to use a trick from computational biology, which is to look at the correlation matrix of these variables, ordered by linkages based on similarity between the correlations.

cmp = sns.clustermap(indicators.corr(method='spearman'), yticklabels=False);

Generally, I don't like putting too much weight in to these linkage clusterings. But we do see two fairly distinct groups. One contains the population variable, so likely the other members of this group are counts rather than frequencies.

(This also shows some entertaining things, like the snowmobiles strongly correlating with latitude.)

Now, we want to use some of these to predict the number of cases of school fire in a given town. Say that \( C \) is the cases of school fire, and that \( P \) is the population. Following this, we let the remaining predictors be \( x_1, x_2, \) etc. The clear relation between \( C \) and \( P \) we noted above, and it would seem we could invsetigate \( \frac{C}{P} \), assuming a linear relation. But we could also behave a bit more agnoistically about this relation.

$$
\log C = f(\log P) + g(\log x_1, \log x_2, \ldots) + \varepsilon
$$

I don't have a particularly good argument for doing all this on a log scale. It is just that I've found it to be generally useful when there are several orders of magnitude of dynamic range.

Ideally here, the effect of population size should be captured by the function \( f \).

An issue though, is the predictors which correlate with population. Since we want all the effects of Population size to be captured in \( f \), we would need to remove this effect from the other predictors. Since we know that Population will predict the Cases quite well, finding one of these to be predictive will most likely just illustrate confounding with the Popultion effect. I can't really think of a clever way of dealing with this (except "normalising" them, but I want to avoid that strategy here, as the point is to try to be non-parametric). For now, we deal with this by not picking predictors from the cluster that correlates with population.

What we actually want to learn is, which of the predictors in \( g \) is more important?

By performing Gaussian Process Regression, we can investigate the functions \( f \) and \( g \). This is a Bayesian technique, where we first provide a prior distribution for each of the functions. If the prior distribution is such that the covariance of function values are parametrised by a covariance function between the predictor values, these distributions will be Gaussian Processes. By selecting your covariance function, you state what kind of functions you would expect if you hadn't observed any data. Once you have observations of data, you can find the posterior distribution of function values for \( f \) and \( g \).

For my research, I use Gaussian Process Regression and other related models to investigate the dynamics of gene expression.

In our case, we will use the Squared Exponential covariance function for \( f \), and an Automatic Relevence Determination version of the SE covariance function for \( g \). The ARD will allow us to find which predictor variables affect predictions from the model, which should relate to their importance.

To model the data as described above, we will use the new Gaussian Process package GPflow. This package implements several Gaussian Process models, and components for these, using the TensorFlow library.

import GPflow

For simplicity, let's combine everything in to one DataFrame.

data = sum_fires.join(indicators.set_index('name'))
data.columns
Index(['Cases', 'Population', 'code', 'medianIncome', 'youthUnemployment2010',
       'youthUnemployment2013', 'unemployment2010', 'unemployment2013',
       'unemploymentChange', 'reportedCrime', 'populationChange',
       'hasEducation', 'asylumCosts', 'urbanDegree', 'satisfactionInfluence',
       'satisfactionGeneral', 'satisfactionElderlyCare', 'foreignBorn',
       'reportedCrimeVandalism', 'youngUnskilled', 'latitude', 'longitude',
       'population', 'populationShare65plus', 'municipalityType',
       'municipalityTypeBroad', 'refugees', 'rentalApartments', 'governing',
       'fokusRanking', 'foretagsklimatRanking', 'cars', 'motorcycles',
       'tractors', 'snowmobiles'],
      dtype='object')

Now we prepare the data for Gaussian Process Regression. We log-scale both the dependent variable, and the predictors (taking care to deal with infinities.)

We also pick some predictors to work with. There are a large number of variables, and with the limited data we have, using all of them will probably just model very small effects which we wouldn't notice very well. Also recall that we don't want to use any of the ones correlating with population!

Y = data[['Cases']] \
        .replace(0, 1) \
        .pipe(np.log10) \
        .as_matrix()

model_kpis = ['Population', 'refugees', 'unemployment2013',
              'longitude', 'asylumCosts', 'satisfactionGeneral',
              'cars', 'fokusRanking', 'youngUnskilled', 'tractors']

X = data[model_kpis] \
        .replace(0, 1) \
        .apply(pd.to_numeric, errors='coerce') \
        .pipe(np.log10) \
        .replace(np.nan, 0) \
        .as_matrix()

A thing which makes these models so flexible, is that the way you express the type of expected functions and the relation between predictors in terms of the covariance function. Here we will formulate the relation by saying that

$$
F(X) = f(\log P) + g(\log x_1, \log x_2, \ldots),
$$
$$
k_F(\cdot, \cdot) = k_f(\cdot, \cdot) + k_g(\cdot, \cdot).
$$

(Here \( k_f(\cdot, \cdot) \) is a covariance function for \( f \) which only acts on the first column of \( X \). And \( k_g(\cdot, \cdot) \) acts on the remaining columns.)

Above, we put the Population in the first column of X, and the remaining columns contain the predicors which we want to use with the ARD SE covariance function.

N_m = len(model_kpis) - 1
kernel = GPflow.kernels.RBF(1, active_dims=[0]) + \
         GPflow.kernels.RBF(N_m , active_dims=list(range(1, N_m + 1)), ARD=True)

Now we are ready to define the model and provide the observed data.

m = GPflow.gpr.GPR(X, Y, kern=kernel)

The way investigate this model, is by selecting hyperparameters for the priors. Hyperparameters are parameters of the covariance functions which dictate features of the functions we are expected to see if we have not observations, which in turn affect the kind of posterior functions we would expect to see.

A nice feature with Gaussian Process models is that in many cases these can be optimized by finding the optimal likelihood of the model. This is also where the power of TensorFlow comes in. Internally in GPflow, you only need to define your objective function, and gradients will be calculated for you.

m.optimize()
compiling tensorflow function...
done
optimization terminated, setting model state





      fun: 47.661120041075691
 hess_inv: <13x13 LbfgsInvHessProduct with dtype=float64>
      jac: array([  2.77119459e-05,   3.10312238e-11,  -5.20177179e-11,
        -1.53390706e-11,  -2.40598480e-10,   7.86837771e-10,
         3.27595726e-04,  -3.28255760e-04,  -3.12000506e-10,
        -1.50492355e-04,   2.20556318e-04,   5.18355739e-05,
        -2.45822687e-03])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 154
      nit: 121
   status: 0
  success: True
        x: array([  7.83550523e+00,   4.15028062e+03,   3.49511056e+03,
         8.85917870e+03,   1.25124606e+04,   1.52042456e+03,
         1.27664713e+00,   8.35964994e-02,   5.61225444e+03,
        -1.24241599e+00,   1.31391936e+00,   2.30057289e+00,
        -2.57666517e+00])

After this optimization, we can investigate the model.

m
Namevaluespriorconstraint
model.kern.rbf_2.lengthscales[ 7.8359016 inf inf inf inf inf
1.52270404 0.73581972 inf]
None+ve
model.kern.rbf_2.variance[ 0.25362403]None+ve
model.kern.rbf_1.lengthscales[ 1.55196403]None+ve
model.kern.rbf_1.variance[ 2.39606716]None+ve
model.likelihood.variance[ 0.07327667]None+ve

The lengthscale parameters indicate how "dynamic" the function is with respect to the predictor variable. A long lengthscale means that longer distances in the predictor are needed before the function values changes. While a short one means the function values chanes a lot, and thus are sensitive with resposect to the predictor.

The trick of the ARD, is to allow a separate lengthscale for each predictor, but with shares noise. A predictor with a short lengthscale (in the optimal likelihood) will cause many changes as you change the predictor.

This means we can use the inverse of the ARD lengthscales as a measure of variable importance.

pdict = m.kern.get_parameter_dict()
pdict
{'model.kern.rbf_1.lengthscales': array([ 1.55196403]),
 'model.kern.rbf_1.variance': array([ 2.39606716]),
 'model.kern.rbf_2.lengthscales': array([ 7.8359016 ,         inf,         inf,         inf,         inf,
                inf,  1.52270404,  0.73581972,         inf]),
 'model.kern.rbf_2.variance': array([ 0.25362403])}
sensitivity = pdict['model.kern.rbf_2.variance'] / pdict['model.kern.rbf_2.lengthscales'] ** 2
figsize(6, 4)
plt.barh(np.arange(1, N_m + 1) - 0.4, sensitivity, color='k');
plt.yticks(np.arange(1, N_m + 1), model_kpis[1:], rotation=0);
plt.xlabel('Sensitivity');

It is possible illustrate the population effect by itself, by making a Gaussian Process Regression model using only this part of the covariance function, which was optimized for the full model.

i = 8

plt.loglog()

m_pop = GPflow.gpr.GPR(X, Y, kern=m.kern.rbf_1)
XX = np.linspace(3.5, 6.)[:, None]
YY_m, YY_v = m_pop.predict_f(XX)

plt.scatter(10 ** X[:, 0], 10 ** Y, c=10 ** X[:, i], s=40);
sns.despine()
plt.colorbar(label=model_kpis[i])

plt.plot(10 ** XX, 10 ** YY_m, c='r', lw=3, label='Population effect')
plt.plot(10 ** XX, 10 ** (YY_m + 2 * np.sqrt(YY_v)), c='r', linestyle='--', lw=3)
plt.plot(10 ** XX, 10 ** (YY_m - 2 * np.sqrt(YY_v)), c='r', linestyle='--', lw=3);

plt.legend(loc='upper left')

plt.xlabel('Population'); plt.ylabel('Cases');

It is still not perfectly visible how a predictor explains school fire cases when correcting for population. As a visualization, we can remove the mean population effect from the cases, and thus look at how the residuals relate to the predictor of choice.

Y_p_m, Y_p_v = m_pop.predict_f(X)
plt.loglog()
plt.scatter(10 ** X[:, 0], 10 ** (Y - Y_p_m), c=10 ** X[:, i], s=40);
sns.despine()
plt.colorbar(label=model_kpis[i])

plt.plot(10 ** XX, 10 ** (0 * YY_m), c='r', lw=3)
plt.plot(10 ** XX, 10 ** (2 * np.sqrt(YY_v)), c='r', linestyle='--', lw=3)
plt.plot(10 ** XX, 10 ** (- 2 * np.sqrt(YY_v)), c='r', linestyle='--', lw=3);

plt.xlabel('Population'); plt.ylabel('Residual of Cases');
plt.loglog()
plt.scatter(10 ** X[:, i], 10 ** (Y - Y_p_m), c='k', s=40);
sns.despine()

plt.xlabel(model_kpis[i]); plt.ylabel('Residual of Cases');
data.sort_values(model_kpis[i], ascending=False).head()[['Cases', model_kpis[i]]]
Cases youngUnskilled
Municipality
Perstorp 4 19.1
Haparanda 2 17.1
Södertälje 44 16.9
Skinnskatteberg 1 16.6
Gnesta 5 16.0

The variability of a spoonful of coffee

I like freshly ground coffee, especially made with a French press. The common advice regarding coffee is to dose by weight, due to measurement variability regarding beans in spoons. I was curious how large this variability is.



The different sizes and shapes of the coffee beans makes the volume of the packed beans not correspond to the mass of the coffee, which will be ground to powder.

Different coffee beans are also differently dense, here I’m looking at Brazil Santos coffee.

I made 10 measurements, where I took a level spoon, weighed the contents, then replaced them in my jar of beans before I repeated the measurement.



To quantify the variability, I made a model in Stan. Here a spoonful is modeled normally distributed with a standard deviation parameter which correspond to the between-spoonful variability.

import pystan

code = '''
data {
    int<lower> N;
    real<lower> y[N];
}
parameters {
    real<lower> mu;
    real<lower> sigma;
}
model {
    // Priors
    mu ~ uniform(0, 25);
    sigma ~ uniform(0, 25);

    // Likelihood
    y ~ normal(mu, sigma);
}
'''

data = {
    'N': 10,
    'y': [13.54, 14.22, 13.63, 13.20, 13.32,
          13.23, 13.98, 13.66, 13.51, 14.18]
}

fit = pystan.stan(model_code=code, data=data)

fit

Inference for Stan model: anon_model_393a8d8deb17adb090b969fb1275a378.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu     13.64  2.3e-3   0.14  13.34  13.55  13.64  13.73  13.91   4000    1.0
sigma   0.43  2.1e-3   0.13   0.26   0.35   0.41   0.49   0.76   4000    1.0
lp__    5.14    0.02   1.13   2.27   4.74   5.47   5.91   6.22   4000    1.0

Samples were drawn using NUTS(diag_e) at Mon Jul 11 20:54:38 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).


The y values in the data dictionaries are the weight measurements.



In the end this means that the confidence interval of the spoonful of coffee is between 12.8g and 14.5g.

Writing and Citing with Google Docs and Paperpile

Earlier this day Mick Watson posted a brief post about his scientific writing workflow using Mendeley and Word. I’ve been happily using Paperpile and Google Docs for similar goals, so though I’d post that as a comparison.

First, you just start a new doc for your paper. You write until you need to cite something. Then you hit ctrl+alt+shift+p and search for papers. Initially in your curated collection, but also in many databases on the web such as PubMed.

As you add references, they both get added to your own Paperpile collection, and stored in the Google Doc itself (so all collaborators can see them and manage citations).

Once you have a lot of citations, you pick a citation style, click Format citations in the Paperpile menu.

And you end up with a reference list styled to your liking.

You can then click each citation and edit it in terms of changing and adding/removing references.

In Mick’s post he makes a group in Mendeley to keep track of papers related to a given manuscript. You can also do this in Paperpile, but in practice i think it’s easier just to look at the documents individual manager (on the right in picture above).

Streaming RNA-seq data from ENA

For many of the projects I’m working on for my PhD I use published data. Up until now my strategy has been to download all read files of an experiment from ENA, then process them all with e.g. Salmon to get expression values. This feels a bit silly because sequencing read files are on the order of gigabytes in size, while a csv file of expression values is a few megabytes. In fact, currently my data directory has almost 50 terabytes of public data in it.

The other day I saw this gist from Mike Love. Supposedly it gives you the URL of a ftp hosted fastq from the ENA/SRA accession number of it. This is great, because you can just use curl on the URL, which by defualt streams a file in chunks, to fetch the contents of a sequencing read data set. We only need to know the name of it.

I made a small Bash script for streaming a given accession id.

#/bin/bash

fastq="$1"

prefix=ftp://ftp.sra.ebi.ac.uk/vol1/fastq

accession=$(echo $fastq | tr '.' '_' | cut -d'_' -f 1)

dir1=${accession:0:6}

a_len=${#accession}
if (( $a_len == 9 )); then
    dir2="";
elif (( $a_len == 10 )); then
    dir2=00${accession:9:1};
elif (( $a_len == 11)); then
    dir2=0${accession:9:2};
else
    dir2=${accession:9:3};
fi

url=$prefix/$dir1/$dir2/$accession/$fastq.gz

curl --keepalive-time 4 -s $url | zcat


We call this file stream_ena. You need to know the id of accession, and whether the file you want to look at is part of a pair. But then you have instant access to the contents of any published sequencing data-set!

If we want to look at some single-end data, we can just do

$ ./stream_ena SRR3185782.fastq | head
@SRR3185782.1 HWI-D00361:180:HJG3GADXX:2:1101:1460:2181/1
AGTGTGTTCATCAGTGTGGATTTGCCAATGCCGGTCTCCCCCACACAGAG
+
BBBFFBFFFB<FFFFFBFF<FFFFFFFFFFFFFIIIIFFFFFFFFIFFFF
@SRR3185782.2 HWI-D00361:180:HJG3GADXX:2:1101:1613:2218/1
GCCAATTTTCTTAATGTAAGTGCTGACTTCCTTAACAATTTCCTCATATC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR3185782.3 HWI-D00361:180:HJG3GADXX:2:1101:2089:2243/1
CGGGTTCTTGGACTTCAGCCAGTTGAGCAGGGCATCCTTGTTGAAGGCGG


If we want to quantify this data set with salmon, we can now simply run

$ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-r <(./stream_ena SRR3185782.fastq) -o SRR3185782


This will stream the entire contents of accession in to salmon directly from ENA without storing anything on disk, and quantified expression will be saved in the directory SRR3185782.

For a dataset with paired reads we would do for example

$ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-1 <(./stream_ena SRR1274127_1.fastq) \
-2 <(./stream_ena SRR1274127_2.fastq) -o SRR1274127


Many sequencing tools supports streaming out box don care whether you are your disk or server online.

Say for example we want to look at 5 random reads:

$ seqtk sample -s $(date +%s) <(./stream_ena SRR1274127_1.fastq) 0.001 | head -n 20
@SRR1274127.753 753/1
TTAGAAGGATTATGGATGCGGTTGCTTGCGTGAGGAAATACTTGATGGCAGCTTCTGTGGAACGAGGGTTTATTTTTTTGGGTAGAACTGGAATAAAAGCT
+
BCCFFFFFHHHHHJJJJJJJJHIJIJJJIICGHIJGIJJIJJJJJJJJJJJJJJJJJIJJJGHHFFFD;ADDDDEEDDDD>&2>?ACDDDCDDDDDDDDDC
@SRR1274127.1464 1464/1
AATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCT
+
CCCFFFFFHHHHHJJIIHIIJJIJJIGJGJJJJJJJJJJJJJJJJJJJJJJJJJJFIIJJJJJJIJJIIJJIIIJJHHHHFFDFFFFEEDDDDDDB@BBB9
@SRR1274127.1672 1672/1
CCCTACTACTATCTCGCACCTGAAACACCCTAACATGACTAACACCCTTAATTCCATCCACCCTCCTCTCCCTAGGAGGCCTGCCCCCGCTAACCGGCTTT
+
@@@DDDD>FABBD@BDFB:FE;+2+2<)):CFB?<3?4?B*99???FFFE98)>F;778)7(5@E1CF?DDB#############################
@SRR1274127.2188 2188/1
GATTATTAGGGGAACTAGTCAGTTGCCAAAGCCTCCGATTATGATGGGTATTACTATGAAGAAGATTATTACAAATGCATGGGCTGTGACGATAACGTTGT
+
@?@DFDEDFBHFHGGG@DCHEEHG@FEHG@HGGGGIID=FDHGGHGCG?8?CFHHGHGAHEACHA@E<D@EFE>??C@@CD@B@AABCCC@BB@BCB9992
@SRR1274127.4127 4127/1
CTTATACTAGTATCCTTAATCATTTTTATTGCCACAACTAACCTCCTCGGACTCCTGCCTCACTCATTTACACCAACCACCCAACTATCTATAAACCTAGC
+
??@DDFFFDHFFDGHGGGGGHEGGJEHIIAFDGIIGGIJGIGHHIJJIAFBGIGHJGEGIGGJGGGI=EGEEECHAB<?ACB?ABCCDDDDDDCCDDCDCD


Or if we want to check the quality of a dataset without wasting space downloading it:

$ ./stream_ena SRR1274127_1.fastq | fastqc -o SRR1274127_1_fastqc -f fastq stdin


Of course there are caveats. You can’t just blindly put any reads in to salmon and get correct expression. You need to read the methods section of the related study to see what parameters use. The data might need some might preprocessing before it is useful. It is also very common that files uploaded to ENA should be merged before being input to processing tools. But this streaming approach greatly ease storage burden burden from working with public data.

K-means in TensorFlow

I have been getting a bit interested in the new fancy TensorFlow package. As a little exercise to figure out roughly how to use it, I figured I could implement a simple model. I chose to implement k-means clustering, since it’s very simple, but a bit different from the regression/classification models in TensorFlows examples.

We’ll make some ideal clustered data using scikit-learn.

from sklearn import datasets

Xi, yi = datasets.make_blobs(500, random_state=1111)
plt.scatter(*Xi.T, c='k', lw=0);
`




Now we want to divide these in to clusters. We will use the objective used by Bottou & Bengio 1995,

$$ E(w) = \sum_i \min_k (x_i - w_k)^2. $$

Or in words: the sum of the (squared) distances to the closest prototype $ w_k $.

For our implementation, first we need to put in data. Secondly, we need to define the variables we want to optimize. In this case we want the prototypes $ w $.

import tensorflow as tf

# Number of clusters
k = 3

# Input
Xi = Xi.astype(np.float32)
X = tf.placeholder_with_default(Xi, Xi.shape, name='data')

# Model variables, initiate randomly from data
idx = np.random.choice(range(Xi.shape[0]), k, replace=False)
Wi = Xi[idx].copy()
W = tf.Variable(Wi, name='prototypes')


Now we only need to write out how the objective function value is calculated from the data and the variables.

# Reshape data and prototypes tensors
Xc = tf.concat(2, k * [tf.expand_dims(X, 2)])
Wc = tf.transpose(tf.expand_dims(W, 2), [2, 1, 0])

# Define objective of model

distances = tf.reduce_sum((Xc - Wc) ** 2, 1)

cost = tf.reduce_sum(tf.reduce_min(distances, 1))


After the tensors have been formatted correctly this is very simple. We just calculate the distances from each point to each prototype. Then we sum the smallest of these per data-point.

The final value in cost is represented in such a way that TensorFlow automatically can figure out the gradients for minimizing it with respect to the variables in W.

# Make an object which will minimize cost w.r.t. W
optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(cost)


This is the nice thing with TensorFlow, you don’t need to care about finding gradients for objective functions.

sess = tf.Session()

sess.run(tf.initialize_all_variables())

for i in range(100 + 1):
    sess.run(optimizer)
    if i % 10 == 0:
        print('Iteration {}, cost {}'.format(i, sess.run(cost)))

Iteration 0, cost 21533.69140625
Iteration 10, cost 950.78369140625
Iteration 20, cost 940.453369140625
Iteration 30, cost 940.4503784179688
Iteration 40, cost 940.450439453125
Iteration 50, cost 940.450439453125
Iteration 60, cost 940.450439453125
Iteration 70, cost 940.450439453125
Iteration 80, cost 940.450439453125
Iteration 90, cost 940.450439453125
Iteration 100, cost 940.450439453125


Once we’ve optimized the W we get the number for them. Then we can make cluster labels for the data points by looking at which prototype $ w_k $ is closest to a data point.

W_learned = sess.run(W)
y = sess.run(tf.arg_min(distances, 1))

sess.close()

plt.scatter(*Xi.T, c=y, lw=0, cmap=cm.Greys_r, vmax=k + 0.5, label='data');
plt.scatter(*W_learned.T, c='none', s=100, edgecolor='r', lw=2, label='prototypes');
plt.legend(scatterpoints=3);



This is also available in notebook form here: https://gist.github.com/vals/a01a37b14c4918df7937b30d43327837

Some intuition about the GPLVM

Over the last year I’ve been learning about the statistical framework of Gaussian Processes. I’ve been thinking about writing about this for a while, but I feel like I would just decrease the signal-to-noise ratio given the fantastic resources available.

One of my favourite models in the Gaussian Process framework is the Gaussian Process Latent Variable Model (GPLVM). The point of the GPLVM, is that if we have multivariate data, we can simultaneously model every variable with a Gaussian Process. It turns out then that the same way that one can pick hyper parameters for a regression model, we can find optimal, latent, predictor variables for the data.

The leap from regression to the latent variable model is a bit drastic though, so I was trying to think of how we could get some more intuition about this. Let us look at a minimal example. We will use the excellent GPy package for this.

We generate a small number of points (5) on a sine curve, and perform Gaussian Process Regression on this.

import GPy

X = np.linspace(0, np.pi, 5)[:, None]
Y = np.sin(X)

kernel = GPy.kern.RBF(1)
m = GPy.models.GPRegression(X, Y, kernel=kernel)
m.optimize()
m.plot(plot_limits=(-0.1, 3.2));


Now we have a curve which optimally describes the 5 points. Imagine we have a stray 6th point. Say we did 6 measurements, but for one of the points we lost the measurement values. Where could this measurement have happened?

Well, we have the model of the curve describing the data. If we guess the measurement values for the 6th point, we can add that as an observation, and evaluate the likelihood of the regression when including the point.

Let us perform this by guessing some potential points in a 20 by 20 grid around or known observations.

N = 20
xx = np.linspace(0, np.pi, N)
yy = np.linspace(0, 1, N)

ll = []
xxi = []
yyi = []
for i in range(N):
    for j in range(N):
        x = xx[i]
        xxi.append(x)
        Xs = np.vstack((X, np.array([[x]])))
        y = yy[j]
        yyi.append(y)
        Ys = np.vstack((Y, np.array([[y]])))
        ms = GPy.models.GPRegression(Xs, Ys, kernel=kernel,
             noise_var=m.Gaussian_noise.variance)
        ll.append(ms.log_likelihood())

plt.scatter(xxi, yyi, c=ll, s=50, vmin=-1e4,
            edgecolor='none', marker='s',
            cmap=cm.Greys_r);
plt.colorbar(label='log likelihood')
m.plot(ax=plt.gca(), c='r', legend=False,
       plot_limits=(-0.1, 3.2));


Here every square is a position where we guess the 6th point could have been, colored by the log likelihood of the GP regression. We see that it is much more likely the 6th point came from close to the curve defiend by the 5 known points.

Now, imagine that we know that the y value of the 6th point was 0.4, then we can find an x value such that the likelihood is maximized.

x = 1.2
y = 0.4
Xs = np.vstack((X, np.array([[x]])))
Ys = np.vstack((Y, np.array([[y]])))
gplvm = GPy.models.GPLVM(Ys, 1, X=Xs, kernel=kernel)
gplvm.Gaussian_noise.variance = m.Gaussian_noise.variance
gplvm.fix()
gplvm.X[-1].unfix();

gplvm.optimize()

gplvm.plot(legend=False, plot_limits=(-0.1, 3.2));
plt.scatter([x], [y], c='r', s=40);
ax = plt.gca()
ax.arrow(x, y, gplvm.X[-1, 0] - x + 0.1, gplvm.Y[-1, 0] - y,
        head_width=0.03, head_length=0.1,
        fc='r', ec='r', lw=2, zorder=3);


This is done with gradient descent, and the optimum it finds will in this particular case depend on where we initialize the x value of the 6th point. But if we had more variables than the y value above, and we want to find the x-value which is optimal for all variables at the same time, it will end up being less ambiguous. And we can do this for every point in the model at the same time, not just one.

This is illustrated in this animation.



Here there are two variables and we are looking for x-values which best describes the two variables as independant Gaussian processes.

Observations needed to estimate standard deviation

I was curious as to practically how well standard deviation can be estimated at different numbers of observed samples. I thought the plot was interesting, so figured I’d share.

Here is the Julia code I used:

using Distributions
using Gadfly

function std_of_sample(n)
    std(rand(Normal(2, 5), n))
end;

nmax = 200
n_observations = collect(0:nmax * 100) % nmax + 2;

p = plot(
    x=n, y=map(std_of_sample, n_observations),

    Guide.XLabel("Number of observations"),
    Guide.YLabel("Estimated σ"),
    Theme(default_color=color("black")),
    Coord.Cartesian(ymin=-3, ymax=13, xmin=-20, xmax=220),
    Guide.xticks(ticks=[0, 10, 25, 50, 100, 200])
);

draw(PNG(640px, 480px), p)


We just sample a different number of samples from a normal distribution with mean 2 and known standard deviation 5.