Swedish School Fires by Gaussian Process Regression

My former colleague Mikael Huss recently published an interesting data set on Kaggle which indicates the school fires reported in Swedish municipalities over last few years. This is coupled with some predictors for the towns over the same years, with aim to see if some predictors associate with an increase in school fires.

The data set is published along with some initial analysis. I was surprised at the poor performance when normalising the fire counts by population, to obtain a frequency.

Partially I got curious about why that was, but also was wondering if I could formulate a nice non-linar model for the data using Gaussian Process Regression.

I registered and downloaded the data from Kaggle. First I read in the fire cases.

%pylab inline
import pandas as pd
import seaborn as sns
Populating the interactive namespace from numpy and matplotlib
fires = pd.read_csv('school_fire_cases_1998_2014.csv')
sns.set_style('ticks')
sns.set_color_codes()
sns.set_context('talk')

Now let's see how these counts relates to population in municipalities

figsize(6, 4)
plt.loglog()
plt.scatter(fires.Population, fires.Cases, c='k', s=30);
sns.despine()
plt.xlabel('Population'), plt.ylabel('Cases');

There seem to be a relation, though it looks a bit binary, more clear at the high end where there are around 10 cases. These seem like a bit too few cases per town to beat out stochastic noise. To picture the underlying generative process a bit better, we ignore the time-dimension we have, and just consider the total number of cases in each town in this region.

sum_fires = pd.DataFrame(fires.groupby('Municipality').sum()['Cases']) \
.join(fires.groupby('Municipality').max()['Population'])
plt.loglog()
plt.scatter(sum_fires.Population, sum_fires.Cases, c='k', s=30);
sns.despine()
plt.xlabel('Population'), plt.ylabel('Cases');

Now the relation between the cases of school fires and the population of a town is much clearer!

This indicates that we should be able to normalise the cases succesfully in to a frequency. However, I would prefer putting the population as part of the model, which we will get into later.

First, let us look at our predictors per town. These are referred to as Key Performance Indicators in the data set.

indicators = pd.read_csv('simplified_municipality_indicators.csv')
indicators.columns
Index(['code', 'name', 'medianIncome', 'youthUnemployment2010',
       'youthUnemployment2013', 'unemployment2010', 'unemployment2013',
       'unemploymentChange', 'reportedCrime', 'populationChange',
       'hasEducation', 'asylumCosts', 'urbanDegree', 'satisfactionInfluence',
       'satisfactionGeneral', 'satisfactionElderlyCare', 'foreignBorn',
       'reportedCrimeVandalism', 'youngUnskilled', 'latitude', 'longitude',
       'population', 'populationShare65plus', 'municipalityType',
       'municipalityTypeBroad', 'refugees', 'rentalApartments', 'governing',
       'fokusRanking', 'foretagsklimatRanking', 'cars', 'motorcycles',
       'tractors', 'snowmobiles'],
      dtype='object')

Now, I have been a bit lazy, and I haven't actually read the documentation for these. I don't know the units for these, and which ones are normalised for population. Or which ones it would make sense to do that with.

To quickly get a hang of this, I like to use a trick from computational biology, which is to look at the correlation matrix of these variables, ordered by linkages based on similarity between the correlations.

cmp = sns.clustermap(indicators.corr(method='spearman'), yticklabels=False);

Generally, I don't like putting too much weight in to these linkage clusterings. But we do see two fairly distinct groups. One contains the population variable, so likely the other members of this group are counts rather than frequencies.

(This also shows some entertaining things, like the snowmobiles strongly correlating with latitude.)

Now, we want to use some of these to predict the number of cases of school fire in a given town. Say that \( C \) is the cases of school fire, and that \( P \) is the population. Following this, we let the remaining predictors be \( x_1, x_2, \) etc. The clear relation between \( C \) and \( P \) we noted above, and it would seem we could invsetigate \( \frac{C}{P} \), assuming a linear relation. But we could also behave a bit more agnoistically about this relation.

$$
\log C = f(\log P) + g(\log x_1, \log x_2, \ldots) + \varepsilon
$$

I don't have a particularly good argument for doing all this on a log scale. It is just that I've found it to be generally useful when there are several orders of magnitude of dynamic range.

Ideally here, the effect of population size should be captured by the function \( f \).

An issue though, is the predictors which correlate with population. Since we want all the effects of Population size to be captured in \( f \), we would need to remove this effect from the other predictors. Since we know that Population will predict the Cases quite well, finding one of these to be predictive will most likely just illustrate confounding with the Popultion effect. I can't really think of a clever way of dealing with this (except "normalising" them, but I want to avoid that strategy here, as the point is to try to be non-parametric). For now, we deal with this by not picking predictors from the cluster that correlates with population.

What we actually want to learn is, which of the predictors in \( g \) is more important?

By performing Gaussian Process Regression, we can investigate the functions \( f \) and \( g \). This is a Bayesian technique, where we first provide a prior distribution for each of the functions. If the prior distribution is such that the covariance of function values are parametrised by a covariance function between the predictor values, these distributions will be Gaussian Processes. By selecting your covariance function, you state what kind of functions you would expect if you hadn't observed any data. Once you have observations of data, you can find the posterior distribution of function values for \( f \) and \( g \).

For my research, I use Gaussian Process Regression and other related models to investigate the dynamics of gene expression.

In our case, we will use the Squared Exponential covariance function for \( f \), and an Automatic Relevence Determination version of the SE covariance function for \( g \). The ARD will allow us to find which predictor variables affect predictions from the model, which should relate to their importance.

To model the data as described above, we will use the new Gaussian Process package GPflow. This package implements several Gaussian Process models, and components for these, using the TensorFlow library.

import GPflow

For simplicity, let's combine everything in to one DataFrame.

data = sum_fires.join(indicators.set_index('name'))
data.columns
Index(['Cases', 'Population', 'code', 'medianIncome', 'youthUnemployment2010',
       'youthUnemployment2013', 'unemployment2010', 'unemployment2013',
       'unemploymentChange', 'reportedCrime', 'populationChange',
       'hasEducation', 'asylumCosts', 'urbanDegree', 'satisfactionInfluence',
       'satisfactionGeneral', 'satisfactionElderlyCare', 'foreignBorn',
       'reportedCrimeVandalism', 'youngUnskilled', 'latitude', 'longitude',
       'population', 'populationShare65plus', 'municipalityType',
       'municipalityTypeBroad', 'refugees', 'rentalApartments', 'governing',
       'fokusRanking', 'foretagsklimatRanking', 'cars', 'motorcycles',
       'tractors', 'snowmobiles'],
      dtype='object')

Now we prepare the data for Gaussian Process Regression. We log-scale both the dependent variable, and the predictors (taking care to deal with infinities.)

We also pick some predictors to work with. There are a large number of variables, and with the limited data we have, using all of them will probably just model very small effects which we wouldn't notice very well. Also recall that we don't want to use any of the ones correlating with population!

Y = data[['Cases']] \
        .replace(0, 1) \
        .pipe(np.log10) \
        .as_matrix()

model_kpis = ['Population', 'refugees', 'unemployment2013',
              'longitude', 'asylumCosts', 'satisfactionGeneral',
              'cars', 'fokusRanking', 'youngUnskilled', 'tractors']

X = data[model_kpis] \
        .replace(0, 1) \
        .apply(pd.to_numeric, errors='coerce') \
        .pipe(np.log10) \
        .replace(np.nan, 0) \
        .as_matrix()

A thing which makes these models so flexible, is that the way you express the type of expected functions and the relation between predictors in terms of the covariance function. Here we will formulate the relation by saying that

$$
F(X) = f(\log P) + g(\log x_1, \log x_2, \ldots),
$$
$$
k_F(\cdot, \cdot) = k_f(\cdot, \cdot) + k_g(\cdot, \cdot).
$$

(Here \( k_f(\cdot, \cdot) \) is a covariance function for \( f \) which only acts on the first column of \( X \). And \( k_g(\cdot, \cdot) \) acts on the remaining columns.)

Above, we put the Population in the first column of X, and the remaining columns contain the predicors which we want to use with the ARD SE covariance function.

N_m = len(model_kpis) - 1
kernel = GPflow.kernels.RBF(1, active_dims=[0]) + \
         GPflow.kernels.RBF(N_m , active_dims=list(range(1, N_m + 1)), ARD=True)

Now we are ready to define the model and provide the observed data.

m = GPflow.gpr.GPR(X, Y, kern=kernel)

The way investigate this model, is by selecting hyperparameters for the priors. Hyperparameters are parameters of the covariance functions which dictate features of the functions we are expected to see if we have not observations, which in turn affect the kind of posterior functions we would expect to see.

A nice feature with Gaussian Process models is that in many cases these can be optimized by finding the optimal likelihood of the model. This is also where the power of TensorFlow comes in. Internally in GPflow, you only need to define your objective function, and gradients will be calculated for you.

m.optimize()
compiling tensorflow function...
done
optimization terminated, setting model state





      fun: 47.661120041075691
 hess_inv: <13x13 LbfgsInvHessProduct with dtype=float64>
      jac: array([  2.77119459e-05,   3.10312238e-11,  -5.20177179e-11,
        -1.53390706e-11,  -2.40598480e-10,   7.86837771e-10,
         3.27595726e-04,  -3.28255760e-04,  -3.12000506e-10,
        -1.50492355e-04,   2.20556318e-04,   5.18355739e-05,
        -2.45822687e-03])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 154
      nit: 121
   status: 0
  success: True
        x: array([  7.83550523e+00,   4.15028062e+03,   3.49511056e+03,
         8.85917870e+03,   1.25124606e+04,   1.52042456e+03,
         1.27664713e+00,   8.35964994e-02,   5.61225444e+03,
        -1.24241599e+00,   1.31391936e+00,   2.30057289e+00,
        -2.57666517e+00])

After this optimization, we can investigate the model.

m
Namevaluespriorconstraint
model.kern.rbf_2.lengthscales[ 7.8359016 inf inf inf inf inf
1.52270404 0.73581972 inf]
None+ve
model.kern.rbf_2.variance[ 0.25362403]None+ve
model.kern.rbf_1.lengthscales[ 1.55196403]None+ve
model.kern.rbf_1.variance[ 2.39606716]None+ve
model.likelihood.variance[ 0.07327667]None+ve

The lengthscale parameters indicate how "dynamic" the function is with respect to the predictor variable. A long lengthscale means that longer distances in the predictor are needed before the function values changes. While a short one means the function values chanes a lot, and thus are sensitive with resposect to the predictor.

The trick of the ARD, is to allow a separate lengthscale for each predictor, but with shares noise. A predictor with a short lengthscale (in the optimal likelihood) will cause many changes as you change the predictor.

This means we can use the inverse of the ARD lengthscales as a measure of variable importance.

pdict = m.kern.get_parameter_dict()
pdict
{'model.kern.rbf_1.lengthscales': array([ 1.55196403]),
 'model.kern.rbf_1.variance': array([ 2.39606716]),
 'model.kern.rbf_2.lengthscales': array([ 7.8359016 ,         inf,         inf,         inf,         inf,
                inf,  1.52270404,  0.73581972,         inf]),
 'model.kern.rbf_2.variance': array([ 0.25362403])}
sensitivity = pdict['model.kern.rbf_2.variance'] / pdict['model.kern.rbf_2.lengthscales'] ** 2
figsize(6, 4)
plt.barh(np.arange(1, N_m + 1) - 0.4, sensitivity, color='k');
plt.yticks(np.arange(1, N_m + 1), model_kpis[1:], rotation=0);
plt.xlabel('Sensitivity');

It is possible illustrate the population effect by itself, by making a Gaussian Process Regression model using only this part of the covariance function, which was optimized for the full model.

i = 8

plt.loglog()

m_pop = GPflow.gpr.GPR(X, Y, kern=m.kern.rbf_1)
XX = np.linspace(3.5, 6.)[:, None]
YY_m, YY_v = m_pop.predict_f(XX)

plt.scatter(10 ** X[:, 0], 10 ** Y, c=10 ** X[:, i], s=40);
sns.despine()
plt.colorbar(label=model_kpis[i])

plt.plot(10 ** XX, 10 ** YY_m, c='r', lw=3, label='Population effect')
plt.plot(10 ** XX, 10 ** (YY_m + 2 * np.sqrt(YY_v)), c='r', linestyle='--', lw=3)
plt.plot(10 ** XX, 10 ** (YY_m - 2 * np.sqrt(YY_v)), c='r', linestyle='--', lw=3);

plt.legend(loc='upper left')

plt.xlabel('Population'); plt.ylabel('Cases');

It is still not perfectly visible how a predictor explains school fire cases when correcting for population. As a visualization, we can remove the mean population effect from the cases, and thus look at how the residuals relate to the predictor of choice.

Y_p_m, Y_p_v = m_pop.predict_f(X)
plt.loglog()
plt.scatter(10 ** X[:, 0], 10 ** (Y - Y_p_m), c=10 ** X[:, i], s=40);
sns.despine()
plt.colorbar(label=model_kpis[i])

plt.plot(10 ** XX, 10 ** (0 * YY_m), c='r', lw=3)
plt.plot(10 ** XX, 10 ** (2 * np.sqrt(YY_v)), c='r', linestyle='--', lw=3)
plt.plot(10 ** XX, 10 ** (- 2 * np.sqrt(YY_v)), c='r', linestyle='--', lw=3);

plt.xlabel('Population'); plt.ylabel('Residual of Cases');
plt.loglog()
plt.scatter(10 ** X[:, i], 10 ** (Y - Y_p_m), c='k', s=40);
sns.despine()

plt.xlabel(model_kpis[i]); plt.ylabel('Residual of Cases');
data.sort_values(model_kpis[i], ascending=False).head()[['Cases', model_kpis[i]]]
Cases youngUnskilled
Municipality
Perstorp 4 19.1
Haparanda 2 17.1
Södertälje 44 16.9
Skinnskatteberg 1 16.6
Gnesta 5 16.0

The variability of a spoonful of coffee

I like freshly ground coffee, especially made with a French press. The common advice regarding coffee is to dose by weight, due to measurement variability regarding beans in spoons. I was curious how large this variability is.



The different sizes and shapes of the coffee beans makes the volume of the packed beans not correspond to the mass of the coffee, which will be ground to powder.

Different coffee beans are also differently dense, here I’m looking at Brazil Santos coffee.

I made 10 measurements, where I took a level spoon, weighed the contents, then replaced them in my jar of beans before I repeated the measurement.



To quantify the variability, I made a model in Stan. Here a spoonful is modeled normally distributed with a standard deviation parameter which correspond to the between-spoonful variability.

import pystan

code = '''
data {
    int<lower> N;
    real<lower> y[N];
}
parameters {
    real<lower> mu;
    real<lower> sigma;
}
model {
    // Priors
    mu ~ uniform(0, 25);
    sigma ~ uniform(0, 25);

    // Likelihood
    y ~ normal(mu, sigma);
}
'''

data = {
    'N': 10,
    'y': [13.54, 14.22, 13.63, 13.20, 13.32,
          13.23, 13.98, 13.66, 13.51, 14.18]
}

fit = pystan.stan(model_code=code, data=data)

fit

Inference for Stan model: anon_model_393a8d8deb17adb090b969fb1275a378.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu     13.64  2.3e-3   0.14  13.34  13.55  13.64  13.73  13.91   4000    1.0
sigma   0.43  2.1e-3   0.13   0.26   0.35   0.41   0.49   0.76   4000    1.0
lp__    5.14    0.02   1.13   2.27   4.74   5.47   5.91   6.22   4000    1.0

Samples were drawn using NUTS(diag_e) at Mon Jul 11 20:54:38 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).


The y values in the data dictionaries are the weight measurements.



In the end this means that the confidence interval of the spoonful of coffee is between 12.8g and 14.5g.

Writing and Citing with Google Docs and Paperpile

Earlier this day Mick Watson posted a brief post about his scientific writing workflow using Mendeley and Word. I’ve been happily using Paperpile and Google Docs for similar goals, so though I’d post that as a comparison.

First, you just start a new doc for your paper. You write until you need to cite something. Then you hit ctrl+alt+shift+p and search for papers. Initially in your curated collection, but also in many databases on the web such as PubMed.

As you add references, they both get added to your own Paperpile collection, and stored in the Google Doc itself (so all collaborators can see them and manage citations).

Once you have a lot of citations, you pick a citation style, click Format citations in the Paperpile menu.

And you end up with a reference list styled to your liking.

You can then click each citation and edit it in terms of changing and adding/removing references.

In Mick’s post he makes a group in Mendeley to keep track of papers related to a given manuscript. You can also do this in Paperpile, but in practice i think it’s easier just to look at the documents individual manager (on the right in picture above).

Streaming RNA-seq data from ENA

For many of the projects I’m working on for my PhD I use published data. Up until now my strategy has been to download all read files of an experiment from ENA, then process them all with e.g. Salmon to get expression values. This feels a bit silly because sequencing read files are on the order of gigabytes in size, while a csv file of expression values is a few megabytes. In fact, currently my data directory has almost 50 terabytes of public data in it.

The other day I saw this gist from Mike Love. Supposedly it gives you the URL of a ftp hosted fastq from the ENA/SRA accession number of it. This is great, because you can just use curl on the URL, which by defualt streams a file in chunks, to fetch the contents of a sequencing read data set. We only need to know the name of it.

I made a small Bash script for streaming a given accession id.

#/bin/bash

fastq="$1"

prefix=ftp://ftp.sra.ebi.ac.uk/vol1/fastq

accession=$(echo $fastq | tr '.' '_' | cut -d'_' -f 1)

dir1=${accession:0:6}

a_len=${#accession}
if (( $a_len == 9 )); then
    dir2="";
elif (( $a_len == 10 )); then
    dir2=00${accession:9:1};
elif (( $a_len == 11)); then
    dir2=0${accession:9:2};
else
    dir2=${accession:9:3};
fi

url=$prefix/$dir1/$dir2/$accession/$fastq.gz

curl --keepalive-time 4 -s $url | zcat


We call this file stream_ena. You need to know the id of accession, and whether the file you want to look at is part of a pair. But then you have instant access to the contents of any published sequencing data-set!

If we want to look at some single-end data, we can just do

$ ./stream_ena SRR3185782.fastq | head
@SRR3185782.1 HWI-D00361:180:HJG3GADXX:2:1101:1460:2181/1
AGTGTGTTCATCAGTGTGGATTTGCCAATGCCGGTCTCCCCCACACAGAG
+
BBBFFBFFFB<FFFFFBFF<FFFFFFFFFFFFFIIIIFFFFFFFFIFFFF
@SRR3185782.2 HWI-D00361:180:HJG3GADXX:2:1101:1613:2218/1
GCCAATTTTCTTAATGTAAGTGCTGACTTCCTTAACAATTTCCTCATATC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR3185782.3 HWI-D00361:180:HJG3GADXX:2:1101:2089:2243/1
CGGGTTCTTGGACTTCAGCCAGTTGAGCAGGGCATCCTTGTTGAAGGCGG


If we want to quantify this data set with salmon, we can now simply run

$ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-r <(./stream_ena SRR3185782.fastq) -o SRR3185782


This will stream the entire contents of accession in to salmon directly from ENA without storing anything on disk, and quantified expression will be saved in the directory SRR3185782.

For a dataset with paired reads we would do for example

$ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-1 <(./stream_ena SRR1274127_1.fastq) \
-2 <(./stream_ena SRR1274127_2.fastq) -o SRR1274127


Many sequencing tools supports streaming out box don care whether you are your disk or server online.

Say for example we want to look at 5 random reads:

$ seqtk sample -s $(date +%s) <(./stream_ena SRR1274127_1.fastq) 0.001 | head -n 20
@SRR1274127.753 753/1
TTAGAAGGATTATGGATGCGGTTGCTTGCGTGAGGAAATACTTGATGGCAGCTTCTGTGGAACGAGGGTTTATTTTTTTGGGTAGAACTGGAATAAAAGCT
+
BCCFFFFFHHHHHJJJJJJJJHIJIJJJIICGHIJGIJJIJJJJJJJJJJJJJJJJJIJJJGHHFFFD;ADDDDEEDDDD>&2>?ACDDDCDDDDDDDDDC
@SRR1274127.1464 1464/1
AATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCT
+
CCCFFFFFHHHHHJJIIHIIJJIJJIGJGJJJJJJJJJJJJJJJJJJJJJJJJJJFIIJJJJJJIJJIIJJIIIJJHHHHFFDFFFFEEDDDDDDB@BBB9
@SRR1274127.1672 1672/1
CCCTACTACTATCTCGCACCTGAAACACCCTAACATGACTAACACCCTTAATTCCATCCACCCTCCTCTCCCTAGGAGGCCTGCCCCCGCTAACCGGCTTT
+
@@@DDDD>FABBD@BDFB:FE;+2+2<)):CFB?<3?4?B*99???FFFE98)>F;778)7(5@E1CF?DDB#############################
@SRR1274127.2188 2188/1
GATTATTAGGGGAACTAGTCAGTTGCCAAAGCCTCCGATTATGATGGGTATTACTATGAAGAAGATTATTACAAATGCATGGGCTGTGACGATAACGTTGT
+
@?@DFDEDFBHFHGGG@DCHEEHG@FEHG@HGGGGIID=FDHGGHGCG?8?CFHHGHGAHEACHA@E<D@EFE>??C@@CD@B@AABCCC@BB@BCB9992
@SRR1274127.4127 4127/1
CTTATACTAGTATCCTTAATCATTTTTATTGCCACAACTAACCTCCTCGGACTCCTGCCTCACTCATTTACACCAACCACCCAACTATCTATAAACCTAGC
+
??@DDFFFDHFFDGHGGGGGHEGGJEHIIAFDGIIGGIJGIGHHIJJIAFBGIGHJGEGIGGJGGGI=EGEEECHAB<?ACB?ABCCDDDDDDCCDDCDCD


Or if we want to check the quality of a dataset without wasting space downloading it:

$ ./stream_ena SRR1274127_1.fastq | fastqc -o SRR1274127_1_fastqc -f fastq stdin


Of course there are caveats. You can’t just blindly put any reads in to salmon and get correct expression. You need to read the methods section of the related study to see what parameters use. The data might need some might preprocessing before it is useful. It is also very common that files uploaded to ENA should be merged before being input to processing tools. But this streaming approach greatly ease storage burden burden from working with public data.

K-means in TensorFlow

I have been getting a bit interested in the new fancy TensorFlow package. As a little exercise to figure out roughly how to use it, I figured I could implement a simple model. I chose to implement k-means clustering, since it’s very simple, but a bit different from the regression/classification models in TensorFlows examples.

We’ll make some ideal clustered data using scikit-learn.

from sklearn import datasets

Xi, yi = datasets.make_blobs(500, random_state=1111)
plt.scatter(*Xi.T, c='k', lw=0);
`




Now we want to divide these in to clusters. We will use the objective used by Bottou & Bengio 1995,

$$ E(w) = \sum_i \min_k (x_i - w_k)^2. $$

Or in words: the sum of the (squared) distances to the closest prototype $ w_k $.

For our implementation, first we need to put in data. Secondly, we need to define the variables we want to optimize. In this case we want the prototypes $ w $.

import tensorflow as tf

# Number of clusters
k = 3

# Input
Xi = Xi.astype(np.float32)
X = tf.placeholder_with_default(Xi, Xi.shape, name='data')

# Model variables, initiate randomly from data
idx = np.random.choice(range(Xi.shape[0]), k, replace=False)
Wi = Xi[idx].copy()
W = tf.Variable(Wi, name='prototypes')


Now we only need to write out how the objective function value is calculated from the data and the variables.

# Reshape data and prototypes tensors
Xc = tf.concat(2, k * [tf.expand_dims(X, 2)])
Wc = tf.transpose(tf.expand_dims(W, 2), [2, 1, 0])

# Define objective of model

distances = tf.reduce_sum((Xc - Wc) ** 2, 1)

cost = tf.reduce_sum(tf.reduce_min(distances, 1))


After the tensors have been formatted correctly this is very simple. We just calculate the distances from each point to each prototype. Then we sum the smallest of these per data-point.

The final value in cost is represented in such a way that TensorFlow automatically can figure out the gradients for minimizing it with respect to the variables in W.

# Make an object which will minimize cost w.r.t. W
optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(cost)


This is the nice thing with TensorFlow, you don’t need to care about finding gradients for objective functions.

sess = tf.Session()

sess.run(tf.initialize_all_variables())

for i in range(100 + 1):
    sess.run(optimizer)
    if i % 10 == 0:
        print('Iteration {}, cost {}'.format(i, sess.run(cost)))

Iteration 0, cost 21533.69140625
Iteration 10, cost 950.78369140625
Iteration 20, cost 940.453369140625
Iteration 30, cost 940.4503784179688
Iteration 40, cost 940.450439453125
Iteration 50, cost 940.450439453125
Iteration 60, cost 940.450439453125
Iteration 70, cost 940.450439453125
Iteration 80, cost 940.450439453125
Iteration 90, cost 940.450439453125
Iteration 100, cost 940.450439453125


Once we’ve optimized the W we get the number for them. Then we can make cluster labels for the data points by looking at which prototype $ w_k $ is closest to a data point.

W_learned = sess.run(W)
y = sess.run(tf.arg_min(distances, 1))

sess.close()

plt.scatter(*Xi.T, c=y, lw=0, cmap=cm.Greys_r, vmax=k + 0.5, label='data');
plt.scatter(*W_learned.T, c='none', s=100, edgecolor='r', lw=2, label='prototypes');
plt.legend(scatterpoints=3);



This is also available in notebook form here: https://gist.github.com/vals/a01a37b14c4918df7937b30d43327837

Some intuition about the GPLVM

Over the last year I’ve been learning about the statistical framework of Gaussian Processes. I’ve been thinking about writing about this for a while, but I feel like I would just decrease the signal-to-noise ratio given the fantastic resources available.

One of my favourite models in the Gaussian Process framework is the Gaussian Process Latent Variable Model (GPLVM). The point of the GPLVM, is that if we have multivariate data, we can simultaneously model every variable with a Gaussian Process. It turns out then that the same way that one can pick hyper parameters for a regression model, we can find optimal, latent, predictor variables for the data.

The leap from regression to the latent variable model is a bit drastic though, so I was trying to think of how we could get some more intuition about this. Let us look at a minimal example. We will use the excellent GPy package for this.

We generate a small number of points (5) on a sine curve, and perform Gaussian Process Regression on this.

import GPy

X = np.linspace(0, np.pi, 5)[:, None]
Y = np.sin(X)

kernel = GPy.kern.RBF(1)
m = GPy.models.GPRegression(X, Y, kernel=kernel)
m.optimize()
m.plot(plot_limits=(-0.1, 3.2));


Now we have a curve which optimally describes the 5 points. Imagine we have a stray 6th point. Say we did 6 measurements, but for one of the points we lost the measurement values. Where could this measurement have happened?

Well, we have the model of the curve describing the data. If we guess the measurement values for the 6th point, we can add that as an observation, and evaluate the likelihood of the regression when including the point.

Let us perform this by guessing some potential points in a 20 by 20 grid around or known observations.

N = 20
xx = np.linspace(0, np.pi, N)
yy = np.linspace(0, 1, N)

ll = []
xxi = []
yyi = []
for i in range(N):
    for j in range(N):
        x = xx[i]
        xxi.append(x)
        Xs = np.vstack((X, np.array([[x]])))
        y = yy[j]
        yyi.append(y)
        Ys = np.vstack((Y, np.array([[y]])))
        ms = GPy.models.GPRegression(Xs, Ys, kernel=kernel,
             noise_var=m.Gaussian_noise.variance)
        ll.append(ms.log_likelihood())

plt.scatter(xxi, yyi, c=ll, s=50, vmin=-1e4,
            edgecolor='none', marker='s',
            cmap=cm.Greys_r);
plt.colorbar(label='log likelihood')
m.plot(ax=plt.gca(), c='r', legend=False,
       plot_limits=(-0.1, 3.2));


Here every square is a position where we guess the 6th point could have been, colored by the log likelihood of the GP regression. We see that it is much more likely the 6th point came from close to the curve defiend by the 5 known points.

Now, imagine that we know that the y value of the 6th point was 0.4, then we can find an x value such that the likelihood is maximized.

x = 1.2
y = 0.4
Xs = np.vstack((X, np.array([[x]])))
Ys = np.vstack((Y, np.array([[y]])))
gplvm = GPy.models.GPLVM(Ys, 1, X=Xs, kernel=kernel)
gplvm.Gaussian_noise.variance = m.Gaussian_noise.variance
gplvm.fix()
gplvm.X[-1].unfix();

gplvm.optimize()

gplvm.plot(legend=False, plot_limits=(-0.1, 3.2));
plt.scatter([x], [y], c='r', s=40);
ax = plt.gca()
ax.arrow(x, y, gplvm.X[-1, 0] - x + 0.1, gplvm.Y[-1, 0] - y,
        head_width=0.03, head_length=0.1,
        fc='r', ec='r', lw=2, zorder=3);


This is done with gradient descent, and the optimum it finds will in this particular case depend on where we initialize the x value of the 6th point. But if we had more variables than the y value above, and we want to find the x-value which is optimal for all variables at the same time, it will end up being less ambiguous. And we can do this for every point in the model at the same time, not just one.

This is illustrated in this animation.



Here there are two variables and we are looking for x-values which best describes the two variables as independant Gaussian processes.

Observations needed to estimate standard deviation

I was curious as to practically how well standard deviation can be estimated at different numbers of observed samples. I thought the plot was interesting, so figured I’d share.

Here is the Julia code I used:

using Distributions
using Gadfly

function std_of_sample(n)
    std(rand(Normal(2, 5), n))
end;

nmax = 200
n_observations = collect(0:nmax * 100) % nmax + 2;

p = plot(
    x=n, y=map(std_of_sample, n_observations),

    Guide.XLabel("Number of observations"),
    Guide.YLabel("Estimated σ"),
    Theme(default_color=color("black")),
    Coord.Cartesian(ymin=-3, ymax=13, xmin=-20, xmax=220),
    Guide.xticks(ticks=[0, 10, 25, 50, 100, 200])
);

draw(PNG(640px, 480px), p)


We just sample a different number of samples from a normal distribution with mean 2 and known standard deviation 5.

Timecourse analysis with Sleuth

An extremely interesting application of RNA-sequencing analysis is to study samples over a time series. This allows you to identify patterns of expression over some response to a stimuli or developmental progression.

While Sleuth together with the Kallisto from the Pachter lab makes it easy to perform differential expression analysis in the two-condition case of an RNA-seq experiment, there is still some confusion about how to perform DE over a time series. I thought it would be useful to write an example text of how to detect DE transcripts over a time course with Sleuth.

The first problem is to define what we mean by differentially expressed in the context of a time series. For a treatment-vs-control type experiment it is simple: What genes are higher or lower expressed in the treatment condition compared to the control? The first generalization which comes to mind would be to find genes whose level increase or decrease with the time points. This means finding correlation with time points and boils down to linear regression of expression with time points. However, this will mask potentially interesting genes which could for example peak in the middle of the time series. What we want is to find if the expression of a gene follows a general pattern to a higher degree than just noise.

In a linear modelling framework, such as Sleuth, a common solution for this is to use natural splines. Given a number of degrees of freedom for a natural spline model, knots will be placed along the quantiles of the observations of the time axis, which will define basis polynomials with local support.

$$ Expression \sim \beta_0 + \sum_{i=1}^3 \beta_i B_{i}(Timepoint) + \varepsilon $$

The idea is now to compare the model of a genes expression which includes the polynomial terms, with a model that only includes the noise term.

$$ Expression \sim \beta_0 + \varepsilon $$

In the latest versions of Sleuth, there is a likelihood ratio test implemented for this sort of comparison of models. This is how time course analysis is implemented both in Ballgown and Monocle. The benefit of Sleuth is the ability of using the quantification bootstraps to attempt to separate technical variance from biological variance.

Let’s get to the actual practical example. We will be using data from the publication “High-resolution mapping of transcriptional dynamics across tissue development reveals a stable mRNA-tRNA interface”. In the paper they perform RNA-sequencing on a developmental time series of mice, from both brain and liver, at six time points. For the sake of this example, let us just focus on the brain samples.

We load Sleuth and define the locations for the Kallisto outputs.

library(sleuth)
sample_id <- list('do2174', 'do2175', 'do2176', 'do2177',
'do2183', 'do2184', 'do2185', 'do2186', 'do2187', 'do2188',
'do2189', 'do2190')
paths <- list('kallisto/do2174_RNAseq_brain_mmuBL6e15.5_CRI01p_kallisto_out',
'kallisto/do2175_RNAseq_brain_mmuBL6e18.5_CRI01p_kallisto_out',
'kallisto/do2176_RNAseq_brain_mmuBL6P0.5_CRI01p_kallisto_out',
'kallisto/do2177_RNAseq_brain_mmuBL6P4_CRI01p_kallisto_out',
'kallisto/do2183_RNAseq_brain_mmuBL6e15.5_CRI01p_kallisto_out',
'kallisto/do2184_RNAseq_brain_mmuBL6e18.5_CRI01p_kallisto_out',
'kallisto/do2185_RNAseq_brain_mmuBL6P0.5_CRI01p_kallisto_out',
'kallisto/do2186_RNAseq_brain_mmuBL6P4_CRI01p_kallisto_out',
'kallisto/do2187_RNAseq_brain_mmuBL6P22_CRI01p_kallisto_out',
'kallisto/do2188_RNAseq_brain_mmuBL6P29_CRI01p_kallisto_out',
'kallisto/do2189_RNAseq_brain_mmuBL6P22_CRI01p_kallisto_out',
'kallisto/do2190_RNAseq_brain_mmuBL6P29_CRI01p_kallisto_out')
names(paths) <- sample_id

s2c <- read.table("brain_times.tsv", header = TRUE, stringsAsFactors = FALSE)
s2c <- dplyr::mutate(s2c, path = paths)
s2c[] <- lapply(s2c, as.character)
s2c


In order to pass the spline based model to Sleuth, we need to make a design matrix. This can be done by using the splines library and pass a formula using the splines by the ns() functions to model.matrix.

library(splines)
day <- s2c$day
full_design <- model.matrix(formula(~ ns(day, df = 4)))
full_design


The df parameter essentially governs how smoothed the expression patterns will be over the time points. I picked 4 because it’s the default in Ballgown. In Monocle the default is 3, but it seems this smooths the expression too much. Lower values for the df parameters will capture less “dynamics”, but will have more statistical power. 

As described in the Sleuth vignette, it is quite handy to have the external gene name which is associated to every transcript, to quickly see roughly what gene we are looking at when we see a differentially expressed transcript.

library("biomaRt")
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "mmusculus_gene_ensembl",
host = "www.ensembl.org")
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id",
"ensembl_gene_id",
"external_gene_name"),
  mart = mart)
t2g <- dplyr::rename(t2g,
target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id,
ext_gene = external_gene_name)


Now we are read to start performing the Sleuth analysis

so <- sleuth_prep(s2c, full_model = full_design, target_mapping = t2g)
so <- sleuth_fit(so)
so <- sleuth_fit(so, formula = ~ 1, fit_name = "reduced")
so <- sleuth_lrt(so, "reduced", "full")



Sleuth has an interactive Shiny application associated with it to browse the results of the analysis. It’s not mandatory to use it though, and many of the handy features in it mostly makes sense for the treatment-vs-control style of experiments.

plot_qq(so, test = 'reduced:full', test_type = 'lrt', sig_level = 0.05)


We can extract the results of the likelihood ratio test as a dataframe to look at closer

lrt_results <- sleuth_results(so, 'reduced:full', test_type = 'lrt')
table(lrt_results[,"qval"] < 0.05)

FALSE  TRUE 
49523  3083


This tells us that we find 3083 transcripts which have an expression pattern ove the time course which is significantly better explained by a natural spline over the time than by just random noise. Let’s have a look at the very top transcripts.

lrt_results %>% head(n = 20) %>% dplyr::select(target_id, qval, ens_gene, ext_gene)


I don’t know enough about brain development to say if these genes make any sense in the context. We can also plot a clustered heatmap of these top genes.

plot_transcript_heatmap(so, head(lrt_results, n = 20)$target_id, 'est_counts')


Unfortunately there doesn’t seem to be a way to rename the axis labels to some other information, or to enforce ordering the samples by time point. Let’s also plot the expression of a differntailly expressed transcript to see what it looks like.

tmp <- so$obs_raw %>% dplyr::filter(target_id == 'ENSMUST00000145067')
tmp <- dplyr::full_join(so$sample_to_covariates, tmp, by = 'sample')
tmp
tmp <- transform(tmp, day = as.numeric(day))
ggplot(tmp, aes(x=day, y=est_counts)) 
  + geom_point(shape=1) 
  + geom_smooth(method = loess)

Let me guess where you’re from

Let me guess where you’re from

On the website letmeguesswhereyourefrom.com you can enter a name, and an algorithm will print five countries the name seems come from. Try it out!

I made this web application as an exercise in applying machine learning, in this post I will describe

  • How I got data,
  • How I made the model,
  • How I optimized the model for use in a web application, and,
  • How I made the web application.

Getting training data

The first problem is where to get the data from. For the model to be representative I would want data from real people. There are many lists of “Most common names in …” for various countries. These however mostly consist of either first names or surnames. I doubted just for example first names would not contain enough information to train a model. For a while I thought I could get surname lists and first name lists and randomly pair elements from them to make larger training sets. Looking for real names, I found that Wikipedia has a list of people who have Wikipedia pages by country!

We can quickly get all this data

%pylab inline
import pandas as pd
import seaborn as sns
sns.set_color_codes()

import codecs
from unidecode import unidecode

import requests

wiki = 'https://en.wikipedia.org'
res = requests.get(wiki + '/wiki/Lists_of_people_by_nationality')


By inspecting the downloaded HTML I noticed a pattern on the rows that contain links to lists of people. So I parsed these and extract the country name and URL to the list.

list_urls = {}
for line in res.content.split('\n'):
    if line.startswith('<li><a href="/wiki/List_of_'):
        l =  line.split('<li><a href="')[-1] \
                 .split('" title=')[0]
        country = l.split('/wiki/List_of_')[-1]
        list_urls[country] = l

f1 = lambda s: s.startswith('<li><a href="/wiki/')
c1 = lambda s: s.split('" title="')[-1].split('"')[0]

lists = {}
for country in list_urls:
    content = requests.get(wiki + list_urls[country]).text.split('\n')
    lists[country] = map(unidecode, map(c1, filter(f1, content)))

with open('name_list.tsv', 'w') as fh:
    for country in lists:
        for name in lists[country]:
            fh.write('{}\t{}\n'.format(name, country))


There are some obvious problems with this. Not all lines I parse out are actually lists of names, but other lists. Additionally the larger countries have so many entries that they link to lists of sublists rather than actually lists. (This includes US and China.) The number of countries isn’t that high and it would be possible to just add these manually. But I don’t feel like doing that and will ignore those cases.

To make downstream analysis easier, I romanized all names using the unidecode package. This translates all characters in the names to ASCII.

In total the process fetches in total 68,634 names distributed over 214 countries.

Making the model

I knew that N-grams of words are commonly used for document classification, and thought I can make something similar for just letters in names. Turns out this was already nicely implemented in the scikit learn as a parameter to the CountVectorizer class.

For the most part I followed the scikit-learn text analytics tutorial. The tutorial said “bags of words” representations of documents typically have more than 100 0000 features. Since a “document” in my case is a name, there are not so many n-grams in each. To get a large number of featuers I set the CountVectorizer to consider all N-grams for all N between 1 and 5. I limited the maximum number of features to 100000 so that I would know how large the data sets would be.

As per the tutorial, I push the output of the CountVectorizer through a TfidfTransformer. The idea with this is to downscale weights of features that occur in many samples.

Finally, I use these features to train a Multinomial Naive Bayes model. (I’m not discussing much of the details of the methods here, as they are well described in the references. A thing about scikit-learn for applying machine learning that I like is also that you can use a lot of very nice tools without knowing much about the details.)

All these steps are combined to a pipeline, so we can easily just supply raw text and get a categorical prediction.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split

name_clf = Pipeline([('vect', CountVectorizer(analyzer='char',
                                              ngram_range=(1,5),
                                              max_features=100000)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])


The data is read in, and split in to a training set that we will use to tune the model, and a testing set that we will finally use to evaluate the model.

df = pd.read_table('name_list.tsv', header=-1)
df.columns = ['name', 'country']
name_train, name_test, country_train, country_test = \
train_test_split(df.name, df.country, test_size=10000)


Now I could simply train the classifier using the training data.

name_clf.fit(name_train, country_train)

predicted = name_clf.predict(name_test)
np.mean(predicted == country_test)

Out []: 0.27860000000000001


While much better than randomly guessing countries (1 / 214 = 0.0047), it’s not a particularly impressive score.

Earlier the same day, I had listened to the Talking Machines podcast where Jennifer Listgarten described how they had trained a model to predict efficient CRISPR/Cas9 guide targets (avilable in a soon to be released service Azimuth). She said that predicting the best guide was a very hard problem. But predicting say the top 5 guides had very high accuracy. Trying a small number of guides is much more efficient than many possible guides anyway.

So I stole that strategy, and return a top 5 of predicted countries for a given name. Thus I change the model evaluation criterion. The strategy is very well suited for MultinomialNB, since prediction means evaluating the posterior probability for each class. We can just take these probabilities and sort out the highest ones.

t5 = name_clf.predict_log_proba(name_test).argsort(axis=1)[:,-5:]
a = country_test
b = name_clf.get_params()['clf'].classes_[t5]
np.mean((a[:, None] == b).max(1))

Out []: 0.4647


Better, but still not stellar.

The MultinomialNB model has a hyperparameter alpha. This is a smoothing parameter that helps with the problem that many features have 0 counts in any given document. This would cause the posterior probability to be 0. The default value for alpha is 1. We’ll try to do better by optimizing the alpha by randomized cross validation.

from sklearn.grid_search import RandomizedSearchCV

a = 10 ** np.linspace(-5, 0)
rscv = RandomizedSearchCV(name_clf,
param_distributions={'clf__alpha': a},
n_iter=10,
n_jobs=2)

rscv.fit(name_train, country_train)

rscv.best_params_

Out []: {'clf__alpha': 0.014563484775012445}


We plug this value in to out classifier pipeline, and try again.

name_clf = Pipeline([('vect', CountVectorizer(analyzer='char',
                                              ngram_range=(1,5),
                                              max_features=100000)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB(alpha=0.014563484775012445))])

name_clf.fit(name_train, country_train)

t5 = name_clf.predict_log_proba(name_test).argsort(axis=1)[:,-5:]
a = country_test
b = name_clf.get_params()['clf'].classes_[t5]
np.mean((a[:, None] == b).max(1))

Out []: 0.66959999999999997


This is quite nice! We can make a function to check how it performs in practice.

def top_countries(name):
    log_proba = name_clf.predict_log_proba([name])
    t5 = log_proba.argsort(axis=1)[:,-5:]
    b = name_clf.get_params()['clf'].classes_[t5]
    for c, lp in zip(b[0], log_proba[:,t5].flatten())[::-1]:
        print '{}\t\t{}'.format(c, np.exp(lp))

top_countries('angela merkel')

Vietnamese      0.893538897127
Greeks      0.0294758574644
Baltic_Germans      0.0194029861994
Irish_people        0.015581222622
Bulgarians      0.00674997108663


While some of the suggestions are… counterintuitive… there is some German there! The strategy to return a top 5 seem to work out.

Making the model deployable

While I thought it would be nice to make the model usable through a web application, I didn’t want to spend a fortune doing so! Heroku offers free web application hosting in their Free tier with the limitation of 512 MB of memory. Additionally, the final size of the compiled web application “slug” can only be 300 MB.

The training data is 58 000 examples with 100 000 features each. The MultinomialNB model needs to store two dense matrices of size (58 000, 100 000). Running the model in the Python interpreter requires a bit over 1 GB of RAM. Storing the model with joblib.dump() as recommended in the documentation creates around 700 MB of data, so we wouldn’t be able to deploy it on Heroku.

The strategy I settled for was to put in a step in the pipeline to ignore non-informative features. To find which features contain the most information I used TruncatedSVD, a dimensionality reduction technique known to work well for “bags of words” models.

from sklearn.decomposition import TruncatedSVD

tf_pipeline = Pipeline([('vect', CountVectorizer(analyzer='char',
                                                 ngram_range=(1,5),
                                                 max_features=100000)),
                        ('tfidf', TfidfTransformer())])

name_features = tf_pipeline.fit_transform(name_train)

tsvd = TruncatedSVD(n_components=400)

tsvd.fit(name_features)

tsvd.explained_variance_ratio_.sum()

Out []: 0.27138872714390422


I picked 400 components because sqrt(100000) = 316.23 and I rounded up. There’s no rule for this or so, but this let me fit the TruncatedSVD for the entire training set in a reasonable amount of time. The total variance explained by 400 components is about 27%. It doesn’t sound as much, but it can inform us on the importance of the features.

We extract importances in the TruncatedSVD by looking at the lengths of the singular vectors stored in TruncatedSVD.components_.

idx = np.linalg.norm(tsvd.components_[:, :], axis=0).argsort()


We make a “Reducer” class which we can plug in to the pipeline to tell the pipeline which features to pass on from the feature extraction step to the MultinomialNB classifier.

from sklearn.base import TransformerMixin

class Reducer(TransformerMixin):
def __init__(self, idx):
    self.idx = idx

def fit_transform(self, X, y=None, **fit_params):
    return self.transform(X)

def transform(self, X):
    return X[:, self.idx]

def get_params(self, *k, **kw):
    return {'': ''}


We could include the important feature selection in a fit method, but to speed this up, we perform that calculation offline and provide the indexes of the features to the constructor of the object.

I checked a range of the number of features to pass on to the MultinomialNB by looking at the model score. This is bad thing to do and you shouldn’t do it! Now I’m causing an information leak between the parameter selection and the model evaluation. However, I think it’s relatively obvious that the score will be monotonically increasing with the number of features. And I had already decided I wanted to use between 10 000 and 20 000 features for the final model due to the computational budget. I was just curious at what rate things would get better if I’d allow more features.

Now we train the model with a reducer step that passes the 15 000 most informative features to the classifier step.

name_clf = Pipeline([('vect', CountVectorizer(analyzer='char',
                                              ngram_range=(1,5),
                                              max_features=100000)),
                     ('tfidf', TfidfTransformer()),
                     ('reducer', Reducer(idx[-15000:])),
                     ('clf', MultinomialNB(alpha=0.014563484775012445))])

name_clf.fit(name_train, country_train)

t5 = name_clf.predict_log_proba(name_test).argsort(axis=1)[:,-5:]
a = country_test
b = name_clf.get_params()['clf'].classes_[t5]
s = np.mean((a[:, None] == b).max(1))
s

Out []: 0.62849999999999995


A pretty modest performance decrease given that the model only takes 15% of the memory needed before! As a side effect, the prediction operation in the model is now a lot snappier as well, which feels good in the web app.

With the indices for the features saved, it is actually relatively quick to just train the model from the total amount of raw data. So for deploying the web app, I just have it load the data and train when it launches. It takes about 30 seconds, and the raw data (the list of names and countries) is about 2 MB in total.

Web application

The web application itself is written using the microframework Flask. I define one entry point that provides an API, where the name is queried and the predictor is called to return the top scoring countries. The API returns these in JSON format. The second entry point just serves a HTML file, which in turn loads a Javascript file that connects the input box and buttons to the API.

To get access to scientific Python tools like scikit-learn on Heroku I use the Conda Buildpack.

You can see the full implementation of the app on GitHub, as well as the Jyputer notebook where I made the model.

Finally, have a look at the result!

letmeguesswhereyourefrom.com

Comparing unpublished RNA-Seq gene expression quantifiers

During the conference BoG a new RNA-Seq gene expression quantification tool called Kallisto was announced and released by the Pachter group. This tools have me excited because it is following the trend of Sailfish, RNASkim and recently Salmon. The classical approach for gene (or transcript) expression is aligning reads to genome, then by looking at coverage figuring out which transcripts are expressed at which levels. These new tools in stead use information contained in reads where exact mapping is not completely necessary, and use that to deconvolve the mixture proportions of transcripts in a sample.

To compare these quantifiers, and also using results from a classical TopHat + Cufflinks combination as a control I used five batches of single cell RNA-Seq from three different published data sets: SRP030617 (batch 1), SRP041736 (batch 2 & 3), and EGAS00001001204 (batch 4 & 5). These are all data from human cells, in total 284 samples. The data have between ~100 000 reads and ~5 000 000 reads.

All these data sets have had ERCC spike-ins added to them. To compare quantification accuracy, I first quantify expression for all genes / transcripts plus the spike-ins using the different tools. Then I extracted the TPM values for the spike-ins and looked at the Pearson correlation of the log transformed TPM values and the log transformed concentration of ERCC spike-ins. The plots below shows the distribution of these Pearson R values for the different tools over the different dataset batches. The n refers to the number of samples in a given batch.

They all perform well, at least equally well to TopHat + Cufflinks, and generally have pretty high correlation values (very few are below 0.8). In some cases Salmon seem to perform better.

To make the task a bit harder, we can look at the input dilution of ERCC’s included in the publications of the data sets and look only at spike-ins present at low levels. In this case given the concentration and dilution in to the volume of the experiments, we look only at the spike-ins present at between 1 and 50 RNA molecule copies.

All the methods perform equally, varying from surprisingly good to pretty bad depending on data set.

Regarding speed, the TopHat + Cufflinks combo took something along the lines of a half a day per sample, using multithreading over 12 cores for each sample. (I didn’t save the timings). The Salmon quantification took about 15 minutes per sample, also with 12 cores multithreading, while the Kallisto took about 10 minutes per sample and does not do any multithreading.

It should be noted that since there are no alternative splice forms for the ERCC spike-ins this doesn’t tell us anything about correction or quantification of alternative transcripts.

Modelling Measles in 20th Century US

A while ago some people at the Wall Street Journal published a number of heatmaps of incidence of infectious diseases in states of the US over the 20th century. This started a bit of a trend online where people remade the plots with their own sensibilites of data presentation or aestethics.

After the fifth or so heatmap I got annoyed and made a non-heatmap version.

I considered the states as individual data points, just as others had treated the weeks of the years in the other visualizations. Since the point was to show a trend over the century I didn’t see the point of stratifying over states. (The plot also appeared on Washinton Post’s Wonkblog)

The data is from the Tycho project, an effort to digitize public health data from historical records to enable trend analysis.

I thought it would be interesting to look at some other features of the data beside the global downward trend, and in particular how the incidence rate varies over time. In total there are about 150 000 data points, which I figured would also be enough to tell differences between states when considering the seasonal trend over a year as well.

We make a linear model by fitting a spline for the yearly trend, and a cosine curve for the seasonal change in measles incidence over a year. We also choose to be even more specific by finding different seasonal parameters for each state in the US.

$$ y = \sum_{i=0}^n c_i \cdot B_i\left(x_\text{year}\right) + \alpha_\text{state} \cdot \cos\left(\frac{2 \pi}{52}\cdot ( x_\text{week} - 1)\right) + \beta_\text{state} \cdot \sin\left(\frac{2 \pi}{52}\cdot ( x_\text{week} - 1)\right) $$

This way we will capture the global trend over the century, as well as a seasonal component which varies with the week of the year for every state. The global trend curve look very similar to the old version.


Every faint dot in the plot is a (state, week) pair. There are up to 2 600 of these in every year, so seeing them all at once is tricky. But we can still see the trend.

As a zoomed in example, let us now focus on one state and investigate a seasonal component.


It is fairly clear both from looking at the data and the fit of the cosine curve how the measles incidence (which is defined as number of cases per 100 000 people) changes over the year, peaking in early April.

As an alternative representation of the same plot, we can put it in polar coordinates to emphasise the yearly periodicity.

We fitted the model such that we will have a different seasonal component for each state. To visualize them all at once we take the seasonal component for each state, assign a color value based on it’s level on a given week, and put this on a map. We do this for every week and animate this over the year, resulting in the video below.



In the video green corresponds to high incidence and blue to low. There is an appearance of a wave starting from some states and going outwards towards the coasts. This means some states have different peak times of the measles incidence than others. We can find the peak time of a state’s seasonal component by doing some simple trigonometry.

$$ \rho \cos\left(\frac{2 \pi t}{l} - \phi\right) = \rho \cos(\phi) \cos\left(\frac{2 \pi t}{l}\right) + \rho \sin(\phi) \sin\left(\frac{2 \pi t}{l}\right) = $$ $$ = \alpha \cos\left(\frac{2 \pi t}{l}\right) + \beta \sin\left(\frac{2 \pi t}{l}\right) $$ $$ \alpha = \rho \cos(\phi), \ \beta = \rho \sin(\phi) $$ $$ \rho^2 = \alpha^2 + \beta^2 $$ $$ \phi = \arccos\left(\frac{\alpha}{\rho}\right) $$

So $ \phi $ will (after scaling) give us the peak time for a state (meaning the offset of the cosine curve). After extracting the offset for each state, we make a new map illustrating the peak times of measles incidence over the year.


The color scale ranges from week 10 (white) in early March to week 14 (black) in early April. We didn’t provide the model with locations of the states, but the peak times do seem to cluster spatially.


Methodology

I also want to show how I did this.

%pylab inline

# Parsing and modelling
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

# Helps with datetime parsing and formatting
from isoweek import Week
import calendar

# For plotting the spline
from patsy.splines import BS
from scipy import interpolate

# For creating US maps
import us
from matplotlib.colors import rgb2hex
from bs4 import BeautifulSoup


The data we use from the Tycho project comes in a csv which we parse with Pandas.

measles = pd.read_csv('MEASLES_Incidence_1928-2003_20150413093130.csv', skiprows=2)
measles.index = pd.DatetimeIndex([Week(*t).monday() for t in zip(measles.YEAR, measles.WEEK)])
measles = measles.drop(['YEAR', 'WEEK'], 1)
measles = measles.convert_objects(convert_numeric=True)


Now we have tabular data of this form

measles.iloc[:5, :5]


ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA
1928-01-02 3.67 NaN 1.90 4.11 1.38
1928-01-09 6.25 NaN 6.40 9.91 1.80
1928-01-16 7.95 NaN 4.50 11.15 1.31
1928-01-23 12.58 NaN 1.90 13.75 1.87
1928-01-30 8.03 NaN 0.47 20.79 2.38


This is a large matrix of states vs time points with incidence in each cell. For making a linear model we would prefer to have it in a form where each row is an observation, and the columns giving variables used.

year = []
week = []
state = []
incidence = []
for i in measles.index:
    for c in measles.columns:
        year.append(i.year)
        week.append(i.week)
        state.append(c)
        incidence.append(np.log10(measles.ix[i, c] + 1))

data = pd.DataFrame({'year': year,
                     'week': week,
                     'state': state,
                     'incidence': incidence})
data.iloc[:5, :5]


incidence state week year
0 0.669317 ALABAMA 1 1928
1 NaN ALASKA 1 1928
2 0.462398 ARIZONA 1 1928
3 0.708421 ARKANSAS 1 1928
4 0.376577 CALIFORNIA 1 1928


We can now use statsmodels to define and fit the linear model

df = 12
d = 3
model = smf.ols('incidence ~ bs(year, df=df, degree=d, include_intercept=True) \
                + np.cos(2 * np.pi / 52 * (week - 1)) : C(state) \
                + np.sin(2 * np.pi / 52 * (week - 1)) : C(state) \
                - 1', data=data).fit()


To be able to use the parameters fitted for the spline, a bit of plumbing is needed. We let patsy and statsmodels pick the knots for the spline, for which the coefficients are fitted by the OLS model. One could calculate these knots directly from the data. But it is simpler to grab the function for picking the knots from patsy. Then we extract the spline coefficients from the model parameters.

my_bs = BS()
my_bs.memorize_chunk(data.year, df=df, degree=d, include_intercept=False)
my_bs.memorize_finish()

coeffs = []

conf_int= model.conf_int(alpha=0.05)
for i in range(df):
    parameter_name = 'bs(year, df=df, degree=d, include_intercept=True)[{}]'.format(i)
    coeffs.append(model.params[parameter_name])

knots = my_bs._all_knots

tck = (np.array(knots), np.array(coeffs), 3)


The interpolate function from scipy uses triples of knots, coefficients and the degree of the basis polynomials as a parameter, which is used to evaluate any point in the domain of the spline.

x = np.arange(1928, 2004)
y = interpolate.splev(x, tck)


Once we have done this, we can create the first figure.

figsize(8, 6)

# For legend with high alpha dots
plt.scatter(-1, -1, c='k', edgecolor='w', label='Observations')
plt.scatter(data.year, data.incidence, alpha=0.01, edgecolor='w', c='k');

plt.plot(x, y, c='w', lw=5);
plt.plot(x, y, c='r', lw=3, label='Model fit');

plt.xlim(x.min() - 1, x.max() + 1);
plt.xlabel('Year')

plt.ylim(0, 2.);
loc, lab = plt.yticks()
lab = [np.round(10 ** l) for l in loc]
plt.yticks(loc, lab);
plt.ylabel('Measles incidence');

plt.legend(scatterpoints=5)

plt.title('U.S. Wide trend');

plt.tight_layout();
plt.savefig('per_year.png');



Next we extract the periodic component parameters for a given state and define a function of the week of the year.

state = 'CALIFORNIA'

alpha = model.params['np.cos(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]
beta  = model.params['np.sin(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]

def periodic(week):
    return alpha * np.cos(2 * np.pi / 52 * (week - 1)) + \
           beta * np.sin(2 * np.pi / 52 * (week - 1))


And using this we can create the second plot

figsize(8, 6)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
plt.scatter(data.query('state == "{}"'.format(state)).week,
            data.query('state == "{}"'.format(state)).incidence,
            c=data.query('state == "{}"'.format(state)).year,
            alpha=0.99,
            s=30,
            cmap=cm.Greys_r,
            edgecolor='w',
            label='Observations', color='k')

x = np.linspace(1, 52, 100)
year = 1950

yy = np.maximum(periodic(x) + y[year - 1928], 0)
plt.plot(x, yy, lw=5, c='w') + \
plt.plot(x, yy, lw=3, c='r', label='Model fit');

plt.xlim(x.min() - 1, x.max() + 1);
plt.ylim(0, 2);

loc, lab = plt.yticks()
lab = [np.round(10 ** l) for l in loc]
plt.yticks(loc, lab);
plt.ylabel('Measles incidence');

xtloc = 1.5 + 52. / 12 * np.arange(0, 13)
plt.xticks(xtloc, calendar.month_name[1:], rotation=30);
plt.xlabel('Time of year');

plt.title(state)
plt.legend(scatterpoints=5)

plt.colorbar(label='Year');

plt.tight_layout();
plt.savefig('ca_per_week.png');



To create the polar version of the same plot we don’t need to change much, just define the axis to be polar, and transform the x range from 1 to 52 to be between 0 and $2\pi$. Though we do also flip the orientation and rotate the 0 of the polar coordinates since having January 1st on top feels more intuitive, as well as clockwise direction.

fig = plt.figure()
ax = fig.add_subplot(111, projection='polar')

ww = data.query('state == "{}"'.format(state)).week
plt.scatter(np.pi / 2.0 - 2 * np.pi * ww / 52,
            data.query('state == "{}"'.format(state)).incidence,
            c=data.query('state == "{}"'.format(state)).year,
            alpha=0.99,
            s=30,
            cmap=cm.Greys_r,
            edgecolor='white',
            color='k',
            label='Observations')

x = np.linspace(1, 52, 100)

year = 1950
yy = np.maximum(periodic(x) + y[year - 1928], 0)
xx = np.pi / 2.0 - 2 * np.pi * x / 52
plt.plot(xx, yy, lw=5, c='w');
plt.plot(xx, yy, lw=3, c='r', label='Model fit');

plt.ylim(0, 2);

month_locs = np.pi / 2.0 + 2 * np.pi / 24 - 2 * np.pi / 12 * np.arange(13)
ax.set_xticks(month_locs);
ax.set_xticklabels(calendar.month_name);
ax.set_yticks([0.5, 1.0, 1.5, 2.0]);
ax.set_yticklabels([''] * 4);

ax.set_title(state);

plt.legend(scatterpoints=5, loc='upper center')

plt.colorbar(label='Year');

plt.tight_layout();
plt.savefig('ca_per_week_polar.png');



Now, the US state map (also known as a choropleth). This was WAY more complicated than I tought it would be! I imagined there would be some simple functions in matplotlib or similar package. But the ones I tried required map files I wasn’t familiar with and some of them I couldn’t get running. In particular I think GeoPandas is interesting, but I could not get it working.

After trying a bunch of packages and approaches, at some point I ended up with an SVG file with defined state paths. I’m not completely sure were it came from, it might have been from simplemapplot. Just to be sure, I put the SVG file up on a gist.

Before going on to making the visualization, let’s put all the states’ seasonal functions in a dictionary so we can easily use them.

def make_periodic(state):
    alpha = model.params['np.cos(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]
    beta  = model.params['np.sin(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]

    def periodic(week):
        return alpha * np.cos(2 * np.pi / 52 * (week - 1)) + \
               beta * np.sin(2 * np.pi / 52 * (week - 1))

    return periodic

periodics = {}
for state in data.state.unique():
    periodics[state] = make_periodic(state)


Ok, so the approach that we take is to use BeautifulSoup to parse and edit the SVG file.

svg = open('output_state_map.svg', 'r').read() 
soup = BeautifulSoup(svg)


The way to change the colors is to change the style attributes of each state element in the SVG. The inspiration for this strategy was this IPython notebook where the same is done on an Iowa map.

First we set up a baseline path style which we will append the color to.

path_style = "font-size:12px;fill-rule:nonzero;stroke:#000000;" + \
             "stroke-width:1;stroke-linecap:butt;stroke-linejoin:bevel;" + \
             "stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;" + \
             "marker-start:none;fill:"


So what we do with this is to loop over the weeks, and for each week color all the states according to their seasonal incidence functions. The colors are converted to hex codes and assigned to the corresponding state’s path. We save one SVG per week.

for w in range(1, 53):
    all_states = soup.findAll(attrs={'class':'state'})
    for p in all_states:
        state = us.states.lookup(p['id']).name.upper()
        p['style'] = path_style + rgb2hex(cm.winter(periodics[state](w) + 0.4))

    h = soup.find(attrs={'id': 'hdlne'})
    mnd = Week(1990, w).monday()
    h.string = calendar.month_name[mnd.month]

    fo = open("/tmp/state_map_{:02}.svg".format(w), "wb")
    fo.write(soup.prettify());
    fo.close()


The modified SVG files are saved as if they are HTML files by BeatifulSoup. The files will still be viewable in web browsers. But for using conversion tools or vector graphics editing tools these need to be in proper SVG format. This is farily easy to fix, one just need to remove the <body> and <html> tags from the file.

%%bash
for f in /tmp/state_map_*.svg; do
    grep -v "body\|html" $f > $f.tmp
    mv $f.tmp $f
done


Once this has beed fixed, we can create an animated gif using the convert command from ImageMagick.

%%bash
convert -delay 6 -loop 0 /tmp/state_map_??.svg animated_state_map.gif


Because having a constantly animating gif on the page was extremely annoying, I converted it to a video using ffmpeg.

%%bash
ffmpeg -f gif -i animated_state_map.gif animated_state_map.mp4


The final bit is making the map with the seasonal peak times. We simple solve the equations described above for each state to find te offset of the cosine function.

state_peak = {}
for state in data.state.unique():
    alpha = model.params['np.cos(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]
    beta = model.params['np.sin(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]
    rho = np.sqrt(alpha ** 2 + beta ** 2)
    theta = np.arccos(alpha / rho)

    peak_week = theta / (2 * np.pi) * 52
    state_peak[state] = peak_week

state_peak = pd.Series(state_peak)


Just like above we color the states by the value.

svg = open('output_state_map.svg', 'r').read() 
soup = BeautifulSoup(svg)

h = soup.find(attrs={'id': 'hdlne'})
h.string = 'Peak incidence'
h['x'] = 600

all_states = soup.findAll(attrs={'class':'state'})
for p in all_states:
    state = us.states.lookup(p['id']).name.upper()
    peak = state_peak[state]
    col = rgb2hex(cm.Greys((peak - state_peak.min()) / (state_peak.max() - state_peak.min())))
    p['style'] = path_style + col

fo = open("measles_spread.svg".format(w), "wb")
fo.write(soup.prettify());
fo.close()


To add the color bar I edited the resulting SVG file in Adobe Illustrator.

Improper applications of Principal Component Analysis on multimodal data

When reading papers I have noticed some strange PCA plots. An example of this I stumbled over today was this one, Figure 1B from Kim et al., 2015, Cell Stem Cell 16, 88–101.

The odd thing here is that there are two obvious subpopulations of points, but within each they appear to have the same slope of PC1 vs PC2. This indicates that there is a linear dependence between PC1 and PC2, with some other factor explaining the difference between the clusters. In the case of the paper they only make this plot as a way to illustrate the relation between the samples. But if there where to be any additional analysis of the components in this PCA, this relation between PC1 and PC2 tells us something is wrong.

I made an artificial example to highlight this issue using Python.

from sklearn.preprocessing import scale
np.random.seed(8249)

N = 500
mode = np.random.randint(0, 3, N) * 10

x = np.random.normal(0, 100, N)
y = mode + np.random.normal(0, 1, N)

X = np.vstack([x, y]).T
X = scale(X)

figsize(6, 6)

plt.scatter(X.T[0], X.T[1], c='k', edgecolor='w', s=50);
sns.axlabel('x', 'y');

plt.title('Generated data');


This data is 2-dimensional with one normally distributed variable accounting for the x-axis, and the y-axis being due to membership in one of three discrete classes, with some normally distributed noise added. Imagine that we don’t have the information about the two processes generating the points, and we wish to find a model for the points by latent factors. We apply principal component analysis to this data.

from sklearn.decomposition import PCA

pca = PCA()
Y = pca.fit_transform(X.copy())

figsize(6, 6)

plt.scatter(Y.T[0], Y.T[1], c='k', edgecolor='w', s=50);

sns.axlabel('Principal Component 1', 'Principal Component 2');

plt.title('Principal Component Analysis');


The issue is that principal component analysis seeks to primarilly find a projection of the data to one dimension which maximizes the variance. If your data is multivariately normally distributed the PCA will find the “longest” factor explaining the data. But this does not hold when the data one is working with is for example multimodal. Like in this trimodal case. The largest variance is obviosly over the diagonal of the square-like area spanned by the points. But for finding a latent factor model with this kind of data variance is simply not so useful as a measure.

In this case it might make sense to use Independent Component Analysis (ICA) in stead. It is a strategy for finding latent factors which aims to make the latent factors searched for as statistically independent as possible. It makes no assumption about the distribution of the data.

from sklearn.decomposition import FastICA

ica = FastICA()
Z = ica.fit_transform(Y.copy())

figsize(6, 6)

plt.scatter(Z.T[0], Z.T[1], c='k', edgecolor='w', s=50);
sns.axlabel('Independent Component 1', 'Independent Component 2');
plt.title('Independent Component Analysis');


One immidate caveat with ICA over PCA is that there is no longer an ordering of the components, component 1 does not necesarilly explain more of the variability of the data than component 2. They are both equally important for the inferred latent variable model.

It is clear here though that component 1 corresponds to the membership, and component 2 to the x-variable.

If you are expecting clusters in your data, PCA is probably not the right tool.

High contrast stacked distribution plots

A while ago a colleague showed me an image of a beautiful way of plotting several distributions to make them easily comparable. I have since failed to find these plots or managed to figure out what they are called. But the concept was to plot densities for several variables in such a way that they appeared behind each other in depth.

Let me first illustrate the effect before we get to an implementation.

%pylab inline
import pandas as pd
import seaborn as sns
sns.set_style('white')

xx = np.linspace(0, 4 * np.pi, 512)


We just make some random sinus curves for illustration

df = pd.DataFrame()
for i in range(1, 32):
    df[i] = np.sin((np.random.rand() + 0.5) * xx) + 1

figsize(16, 16)
for i, ind in enumerate(df):
    offset = 1.1 * i
    plt.fill_between(xx, df[ind] + offset, 0 * df[ind] + offset,
                     zorder=-i,
                     facecolor='k',
                     edgecolor='w',
                     lw=3)

sns.despine(left=True, bottom=True)
plt.xticks([])
plt.yticks([]);

plt.savefig('plots/1.png');

There are just a few small tricks to the effect. The zorder=-i parameter makes the curves stack in the correct order. Making the edge colors the same color as the background gives the effect of depth. Personally I like the very high contrast style of using black fills. (Another colleague who saw these plots called them “Moomin plots” )

You can also make some pretty, regular, patterns with this look.

df = pd.DataFrame()
for i in range(1, 16):
    df[i] = np.sin(i * xx) + 1

figsize(14, 8)
for i, ind in enumerate(df):
    offset = 1.1 * i
    plt.fill_between(xx, df[ind] + offset, 0 * df[ind] + offset,
                     zorder=-i,
                     facecolor='k',
                     edgecolor='w',
                     lw=3)

sns.despine(left=True, bottom=True)
plt.xticks([])
plt.yticks([]);

plt.savefig('plots/2.png');

So how can we plot distribution for comparison with this technique? First let us create some simulated data. In this case I’m randomly mixing unimodal and bimodal normally distributed data, this is what a lot of the data we study in my research group looks like. Letting the mean be between 0 and 16 is arbitrary of course, and I just picked the variance to be 2 because I thought it looked good.

def unimodal():
    m = random.randint(0, 16)
    return random.randn(256) * 2 + m

def bimodal():
    m1 = random.randint(0, 16)
    s1 = random.randn(128) * 2 + m1
    m2 = random.randint(0, 16)
    s2 = random.randn(128) * 2 + m2
    return np.hstack((s1, s2))

df = pd.DataFrame()
distributions = [unimodal, bimodal]
for i in range(16):
    distribution = random.choice(distributions)
    df[i] = distribution()


We’ll use the Gaussian KDE from scipy for this.

from scipy.stats import kde


Now we pick an x-axis which should cover the entire range of the data in all the columns of the DataFrame we keep the data in.

We handle the y-offsets of the data by picking half the height of the previous density plot. This can be tweaked and give slightly different effect dependong on what fraction of the height is used. I found half height to be most generally good looking though.

To make it easy to relate a label to a distribution we plot a line the density can “rest on”. If it is thick enough it will give the effect of the density growing up from it.

figsize(6, 8)
xx = np.linspace(df.min().min(), df.max().max(), 256)
yy = None
offset = 0
offs = [offset]
for i in df:
    if yy is not None:
        offset += 0.5 * yy.max()
        offs.append(offset)

    s = df[i]
    f = kde.gaussian_kde(s)
    yy = f(xx)

    plt.fill_between(xx, yy + offset, 0 * yy + offset,
                     zorder=-i,
                     facecolor='k',
                     edgecolor='w',
                     lw=3)

    plt.axhline(y=offset,
                zorder=-i,
                linestyle='-',
                color='k',
                lw=4)

sns.despine(left=True, bottom=True)
plt.yticks(offs, ['Sample {}'.format(s + 1) for s in df.columns]);

plt.tight_layout()
plt.savefig('plots/3.png');

I like these a lot, though it looks a bit messy. Of course, with real data the ordering of the variables on the y-axis would be informed by some prior knowledge so that trends could be easily spotted.

But even so, if we don’t know how we should order these for easy comparison between the distributions, we can use hierarchical clustering to arrange the distribution so they become more comparable.

import fastcluster as fc
from scipy.cluster.hierarchy import dendrogram

figsize(6, 8)

link = fc.linkage(df.T, method='ward')
dend = dendrogram(link, no_plot=True)
dend['leaves']

xx = np.linspace(df.min().min(), df.max().max(), 256)
yy = None
offset = 0
offs = [offset]
for i, ind in enumerate(df[dend['leaves']]):
    if yy is not None:
        offset += 0.5 * yy.max()
        offs.append(offset)

    s = df[ind]
    f = kde.gaussian_kde(s)
    yy = f(xx)

    plt.fill_between(xx, yy + offset, 0 * yy + offset,
                     zorder=-i,
                     facecolor='k',
                     edgecolor='w',
                     lw=3)

    plt.axhline(y=offset,
                zorder=-i,
                linestyle='-',
                color='k',
                lw=4)

sns.despine(left=True, bottom=True)
plt.yticks(offs, ['Sample {}'.format(s + 1) for s in df[dend['leaves']].columns]);

plt.tight_layout()
plt.savefig('plots/4.png');

Extract Cluster Elements by Color in Python Dendrograms

One aspect of using Python for data analysis is that hierarchical clustering dendrograms are rather cumbersome to work with. Both in terms of plotting next to a heatmap, and how to relate the input data to the resulting plot. Ideally the dendrogram function would return a proper instances of some Dendrogram class. But let’s work with what we have.

There are plenty (1, 2) of guides around on how to make R inspired heatmaps with dendrograms. Here I will describe how to parse out the cluster members as seen in a dendrogram, which can be handy if one notices interesting patterns in the corresponding heatmap.

Let’s start with importing some modules.

%pylab inline

from collections import defaultdict

import pandas as pd
from scipy.cluster.hierarchy import dendrogram, set_link_color_palette
from fastcluster import linkage
import seaborn as sns
from matplotlib.colors import rgb2hex, colorConverter

numpy.random.seed(0025)


We also set some prettier non-default colours.

sns.set_palette('Set1', 10, 0.65)
palette = sns.color_palette()
set_link_color_palette(map(rgb2hex, palette))
sns.set_style('white')


We should create some simple example data for the purpose of illustration.

x, y = 3, 10
df = pd.DataFrame(np.random.randn(x, y),
                  index=['sample_{}'.format(i) for i in range(1, x + 1)],
                  columns=['gene_{}'.format(i) for i in range(1, y + 1)])

link = linkage(df, metric='correlation', method='ward')

figsize(8, 3)
den = dendrogram(link, labels=df.index, abv_threshold_color='#AAAAAA')
plt.xticks(rotation=90)
no_spine = {'left': True, 'bottom': True, 'right': True, 'top': True}
sns.despine(**no_spine);


The dendrogram function will return a dictonary containing a representation of the tree as plotted.

den

{'color_list': ['#c13d3f', '#AAAAAA'],
 'dcoord': [[0.0, 0.70210454159043401, 0.70210454159043401, 0.0],
  [0.0, 1.1003608323279817, 1.1003608323279817, 0.70210454159043401]],
 'icoord': [[15.0, 15.0, 25.0, 25.0], [5.0, 5.0, 20.0, 20.0]],
 'ivl': ['sample_3', 'sample_1', 'sample_2'],
 'leaves': [2, 0, 1]}


The tree is represented as a collection of ∏ shaped components.

The three items named color_list, dcoord, icoord indexes these ∏’s

Obviosly color_list contains the colors. The lists in dcoord contain the y coordinates of the ∏’s (the distances) while icoord has the x coordinates. These would be the index coordinates.

In the minimal example above we have two ∏’s, and one can see that the x-coordinates are repeated once.

The coordinates go from left to right. So for the red ∏, the ‘legs’ are located at 15 and 25, while the grey one has legs at 5 and 20.

The apperant pattern is that legs positioned at leaves will end with 5. The reason is to nicely place the leg at the middle of the corresponding leaf index value. This also implies the actual list indices of the leaf are multiplied by 10.

Thus we first subtract 5 from each colors icoord, then divide by 10. If the resulting number is close to the closest integer, we consider this to be an index for a leaf. If the resulting number is not close to an integer index, it means the colored tree we got it from is from non-leaf parts of the trees.

For each leaf, we add it to a list per color in a dictionary.

cluster_idxs = defaultdict(list)
for c, pi in zip(den['color_list'], den['icoord']):
    for leg in pi[1:3]:
        i = (leg - 5.0) / 10.0
        if abs(i - int(i)) < 1e-5:
            cluster_idxs[c].append(int(i))

cluster_idxs

defaultdict(<type 'list'>, {'#c13d3f': [1, 2], '#AAAAAA': [0]})


Next we need to grab the labels of the leaves given the indexes. But before we do that, since it’s difficult to keep track of what color e.g. #c13d3f is, we make make an IPython notebook compatible HTML representation of the dictionary holding the information. Objects of this class will behave just like dictionaries, except for representing them as a HTML table.

class Clusters(dict):
    def _repr_html_(self):
        html = '<table style="border: 0;">'
        for c in self:
            hx = rgb2hex(colorConverter.to_rgb(c))
            html += '<tr style="border: 0;">' \
            '<td style="background-color: {0}; ' \
                       'border: 0;">' \
            '<code style="background-color: {0};">'.format(hx)
            html += c + '</code></td>'
            html += '<td style="border: 0"><code>' 
            html += repr(self[c]) + '</code>'
            html += '</td></tr>'

        html += '</table>'

        return html


(Note that the representation uses colerConverter from matplotlib, so it supports any matplotlib supported color representation, not only hex color strings.)

Now just go through the list of indices and fetch labels.

cluster_classes = Clusters()
for c, l in cluster_idxs.items():
    i_l = [den['ivl'][i] for i in l]
    cluster_classes[c] = i_l

cluster_classes


#c13d3f['sample_1', 'sample_2']
#AAAAAA['sample_3']


Let’s combine this to a nice reusable function, and try it out on a larger example.

def get_cluster_classes(den, label='ivl'):
    cluster_idxs = defaultdict(list)
    for c, pi in zip(den['color_list'], den['icoord']):
        for leg in pi[1:3]:
            i = (leg - 5.0) / 10.0
            if abs(i - int(i)) < 1e-5:
                cluster_idxs[c].append(int(i))

    cluster_classes = Clusters()
    for c, l in cluster_idxs.items():
        i_l = [den[label][i] for i in l]
        cluster_classes[c] = i_l

    return cluster_classes

x, y = 96, 10
df = pd.DataFrame(np.random.randn(x, y),
                  index=['sample_{}'.format(i) for i in range(1, x + 1)],
                  columns=['gene_{}'.format(i) for i in range(1, y + 1)])

link = linkage(df, metric='correlation', method='ward')

figsize(12, 4)
den = dendrogram(link, labels=df.index, abv_threshold_color='#AAAAAA')
plt.xticks(rotation=90)
sns.despine(**no_spine);


get_cluster_classes(den)


#c13d3f['sample_20', 'sample_87', 'sample_35', 'sample_2', 'sample_24', 'sample_13', 'sample_6', 'sample_82', 'sample_11', 'sample_12', 'sample_44', 'sample_94', 'sample_73', 'sample_77', 'sample_76']
#4e7ca1['sample_56', 'sample_81', 'sample_69', 'sample_9', 'sample_27', 'sample_21', 'sample_45', 'sample_52', 'sample_59', 'sample_46', 'sample_93', 'sample_10', 'sample_48', 'sample_78', 'sample_14', 'sample_50', 'sample_31', 'sample_91', 'sample_75', 'sample_86', 'sample_84', 'sample_55', 'sample_88', 'sample_43', 'sample_58', 'sample_33', 'sample_96', 'sample_34', 'sample_19', 'sample_68', 'sample_21']
#8e5d93['sample_4', 'sample_92', 'sample_15', 'sample_54', 'sample_1', 'sample_17', 'sample_42', 'sample_79', 'sample_41', 'sample_57', 'sample_63', 'sample_67', 'sample_22', 'sample_64', 'sample_83', 'sample_30', 'sample_53', 'sample_26', 'sample_3', 'sample_29', 'sample_28', 'sample_36', 'sample_7', 'sample_80', 'sample_8', 'sample_5', 'sample_71', 'sample_65', 'sample_39', 'sample_62', 'sample_85']
#5e9d5c['sample_23', 'sample_38', 'sample_90', 'sample_51', 'sample_25', 'sample_60', 'sample_74', 'sample_18', 'sample_61', 'sample_32', 'sample_49', 'sample_37', 'sample_70', 'sample_66', 'sample_16', 'sample_95', 'sample_40', 'sample_72', 'sample_47', 'sample_89', 'sample_40']


There is also an IPython notebook version of this post here

Riemannian Newton iteration for Rayleigh quotients on the sphere

In the above figure, each point is in initial value which will be converge to different eigenvectors of an orthogonal 3x3 matrix. The above figure is the equivalent of a Newton fractal, but applied to Rayleigh quotient iteration on a sphere. This post will go through an explanation of the figure, and the numerical method on the sphere, which can be applied to any manifold.

Rayleigh quotients and Eigenvectors

Given a Hermetian matrix \( M \) and a vector \( x \) the Rayleigh quotiend is defined by

$$ R(M, x) = \frac{x^TMx}{x^T x} $$

For the sake of this text, let us limit ourselves to looking at \( M = A^T A \) which are covariance matrices of some data matrix \( A \). This implies that \( M \) has unique real values eigenvalues. In addition we know that the eigenvectors of \( M \) will be orthogonal.

Using these eigenvectors as a basis, let us express an arbitrary vector \( x \) as a linear combination of eigenvectors.

$$ x = \sum_{i = 1}^n \alpha_i v_i $$

Now we have that

$$ R(M, x) = \frac{x^T A^T A x}{x^T x} = \frac{x^T A^T A x}{x^T x} = $$

$$ = \frac{\left(\sum_{j = 1}^n \alpha_j v_j\right)^T \left( A^T A \right) \left(\sum_{i = 1}^n \alpha_i v_i\right)} {\left(\sum_{j = 1}^n \alpha_j v_j\right)^T \left( \sum_{i = 1}^n \alpha_i v_i \right)} = $$

$$ = \frac{\sum_{i = 1}^n \alpha_i^2 \lambda_i}{\sum_{i = 1}^n \alpha_i^2} = \sum_{i = 1}^n \lambda_i \frac{(x^T v_i)^2}{(x^T x)(v_i^T v_I )} $$

since eigenvectors are orthogonal. Additionally, we see that the sum in the last equality will be minimized when \( x \) is orthogonal to the majority of the \( v_i \)’s. Which is locally the case for any \( x = v_j \), and globally when the eigenvector \( v_j \) corresponds to the smallest eigenvalue.

As \( R(M, x) \) is unaffected by lengths of vectors (due to division by the scalar product) it is defined on the sphere \( S^{n - 1} \), which is a Riemannian manifold.

$$ R(M, \cdot) : S^{n - 1} \rightarrow \mathbb{R} $$

Newton Iteration

The Newton Method of iteration is defined in \( \mathbb{R}^n \) by the procedure

  1. Solve \( D(\text{grad } \ f)(x)[\eta_x] = -\text{grad } \ f(x) \) for \( \eta_x \)
  2. Set \( x_{i + 1} := x_i + \eta_x \)

In each update of the iteration we solve for the update vector \( \eta_x \). Optimal points are found by finding the roots of the derivative of the function. Newton iteration is used to find local optima of differentiable functions by in each step making a local second order Taylor approximation of the given function. Since Newton iteration will have information about the local curvature it is normally quicker to converge than gradient descent.

A very classical assignment in courses on numerical methods is to implement the Newton method for root finding to find the complex roots of a polynomial. Usually this entails looking at which root the algorithm will find depending on initial condition. The result is the so called Newton Fractal. (Note that in this case we are looking for the roots of the function in question. For optimizing by finding stationary points we need to look at the derivative of the given function.)

For example, let us find the complex roots of the polynomial \( p(x) = x^3 - 1 \). The iteration becomes

  1. Solve \( p(x_k) + p’(x_k) \cdot \Delta x_k = 0 \) for \( \Delta x_k \)
  2. Set \( x_{k + 1} := x_k + \Delta x_k \)

Or in other words

$$ x_{k + 1} = x_k - \frac{x_k^3 - 1}{2x_k^2} $$

We can simply implement this iteration in the following fashion using Python with NumPy.

f = np.poly1d([1, 0, 0, -1])  # x^3 - 1
fp = np.polyder(f)

def newton(x, y, max_iters):
    c = complex(x, y)
    for i in range(max_iters):
        c = c - f(c) / fp(c)
        if np.abs(f(c)) < 1e-4:
            return c

    return 0.0


Now let us use this to illustrate the dependence on the initial value of the iteration. We simply define a function which performs the iteration until convergence over a grid of starting values in the complex plane. We color the grid by which root is found.

def create_newton_fractal(min_x, max_x, min_y, max_y, image, iters):
  height = image.shape[0]
  width = image.shape[1]

  pixel_size_x = (max_x - min_x) / width
  pixel_size_y = (max_y - min_y) / height

  for x in range(width):
    real = min_x + x * pixel_size_x
    for y in range(height):
      imag = min_y + y * pixel_size_y
      color = newton(real, imag, iters)
      image[y, x] = color.imag + 1


This can be generalized to work on any Riemannian manifold \( \mathcal{M} \) with an affine connection \( \nabla \) to optimize real valued functions \( f \). Define the Hessian by

$$ \text{Hess} \ f(x_k):= \nabla_{\eta_k} \text{grad } \ f. $$

Also let \( R : T\mathcal{M} \rightarrow \mathcal{M} \) be a retraction on \( \mathcal{M} \), a smooth mapping which satisfies

  1. \( R(0_x) = x \), where \( 0_x \) is the origin of \( T_x\mathcal{M} \)
  2. \( \frac{d}{dt} R(t \xi_x) \vert_{t=0} = \xi_x \)

Now we can define the new iteration method by

  1. Solve \( \text{Hess} \ f(x_k) \eta_k = -\text{grad } \ f(x_k) \) for \( \eta_k \in T_{x_k} \mathcal{M} \)
  2. Set \( x_{k + 1} := R_{x_k}(\eta_k) \)


Newton Iteration on Spheres

As a first step towards working on the sphere, we need to look at how the iteration would work for manifolds which are sub-manifolds of \( \mathbb{R}^n \). In this case we pick \( \nabla \) to be the Levi-Civita connection. What we need is a way of calculating the gradient in \( \mathcal{M} \).

Every vector \( z \in T_x \mathbb{R}^n \cong \mathbb{R}^n \) admits a decomposition \( z = P_x z + P^\perp_x z \) where \( P_x z \in T_x\mathcal{M} \) and \( P^\perp_x z \in T^\perp_x\mathcal{M} \); the orthogonal complement to \( T_x\mathcal{M} \) in \( \mathbb{R}^n \). Now say that \( \bar{f} : \mathbb{R}^n \rightarrow \mathbb{R} \), and the function \( f \) is defined by the restriction \( f = \bar{f} \vert_\mathcal{M} \). Then the gradient can be calculated by

$$ \text{grad } \ f(x) = P_x \ \text{grad } \ \bar{f}(x) $$

Using this identity, we can now realize the Levi-Civita connection by

$$ \nabla_{\eta_k} \text{grad} \ f = P_{x_k} \text{D} (\text{grad} \ f(x_k))[\eta_k] =: \text{Hess} \ f(x_k) \eta_k $$

So now in the step solving for \( \eta_k \in T_{x_k} \mathcal{M} \) we need to know a partitioning.

Let us now pull back from this abstraction and consider the particular case when \( \mathcal{M} = S^{n - 1} \) as a sub-manifold of \( \mathbb{R}^n \). That means in this case we have \( f : S^{n - 1} \rightarrow \mathbb{R}^n \). We let the partitioning projection \( P_x \) in to \( T_x S^{n - 1} \) be defined by

$$ P_x := \left( I - xx^T \right). $$

(A projection to the orthogonal complement of the vector projection onto \( x \))

We now have the iteration defined by

1: Solve

$$ P_{x_k} \text{D}(\text{grad} \ f)(x_k)[\eta_k] = -\text{grad} f(x_k) $$

for \( \eta_k \in T_x S^{n - 1} \cong \mathbb{R}^n \)

2: Step and retract back to the sphere by setting

$$ x_{k + 1} := \frac{x_k + \eta_k}{\Vert x_k + \eta_k \Vert} $$

Newton Iteration of Rayleigh quotient

We now have all the necessary tools to solve the eigenvector problem by minimizing the Rayleigh quotient. For gradient evaluation, as \( \frac{\partial}{\partial x_j}(x^T x) = 2x_j\) and \( \frac{\partial}{\partial x_j}(x^T M x) = 2(Mx)_j \) it follows that

$$ \text{grad} \ R(M, x) = \frac{2}{x^T x} \left( Mx - R(M, x) \cdot x \right) $$

This leads to the Newton equation

$$ P_{x_k} M P_{x_k} \eta_k - \eta_k x_k^T M x_k = -P_{x_k} M x_k $$

We know have a concrete iteration method which we can implement and run using Python

I = np.eye(3)

# Retraction at x
def R(x, eta):
    return (x + eta) / linalg.norm(x + eta)

# Small helper function
def f(A, x):
    return dot(x.T, dot(A, x))

# Projection P
def P(xk):
    return I - outer(xk, xk)

# Make a random orthogonal matrix
A = random.randint(0, 10, (3, 3))
A = dot(A, A.T)

# Pick starting point
x0 = np.array([0,-1,0])
xk = x0

# Iterate and store path
path = [xk]
for i in range(10):
    Pxk = P(xk)
    M = (Pxk.dot(A).dot(Pxk) - f(A, xk) * I)
    y = -Pxk.dot(A).dot(xk)
    eta_k = linalg.solve(M, y)
    path.append(xk + eta_k)
    xk = R(xk, eta_k)
    path.append(xk)


Let us plot the path of the iteration over the sphere, in the following plot blue lines mean \( x_k + \eta_k \) (updates in the tangent space) and orange lines \( R_{x_k}(\eta_k) \) (retraction to the sphere).

In this case the iteration has already converged after five steps, the following five steps are also plotted, but are stable and can’t be seen. The iteration converges when the Rayleigh quotient is minimized, therefore, we have now found an eigenvector to the randomly chosen matrix A.

Now we have everything we need to generate the figure above. First we define a couple of functions.

def RNmS(A, x0):
    # Iteration
    N = 5
    xk = x0
    for k in range(0, N + 1):
        Pxk = P(xk)
        M = (Pxk.dot(A).dot(Pxk) - f(A, xk) * I)
        y = -Pxk.dot(A).dot(xk)
        eta_k = linalg.solve(M, y)
        xk = R(xk, eta_k)

    X = A.dot(xk) / xk
    return X.mean()

def find_nearest_index(array, value):
    index = ((array.real - value.real) ** 2 + \
             (array.imag - value.imag) ** 2).argmin(0)

    return index


The we pick a matrix and calculate the eigenvalues with the standard algorithm for comparison.

# Choice of 3x3 symmetric matrix for which we want to find the eigenvectors/-values.
# Note that different matrices produce very different images.
A = random.randint(0, 10, (3, 3))
A = dot(A, A.T)

EW, EV = linalg.eig(A)
print "Eigenvalues: ", 
print EW


Finally we make a spherical grid, perform the iteration for every point in the grid, and plot the result.

# Resolution
n = 512
nj = n * 1j

# u and v are parametric variables.
u = r_[0:2*pi:nj]
v = r_[0:pi:nj]

x = outer(cos(u), sin(v))
y = outer(sin(u), sin(v))
z = outer(ones(size(u)), cos(v))

# Figure out which eigenvectors/-values the algorithm converges to
# for all the spherical initial vectors.
s = zeros((n, n))
for i in range(0, n):
    for j in range(0, n):
        try:
            X = RNmS(A, array([x[i, j], y[i, j], z[i, j]]))
            s[i, j] = find_nearest_index(EW, X)
        except:
            raise

print X

print "Indexes: ",
print set(s.flatten())

bmap = brewer2mpl.get_map('RdBu', 'Diverging', 3)
cmap = bmap.get_mpl_colormap(N=3, gamma=1.0)

figsize(50, 50)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_aspect('equal')
ax.scatter(x, y, z, c=s, s=16, edgecolor='none', marker='s', cmap=cmap);
ax.axis("off");


References

Gene Discovery Rate in RNA-seq data

I was curious regarding at what rate new genes are detected as the number of reads in RNA-seq data increases. Let’s start by looking at the results.

The figure shows how many genes there are with nonzero counts at certain read depths.

In particular I was interested how this discovery rate looked for single-cell RNA-seq data. I found some read files at ENA which was from human cells. The two data sets used are SRR445722 and SRR499825.

In the interest of saving space on my tiny laptop, I only kept the first four million reads in the fastq files. For each read number investigated, reads are first randomly sampled from the fastq file using the reservoir sampling of seqtk. The sampled reads are then mapped to the human transcriptome (Homo_sapiens.GRCh37.73.cdna.abinitio.fa) using bwa mem. To map from transcripts to genes, I use eXpress (with the annotation file Homo_sapiens.GRCh37.73.gtf.gz). The final step is to count the number of genes which have counts different from 0.

In the interest of time and simplicity, I didn’t perform any quality or adapter trimming of the data.

For each read count, the sampling was performed five times with different random seeds. In the figure these replicates can barely be seen, but at some pints some smaller colored points can be glanced.

These are the pipeline for the process for the two datasets:

for READS in $(seq 10000 10000 1000000)
do
    for i in $(seq 5)
    do
        echo $READS reads >> sample_num_genes_SRR445722.out
        seqtk sample -s$(date +%s) SRR445722_4M_reads.fastq $(echo "scale=8; $READS / 4000000" | bc) | \
        bwa mem -t 2 Homo_sapiens.GRCh37.73.cdna.abinitio.fa - | \
        express --no-bias-correct Homo_sapiens.GRCh37.73.cdna.abinitio.fa
        cut -f5 results.xprs | grep -v "\b0\b" | wc -l >> sample_num_genes.out
    done
done


and

for READS in $(seq 10000 10000 1000000)
do
    for i in $(seq 5)
    do
        echo $READS reads >> sample_num_genes_SRR499825.out

        SEED=$(date +%s)
        FRAC=$(echo "scale=8; $READS / 4000000" | bc)

        bwa mem -t 2 Homo_sapiens.GRCh37.73.cdna.abinitio.fa \
        <(seqtk sample -s$SEED SRR499825_1_4M_reads.fastq $FRAC) \
        <(seqtk sample -s$SEED SRR499825_2_4M_reads.fastq $FRAC) \
        | express --no-bias-correct Homo_sapiens.GRCh37.73.cdna.abinitio.fa

        cut -f5 results.xprs | grep -v "\b0\b" | wc -l >> sample_num_genes_SRR499825.out
    done
done


They are slightly different since SRR499825 is paired-ended while SRR445722 is single ended.

I ran these for as long as I could leave my laptop on and stationary at a time.

What I find interesting about this is how these logarithm-like curves seem are so smooth.

Both studies these data come from have plenty of replicates, so a more thorough version of this study would be to try data from multiple cells and make sure that correlates. It would also be interesting to see how bulk RNA- seq compares to these data sets. I also only looked at data from human cells because I didn’t want include another set of transcriptome and annotation file to keep track of, but there are a lot more mouse single-cell RNA-seq studies on ENA than human.

It would also be fun to run the analysis much longer, to get up to millions of reads, where we start seeing the majority of genes as detected. And see precisely at which point this starts to happen.

EDIT

I reran the computations, but this time I let both data sets run to completion (depth of 1000 000 reads). I also saved the counts for each gene, so I could query detected genes with a lower cutoff. The new graph above show the detection rates for lower cutoffs of one, three and five mapped reads.

More interactive Python plotting with Clojure and Quil

The Python plotting library matplotlib is one of the libraries I read about a lot, and often, to try to keep track of everything possible to do with it. But I keep running in to situations where I don’t find it very useful.

Here I will write about one issue I had at a point, where I found it annoying when I wanted to sequentially build a graph, by adding scatter points to it one set at a time.

This summer I started playing with Clojure, and I enjoyed it quite a lot. Through looking at presentations for this I found Quil. I’ve only at some points played with Processing before, and I feel like quil better fits my way of thinking.

So, what I want, is to to my regular things with point data in IPython Notebook, and when I want to add it to a graph, I want to just send it to a quil sketch which is just running to the side of my IPython window. This way I can build my graphs over time.

I enjoy working with MongoDB, it has a tendency to just work where I need it and it is easy get started with. So I used this as a backend, and connected my quil sketch to the database using the monger library, which is extremely simple to use.

Here follows the entire code for the quill sketch that does what I want:

(ns square-vis.core
  (:gen-class)
  (:require [monger.core :as mg])
  (:require [monger.collection :as mc])
  (:use monger.operators)
  (:use quil.core))

(mg/connect!)
(mg/set-db! (mg/get-db "test"))

(defn draw-square
  [doc]
  (let [size (get doc :size)
        x    (get (get doc :coord) 0)
        y    (get (get doc :coord) 1)
        c    (get doc :c)]
    (no-stroke)
    (fill c)
    (rect x y size size)))

(defn -main
  [& args])

(defn setup []
  (smooth)
  (frame-rate 30)
  (background 200))

(defn draw []
  (background 200)
  (doseq [item (mc/find-maps "squares")] (draw-square item)))

(defsketch example                  
           :title "Squaaaares!"  
           :setup setup                      
           :draw draw                        
           :size [323 200])                  


This is only for making scatter plots with squares, in grayscale, which I like a lot. This piece of code connects to a database called “test”, and searches for documents in the collection named “squares”, and give them as Clojure maps, so they are easy to use in the rest of the sketch. The documents in the collection represents squares, with the following schema.

square = {'coord': [10, 20], 'size': 8, 'c': 155}


Some small examples:

from pymongo import Connection
conn = Connection()
coll = conn['test']['squares']

for i in range(28):
    for j in range(22):
        square = {'coord': [10 + 10 * i, 20 + 10 * j],
                  'size': 5, 'c': (i * j) ** 2 % 251}
        coll.insert(square)

coll.drop()
for k in linspace(0.0, 2.0):
    x = np.linspace(1, 350)
    y = k * x
    for xi, yi in zip(x, y):
        doc = {'coord': [xi, yi], 'size': 5 * k + 1, 'c': 100 * k}
        coll.save(doc)

print('running')
while True:
    try:
        for doc in coll.find():
            doc['coord'][0] += 1 * (random.random() - 0.5)
            doc['coord'][1] += 1 * (random.random() - 0.5)
            coll.save(doc)
    except KeyboardInterrupt:
        print('done')
        break

Non-standard Python on UPPMAX

Non-standard Python on UPPMAX

This is something I’ve had to look up at least 10 times in the last year, so to save myself from searching my email inbox yet another time I will write it out here. This post is specific to the Kalkyl cluster at UPPMAX.

To make a virtual environment with Python 2.7 on UPPMAX, do the following

mkvirtualenv --no-site-packages --python=/sw/comp/python/2.7.1_kalkyl/bin/python2.7 new_venv


What’s the problem?

Firstly, the standard Python on Kalkyl is 2.6, which is incompatible with most things we do, as we all tend to use 2.7.

UPPMAX uses the module system to manage different versions of software. Naively, you would think one could make a Python 2.7 by doing

$ module load python/2.7.1
$ mkvirtualenv --no-site-packages  new_venv


This will not work.

In stead, one need to point to the Python binary as it is stored on the filesysten; the one that the module system links to when module load is called. And this is located in /sw/comp/python/2.7.1_kalkyl.

Visualizing overlapping intervals

Visualizing overlapping intervals

Rather often I find myself wanting to see a pattern of how some interval data looks. I never found a good package for this, so always reverted to either drawing the intervals after each other on the x = y line. Or pile the intervals to histogram like graphs.

But today I got an idea how I could do it rather simply. What I want to do is stack the intervals so they don’t overlap, but also doesn’t need too much space on the Y-axis.

First let’s create some intervals that we want to try stacking.

intervals = [(2, 6), (7, 10), (1, 3), (4, 6), (8, 10), (1, 2), (3, 7), 
            (9, 10), (1, 2), (3, 5), (6, 7), (8, 10), (2, 5), (8, 10)]


Here is the basic idea: We sort the intervals by the start values. We give the first interval an initial Y-value. For the next interval, we check if the lower value is smaller than the upper value of the previous interval. If not, we just place the interval after the previous one. But if, we increment the Y-value, and check the same thing for the previous interval in that Y-category.

Here is a python implementation of this Y-categorization.

def intervals2layers(si):
    layers = [[si[0]]]
    for p in si[1:]:
        for lay in layers:
            if lay[-1][-1] < p[0]:
                lay.append(p)
                break
        else:
            layers.append([p])

    return layers


Let’s try it out!

si = sorted(intervals, key=lambda p: p[0])
layers = intervals2layers(si)

figsize(8, 4)
for i, lay in enumerate(layers):
    x1, x2 = zip(*lay)
    plt.hlines([i + 1] * len(x1), x1, x2, lw=30)

plt.xlim(0, 11)
plt.ylim(0, 6);

This works quite nicely. Let’s try something more problematic; we’ll generate 1024 intervals and see how these end up stacking using this strategy.

intervals = []
for i in range(1024):
    x1 = random.normalvariate(0.5, 0.25)
    x2 = x1 + random.random() / 16
    intervals.append((x1, x2))


figsize(20, 10)
for i, lay in enumerate(layers):
    x1, x2 = zip(*lay)
    plt.hlines([i + 1] * len(x1), x1, x2)

ax = plt.gca()
adjust_spines(ax,['bottom']);

Precisely what I’m after! One can see overall patterns of the distribution of all the intervals, but it’s also possible to see certain subintervals which might be interesting.

This is how some real data looks like.

Single Linkage Clustering Heuristic

Explaining the magic fudge in Mapper.

Background

A few months back I attended a couple of seminars by Gunner Carlsson talking about topological data analysis, and discussed some methods they employ at Ayasdi. Given my background in mathematics I thought it was quite nice to see some concrete applications of some rather abstract theory. In addition to this, their topology based analysis methods seem to be succesful with bioinformatical data where other methods fail.

This first paper about the method describes it quite well, and on the groups website they are also hosting a Matlab script which they used for the applications in that first paper. This scientific report gives a very nice summary of capabilities of topology based data analysis.

Using their refined implementation of the method on your data means contacting them and sending them your data to run their analysis on.

Therefore I thought a nice evening project would be to reimplement the Matlab script in Python, so it can be easily run anywhere, and eventually optimize the implementation somewhat.

A great start would be at the [upcoming Seqahead Hackathon] (http://seqahead.cs.tu-dortmund.de/meetings:2013-06-hackathon), where I’d hoped to perhaps get some collaboration.

Mapper basics

Just extremely briefly (see the scientific report or paper for details), the basic idea of Mapper is to divide the space of the data in to sections. Within these sections clustering is performed, these clusters are then combined depending on overlaps to form a graph which reveals a “shape” of the data.

Single Linkage Clustering

In the Matlab implementation of Mapper the authors use single linkage clustering. Briefly, single linkage clustering means initially taking every object as a cluster on its own, then pairwise link together “clusters” which are the closest to each other. This is repeated until the algorithm is left with one single cluster.

Software performing this kind of clustering usually returns a representation of the tree being built throughout the clustering, with the intermediate distances between clusters in each step preserved.

Mysteries in the Matlab implementation

When reading the Matlab implementation one comes across the parameter magicFudge. This is described in the comments of the script.

%  magicFudge: The number of histograms to divide each homology plot into
%              Increase this number if you expect to see more clusters
%              and decrease it if you are seeing more clusters.
%              Default : 10

Heuristic for number of clusters

This is further explained in the paper (page 5).

Now, to find the number of clusters we use the edge length at which each cluster was merged. The heuristic is that the interpoint distance within each cluster would be smaller than the distance between clusters, so shorter edges are required to connect points within each cluster, but relatively longer edges are required to merge the clusters. If we look at the histogram of edge lengths in C, it is observed experimentally, that shorter edges which connect points within each cluster have a relatively smooth distribution and the edges which are required to merge the clusters are disjoint from this in the histogram. If we determine the histogram of C using k intervals, then we expect to find a set of empty in- terval(s) after which the edges which are required to merge the clusters appear.

The authors also points out problems with this approach.

Although this heuristic has worked well for many datasets that we have tried, it suffers from the following limitations: (1) If the clusters have very different densities, it will tend to pick out clusters of high density only. (2) It is possible to construct examples where the clusters are distributed in such a way such that we re- cover the incorrect clustering. Due to such limitations, this part of the procedure is open to exploration and change in the future.

While I was reading the paper, I didn’t think much about this heuristic, but when trying to implement the method, it got me fairly confused.

I figured it would be nice to try the heuristic out and see how well it performs.

First we want to simulate some data, there’s a nice function in scikits-learn to generate blobs of data (clusters): sklearn.datasets.make_blobs. First generate some clusters.

from sklearn import datasets
num_clusters = 3
X, y = datasets.make_blobs(n_samples=N, centers=num_clusters)

(Throughout this post, I will omit commands for creating plots, but it can all be viewd in this IPython notebook.)

To perform single linkage clustering, we need a distance matrix between all points in the data set.

from scipy.spatial import distance
d = distance.cdist(X, X)

Then we can use, e.g. this new fastcluster package to perform the clustering.

import fastcluster
Z = fastcluster.linkage(d, method='single')

This creates the linkage Z, a (num_points - 1, 4)-shaped array where the distance between Z[i, 0] and Z[i, 1] is Z[i, 2].

This graph contains a dendrogram representation of the linkage.

In Mapper they also add the longest distance from the distance matrix to the array of inter cluster distances.

lens = np.zeros(Z.shape[0] + 1)
lens[:-1] = Z[:, 2]
lens[-1] = d.max()

They also introduce the notion of calling this collection of distances a “lens”. I like this as it invokes the idea of “unfocusing” the view of the data to see the overall shape.

Now just look at the histogram of distances.

This finds one cluster, as can be seen by doing the following.

z = np.nonzero(hst == 0)[0]
hst[z[0]:len(hst)].sum()

Out[613]: 1

It’s pretty impossible to see, but this cluster has the distances in the 9th bin in the histogram. This is obviously wrong, since we knew beforehand that we made three clusters in this example. Thus we should “focus” the lens by increasing the magicFudge, in this case I increase it to 64.

magic_fudge = 64
hst, _ = np.histogram(lens, bins=magic_fudge)

z = np.nonzero(hst == 0)[0]
hst[z[0]:len(hst)].sum()

Out[632] 3

Here we see the pattern discussed in the paper quite clearly. A nice almost normal distribution of small distances, breaking at the 9th bin (black arrow), followed by three bins at various locations with a single distance in.

Now to the issue that motivated this blog post: It took my many tries of generating example clusters, as well as trying different magicFudge-values for this last histogram.

Testing the heuristic

To see how the heuristic performed, I wrote some code that randomly generates between one and six clusters, performs the cluster count using nine different magicFudge-values, and collects how many times for each value the heuristic found the correct number of clusters.

magic_fudge_settings = [8, 16, 25, 32, 50, 64, 100, 128, 256]
hit_counts = [0] * len(magic_fudge_settings)
n_tests = 500
for i in range(n_tests):
    true_num_clusters = random.randint(1, 6)

    X, _ = datasets.make_blobs(n_samples=300, centers=true_num_clusters)

    d = distance.cdist(X, X)
    Z = fastcluster.linkage(d, method='single')
    lens = np.zeros(Z.shape[0] + 1)
    lens[:-1] = Z[:, 2]
    lens[-1] = d.max()

    for j, magic_fudge in enumerate(magic_fudge_settings):
        hst, _ = np.histogram(lens, bins=magic_fudge)
        z = np.nonzero(hst == 0)[0]
        if len(z) == 0:
            num_clusters = 1
        else:
            num_clusters = hst[z[0]:len(hst)].sum()

        if num_clusters == true_num_clusters:
            hit_counts[j] += 1

Results

Are best summarized with the following plot.

These results are quite bad in my opinion. However, when using this process to find clusters in order to make a graph of connections between them, it is probably not that important to be accurate. Just as long as enough overlapping clusters are found we will end up with a nice network that describes the data. But I very much doubt this is the cluster counter method used in the production implementation at Ayasdi.