Writing and Citing with Google Docs and Paperpile

Earlier this day Mick Watson posted a brief post about his scientific writing workflow using Mendeley and Word. I’ve been happily using Paperpile and Google Docs for similar goals, so though I’d post that as a comparison.

First, you just start a new doc for your paper. You write until you need to cite something. Then you hit ctrl+alt+shift+p and search for papers. Initially in your curated collection, but also in many databases on the web such as PubMed.

As you add references, they both get added to your own Paperpile collection, and stored in the Google Doc itself (so all collaborators can see them and manage citations).

Once you have a lot of citations, you pick a citation style, click Format citations in the Paperpile menu.

And you end up with a reference list styled to your liking.

You can then click each citation and edit it in terms of changing and adding/removing references.

In Mick’s post he makes a group in Mendeley to keep track of papers related to a given manuscript. You can also do this in Paperpile, but in practice i think it’s easier just to look at the documents individual manager (on the right in picture above).

Streaming RNA-seq data from ENA

For many of the projects I’m working on for my PhD I use published data. Up until now my strategy has been to download all read files of an experiment from ENA, then process them all with e.g. Salmon to get expression values. This feels a bit silly because sequencing read files are on the order of gigabytes in size, while a csv file of expression values is a few megabytes. In fact, currently my data directory has almost 50 terabytes of public data in it.

The other day I saw this gist from Mike Love. Supposedly it gives you the URL of a ftp hosted fastq from the ENA/SRA accession number of it. This is great, because you can just use curl on the URL, which by defualt streams a file in chunks, to fetch the contents of a sequencing read data set. We only need to know the name of it.

I made a small Bash script for streaming a given accession id.




accession=$(echo $fastq | tr '.' '_' | cut -d'_' -f 1)


if (( $a_len == 9 )); then
elif (( $a_len == 10 )); then
elif (( $a_len == 11)); then


curl --keepalive-time 4 -s $url | zcat

We call this file stream_ena. You need to know the id of accession, and whether the file you want to look at is part of a pair. But then you have instant access to the contents of any published sequencing data-set!

If we want to look at some single-end data, we can just do

$ ./stream_ena SRR3185782.fastq | head
@SRR3185782.1 HWI-D00361:180:HJG3GADXX:2:1101:1460:2181/1
@SRR3185782.2 HWI-D00361:180:HJG3GADXX:2:1101:1613:2218/1
@SRR3185782.3 HWI-D00361:180:HJG3GADXX:2:1101:2089:2243/1

If we want to quantify this data set with salmon, we can now simply run

$ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-r <(./stream_ena SRR3185782.fastq) -o SRR3185782

This will stream the entire contents of accession in to salmon directly from ENA without storing anything on disk, and quantified expression will be saved in the directory SRR3185782.

For a dataset with paired reads we would do for example

$ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-1 <(./stream_ena SRR1274127_1.fastq) \
-2 <(./stream_ena SRR1274127_2.fastq) -o SRR1274127

Many sequencing tools supports streaming out box don care whether you are your disk or server online.

Say for example we want to look at 5 random reads:

$ seqtk sample -s $(date +%s) <(./stream_ena SRR1274127_1.fastq) 0.001 | head -n 20
@SRR1274127.753 753/1
@SRR1274127.1464 1464/1
@SRR1274127.1672 1672/1
@SRR1274127.2188 2188/1
@SRR1274127.4127 4127/1

Or if we want to check the quality of a dataset without wasting space downloading it:

$ ./stream_ena SRR1274127_1.fastq | fastqc -o SRR1274127_1_fastqc -f fastq stdin

Of course there are caveats. You can’t just blindly put any reads in to salmon and get correct expression. You need to read the methods section of the related study to see what parameters use. The data might need some might preprocessing before it is useful. It is also very common that files uploaded to ENA should be merged before being input to processing tools. But this streaming approach greatly ease storage burden burden from working with public data.

K-means in TensorFlow

I have been getting a bit interested in the new fancy TensorFlow package. As a little exercise to figure out roughly how to use it, I figured I could implement a simple model. I chose to implement k-means clustering, since it’s very simple, but a bit different from the regression/classification models in TensorFlows examples.

We’ll make some ideal clustered data using scikit-learn.

from sklearn import datasets

Xi, yi = datasets.make_blobs(500, random_state=1111)
plt.scatter(*Xi.T, c='k', lw=0);

Now we want to divide these in to clusters. We will use the objective used by Bottou & Bengio 1995,

$$ E(w) = \sum_i \min_k (x_i - w_k)^2. $$

Or in words: the sum of the (squared) distances to the closest prototype $ w_k $.

For our implementation, first we need to put in data. Secondly, we need to define the variables we want to optimize. In this case we want the prototypes $ w $.

import tensorflow as tf

# Number of clusters
k = 3

# Input
Xi = Xi.astype(np.float32)
X = tf.placeholder_with_default(Xi, Xi.shape, name='data')

# Model variables, initiate randomly from data
idx = np.random.choice(range(Xi.shape[0]), k, replace=False)
Wi = Xi[idx].copy()
W = tf.Variable(Wi, name='prototypes')

Now we only need to write out how the objective function value is calculated from the data and the variables.

# Reshape data and prototypes tensors
Xc = tf.concat(2, k * [tf.expand_dims(X, 2)])
Wc = tf.transpose(tf.expand_dims(W, 2), [2, 1, 0])

# Define objective of model

distances = tf.reduce_sum((Xc - Wc) ** 2, 1)

cost = tf.reduce_sum(tf.reduce_min(distances, 1))

After the tensors have been formatted correctly this is very simple. We just calculate the distances from each point to each prototype. Then we sum the smallest of these per data-point.

The final value in cost is represented in such a way that TensorFlow automatically can figure out the gradients for minimizing it with respect to the variables in W.

# Make an object which will minimize cost w.r.t. W
optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(cost)

This is the nice thing with TensorFlow, you don’t need to care about finding gradients for objective functions.

sess = tf.Session()


for i in range(100 + 1):
    if i % 10 == 0:
        print('Iteration {}, cost {}'.format(i, sess.run(cost)))

Iteration 0, cost 21533.69140625
Iteration 10, cost 950.78369140625
Iteration 20, cost 940.453369140625
Iteration 30, cost 940.4503784179688
Iteration 40, cost 940.450439453125
Iteration 50, cost 940.450439453125
Iteration 60, cost 940.450439453125
Iteration 70, cost 940.450439453125
Iteration 80, cost 940.450439453125
Iteration 90, cost 940.450439453125
Iteration 100, cost 940.450439453125

Once we’ve optimized the W we get the number for them. Then we can make cluster labels for the data points by looking at which prototype $ w_k $ is closest to a data point.

W_learned = sess.run(W)
y = sess.run(tf.arg_min(distances, 1))


plt.scatter(*Xi.T, c=y, lw=0, cmap=cm.Greys_r, vmax=k + 0.5, label='data');
plt.scatter(*W_learned.T, c='none', s=100, edgecolor='r', lw=2, label='prototypes');

This is also available in notebook form here: https://gist.github.com/vals/a01a37b14c4918df7937b30d43327837

Some intuition about the GPLVM

Over the last year I’ve been learning about the statistical framework of Gaussian Processes. I’ve been thinking about writing about this for a while, but I feel like I would just decrease the signal-to-noise ratio given the fantastic resources available.

One of my favourite models in the Gaussian Process framework is the Gaussian Process Latent Variable Model (GPLVM). The point of the GPLVM, is that if we have multivariate data, we can simultaneously model every variable with a Gaussian Process. It turns out then that the same way that one can pick hyper parameters for a regression model, we can find optimal, latent, predictor variables for the data.

The leap from regression to the latent variable model is a bit drastic though, so I was trying to think of how we could get some more intuition about this. Let us look at a minimal example. We will use the excellent GPy package for this.

We generate a small number of points (5) on a sine curve, and perform Gaussian Process Regression on this.

import GPy

X = np.linspace(0, np.pi, 5)[:, None]
Y = np.sin(X)

kernel = GPy.kern.RBF(1)
m = GPy.models.GPRegression(X, Y, kernel=kernel)
m.plot(plot_limits=(-0.1, 3.2));

Now we have a curve which optimally describes the 5 points. Imagine we have a stray 6th point. Say we did 6 measurements, but for one of the points we lost the measurement values. Where could this measurement have happened?

Well, we have the model of the curve describing the data. If we guess the measurement values for the 6th point, we can add that as an observation, and evaluate the likelihood of the regression when including the point.

Let us perform this by guessing some potential points in a 20 by 20 grid around or known observations.

N = 20
xx = np.linspace(0, np.pi, N)
yy = np.linspace(0, 1, N)

ll = []
xxi = []
yyi = []
for i in range(N):
    for j in range(N):
        x = xx[i]
        Xs = np.vstack((X, np.array([[x]])))
        y = yy[j]
        Ys = np.vstack((Y, np.array([[y]])))
        ms = GPy.models.GPRegression(Xs, Ys, kernel=kernel,

plt.scatter(xxi, yyi, c=ll, s=50, vmin=-1e4,
            edgecolor='none', marker='s',
plt.colorbar(label='log likelihood')
m.plot(ax=plt.gca(), c='r', legend=False,
       plot_limits=(-0.1, 3.2));

Here every square is a position where we guess the 6th point could have been, colored by the log likelihood of the GP regression. We see that it is much more likely the 6th point came from close to the curve defiend by the 5 known points.

Now, imagine that we know that the y value of the 6th point was 0.4, then we can find an x value such that the likelihood is maximized.

x = 1.2
y = 0.4
Xs = np.vstack((X, np.array([[x]])))
Ys = np.vstack((Y, np.array([[y]])))
gplvm = GPy.models.GPLVM(Ys, 1, X=Xs, kernel=kernel)
gplvm.Gaussian_noise.variance = m.Gaussian_noise.variance


gplvm.plot(legend=False, plot_limits=(-0.1, 3.2));
plt.scatter([x], [y], c='r', s=40);
ax = plt.gca()
ax.arrow(x, y, gplvm.X[-1, 0] - x + 0.1, gplvm.Y[-1, 0] - y,
        head_width=0.03, head_length=0.1,
        fc='r', ec='r', lw=2, zorder=3);

This is done with gradient descent, and the optimum it finds will in this particular case depend on where we initialize the x value of the 6th point. But if we had more variables than the y value above, and we want to find the x-value which is optimal for all variables at the same time, it will end up being less ambiguous. And we can do this for every point in the model at the same time, not just one.

This is illustrated in this animation.

Here there are two variables and we are looking for x-values which best describes the two variables as independant Gaussian processes.

Observations needed to estimate standard deviation

I was curious as to practically how well standard deviation can be estimated at different numbers of observed samples. I thought the plot was interesting, so figured I’d share.

Here is the Julia code I used:

using Distributions
using Gadfly

function std_of_sample(n)
    std(rand(Normal(2, 5), n))

nmax = 200
n_observations = collect(0:nmax * 100) % nmax + 2;

p = plot(
    x=n, y=map(std_of_sample, n_observations),

    Guide.XLabel("Number of observations"),
    Guide.YLabel("Estimated σ"),
    Coord.Cartesian(ymin=-3, ymax=13, xmin=-20, xmax=220),
    Guide.xticks(ticks=[0, 10, 25, 50, 100, 200])

draw(PNG(640px, 480px), p)

We just sample a different number of samples from a normal distribution with mean 2 and known standard deviation 5.

Timecourse analysis with Sleuth

An extremely interesting application of RNA-sequencing analysis is to study samples over a time series. This allows you to identify patterns of expression over some response to a stimuli or developmental progression.

While Sleuth together with the Kallisto from the Pachter lab makes it easy to perform differential expression analysis in the two-condition case of an RNA-seq experiment, there is still some confusion about how to perform DE over a time series. I thought it would be useful to write an example text of how to detect DE transcripts over a time course with Sleuth.

The first problem is to define what we mean by differentially expressed in the context of a time series. For a treatment-vs-control type experiment it is simple: What genes are higher or lower expressed in the treatment condition compared to the control? The first generalization which comes to mind would be to find genes whose level increase or decrease with the time points. This means finding correlation with time points and boils down to linear regression of expression with time points. However, this will mask potentially interesting genes which could for example peak in the middle of the time series. What we want is to find if the expression of a gene follows a general pattern to a higher degree than just noise.

In a linear modelling framework, such as Sleuth, a common solution for this is to use natural splines. Given a number of degrees of freedom for a natural spline model, knots will be placed along the quantiles of the observations of the time axis, which will define basis polynomials with local support.

$$ Expression \sim \beta_0 + \sum_{i=1}^3 \beta_i B_{i}(Timepoint) + \varepsilon $$

The idea is now to compare the model of a genes expression which includes the polynomial terms, with a model that only includes the noise term.

$$ Expression \sim \beta_0 + \varepsilon $$

In the latest versions of Sleuth, there is a likelihood ratio test implemented for this sort of comparison of models. This is how time course analysis is implemented both in Ballgown and Monocle. The benefit of Sleuth is the ability of using the quantification bootstraps to attempt to separate technical variance from biological variance.

Let’s get to the actual practical example. We will be using data from the publication “High-resolution mapping of transcriptional dynamics across tissue development reveals a stable mRNA-tRNA interface”. In the paper they perform RNA-sequencing on a developmental time series of mice, from both brain and liver, at six time points. For the sake of this example, let us just focus on the brain samples.

We load Sleuth and define the locations for the Kallisto outputs.

sample_id <- list('do2174', 'do2175', 'do2176', 'do2177',
'do2183', 'do2184', 'do2185', 'do2186', 'do2187', 'do2188',
'do2189', 'do2190')
paths <- list('kallisto/do2174_RNAseq_brain_mmuBL6e15.5_CRI01p_kallisto_out',
names(paths) <- sample_id

s2c <- read.table("brain_times.tsv", header = TRUE, stringsAsFactors = FALSE)
s2c <- dplyr::mutate(s2c, path = paths)
s2c[] <- lapply(s2c, as.character)

In order to pass the spline based model to Sleuth, we need to make a design matrix. This can be done by using the splines library and pass a formula using the splines by the ns() functions to model.matrix.

day <- s2c$day
full_design <- model.matrix(formula(~ ns(day, df = 4)))

The df parameter essentially governs how smoothed the expression patterns will be over the time points. I picked 4 because it’s the default in Ballgown. In Monocle the default is 3, but it seems this smooths the expression too much. Lower values for the df parameters will capture less “dynamics”, but will have more statistical power. 

As described in the Sleuth vignette, it is quite handy to have the external gene name which is associated to every transcript, to quickly see roughly what gene we are looking at when we see a differentially expressed transcript.

mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "mmusculus_gene_ensembl",
host = "www.ensembl.org")
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id",
  mart = mart)
t2g <- dplyr::rename(t2g,
target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id,
ext_gene = external_gene_name)

Now we are read to start performing the Sleuth analysis

so <- sleuth_prep(s2c, full_model = full_design, target_mapping = t2g)
so <- sleuth_fit(so)
so <- sleuth_fit(so, formula = ~ 1, fit_name = "reduced")
so <- sleuth_lrt(so, "reduced", "full")

Sleuth has an interactive Shiny application associated with it to browse the results of the analysis. It’s not mandatory to use it though, and many of the handy features in it mostly makes sense for the treatment-vs-control style of experiments.

plot_qq(so, test = 'reduced:full', test_type = 'lrt', sig_level = 0.05)

We can extract the results of the likelihood ratio test as a dataframe to look at closer

lrt_results <- sleuth_results(so, 'reduced:full', test_type = 'lrt')
table(lrt_results[,"qval"] < 0.05)

49523  3083

This tells us that we find 3083 transcripts which have an expression pattern ove the time course which is significantly better explained by a natural spline over the time than by just random noise. Let’s have a look at the very top transcripts.

lrt_results %>% head(n = 20) %>% dplyr::select(target_id, qval, ens_gene, ext_gene)

I don’t know enough about brain development to say if these genes make any sense in the context. We can also plot a clustered heatmap of these top genes.

plot_transcript_heatmap(so, head(lrt_results, n = 20)$target_id, 'est_counts')

Unfortunately there doesn’t seem to be a way to rename the axis labels to some other information, or to enforce ordering the samples by time point. Let’s also plot the expression of a differntailly expressed transcript to see what it looks like.

tmp <- so$obs_raw %>% dplyr::filter(target_id == 'ENSMUST00000145067')
tmp <- dplyr::full_join(so$sample_to_covariates, tmp, by = 'sample')
tmp <- transform(tmp, day = as.numeric(day))
ggplot(tmp, aes(x=day, y=est_counts)) 
  + geom_point(shape=1) 
  + geom_smooth(method = loess)

Let me guess where you’re from

Let me guess where you’re from

On the website letmeguesswhereyourefrom.com you can enter a name, and an algorithm will print five countries the name seems come from. Try it out!

I made this web application as an exercise in applying machine learning, in this post I will describe

  • How I got data,
  • How I made the model,
  • How I optimized the model for use in a web application, and,
  • How I made the web application.

Getting training data

The first problem is where to get the data from. For the model to be representative I would want data from real people. There are many lists of “Most common names in …” for various countries. These however mostly consist of either first names or surnames. I doubted just for example first names would not contain enough information to train a model. For a while I thought I could get surname lists and first name lists and randomly pair elements from them to make larger training sets. Looking for real names, I found that Wikipedia has a list of people who have Wikipedia pages by country!

We can quickly get all this data

%pylab inline
import pandas as pd
import seaborn as sns

import codecs
from unidecode import unidecode

import requests

wiki = 'https://en.wikipedia.org'
res = requests.get(wiki + '/wiki/Lists_of_people_by_nationality')

By inspecting the downloaded HTML I noticed a pattern on the rows that contain links to lists of people. So I parsed these and extract the country name and URL to the list.

list_urls = {}
for line in res.content.split('\n'):
    if line.startswith('<li><a href="/wiki/List_of_'):
        l =  line.split('<li><a href="')[-1] \
                 .split('" title=')[0]
        country = l.split('/wiki/List_of_')[-1]
        list_urls[country] = l

f1 = lambda s: s.startswith('<li><a href="/wiki/')
c1 = lambda s: s.split('" title="')[-1].split('"')[0]

lists = {}
for country in list_urls:
    content = requests.get(wiki + list_urls[country]).text.split('\n')
    lists[country] = map(unidecode, map(c1, filter(f1, content)))

with open('name_list.tsv', 'w') as fh:
    for country in lists:
        for name in lists[country]:
            fh.write('{}\t{}\n'.format(name, country))

There are some obvious problems with this. Not all lines I parse out are actually lists of names, but other lists. Additionally the larger countries have so many entries that they link to lists of sublists rather than actually lists. (This includes US and China.) The number of countries isn’t that high and it would be possible to just add these manually. But I don’t feel like doing that and will ignore those cases.

To make downstream analysis easier, I romanized all names using the unidecode package. This translates all characters in the names to ASCII.

In total the process fetches in total 68,634 names distributed over 214 countries.

Making the model

I knew that N-grams of words are commonly used for document classification, and thought I can make something similar for just letters in names. Turns out this was already nicely implemented in the scikit learn as a parameter to the CountVectorizer class.

For the most part I followed the scikit-learn text analytics tutorial. The tutorial said “bags of words” representations of documents typically have more than 100 0000 features. Since a “document” in my case is a name, there are not so many n-grams in each. To get a large number of featuers I set the CountVectorizer to consider all N-grams for all N between 1 and 5. I limited the maximum number of features to 100000 so that I would know how large the data sets would be.

As per the tutorial, I push the output of the CountVectorizer through a TfidfTransformer. The idea with this is to downscale weights of features that occur in many samples.

Finally, I use these features to train a Multinomial Naive Bayes model. (I’m not discussing much of the details of the methods here, as they are well described in the references. A thing about scikit-learn for applying machine learning that I like is also that you can use a lot of very nice tools without knowing much about the details.)

All these steps are combined to a pipeline, so we can easily just supply raw text and get a categorical prediction.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split

name_clf = Pipeline([('vect', CountVectorizer(analyzer='char',
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])

The data is read in, and split in to a training set that we will use to tune the model, and a testing set that we will finally use to evaluate the model.

df = pd.read_table('name_list.tsv', header=-1)
df.columns = ['name', 'country']
name_train, name_test, country_train, country_test = \
train_test_split(df.name, df.country, test_size=10000)

Now I could simply train the classifier using the training data.

name_clf.fit(name_train, country_train)

predicted = name_clf.predict(name_test)
np.mean(predicted == country_test)

Out []: 0.27860000000000001

While much better than randomly guessing countries (1 / 214 = 0.0047), it’s not a particularly impressive score.

Earlier the same day, I had listened to the Talking Machines podcast where Jennifer Listgarten described how they had trained a model to predict efficient CRISPR/Cas9 guide targets (avilable in a soon to be released service Azimuth). She said that predicting the best guide was a very hard problem. But predicting say the top 5 guides had very high accuracy. Trying a small number of guides is much more efficient than many possible guides anyway.

So I stole that strategy, and return a top 5 of predicted countries for a given name. Thus I change the model evaluation criterion. The strategy is very well suited for MultinomialNB, since prediction means evaluating the posterior probability for each class. We can just take these probabilities and sort out the highest ones.

t5 = name_clf.predict_log_proba(name_test).argsort(axis=1)[:,-5:]
a = country_test
b = name_clf.get_params()['clf'].classes_[t5]
np.mean((a[:, None] == b).max(1))

Out []: 0.4647

Better, but still not stellar.

The MultinomialNB model has a hyperparameter alpha. This is a smoothing parameter that helps with the problem that many features have 0 counts in any given document. This would cause the posterior probability to be 0. The default value for alpha is 1. We’ll try to do better by optimizing the alpha by randomized cross validation.

from sklearn.grid_search import RandomizedSearchCV

a = 10 ** np.linspace(-5, 0)
rscv = RandomizedSearchCV(name_clf,
param_distributions={'clf__alpha': a},

rscv.fit(name_train, country_train)


Out []: {'clf__alpha': 0.014563484775012445}

We plug this value in to out classifier pipeline, and try again.

name_clf = Pipeline([('vect', CountVectorizer(analyzer='char',
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB(alpha=0.014563484775012445))])

name_clf.fit(name_train, country_train)

t5 = name_clf.predict_log_proba(name_test).argsort(axis=1)[:,-5:]
a = country_test
b = name_clf.get_params()['clf'].classes_[t5]
np.mean((a[:, None] == b).max(1))

Out []: 0.66959999999999997

This is quite nice! We can make a function to check how it performs in practice.

def top_countries(name):
    log_proba = name_clf.predict_log_proba([name])
    t5 = log_proba.argsort(axis=1)[:,-5:]
    b = name_clf.get_params()['clf'].classes_[t5]
    for c, lp in zip(b[0], log_proba[:,t5].flatten())[::-1]:
        print '{}\t\t{}'.format(c, np.exp(lp))

top_countries('angela merkel')

Vietnamese      0.893538897127
Greeks      0.0294758574644
Baltic_Germans      0.0194029861994
Irish_people        0.015581222622
Bulgarians      0.00674997108663

While some of the suggestions are… counterintuitive… there is some German there! The strategy to return a top 5 seem to work out.

Making the model deployable

While I thought it would be nice to make the model usable through a web application, I didn’t want to spend a fortune doing so! Heroku offers free web application hosting in their Free tier with the limitation of 512 MB of memory. Additionally, the final size of the compiled web application “slug” can only be 300 MB.

The training data is 58 000 examples with 100 000 features each. The MultinomialNB model needs to store two dense matrices of size (58 000, 100 000). Running the model in the Python interpreter requires a bit over 1 GB of RAM. Storing the model with joblib.dump() as recommended in the documentation creates around 700 MB of data, so we wouldn’t be able to deploy it on Heroku.

The strategy I settled for was to put in a step in the pipeline to ignore non-informative features. To find which features contain the most information I used TruncatedSVD, a dimensionality reduction technique known to work well for “bags of words” models.

from sklearn.decomposition import TruncatedSVD

tf_pipeline = Pipeline([('vect', CountVectorizer(analyzer='char',
                        ('tfidf', TfidfTransformer())])

name_features = tf_pipeline.fit_transform(name_train)

tsvd = TruncatedSVD(n_components=400)



Out []: 0.27138872714390422

I picked 400 components because sqrt(100000) = 316.23 and I rounded up. There’s no rule for this or so, but this let me fit the TruncatedSVD for the entire training set in a reasonable amount of time. The total variance explained by 400 components is about 27%. It doesn’t sound as much, but it can inform us on the importance of the features.

We extract importances in the TruncatedSVD by looking at the lengths of the singular vectors stored in TruncatedSVD.components_.

idx = np.linalg.norm(tsvd.components_[:, :], axis=0).argsort()

We make a “Reducer” class which we can plug in to the pipeline to tell the pipeline which features to pass on from the feature extraction step to the MultinomialNB classifier.

from sklearn.base import TransformerMixin

class Reducer(TransformerMixin):
def __init__(self, idx):
    self.idx = idx

def fit_transform(self, X, y=None, **fit_params):
    return self.transform(X)

def transform(self, X):
    return X[:, self.idx]

def get_params(self, *k, **kw):
    return {'': ''}

We could include the important feature selection in a fit method, but to speed this up, we perform that calculation offline and provide the indexes of the features to the constructor of the object.

I checked a range of the number of features to pass on to the MultinomialNB by looking at the model score. This is bad thing to do and you shouldn’t do it! Now I’m causing an information leak between the parameter selection and the model evaluation. However, I think it’s relatively obvious that the score will be monotonically increasing with the number of features. And I had already decided I wanted to use between 10 000 and 20 000 features for the final model due to the computational budget. I was just curious at what rate things would get better if I’d allow more features.

Now we train the model with a reducer step that passes the 15 000 most informative features to the classifier step.

name_clf = Pipeline([('vect', CountVectorizer(analyzer='char',
                     ('tfidf', TfidfTransformer()),
                     ('reducer', Reducer(idx[-15000:])),
                     ('clf', MultinomialNB(alpha=0.014563484775012445))])

name_clf.fit(name_train, country_train)

t5 = name_clf.predict_log_proba(name_test).argsort(axis=1)[:,-5:]
a = country_test
b = name_clf.get_params()['clf'].classes_[t5]
s = np.mean((a[:, None] == b).max(1))

Out []: 0.62849999999999995

A pretty modest performance decrease given that the model only takes 15% of the memory needed before! As a side effect, the prediction operation in the model is now a lot snappier as well, which feels good in the web app.

With the indices for the features saved, it is actually relatively quick to just train the model from the total amount of raw data. So for deploying the web app, I just have it load the data and train when it launches. It takes about 30 seconds, and the raw data (the list of names and countries) is about 2 MB in total.

Web application

The web application itself is written using the microframework Flask. I define one entry point that provides an API, where the name is queried and the predictor is called to return the top scoring countries. The API returns these in JSON format. The second entry point just serves a HTML file, which in turn loads a Javascript file that connects the input box and buttons to the API.

To get access to scientific Python tools like scikit-learn on Heroku I use the Conda Buildpack.

You can see the full implementation of the app on GitHub, as well as the Jyputer notebook where I made the model.

Finally, have a look at the result!


Comparing unpublished RNA-Seq gene expression quantifiers

During the conference BoG a new RNA-Seq gene expression quantification tool called Kallisto was announced and released by the Pachter group. This tools have me excited because it is following the trend of Sailfish, RNASkim and recently Salmon. The classical approach for gene (or transcript) expression is aligning reads to genome, then by looking at coverage figuring out which transcripts are expressed at which levels. These new tools in stead use information contained in reads where exact mapping is not completely necessary, and use that to deconvolve the mixture proportions of transcripts in a sample.

To compare these quantifiers, and also using results from a classical TopHat + Cufflinks combination as a control I used five batches of single cell RNA-Seq from three different published data sets: SRP030617 (batch 1), SRP041736 (batch 2 & 3), and EGAS00001001204 (batch 4 & 5). These are all data from human cells, in total 284 samples. The data have between ~100 000 reads and ~5 000 000 reads.

All these data sets have had ERCC spike-ins added to them. To compare quantification accuracy, I first quantify expression for all genes / transcripts plus the spike-ins using the different tools. Then I extracted the TPM values for the spike-ins and looked at the Pearson correlation of the log transformed TPM values and the log transformed concentration of ERCC spike-ins. The plots below shows the distribution of these Pearson R values for the different tools over the different dataset batches. The n refers to the number of samples in a given batch.

They all perform well, at least equally well to TopHat + Cufflinks, and generally have pretty high correlation values (very few are below 0.8). In some cases Salmon seem to perform better.

To make the task a bit harder, we can look at the input dilution of ERCC’s included in the publications of the data sets and look only at spike-ins present at low levels. In this case given the concentration and dilution in to the volume of the experiments, we look only at the spike-ins present at between 1 and 50 RNA molecule copies.

All the methods perform equally, varying from surprisingly good to pretty bad depending on data set.

Regarding speed, the TopHat + Cufflinks combo took something along the lines of a half a day per sample, using multithreading over 12 cores for each sample. (I didn’t save the timings). The Salmon quantification took about 15 minutes per sample, also with 12 cores multithreading, while the Kallisto took about 10 minutes per sample and does not do any multithreading.

It should be noted that since there are no alternative splice forms for the ERCC spike-ins this doesn’t tell us anything about correction or quantification of alternative transcripts.

Modelling Measles in 20th Century US

A while ago some people at the Wall Street Journal published a number of heatmaps of incidence of infectious diseases in states of the US over the 20th century. This started a bit of a trend online where people remade the plots with their own sensibilites of data presentation or aestethics.

After the fifth or so heatmap I got annoyed and made a non-heatmap version.

I considered the states as individual data points, just as others had treated the weeks of the years in the other visualizations. Since the point was to show a trend over the century I didn’t see the point of stratifying over states. (The plot also appeared on Washinton Post’s Wonkblog)

The data is from the Tycho project, an effort to digitize public health data from historical records to enable trend analysis.

I thought it would be interesting to look at some other features of the data beside the global downward trend, and in particular how the incidence rate varies over time. In total there are about 150 000 data points, which I figured would also be enough to tell differences between states when considering the seasonal trend over a year as well.

We make a linear model by fitting a spline for the yearly trend, and a cosine curve for the seasonal change in measles incidence over a year. We also choose to be even more specific by finding different seasonal parameters for each state in the US.

$$ y = \sum_{i=0}^n c_i \cdot B_i\left(x_\text{year}\right) + \alpha_\text{state} \cdot \cos\left(\frac{2 \pi}{52}\cdot ( x_\text{week} - 1)\right) + \beta_\text{state} \cdot \sin\left(\frac{2 \pi}{52}\cdot ( x_\text{week} - 1)\right) $$

This way we will capture the global trend over the century, as well as a seasonal component which varies with the week of the year for every state. The global trend curve look very similar to the old version.

Every faint dot in the plot is a (state, week) pair. There are up to 2 600 of these in every year, so seeing them all at once is tricky. But we can still see the trend.

As a zoomed in example, let us now focus on one state and investigate a seasonal component.

It is fairly clear both from looking at the data and the fit of the cosine curve how the measles incidence (which is defined as number of cases per 100 000 people) changes over the year, peaking in early April.

As an alternative representation of the same plot, we can put it in polar coordinates to emphasise the yearly periodicity.

We fitted the model such that we will have a different seasonal component for each state. To visualize them all at once we take the seasonal component for each state, assign a color value based on it’s level on a given week, and put this on a map. We do this for every week and animate this over the year, resulting in the video below.

In the video green corresponds to high incidence and blue to low. There is an appearance of a wave starting from some states and going outwards towards the coasts. This means some states have different peak times of the measles incidence than others. We can find the peak time of a state’s seasonal component by doing some simple trigonometry.

$$ \rho \cos\left(\frac{2 \pi t}{l} - \phi\right) = \rho \cos(\phi) \cos\left(\frac{2 \pi t}{l}\right) + \rho \sin(\phi) \sin\left(\frac{2 \pi t}{l}\right) = $$ $$ = \alpha \cos\left(\frac{2 \pi t}{l}\right) + \beta \sin\left(\frac{2 \pi t}{l}\right) $$ $$ \alpha = \rho \cos(\phi), \ \beta = \rho \sin(\phi) $$ $$ \rho^2 = \alpha^2 + \beta^2 $$ $$ \phi = \arccos\left(\frac{\alpha}{\rho}\right) $$

So $ \phi $ will (after scaling) give us the peak time for a state (meaning the offset of the cosine curve). After extracting the offset for each state, we make a new map illustrating the peak times of measles incidence over the year.

The color scale ranges from week 10 (white) in early March to week 14 (black) in early April. We didn’t provide the model with locations of the states, but the peak times do seem to cluster spatially.


I also want to show how I did this.

%pylab inline

# Parsing and modelling
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

# Helps with datetime parsing and formatting
from isoweek import Week
import calendar

# For plotting the spline
from patsy.splines import BS
from scipy import interpolate

# For creating US maps
import us
from matplotlib.colors import rgb2hex
from bs4 import BeautifulSoup

The data we use from the Tycho project comes in a csv which we parse with Pandas.

measles = pd.read_csv('MEASLES_Incidence_1928-2003_20150413093130.csv', skiprows=2)
measles.index = pd.DatetimeIndex([Week(*t).monday() for t in zip(measles.YEAR, measles.WEEK)])
measles = measles.drop(['YEAR', 'WEEK'], 1)
measles = measles.convert_objects(convert_numeric=True)

Now we have tabular data of this form

measles.iloc[:5, :5]

1928-01-02 3.67 NaN 1.90 4.11 1.38
1928-01-09 6.25 NaN 6.40 9.91 1.80
1928-01-16 7.95 NaN 4.50 11.15 1.31
1928-01-23 12.58 NaN 1.90 13.75 1.87
1928-01-30 8.03 NaN 0.47 20.79 2.38

This is a large matrix of states vs time points with incidence in each cell. For making a linear model we would prefer to have it in a form where each row is an observation, and the columns giving variables used.

year = []
week = []
state = []
incidence = []
for i in measles.index:
    for c in measles.columns:
        incidence.append(np.log10(measles.ix[i, c] + 1))

data = pd.DataFrame({'year': year,
                     'week': week,
                     'state': state,
                     'incidence': incidence})
data.iloc[:5, :5]

incidence state week year
0 0.669317 ALABAMA 1 1928
1 NaN ALASKA 1 1928
2 0.462398 ARIZONA 1 1928
3 0.708421 ARKANSAS 1 1928
4 0.376577 CALIFORNIA 1 1928

We can now use statsmodels to define and fit the linear model

df = 12
d = 3
model = smf.ols('incidence ~ bs(year, df=df, degree=d, include_intercept=True) \
                + np.cos(2 * np.pi / 52 * (week - 1)) : C(state) \
                + np.sin(2 * np.pi / 52 * (week - 1)) : C(state) \
                - 1', data=data).fit()

To be able to use the parameters fitted for the spline, a bit of plumbing is needed. We let patsy and statsmodels pick the knots for the spline, for which the coefficients are fitted by the OLS model. One could calculate these knots directly from the data. But it is simpler to grab the function for picking the knots from patsy. Then we extract the spline coefficients from the model parameters.

my_bs = BS()
my_bs.memorize_chunk(data.year, df=df, degree=d, include_intercept=False)

coeffs = []

conf_int= model.conf_int(alpha=0.05)
for i in range(df):
    parameter_name = 'bs(year, df=df, degree=d, include_intercept=True)[{}]'.format(i)

knots = my_bs._all_knots

tck = (np.array(knots), np.array(coeffs), 3)

The interpolate function from scipy uses triples of knots, coefficients and the degree of the basis polynomials as a parameter, which is used to evaluate any point in the domain of the spline.

x = np.arange(1928, 2004)
y = interpolate.splev(x, tck)

Once we have done this, we can create the first figure.

figsize(8, 6)

# For legend with high alpha dots
plt.scatter(-1, -1, c='k', edgecolor='w', label='Observations')
plt.scatter(data.year, data.incidence, alpha=0.01, edgecolor='w', c='k');

plt.plot(x, y, c='w', lw=5);
plt.plot(x, y, c='r', lw=3, label='Model fit');

plt.xlim(x.min() - 1, x.max() + 1);

plt.ylim(0, 2.);
loc, lab = plt.yticks()
lab = [np.round(10 ** l) for l in loc]
plt.yticks(loc, lab);
plt.ylabel('Measles incidence');


plt.title('U.S. Wide trend');


Next we extract the periodic component parameters for a given state and define a function of the week of the year.

state = 'CALIFORNIA'

alpha = model.params['np.cos(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]
beta  = model.params['np.sin(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]

def periodic(week):
    return alpha * np.cos(2 * np.pi / 52 * (week - 1)) + \
           beta * np.sin(2 * np.pi / 52 * (week - 1))

And using this we can create the second plot

figsize(8, 6)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
plt.scatter(data.query('state == "{}"'.format(state)).week,
            data.query('state == "{}"'.format(state)).incidence,
            c=data.query('state == "{}"'.format(state)).year,
            label='Observations', color='k')

x = np.linspace(1, 52, 100)
year = 1950

yy = np.maximum(periodic(x) + y[year - 1928], 0)
plt.plot(x, yy, lw=5, c='w') + \
plt.plot(x, yy, lw=3, c='r', label='Model fit');

plt.xlim(x.min() - 1, x.max() + 1);
plt.ylim(0, 2);

loc, lab = plt.yticks()
lab = [np.round(10 ** l) for l in loc]
plt.yticks(loc, lab);
plt.ylabel('Measles incidence');

xtloc = 1.5 + 52. / 12 * np.arange(0, 13)
plt.xticks(xtloc, calendar.month_name[1:], rotation=30);
plt.xlabel('Time of year');




To create the polar version of the same plot we don’t need to change much, just define the axis to be polar, and transform the x range from 1 to 52 to be between 0 and $2\pi$. Though we do also flip the orientation and rotate the 0 of the polar coordinates since having January 1st on top feels more intuitive, as well as clockwise direction.

fig = plt.figure()
ax = fig.add_subplot(111, projection='polar')

ww = data.query('state == "{}"'.format(state)).week
plt.scatter(np.pi / 2.0 - 2 * np.pi * ww / 52,
            data.query('state == "{}"'.format(state)).incidence,
            c=data.query('state == "{}"'.format(state)).year,

x = np.linspace(1, 52, 100)

year = 1950
yy = np.maximum(periodic(x) + y[year - 1928], 0)
xx = np.pi / 2.0 - 2 * np.pi * x / 52
plt.plot(xx, yy, lw=5, c='w');
plt.plot(xx, yy, lw=3, c='r', label='Model fit');

plt.ylim(0, 2);

month_locs = np.pi / 2.0 + 2 * np.pi / 24 - 2 * np.pi / 12 * np.arange(13)
ax.set_yticks([0.5, 1.0, 1.5, 2.0]);
ax.set_yticklabels([''] * 4);


plt.legend(scatterpoints=5, loc='upper center')



Now, the US state map (also known as a choropleth). This was WAY more complicated than I tought it would be! I imagined there would be some simple functions in matplotlib or similar package. But the ones I tried required map files I wasn’t familiar with and some of them I couldn’t get running. In particular I think GeoPandas is interesting, but I could not get it working.

After trying a bunch of packages and approaches, at some point I ended up with an SVG file with defined state paths. I’m not completely sure were it came from, it might have been from simplemapplot. Just to be sure, I put the SVG file up on a gist.

Before going on to making the visualization, let’s put all the states’ seasonal functions in a dictionary so we can easily use them.

def make_periodic(state):
    alpha = model.params['np.cos(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]
    beta  = model.params['np.sin(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]

    def periodic(week):
        return alpha * np.cos(2 * np.pi / 52 * (week - 1)) + \
               beta * np.sin(2 * np.pi / 52 * (week - 1))

    return periodic

periodics = {}
for state in data.state.unique():
    periodics[state] = make_periodic(state)

Ok, so the approach that we take is to use BeautifulSoup to parse and edit the SVG file.

svg = open('output_state_map.svg', 'r').read() 
soup = BeautifulSoup(svg)

The way to change the colors is to change the style attributes of each state element in the SVG. The inspiration for this strategy was this IPython notebook where the same is done on an Iowa map.

First we set up a baseline path style which we will append the color to.

path_style = "font-size:12px;fill-rule:nonzero;stroke:#000000;" + \
             "stroke-width:1;stroke-linecap:butt;stroke-linejoin:bevel;" + \
             "stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;" + \

So what we do with this is to loop over the weeks, and for each week color all the states according to their seasonal incidence functions. The colors are converted to hex codes and assigned to the corresponding state’s path. We save one SVG per week.

for w in range(1, 53):
    all_states = soup.findAll(attrs={'class':'state'})
    for p in all_states:
        state = us.states.lookup(p['id']).name.upper()
        p['style'] = path_style + rgb2hex(cm.winter(periodics[state](w) + 0.4))

    h = soup.find(attrs={'id': 'hdlne'})
    mnd = Week(1990, w).monday()
    h.string = calendar.month_name[mnd.month]

    fo = open("/tmp/state_map_{:02}.svg".format(w), "wb")

The modified SVG files are saved as if they are HTML files by BeatifulSoup. The files will still be viewable in web browsers. But for using conversion tools or vector graphics editing tools these need to be in proper SVG format. This is farily easy to fix, one just need to remove the <body> and <html> tags from the file.

for f in /tmp/state_map_*.svg; do
    grep -v "body\|html" $f > $f.tmp
    mv $f.tmp $f

Once this has beed fixed, we can create an animated gif using the convert command from ImageMagick.

convert -delay 6 -loop 0 /tmp/state_map_??.svg animated_state_map.gif

Because having a constantly animating gif on the page was extremely annoying, I converted it to a video using ffmpeg.

ffmpeg -f gif -i animated_state_map.gif animated_state_map.mp4

The final bit is making the map with the seasonal peak times. We simple solve the equations described above for each state to find te offset of the cosine function.

state_peak = {}
for state in data.state.unique():
    alpha = model.params['np.cos(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]
    beta = model.params['np.sin(2 * np.pi / 52 * (week - 1)):C(state)[{}]'.format(state)]
    rho = np.sqrt(alpha ** 2 + beta ** 2)
    theta = np.arccos(alpha / rho)

    peak_week = theta / (2 * np.pi) * 52
    state_peak[state] = peak_week

state_peak = pd.Series(state_peak)

Just like above we color the states by the value.

svg = open('output_state_map.svg', 'r').read() 
soup = BeautifulSoup(svg)

h = soup.find(attrs={'id': 'hdlne'})
h.string = 'Peak incidence'
h['x'] = 600

all_states = soup.findAll(attrs={'class':'state'})
for p in all_states:
    state = us.states.lookup(p['id']).name.upper()
    peak = state_peak[state]
    col = rgb2hex(cm.Greys((peak - state_peak.min()) / (state_peak.max() - state_peak.min())))
    p['style'] = path_style + col

fo = open("measles_spread.svg".format(w), "wb")

To add the color bar I edited the resulting SVG file in Adobe Illustrator.

Improper applications of Principal Component Analysis on multimodal data

When reading papers I have noticed some strange PCA plots. An example of this I stumbled over today was this one, Figure 1B from Kim et al., 2015, Cell Stem Cell 16, 88–101.

The odd thing here is that there are two obvious subpopulations of points, but within each they appear to have the same slope of PC1 vs PC2. This indicates that there is a linear dependence between PC1 and PC2, with some other factor explaining the difference between the clusters. In the case of the paper they only make this plot as a way to illustrate the relation between the samples. But if there where to be any additional analysis of the components in this PCA, this relation between PC1 and PC2 tells us something is wrong.

I made an artificial example to highlight this issue using Python.

from sklearn.preprocessing import scale

N = 500
mode = np.random.randint(0, 3, N) * 10

x = np.random.normal(0, 100, N)
y = mode + np.random.normal(0, 1, N)

X = np.vstack([x, y]).T
X = scale(X)

figsize(6, 6)

plt.scatter(X.T[0], X.T[1], c='k', edgecolor='w', s=50);
sns.axlabel('x', 'y');

plt.title('Generated data');

This data is 2-dimensional with one normally distributed variable accounting for the x-axis, and the y-axis being due to membership in one of three discrete classes, with some normally distributed noise added. Imagine that we don’t have the information about the two processes generating the points, and we wish to find a model for the points by latent factors. We apply principal component analysis to this data.

from sklearn.decomposition import PCA

pca = PCA()
Y = pca.fit_transform(X.copy())

figsize(6, 6)

plt.scatter(Y.T[0], Y.T[1], c='k', edgecolor='w', s=50);

sns.axlabel('Principal Component 1', 'Principal Component 2');

plt.title('Principal Component Analysis');

The issue is that principal component analysis seeks to primarilly find a projection of the data to one dimension which maximizes the variance. If your data is multivariately normally distributed the PCA will find the “longest” factor explaining the data. But this does not hold when the data one is working with is for example multimodal. Like in this trimodal case. The largest variance is obviosly over the diagonal of the square-like area spanned by the points. But for finding a latent factor model with this kind of data variance is simply not so useful as a measure.

In this case it might make sense to use Independent Component Analysis (ICA) in stead. It is a strategy for finding latent factors which aims to make the latent factors searched for as statistically independent as possible. It makes no assumption about the distribution of the data.

from sklearn.decomposition import FastICA

ica = FastICA()
Z = ica.fit_transform(Y.copy())

figsize(6, 6)

plt.scatter(Z.T[0], Z.T[1], c='k', edgecolor='w', s=50);
sns.axlabel('Independent Component 1', 'Independent Component 2');
plt.title('Independent Component Analysis');

One immidate caveat with ICA over PCA is that there is no longer an ordering of the components, component 1 does not necesarilly explain more of the variability of the data than component 2. They are both equally important for the inferred latent variable model.

It is clear here though that component 1 corresponds to the membership, and component 2 to the x-variable.

If you are expecting clusters in your data, PCA is probably not the right tool.

High contrast stacked distribution plots

A while ago a colleague showed me an image of a beautiful way of plotting several distributions to make them easily comparable. I have since failed to find these plots or managed to figure out what they are called. But the concept was to plot densities for several variables in such a way that they appeared behind each other in depth.

Let me first illustrate the effect before we get to an implementation.

%pylab inline
import pandas as pd
import seaborn as sns

xx = np.linspace(0, 4 * np.pi, 512)

We just make some random sinus curves for illustration

df = pd.DataFrame()
for i in range(1, 32):
    df[i] = np.sin((np.random.rand() + 0.5) * xx) + 1

figsize(16, 16)
for i, ind in enumerate(df):
    offset = 1.1 * i
    plt.fill_between(xx, df[ind] + offset, 0 * df[ind] + offset,

sns.despine(left=True, bottom=True)


There are just a few small tricks to the effect. The zorder=-i parameter makes the curves stack in the correct order. Making the edge colors the same color as the background gives the effect of depth. Personally I like the very high contrast style of using black fills. (Another colleague who saw these plots called them “Moomin plots” )

You can also make some pretty, regular, patterns with this look.

df = pd.DataFrame()
for i in range(1, 16):
    df[i] = np.sin(i * xx) + 1

figsize(14, 8)
for i, ind in enumerate(df):
    offset = 1.1 * i
    plt.fill_between(xx, df[ind] + offset, 0 * df[ind] + offset,

sns.despine(left=True, bottom=True)


So how can we plot distribution for comparison with this technique? First let us create some simulated data. In this case I’m randomly mixing unimodal and bimodal normally distributed data, this is what a lot of the data we study in my research group looks like. Letting the mean be between 0 and 16 is arbitrary of course, and I just picked the variance to be 2 because I thought it looked good.

def unimodal():
    m = random.randint(0, 16)
    return random.randn(256) * 2 + m

def bimodal():
    m1 = random.randint(0, 16)
    s1 = random.randn(128) * 2 + m1
    m2 = random.randint(0, 16)
    s2 = random.randn(128) * 2 + m2
    return np.hstack((s1, s2))

df = pd.DataFrame()
distributions = [unimodal, bimodal]
for i in range(16):
    distribution = random.choice(distributions)
    df[i] = distribution()

We’ll use the Gaussian KDE from scipy for this.

from scipy.stats import kde

Now we pick an x-axis which should cover the entire range of the data in all the columns of the DataFrame we keep the data in.

We handle the y-offsets of the data by picking half the height of the previous density plot. This can be tweaked and give slightly different effect dependong on what fraction of the height is used. I found half height to be most generally good looking though.

To make it easy to relate a label to a distribution we plot a line the density can “rest on”. If it is thick enough it will give the effect of the density growing up from it.

figsize(6, 8)
xx = np.linspace(df.min().min(), df.max().max(), 256)
yy = None
offset = 0
offs = [offset]
for i in df:
    if yy is not None:
        offset += 0.5 * yy.max()

    s = df[i]
    f = kde.gaussian_kde(s)
    yy = f(xx)

    plt.fill_between(xx, yy + offset, 0 * yy + offset,


sns.despine(left=True, bottom=True)
plt.yticks(offs, ['Sample {}'.format(s + 1) for s in df.columns]);


I like these a lot, though it looks a bit messy. Of course, with real data the ordering of the variables on the y-axis would be informed by some prior knowledge so that trends could be easily spotted.

But even so, if we don’t know how we should order these for easy comparison between the distributions, we can use hierarchical clustering to arrange the distribution so they become more comparable.

import fastcluster as fc
from scipy.cluster.hierarchy import dendrogram

figsize(6, 8)

link = fc.linkage(df.T, method='ward')
dend = dendrogram(link, no_plot=True)

xx = np.linspace(df.min().min(), df.max().max(), 256)
yy = None
offset = 0
offs = [offset]
for i, ind in enumerate(df[dend['leaves']]):
    if yy is not None:
        offset += 0.5 * yy.max()

    s = df[ind]
    f = kde.gaussian_kde(s)
    yy = f(xx)

    plt.fill_between(xx, yy + offset, 0 * yy + offset,


sns.despine(left=True, bottom=True)
plt.yticks(offs, ['Sample {}'.format(s + 1) for s in df[dend['leaves']].columns]);


Extract Cluster Elements by Color in Python Dendrograms

One aspect of using Python for data analysis is that hierarchical clustering dendrograms are rather cumbersome to work with. Both in terms of plotting next to a heatmap, and how to relate the input data to the resulting plot. Ideally the dendrogram function would return a proper instances of some Dendrogram class. But let’s work with what we have.

There are plenty (1, 2) of guides around on how to make R inspired heatmaps with dendrograms. Here I will describe how to parse out the cluster members as seen in a dendrogram, which can be handy if one notices interesting patterns in the corresponding heatmap.

Let’s start with importing some modules.

%pylab inline

from collections import defaultdict

import pandas as pd
from scipy.cluster.hierarchy import dendrogram, set_link_color_palette
from fastcluster import linkage
import seaborn as sns
from matplotlib.colors import rgb2hex, colorConverter


We also set some prettier non-default colours.

sns.set_palette('Set1', 10, 0.65)
palette = sns.color_palette()
set_link_color_palette(map(rgb2hex, palette))

We should create some simple example data for the purpose of illustration.

x, y = 3, 10
df = pd.DataFrame(np.random.randn(x, y),
                  index=['sample_{}'.format(i) for i in range(1, x + 1)],
                  columns=['gene_{}'.format(i) for i in range(1, y + 1)])

link = linkage(df, metric='correlation', method='ward')

figsize(8, 3)
den = dendrogram(link, labels=df.index, abv_threshold_color='#AAAAAA')
no_spine = {'left': True, 'bottom': True, 'right': True, 'top': True}

The dendrogram function will return a dictonary containing a representation of the tree as plotted.


{'color_list': ['#c13d3f', '#AAAAAA'],
 'dcoord': [[0.0, 0.70210454159043401, 0.70210454159043401, 0.0],
  [0.0, 1.1003608323279817, 1.1003608323279817, 0.70210454159043401]],
 'icoord': [[15.0, 15.0, 25.0, 25.0], [5.0, 5.0, 20.0, 20.0]],
 'ivl': ['sample_3', 'sample_1', 'sample_2'],
 'leaves': [2, 0, 1]}

The tree is represented as a collection of ∏ shaped components.

The three items named color_list, dcoord, icoord indexes these ∏’s

Obviosly color_list contains the colors. The lists in dcoord contain the y coordinates of the ∏’s (the distances) while icoord has the x coordinates. These would be the index coordinates.

In the minimal example above we have two ∏’s, and one can see that the x-coordinates are repeated once.

The coordinates go from left to right. So for the red ∏, the ‘legs’ are located at 15 and 25, while the grey one has legs at 5 and 20.

The apperant pattern is that legs positioned at leaves will end with 5. The reason is to nicely place the leg at the middle of the corresponding leaf index value. This also implies the actual list indices of the leaf are multiplied by 10.

Thus we first subtract 5 from each colors icoord, then divide by 10. If the resulting number is close to the closest integer, we consider this to be an index for a leaf. If the resulting number is not close to an integer index, it means the colored tree we got it from is from non-leaf parts of the trees.

For each leaf, we add it to a list per color in a dictionary.

cluster_idxs = defaultdict(list)
for c, pi in zip(den['color_list'], den['icoord']):
    for leg in pi[1:3]:
        i = (leg - 5.0) / 10.0
        if abs(i - int(i)) < 1e-5:


defaultdict(<type 'list'>, {'#c13d3f': [1, 2], '#AAAAAA': [0]})

Next we need to grab the labels of the leaves given the indexes. But before we do that, since it’s difficult to keep track of what color e.g. #c13d3f is, we make make an IPython notebook compatible HTML representation of the dictionary holding the information. Objects of this class will behave just like dictionaries, except for representing them as a HTML table.

class Clusters(dict):
    def _repr_html_(self):
        html = '<table style="border: 0;">'
        for c in self:
            hx = rgb2hex(colorConverter.to_rgb(c))
            html += '<tr style="border: 0;">' \
            '<td style="background-color: {0}; ' \
                       'border: 0;">' \
            '<code style="background-color: {0};">'.format(hx)
            html += c + '</code></td>'
            html += '<td style="border: 0"><code>' 
            html += repr(self[c]) + '</code>'
            html += '</td></tr>'

        html += '</table>'

        return html

(Note that the representation uses colerConverter from matplotlib, so it supports any matplotlib supported color representation, not only hex color strings.)

Now just go through the list of indices and fetch labels.

cluster_classes = Clusters()
for c, l in cluster_idxs.items():
    i_l = [den['ivl'][i] for i in l]
    cluster_classes[c] = i_l


#c13d3f['sample_1', 'sample_2']

Let’s combine this to a nice reusable function, and try it out on a larger example.

def get_cluster_classes(den, label='ivl'):
    cluster_idxs = defaultdict(list)
    for c, pi in zip(den['color_list'], den['icoord']):
        for leg in pi[1:3]:
            i = (leg - 5.0) / 10.0
            if abs(i - int(i)) < 1e-5:

    cluster_classes = Clusters()
    for c, l in cluster_idxs.items():
        i_l = [den[label][i] for i in l]
        cluster_classes[c] = i_l

    return cluster_classes

x, y = 96, 10
df = pd.DataFrame(np.random.randn(x, y),
                  index=['sample_{}'.format(i) for i in range(1, x + 1)],
                  columns=['gene_{}'.format(i) for i in range(1, y + 1)])

link = linkage(df, metric='correlation', method='ward')

figsize(12, 4)
den = dendrogram(link, labels=df.index, abv_threshold_color='#AAAAAA')


#c13d3f['sample_20', 'sample_87', 'sample_35', 'sample_2', 'sample_24', 'sample_13', 'sample_6', 'sample_82', 'sample_11', 'sample_12', 'sample_44', 'sample_94', 'sample_73', 'sample_77', 'sample_76']
#4e7ca1['sample_56', 'sample_81', 'sample_69', 'sample_9', 'sample_27', 'sample_21', 'sample_45', 'sample_52', 'sample_59', 'sample_46', 'sample_93', 'sample_10', 'sample_48', 'sample_78', 'sample_14', 'sample_50', 'sample_31', 'sample_91', 'sample_75', 'sample_86', 'sample_84', 'sample_55', 'sample_88', 'sample_43', 'sample_58', 'sample_33', 'sample_96', 'sample_34', 'sample_19', 'sample_68', 'sample_21']
#8e5d93['sample_4', 'sample_92', 'sample_15', 'sample_54', 'sample_1', 'sample_17', 'sample_42', 'sample_79', 'sample_41', 'sample_57', 'sample_63', 'sample_67', 'sample_22', 'sample_64', 'sample_83', 'sample_30', 'sample_53', 'sample_26', 'sample_3', 'sample_29', 'sample_28', 'sample_36', 'sample_7', 'sample_80', 'sample_8', 'sample_5', 'sample_71', 'sample_65', 'sample_39', 'sample_62', 'sample_85']
#5e9d5c['sample_23', 'sample_38', 'sample_90', 'sample_51', 'sample_25', 'sample_60', 'sample_74', 'sample_18', 'sample_61', 'sample_32', 'sample_49', 'sample_37', 'sample_70', 'sample_66', 'sample_16', 'sample_95', 'sample_40', 'sample_72', 'sample_47', 'sample_89', 'sample_40']

There is also an IPython notebook version of this post here

Riemannian Newton iteration for Rayleigh quotients on the sphere

In the above figure, each point is in initial value which will be converge to different eigenvectors of an orthogonal 3x3 matrix. The above figure is the equivalent of a Newton fractal, but applied to Rayleigh quotient iteration on a sphere. This post will go through an explanation of the figure, and the numerical method on the sphere, which can be applied to any manifold.

Rayleigh quotients and Eigenvectors

Given a Hermetian matrix \( M \) and a vector \( x \) the Rayleigh quotiend is defined by

$$ R(M, x) = \frac{x^TMx}{x^T x} $$

For the sake of this text, let us limit ourselves to looking at \( M = A^T A \) which are covariance matrices of some data matrix \( A \). This implies that \( M \) has unique real values eigenvalues. In addition we know that the eigenvectors of \( M \) will be orthogonal.

Using these eigenvectors as a basis, let us express an arbitrary vector \( x \) as a linear combination of eigenvectors.

$$ x = \sum_{i = 1}^n \alpha_i v_i $$

Now we have that

$$ R(M, x) = \frac{x^T A^T A x}{x^T x} = \frac{x^T A^T A x}{x^T x} = $$

$$ = \frac{\left(\sum_{j = 1}^n \alpha_j v_j\right)^T \left( A^T A \right) \left(\sum_{i = 1}^n \alpha_i v_i\right)} {\left(\sum_{j = 1}^n \alpha_j v_j\right)^T \left( \sum_{i = 1}^n \alpha_i v_i \right)} = $$

$$ = \frac{\sum_{i = 1}^n \alpha_i^2 \lambda_i}{\sum_{i = 1}^n \alpha_i^2} = \sum_{i = 1}^n \lambda_i \frac{(x^T v_i)^2}{(x^T x)(v_i^T v_I )} $$

since eigenvectors are orthogonal. Additionally, we see that the sum in the last equality will be minimized when \( x \) is orthogonal to the majority of the \( v_i \)’s. Which is locally the case for any \( x = v_j \), and globally when the eigenvector \( v_j \) corresponds to the smallest eigenvalue.

As \( R(M, x) \) is unaffected by lengths of vectors (due to division by the scalar product) it is defined on the sphere \( S^{n - 1} \), which is a Riemannian manifold.

$$ R(M, \cdot) : S^{n - 1} \rightarrow \mathbb{R} $$

Newton Iteration

The Newton Method of iteration is defined in \( \mathbb{R}^n \) by the procedure

  1. Solve \( D(\text{grad } \ f)(x)[\eta_x] = -\text{grad } \ f(x) \) for \( \eta_x \)
  2. Set \( x_{i + 1} := x_i + \eta_x \)

In each update of the iteration we solve for the update vector \( \eta_x \). Optimal points are found by finding the roots of the derivative of the function. Newton iteration is used to find local optima of differentiable functions by in each step making a local second order Taylor approximation of the given function. Since Newton iteration will have information about the local curvature it is normally quicker to converge than gradient descent.

A very classical assignment in courses on numerical methods is to implement the Newton method for root finding to find the complex roots of a polynomial. Usually this entails looking at which root the algorithm will find depending on initial condition. The result is the so called Newton Fractal. (Note that in this case we are looking for the roots of the function in question. For optimizing by finding stationary points we need to look at the derivative of the given function.)

For example, let us find the complex roots of the polynomial \( p(x) = x^3 - 1 \). The iteration becomes

  1. Solve \( p(x_k) + p’(x_k) \cdot \Delta x_k = 0 \) for \( \Delta x_k \)
  2. Set \( x_{k + 1} := x_k + \Delta x_k \)

Or in other words

$$ x_{k + 1} = x_k - \frac{x_k^3 - 1}{2x_k^2} $$

We can simply implement this iteration in the following fashion using Python with NumPy.

f = np.poly1d([1, 0, 0, -1])  # x^3 - 1
fp = np.polyder(f)

def newton(x, y, max_iters):
    c = complex(x, y)
    for i in range(max_iters):
        c = c - f(c) / fp(c)
        if np.abs(f(c)) < 1e-4:
            return c

    return 0.0

Now let us use this to illustrate the dependence on the initial value of the iteration. We simply define a function which performs the iteration until convergence over a grid of starting values in the complex plane. We color the grid by which root is found.

def create_newton_fractal(min_x, max_x, min_y, max_y, image, iters):
  height = image.shape[0]
  width = image.shape[1]

  pixel_size_x = (max_x - min_x) / width
  pixel_size_y = (max_y - min_y) / height

  for x in range(width):
    real = min_x + x * pixel_size_x
    for y in range(height):
      imag = min_y + y * pixel_size_y
      color = newton(real, imag, iters)
      image[y, x] = color.imag + 1

This can be generalized to work on any Riemannian manifold \( \mathcal{M} \) with an affine connection \( \nabla \) to optimize real valued functions \( f \). Define the Hessian by

$$ \text{Hess} \ f(x_k):= \nabla_{\eta_k} \text{grad } \ f. $$

Also let \( R : T\mathcal{M} \rightarrow \mathcal{M} \) be a retraction on \( \mathcal{M} \), a smooth mapping which satisfies

  1. \( R(0_x) = x \), where \( 0_x \) is the origin of \( T_x\mathcal{M} \)
  2. \( \frac{d}{dt} R(t \xi_x) \vert_{t=0} = \xi_x \)

Now we can define the new iteration method by

  1. Solve \( \text{Hess} \ f(x_k) \eta_k = -\text{grad } \ f(x_k) \) for \( \eta_k \in T_{x_k} \mathcal{M} \)
  2. Set \( x_{k + 1} := R_{x_k}(\eta_k) \)

Newton Iteration on Spheres

As a first step towards working on the sphere, we need to look at how the iteration would work for manifolds which are sub-manifolds of \( \mathbb{R}^n \). In this case we pick \( \nabla \) to be the Levi-Civita connection. What we need is a way of calculating the gradient in \( \mathcal{M} \).

Every vector \( z \in T_x \mathbb{R}^n \cong \mathbb{R}^n \) admits a decomposition \( z = P_x z + P^\perp_x z \) where \( P_x z \in T_x\mathcal{M} \) and \( P^\perp_x z \in T^\perp_x\mathcal{M} \); the orthogonal complement to \( T_x\mathcal{M} \) in \( \mathbb{R}^n \). Now say that \( \bar{f} : \mathbb{R}^n \rightarrow \mathbb{R} \), and the function \( f \) is defined by the restriction \( f = \bar{f} \vert_\mathcal{M} \). Then the gradient can be calculated by

$$ \text{grad } \ f(x) = P_x \ \text{grad } \ \bar{f}(x) $$

Using this identity, we can now realize the Levi-Civita connection by

$$ \nabla_{\eta_k} \text{grad} \ f = P_{x_k} \text{D} (\text{grad} \ f(x_k))[\eta_k] =: \text{Hess} \ f(x_k) \eta_k $$

So now in the step solving for \( \eta_k \in T_{x_k} \mathcal{M} \) we need to know a partitioning.

Let us now pull back from this abstraction and consider the particular case when \( \mathcal{M} = S^{n - 1} \) as a sub-manifold of \( \mathbb{R}^n \). That means in this case we have \( f : S^{n - 1} \rightarrow \mathbb{R}^n \). We let the partitioning projection \( P_x \) in to \( T_x S^{n - 1} \) be defined by

$$ P_x := \left( I - xx^T \right). $$

(A projection to the orthogonal complement of the vector projection onto \( x \))

We now have the iteration defined by

1: Solve

$$ P_{x_k} \text{D}(\text{grad} \ f)(x_k)[\eta_k] = -\text{grad} f(x_k) $$

for \( \eta_k \in T_x S^{n - 1} \cong \mathbb{R}^n \)

2: Step and retract back to the sphere by setting

$$ x_{k + 1} := \frac{x_k + \eta_k}{\Vert x_k + \eta_k \Vert} $$

Newton Iteration of Rayleigh quotient

We now have all the necessary tools to solve the eigenvector problem by minimizing the Rayleigh quotient. For gradient evaluation, as \( \frac{\partial}{\partial x_j}(x^T x) = 2x_j\) and \( \frac{\partial}{\partial x_j}(x^T M x) = 2(Mx)_j \) it follows that

$$ \text{grad} \ R(M, x) = \frac{2}{x^T x} \left( Mx - R(M, x) \cdot x \right) $$

This leads to the Newton equation

$$ P_{x_k} M P_{x_k} \eta_k - \eta_k x_k^T M x_k = -P_{x_k} M x_k $$

We know have a concrete iteration method which we can implement and run using Python

I = np.eye(3)

# Retraction at x
def R(x, eta):
    return (x + eta) / linalg.norm(x + eta)

# Small helper function
def f(A, x):
    return dot(x.T, dot(A, x))

# Projection P
def P(xk):
    return I - outer(xk, xk)

# Make a random orthogonal matrix
A = random.randint(0, 10, (3, 3))
A = dot(A, A.T)

# Pick starting point
x0 = np.array([0,-1,0])
xk = x0

# Iterate and store path
path = [xk]
for i in range(10):
    Pxk = P(xk)
    M = (Pxk.dot(A).dot(Pxk) - f(A, xk) * I)
    y = -Pxk.dot(A).dot(xk)
    eta_k = linalg.solve(M, y)
    path.append(xk + eta_k)
    xk = R(xk, eta_k)

Let us plot the path of the iteration over the sphere, in the following plot blue lines mean \( x_k + \eta_k \) (updates in the tangent space) and orange lines \( R_{x_k}(\eta_k) \) (retraction to the sphere).

In this case the iteration has already converged after five steps, the following five steps are also plotted, but are stable and can’t be seen. The iteration converges when the Rayleigh quotient is minimized, therefore, we have now found an eigenvector to the randomly chosen matrix A.

Now we have everything we need to generate the figure above. First we define a couple of functions.

def RNmS(A, x0):
    # Iteration
    N = 5
    xk = x0
    for k in range(0, N + 1):
        Pxk = P(xk)
        M = (Pxk.dot(A).dot(Pxk) - f(A, xk) * I)
        y = -Pxk.dot(A).dot(xk)
        eta_k = linalg.solve(M, y)
        xk = R(xk, eta_k)

    X = A.dot(xk) / xk
    return X.mean()

def find_nearest_index(array, value):
    index = ((array.real - value.real) ** 2 + \
             (array.imag - value.imag) ** 2).argmin(0)

    return index

The we pick a matrix and calculate the eigenvalues with the standard algorithm for comparison.

# Choice of 3x3 symmetric matrix for which we want to find the eigenvectors/-values.
# Note that different matrices produce very different images.
A = random.randint(0, 10, (3, 3))
A = dot(A, A.T)

EW, EV = linalg.eig(A)
print "Eigenvalues: ", 
print EW

Finally we make a spherical grid, perform the iteration for every point in the grid, and plot the result.

# Resolution
n = 512
nj = n * 1j

# u and v are parametric variables.
u = r_[0:2*pi:nj]
v = r_[0:pi:nj]

x = outer(cos(u), sin(v))
y = outer(sin(u), sin(v))
z = outer(ones(size(u)), cos(v))

# Figure out which eigenvectors/-values the algorithm converges to
# for all the spherical initial vectors.
s = zeros((n, n))
for i in range(0, n):
    for j in range(0, n):
            X = RNmS(A, array([x[i, j], y[i, j], z[i, j]]))
            s[i, j] = find_nearest_index(EW, X)

print X

print "Indexes: ",
print set(s.flatten())

bmap = brewer2mpl.get_map('RdBu', 'Diverging', 3)
cmap = bmap.get_mpl_colormap(N=3, gamma=1.0)

figsize(50, 50)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z, c=s, s=16, edgecolor='none', marker='s', cmap=cmap);


Gene Discovery Rate in RNA-seq data

I was curious regarding at what rate new genes are detected as the number of reads in RNA-seq data increases. Let’s start by looking at the results.

The figure shows how many genes there are with nonzero counts at certain read depths.

In particular I was interested how this discovery rate looked for single-cell RNA-seq data. I found some read files at ENA which was from human cells. The two data sets used are SRR445722 and SRR499825.

In the interest of saving space on my tiny laptop, I only kept the first four million reads in the fastq files. For each read number investigated, reads are first randomly sampled from the fastq file using the reservoir sampling of seqtk. The sampled reads are then mapped to the human transcriptome (Homo_sapiens.GRCh37.73.cdna.abinitio.fa) using bwa mem. To map from transcripts to genes, I use eXpress (with the annotation file Homo_sapiens.GRCh37.73.gtf.gz). The final step is to count the number of genes which have counts different from 0.

In the interest of time and simplicity, I didn’t perform any quality or adapter trimming of the data.

For each read count, the sampling was performed five times with different random seeds. In the figure these replicates can barely be seen, but at some pints some smaller colored points can be glanced.

These are the pipeline for the process for the two datasets:

for READS in $(seq 10000 10000 1000000)
    for i in $(seq 5)
        echo $READS reads >> sample_num_genes_SRR445722.out
        seqtk sample -s$(date +%s) SRR445722_4M_reads.fastq $(echo "scale=8; $READS / 4000000" | bc) | \
        bwa mem -t 2 Homo_sapiens.GRCh37.73.cdna.abinitio.fa - | \
        express --no-bias-correct Homo_sapiens.GRCh37.73.cdna.abinitio.fa
        cut -f5 results.xprs | grep -v "\b0\b" | wc -l >> sample_num_genes.out


for READS in $(seq 10000 10000 1000000)
    for i in $(seq 5)
        echo $READS reads >> sample_num_genes_SRR499825.out

        SEED=$(date +%s)
        FRAC=$(echo "scale=8; $READS / 4000000" | bc)

        bwa mem -t 2 Homo_sapiens.GRCh37.73.cdna.abinitio.fa \
        <(seqtk sample -s$SEED SRR499825_1_4M_reads.fastq $FRAC) \
        <(seqtk sample -s$SEED SRR499825_2_4M_reads.fastq $FRAC) \
        | express --no-bias-correct Homo_sapiens.GRCh37.73.cdna.abinitio.fa

        cut -f5 results.xprs | grep -v "\b0\b" | wc -l >> sample_num_genes_SRR499825.out

They are slightly different since SRR499825 is paired-ended while SRR445722 is single ended.

I ran these for as long as I could leave my laptop on and stationary at a time.

What I find interesting about this is how these logarithm-like curves seem are so smooth.

Both studies these data come from have plenty of replicates, so a more thorough version of this study would be to try data from multiple cells and make sure that correlates. It would also be interesting to see how bulk RNA- seq compares to these data sets. I also only looked at data from human cells because I didn’t want include another set of transcriptome and annotation file to keep track of, but there are a lot more mouse single-cell RNA-seq studies on ENA than human.

It would also be fun to run the analysis much longer, to get up to millions of reads, where we start seeing the majority of genes as detected. And see precisely at which point this starts to happen.


I reran the computations, but this time I let both data sets run to completion (depth of 1000 000 reads). I also saved the counts for each gene, so I could query detected genes with a lower cutoff. The new graph above show the detection rates for lower cutoffs of one, three and five mapped reads.

More interactive Python plotting with Clojure and Quil

The Python plotting library matplotlib is one of the libraries I read about a lot, and often, to try to keep track of everything possible to do with it. But I keep running in to situations where I don’t find it very useful.

Here I will write about one issue I had at a point, where I found it annoying when I wanted to sequentially build a graph, by adding scatter points to it one set at a time.

This summer I started playing with Clojure, and I enjoyed it quite a lot. Through looking at presentations for this I found Quil. I’ve only at some points played with Processing before, and I feel like quil better fits my way of thinking.

So, what I want, is to to my regular things with point data in IPython Notebook, and when I want to add it to a graph, I want to just send it to a quil sketch which is just running to the side of my IPython window. This way I can build my graphs over time.

I enjoy working with MongoDB, it has a tendency to just work where I need it and it is easy get started with. So I used this as a backend, and connected my quil sketch to the database using the monger library, which is extremely simple to use.

Here follows the entire code for the quill sketch that does what I want:

(ns square-vis.core
  (:require [monger.core :as mg])
  (:require [monger.collection :as mc])
  (:use monger.operators)
  (:use quil.core))

(mg/set-db! (mg/get-db "test"))

(defn draw-square
  (let [size (get doc :size)
        x    (get (get doc :coord) 0)
        y    (get (get doc :coord) 1)
        c    (get doc :c)]
    (fill c)
    (rect x y size size)))

(defn -main
  [& args])

(defn setup []
  (frame-rate 30)
  (background 200))

(defn draw []
  (background 200)
  (doseq [item (mc/find-maps "squares")] (draw-square item)))

(defsketch example                  
           :title "Squaaaares!"  
           :setup setup                      
           :draw draw                        
           :size [323 200])                  

This is only for making scatter plots with squares, in grayscale, which I like a lot. This piece of code connects to a database called “test”, and searches for documents in the collection named “squares”, and give them as Clojure maps, so they are easy to use in the rest of the sketch. The documents in the collection represents squares, with the following schema.

square = {'coord': [10, 20], 'size': 8, 'c': 155}

Some small examples:

from pymongo import Connection
conn = Connection()
coll = conn['test']['squares']

for i in range(28):
    for j in range(22):
        square = {'coord': [10 + 10 * i, 20 + 10 * j],
                  'size': 5, 'c': (i * j) ** 2 % 251}

for k in linspace(0.0, 2.0):
    x = np.linspace(1, 350)
    y = k * x
    for xi, yi in zip(x, y):
        doc = {'coord': [xi, yi], 'size': 5 * k + 1, 'c': 100 * k}

while True:
        for doc in coll.find():
            doc['coord'][0] += 1 * (random.random() - 0.5)
            doc['coord'][1] += 1 * (random.random() - 0.5)
    except KeyboardInterrupt:

Non-standard Python on UPPMAX

Non-standard Python on UPPMAX

This is something I’ve had to look up at least 10 times in the last year, so to save myself from searching my email inbox yet another time I will write it out here. This post is specific to the Kalkyl cluster at UPPMAX.

To make a virtual environment with Python 2.7 on UPPMAX, do the following

mkvirtualenv --no-site-packages --python=/sw/comp/python/2.7.1_kalkyl/bin/python2.7 new_venv

What’s the problem?

Firstly, the standard Python on Kalkyl is 2.6, which is incompatible with most things we do, as we all tend to use 2.7.

UPPMAX uses the module system to manage different versions of software. Naively, you would think one could make a Python 2.7 by doing

$ module load python/2.7.1
$ mkvirtualenv --no-site-packages  new_venv

This will not work.

In stead, one need to point to the Python binary as it is stored on the filesysten; the one that the module system links to when module load is called. And this is located in /sw/comp/python/2.7.1_kalkyl.

Visualizing overlapping intervals

Visualizing overlapping intervals

Rather often I find myself wanting to see a pattern of how some interval data looks. I never found a good package for this, so always reverted to either drawing the intervals after each other on the x = y line. Or pile the intervals to histogram like graphs.

But today I got an idea how I could do it rather simply. What I want to do is stack the intervals so they don’t overlap, but also doesn’t need too much space on the Y-axis.

First let’s create some intervals that we want to try stacking.

intervals = [(2, 6), (7, 10), (1, 3), (4, 6), (8, 10), (1, 2), (3, 7), 
            (9, 10), (1, 2), (3, 5), (6, 7), (8, 10), (2, 5), (8, 10)]

Here is the basic idea: We sort the intervals by the start values. We give the first interval an initial Y-value. For the next interval, we check if the lower value is smaller than the upper value of the previous interval. If not, we just place the interval after the previous one. But if, we increment the Y-value, and check the same thing for the previous interval in that Y-category.

Here is a python implementation of this Y-categorization.

def intervals2layers(si):
    layers = [[si[0]]]
    for p in si[1:]:
        for lay in layers:
            if lay[-1][-1] < p[0]:

    return layers

Let’s try it out!

si = sorted(intervals, key=lambda p: p[0])
layers = intervals2layers(si)

figsize(8, 4)
for i, lay in enumerate(layers):
    x1, x2 = zip(*lay)
    plt.hlines([i + 1] * len(x1), x1, x2, lw=30)

plt.xlim(0, 11)
plt.ylim(0, 6);

This works quite nicely. Let’s try something more problematic; we’ll generate 1024 intervals and see how these end up stacking using this strategy.

intervals = []
for i in range(1024):
    x1 = random.normalvariate(0.5, 0.25)
    x2 = x1 + random.random() / 16
    intervals.append((x1, x2))

figsize(20, 10)
for i, lay in enumerate(layers):
    x1, x2 = zip(*lay)
    plt.hlines([i + 1] * len(x1), x1, x2)

ax = plt.gca()

Precisely what I’m after! One can see overall patterns of the distribution of all the intervals, but it’s also possible to see certain subintervals which might be interesting.

This is how some real data looks like.

Single Linkage Clustering Heuristic

Explaining the magic fudge in Mapper.


A few months back I attended a couple of seminars by Gunner Carlsson talking about topological data analysis, and discussed some methods they employ at Ayasdi. Given my background in mathematics I thought it was quite nice to see some concrete applications of some rather abstract theory. In addition to this, their topology based analysis methods seem to be succesful with bioinformatical data where other methods fail.

This first paper about the method describes it quite well, and on the groups website they are also hosting a Matlab script which they used for the applications in that first paper. This scientific report gives a very nice summary of capabilities of topology based data analysis.

Using their refined implementation of the method on your data means contacting them and sending them your data to run their analysis on.

Therefore I thought a nice evening project would be to reimplement the Matlab script in Python, so it can be easily run anywhere, and eventually optimize the implementation somewhat.

A great start would be at the [upcoming Seqahead Hackathon] (http://seqahead.cs.tu-dortmund.de/meetings:2013-06-hackathon), where I’d hoped to perhaps get some collaboration.

Mapper basics

Just extremely briefly (see the scientific report or paper for details), the basic idea of Mapper is to divide the space of the data in to sections. Within these sections clustering is performed, these clusters are then combined depending on overlaps to form a graph which reveals a “shape” of the data.

Single Linkage Clustering

In the Matlab implementation of Mapper the authors use single linkage clustering. Briefly, single linkage clustering means initially taking every object as a cluster on its own, then pairwise link together “clusters” which are the closest to each other. This is repeated until the algorithm is left with one single cluster.

Software performing this kind of clustering usually returns a representation of the tree being built throughout the clustering, with the intermediate distances between clusters in each step preserved.

Mysteries in the Matlab implementation

When reading the Matlab implementation one comes across the parameter magicFudge. This is described in the comments of the script.

%  magicFudge: The number of histograms to divide each homology plot into
%              Increase this number if you expect to see more clusters
%              and decrease it if you are seeing more clusters.
%              Default : 10

Heuristic for number of clusters

This is further explained in the paper (page 5).

Now, to find the number of clusters we use the edge length at which each cluster was merged. The heuristic is that the interpoint distance within each cluster would be smaller than the distance between clusters, so shorter edges are required to connect points within each cluster, but relatively longer edges are required to merge the clusters. If we look at the histogram of edge lengths in C, it is observed experimentally, that shorter edges which connect points within each cluster have a relatively smooth distribution and the edges which are required to merge the clusters are disjoint from this in the histogram. If we determine the histogram of C using k intervals, then we expect to find a set of empty in- terval(s) after which the edges which are required to merge the clusters appear.

The authors also points out problems with this approach.

Although this heuristic has worked well for many datasets that we have tried, it suffers from the following limitations: (1) If the clusters have very different densities, it will tend to pick out clusters of high density only. (2) It is possible to construct examples where the clusters are distributed in such a way such that we re- cover the incorrect clustering. Due to such limitations, this part of the procedure is open to exploration and change in the future.

While I was reading the paper, I didn’t think much about this heuristic, but when trying to implement the method, it got me fairly confused.

I figured it would be nice to try the heuristic out and see how well it performs.

First we want to simulate some data, there’s a nice function in scikits-learn to generate blobs of data (clusters): sklearn.datasets.make_blobs. First generate some clusters.

from sklearn import datasets
num_clusters = 3
X, y = datasets.make_blobs(n_samples=N, centers=num_clusters)

(Throughout this post, I will omit commands for creating plots, but it can all be viewd in this IPython notebook.)

To perform single linkage clustering, we need a distance matrix between all points in the data set.

from scipy.spatial import distance
d = distance.cdist(X, X)

Then we can use, e.g. this new fastcluster package to perform the clustering.

import fastcluster
Z = fastcluster.linkage(d, method='single')

This creates the linkage Z, a (num_points - 1, 4)-shaped array where the distance between Z[i, 0] and Z[i, 1] is Z[i, 2].

This graph contains a dendrogram representation of the linkage.

In Mapper they also add the longest distance from the distance matrix to the array of inter cluster distances.

lens = np.zeros(Z.shape[0] + 1)
lens[:-1] = Z[:, 2]
lens[-1] = d.max()

They also introduce the notion of calling this collection of distances a “lens”. I like this as it invokes the idea of “unfocusing” the view of the data to see the overall shape.

Now just look at the histogram of distances.

This finds one cluster, as can be seen by doing the following.

z = np.nonzero(hst == 0)[0]

Out[613]: 1

It’s pretty impossible to see, but this cluster has the distances in the 9th bin in the histogram. This is obviously wrong, since we knew beforehand that we made three clusters in this example. Thus we should “focus” the lens by increasing the magicFudge, in this case I increase it to 64.

magic_fudge = 64
hst, _ = np.histogram(lens, bins=magic_fudge)

z = np.nonzero(hst == 0)[0]

Out[632] 3

Here we see the pattern discussed in the paper quite clearly. A nice almost normal distribution of small distances, breaking at the 9th bin (black arrow), followed by three bins at various locations with a single distance in.

Now to the issue that motivated this blog post: It took my many tries of generating example clusters, as well as trying different magicFudge-values for this last histogram.

Testing the heuristic

To see how the heuristic performed, I wrote some code that randomly generates between one and six clusters, performs the cluster count using nine different magicFudge-values, and collects how many times for each value the heuristic found the correct number of clusters.

magic_fudge_settings = [8, 16, 25, 32, 50, 64, 100, 128, 256]
hit_counts = [0] * len(magic_fudge_settings)
n_tests = 500
for i in range(n_tests):
    true_num_clusters = random.randint(1, 6)

    X, _ = datasets.make_blobs(n_samples=300, centers=true_num_clusters)

    d = distance.cdist(X, X)
    Z = fastcluster.linkage(d, method='single')
    lens = np.zeros(Z.shape[0] + 1)
    lens[:-1] = Z[:, 2]
    lens[-1] = d.max()

    for j, magic_fudge in enumerate(magic_fudge_settings):
        hst, _ = np.histogram(lens, bins=magic_fudge)
        z = np.nonzero(hst == 0)[0]
        if len(z) == 0:
            num_clusters = 1
            num_clusters = hst[z[0]:len(hst)].sum()

        if num_clusters == true_num_clusters:
            hit_counts[j] += 1


Are best summarized with the following plot.

These results are quite bad in my opinion. However, when using this process to find clusters in order to make a graph of connections between them, it is probably not that important to be accurate. Just as long as enough overlapping clusters are found we will end up with a nice network that describes the data. But I very much doubt this is the cluster counter method used in the production implementation at Ayasdi.

Teaching Python at SciLifeLab

This spring we organized the KTH course Scientific Programming in Python for Computational Biology, hosted at SciLifeLab Stockholm.

The course ran over seven weeks with one two hour session every wednesday, with alotted time after the session for students to ask questions about assignments. The examination was by passing the weekly assignments in a timely fashion. Our twist on this classic concept was that all hand ins had to be done via Github.

The structure of the course was inspired by Software Carpentry.

There was demand for a course that goes beyond the basics of Python, and one challange with this is that even thought there are plenty of online resources for introducing people to Python, structured lectures, and in particular exercises, of more advanced sections are hard to come by.

We had a brainstorming session where we talked about what kind of things we wanted to cover in the course. The course was intended for scientists, but in the spirit of software carpentry, we also wanted to teach good software engineering practices.

A survey to various mailing lists to gage the interest of a Python course for scientists and how they would feel about a more advanced course rather than the normal introductory courses.

Responses for the survey were great, though most people had stated they would prefer a simpler course, we felt that those are already available several times per year. When we announced the course we asked people who intended to attend to sign up on a mailing list. We got some 100 people signed up, but we figured plenty of people signed up for the list simply because it was easy, and wasn’t really planning to attend. However, when the course started, we had roughly 70 people on the first session! We had booked the largest presentation room in the building, but it was filled to the brim. Personally I had expected the course to be more like a 10-15 people seminar. Now it was something else.

Everything was in Github

In the first session, Roman Valls Guimera gave an interactive lecture where students were to set up their Github accounts, join the Github Organization for the course, and learn how to clone their repos for the course to their computers. Among many other things he taught the students how to structure a Python project. The thought was that at the end, the students should all have a repository for the course which we could later build upon with examples.

Part of the first assignment was to have everything set up so that you could push code to your personal repository in the course organization. Every student had a repository named after their last name. Surprisingly, there were no overlapping last names!

There were a couple of points of confusion here. Some inconsistencies in examples, along with a confusion even among us organizers about where the repo’s for the course should be kept, caused students repositories to be quite varied. But after just this first session we were lenient in the correction, enjoying the success of getting over 50 people to use git for the first time and join Github!

The takeaway from this in the end, is that consistency of organization and examples would have helped the students understand things quicker. And we organizers definitaly could have done a better job communicating with each other in the beginning.

A key here is that we forced the students to use version control, and while we didn’t go through anything particularly advanced, not even branching, getting people to do commits and push to a central server habitually is something that courses normally don’t do.

IPython Notebooks rather than slides

In the second session I talked about how to parse various file formats with Python, and gave recommendations for packages to use in various occasions. Actually, I can show exactly what I talked about, because for rather than using slides as is common, live coding, I used IPython notebooks which I kept on Github for students to read after the sessions.

Session 2 - Getting Data

Personally, I think the notebook is optimal for teaching Python. It takes a bit of explaining what it actually is, and how it works. (That you can run code in cells of a browser is not intuitive when you’re just getting used to the idea of running code in the terminal, as was the case for many students.)

It enables you to prepare a lecture in advance, like slides, but also gives you the abality to change things, or quickly show things you didn’t think of when students have questions.

The session also included a guest lecture by Mats Dahlberg showing how to talk to relational databases with Python in the second hour.

At the end, I went through how to talk to web services using requests. This is a bit unconventional, but we have the convention at SciLifeLab to write and use RESTful API’s, and then it makes a lot of sense to show people how to actually use these.

For the assignment the students were supposed to fetch some data from the web and do some calculation. But as a part of this they were also supposed to structure the repo, and make setup.py install scripts so they could be used globally.

Originally, I wanted to use data from the public transport network about the bus stop outside the SciLifeLab building, but it turned out that data was not available without a for-pay API key. So instead I used the New York City transit system data as an example. An idea I got from Python for Data Analysis.

To push the students to structure their repositories the way that was originally intended I explicitly wrote out what kind of structure they should have after the assignment was done:


This caused some to change their structure at least. Since we had so many students, we really wanted the repos to look basically the same, so we could easily correct assignments, and assist students that needed help without having to figure out their repo structure.

In the end, we never required the students to use the repos in the organization account, and allowed them to update the fork of the repo they had on their personal accounts in stead. Thinking back, it would have been much better to only look at the repos in the organization.

Self correcting assignments

The third session was about some modern Python idioms that people normally don’t see in an introductory course. You can see the notebooks online.

The session also included a brief introduction to NumPy, and I remember talking a fair amount about the difference between lists and arrays.

Session 3 - Modern Python Idioms

When preparing the session I was struggling with three things.

  1. I wanted a self contained example that would use what I had talked aboud.
  2. Many students still had inconsistant repository structures.
  3. I wanted to find time to write a script that would correct assignments automatically.

Out of all the nice things we did with the course, this is the one I’m the proudest of, I made the assignment so that the students were supposed to make a class who’s objects would be used to check that any course repository was correct. This is an assignment that will correct itself! A script was supposed to be written which would output the string “PASS” is the specification of the directory was satisfied, and “FAIL” otherwise.

I made sure to be very explicit in what kind of structure the RepoChecker class should check for, and pointed the students to use git mv if their own repos didn’t satisfy the structure the checker was looking for.

After this session almost every student had the required structure on their repos!

Use data about students generated through the course

The fourth session was much about not reinvinting the wheel, in particular about data containers. I mostly went through use cases for the more useful types in the built in collections module.

Session 4 - Don’t reinvent the wheel

I also talked about SciPy and the SciKits. I didn’t want to delve too deeply in to any particular SciPy submodule as the course was supposed to be wide, and honestly, for bioinformaticians SciPy doesn’t contain extremely useful submodules.

Something I was glad to show and go through was Pandas, a package I really enjoy using. One thing to note, is that though the rendering of a DataFrame as a table in the IPython notebook is really nice, some students got confused of whether Pandas worked in the terminal, so I had to open a terminal and show a couple of examples.

Data containers are used everywhere, but the different ones are used in rather particular instances. So I didn’t even try to think of an assignment that would use the ones I talked about. In stead, I wanted the students to do an assignment where they used Pandas.

Inspired by the previous weeks assignment, I thought the data could be self generated as well. As the course had been going on for a few weeks now, there was quite a bit of activity on the Github repositories. You can see the entire formulation of the assignment at the end of the linked notebook. (I was even more explicit in the description this time. Really, if you want to be able to correct things automatically, you need to be very explicit.)

In gist, the assignment was to use the RESTful API of Github to take data about commits to the organization repos and put this in a Pandas DataFrame of time series, which they then couls use to find trends in the data.

Result: All students found that the vast majority of commits occured the hour before the weekly session (when also the assignment deadline occured).

Many of the students had problems with this assignment, in particular due to one repository that ruined most people strategies for fetching the data. In retrospect I think this was actually a good thing, it really demonstrated how programming turns out a lot of the time in the real world, they had to figure out how to handle the special case.

Scientists should learn CI and testing

In the fifth session Roman returned, and talked about how to work with tests, and how to use a continous integration system. He actually blogged about this here: Automated Python education via unit testing and Travis-CI

Rather than writing too much about it I’ll encourage anyone interested to read his blog post.

The session also had Hossein Farahani showing the graph package Networkx.

(Try to) use the code the students created.

For the sixth session I talked about how to optimize Python code, and quite in depth about how to use iterators / generators.

Session 6 - Optimizing Python

This also lead to an interesting construct in Python that I see very few people use, the coroutines. These are in essence the reverse of a generator, have a look at the notebook if you’re interested.

At the end I talked about how to use various profilers. When preparing this part of the lecture I realized even more how handy the IPython notebook is for teaching. By using the %%bash cell magic, I could store bash commands, and run them and see the output in a very nice way. In the process I also learned about the ability of starting background processes in a notebook cell using %%script --bg bash.

For the assignment, again thinking that it’s the best for the students to work with what they have already created, I figured they could profile the scripts they had made and report the results (by saving the outputs in the repo).

This, however, did not work out at all. (Yes this means the headline was never achieved). One of the profilers I wanted them to use turned out to be more experimental than I had thought, and just plain crashed on the students code. Various other differences in how the students had structured their scripts caused them to not be able to run several of the profilers. In addition to this, their script weren’t doing enough job for the profilers to have much time finding anything interesting.

In the end, we had to give the students an extra week to complete the assignment. And in stead of them using their scripts I found a script online for factoring numbers using the Elliptic Curve Method. I made sure to give very explicit instructions, and made sure the profiling worked.

Parallelization is hard but rewarding

For the seventh and final session of the course I went through the two most common strategies for parallelization. The multiprocessing module for programmatic parallelization, and IPython.parallel for interactive explorative parallelization.

Session 7 - Parallel Python

This session was very popular, it seemed a lot of people realized many of their time consuming problems could be rather easily parallelized to cut down computational time significally. As far as the lecture went, I just described how to use the various components of the modules. For this session the students had plenty of questions, and the feedback about the notebook was that it should have contained more comments and explanatory text.

One particular thing, is that multiprocessing doesn’t work in interactive prompts, so I had to make use of the %%file cell magic to write tiny scripts for each example, then again use the %%bash magic to run them. It went rather smooth, and I think the students understood what was going on.

Another detail, is that I piped the interactive notebook from an 8 core desktop to my laptop, as I figured the examples would be quite dull with my dual core laptop.

For the assignment I wanted the students to factorize many numbers, and calculate some statistics about the factors afterwards. So that this would be emberrasingly parallel over the list of numbers to factor. I wanted them to implement this using both multiprocessing and IPython.parallal.

It turned out that IPython.parallel was very bad for this problem, and everyone was confused as to why the parallel implementation was slower than a serial implementation. I managed to make an implementation that was slightly faster than serial, but only when using a much longer list of numbers than the one specified, and only large numbers.

Meanwhile the multiprocessing implementation gave everyone speedups quite quickly.

Making nicer looking pie charts with matplotlib

Firstly, one should in general stay away from pie charts, showing an area when the data relates to the arc lengths is confusing. Among other things, resizing a pie chart have a tendency to make people change their interpretation of the relation between the slices.

But, sometimes people really want them.

Matplotlib, the Python plotting framework, has a pie function implemented. And it’s used like this:

import numpy as np
import matplotlib.pyplot as plt
from random import shuffle

slices = [1,2,3] * 4 + [20, 25, 30] * 2

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

cmap = plt.cm.prism
colors = cmap(np.linspace(0., 1., len(slices)))

labels = ["Some text"] * len(slices)

ax.pie(slices, colors=colors, labels=labels, labeldistance=1.05)
ax.set_title("Figure 1");


The very first thing that comes to mind is that the black edges doesn’t really work. So as a first action, lets make them white. Unfortunately there’s no keyword for edgecolor in pie, so we need to go through the patches for the slices and change each of their edge color property.

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

pie_wedge_collection = ax.pie(slices, colors=colors, labels=labels, labeldistance=1.05);

for pie_wedge in pie_wedge_collection[0]:

ax.set_title("Figure 2");


Ok, so the point I originally wanted to make with this point, is what you can notice in the top left corner of the pie chart. The label text overlaps when thin slices are too vertical.

Let’s go through some steps on how to mend this. First, if we sort the data by size, we will at least now for sure where the thin slices will end up.

slices = sorted(slices)

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

pie_wedge_collection = ax.pie(slices, colors=colors, labels=labels, labeldistance=1.05);

for pie_wedge in pie_wedge_collection[0]:

ax.set_title("Figure 3");


That’s good, however, firstly, if we have more thin slices they will still after a while end up being too vertical and the label texts will again overlap. Secondly, in my opinion, it looks a bit odd.

For a nicer solution, take the sorted slices, and make four sections; two with large slices, and two with small slices.

large = slices[:len(slices) / 2]
small = slices[len(slices) / 2:]

reordered = large[::2] + small[::2] + large[1::2] + small[1::2]

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

pie_wedge_collection = ax.pie(reordered, colors=colors, labels=labels, labeldistance=1.05);

for pie_wedge in pie_wedge_collection[0]:

ax.set_title("Figure 4");


That looks much nicer, but it would be ideal if the thin slices where all as close to horizontal as possible. This we can fix by calculating the angle in degrees that we would need to shift the pie chart in order to get the first small section aligned around the horizontal axis.

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

angle = 180 + float(sum(small[::2])) / sum(reordered) * 360

pie_wedge_collection = ax.pie(reordered, colors=colors, labels=labels, labeldistance=1.05, startangle=angle);

for pie_wedge in pie_wedge_collection[0]:

ax.set_title("Figure 5");



The code for making the pie charts can also be seen in this gist