Gene Discovery Rate in RNA-seq data

I was curious regarding at what rate new genes are detected as the number of reads in RNA-seq data increases. Let’s start by looking at the results.

The figure shows how many genes there are with nonzero counts at certain read depths.

In particular I was interested how this discovery rate looked for single-cell RNA-seq data. I found some read files at ENA which was from human cells. The two data sets used are SRR445722 and SRR499825.

In the interest of saving space on my tiny laptop, I only kept the first four million reads in the fastq files. For each read number investigated, reads are first randomly sampled from the fastq file using the reservoir sampling of seqtk. The sampled reads are then mapped to the human transcriptome (Homo_sapiens.GRCh37.73.cdna.abinitio.fa) using bwa mem. To map from transcripts to genes, I use eXpress (with the annotation file Homo_sapiens.GRCh37.73.gtf.gz). The final step is to count the number of genes which have counts different from 0.

In the interest of time and simplicity, I didn’t perform any quality or adapter trimming of the data.

For each read count, the sampling was performed five times with different random seeds. In the figure these replicates can barely be seen, but at some pints some smaller colored points can be glanced.

These are the pipeline for the process for the two datasets:

for READS in $(seq 10000 10000 1000000)
do
    for i in $(seq 5)
    do
        echo $READS reads >> sample_num_genes_SRR445722.out
        seqtk sample -s$(date +%s) SRR445722_4M_reads.fastq $(echo "scale=8; $READS / 4000000" | bc) | \
        bwa mem -t 2 Homo_sapiens.GRCh37.73.cdna.abinitio.fa - | \
        express --no-bias-correct Homo_sapiens.GRCh37.73.cdna.abinitio.fa
        cut -f5 results.xprs | grep -v "\b0\b" | wc -l >> sample_num_genes.out
    done
done


and

for READS in $(seq 10000 10000 1000000)
do
    for i in $(seq 5)
    do
        echo $READS reads >> sample_num_genes_SRR499825.out

        SEED=$(date +%s)
        FRAC=$(echo "scale=8; $READS / 4000000" | bc)

        bwa mem -t 2 Homo_sapiens.GRCh37.73.cdna.abinitio.fa \
        <(seqtk sample -s$SEED SRR499825_1_4M_reads.fastq $FRAC) \
        <(seqtk sample -s$SEED SRR499825_2_4M_reads.fastq $FRAC) \
        | express --no-bias-correct Homo_sapiens.GRCh37.73.cdna.abinitio.fa

        cut -f5 results.xprs | grep -v "\b0\b" | wc -l >> sample_num_genes_SRR499825.out
    done
done


They are slightly different since SRR499825 is paired-ended while SRR445722 is single ended.

I ran these for as long as I could leave my laptop on and stationary at a time.

What I find interesting about this is how these logarithm-like curves seem are so smooth.

Both studies these data come from have plenty of replicates, so a more thorough version of this study would be to try data from multiple cells and make sure that correlates. It would also be interesting to see how bulk RNA- seq compares to these data sets. I also only looked at data from human cells because I didn’t want include another set of transcriptome and annotation file to keep track of, but there are a lot more mouse single-cell RNA-seq studies on ENA than human.

It would also be fun to run the analysis much longer, to get up to millions of reads, where we start seeing the majority of genes as detected. And see precisely at which point this starts to happen.

EDIT

I reran the computations, but this time I let both data sets run to completion (depth of 1000 000 reads). I also saved the counts for each gene, so I could query detected genes with a lower cutoff. The new graph above show the detection rates for lower cutoffs of one, three and five mapped reads.

More interactive Python plotting with Clojure and Quil

The Python plotting library matplotlib is one of the libraries I read about a lot, and often, to try to keep track of everything possible to do with it. But I keep running in to situations where I don’t find it very useful.

Here I will write about one issue I had at a point, where I found it annoying when I wanted to sequentially build a graph, by adding scatter points to it one set at a time.

This summer I started playing with Clojure, and I enjoyed it quite a lot. Through looking at presentations for this I found Quil. I’ve only at some points played with Processing before, and I feel like quil better fits my way of thinking.

So, what I want, is to to my regular things with point data in IPython Notebook, and when I want to add it to a graph, I want to just send it to a quil sketch which is just running to the side of my IPython window. This way I can build my graphs over time.

I enjoy working with MongoDB, it has a tendency to just work where I need it and it is easy get started with. So I used this as a backend, and connected my quil sketch to the database using the monger library, which is extremely simple to use.

Here follows the entire code for the quill sketch that does what I want:

(ns square-vis.core
  (:gen-class)
  (:require [monger.core :as mg])
  (:require [monger.collection :as mc])
  (:use monger.operators)
  (:use quil.core))

(mg/connect!)
(mg/set-db! (mg/get-db "test"))

(defn draw-square
  [doc]
  (let [size (get doc :size)
        x    (get (get doc :coord) 0)
        y    (get (get doc :coord) 1)
        c    (get doc :c)]
    (no-stroke)
    (fill c)
    (rect x y size size)))

(defn -main
  [& args])

(defn setup []
  (smooth)
  (frame-rate 30)
  (background 200))

(defn draw []
  (background 200)
  (doseq [item (mc/find-maps "squares")] (draw-square item)))

(defsketch example                  
           :title "Squaaaares!"  
           :setup setup                      
           :draw draw                        
           :size [323 200])                  


This is only for making scatter plots with squares, in grayscale, which I like a lot. This piece of code connects to a database called “test”, and searches for documents in the collection named “squares”, and give them as Clojure maps, so they are easy to use in the rest of the sketch. The documents in the collection represents squares, with the following schema.

square = {'coord': [10, 20], 'size': 8, 'c': 155}


Some small examples:

from pymongo import Connection
conn = Connection()
coll = conn['test']['squares']

for i in range(28):
    for j in range(22):
        square = {'coord': [10 + 10 * i, 20 + 10 * j],
                  'size': 5, 'c': (i * j) ** 2 % 251}
        coll.insert(square)

coll.drop()
for k in linspace(0.0, 2.0):
    x = np.linspace(1, 350)
    y = k * x
    for xi, yi in zip(x, y):
        doc = {'coord': [xi, yi], 'size': 5 * k + 1, 'c': 100 * k}
        coll.save(doc)

print('running')
while True:
    try:
        for doc in coll.find():
            doc['coord'][0] += 1 * (random.random() - 0.5)
            doc['coord'][1] += 1 * (random.random() - 0.5)
            coll.save(doc)
    except KeyboardInterrupt:
        print('done')
        break

Non-standard Python on UPPMAX

Non-standard Python on UPPMAX

This is something I’ve had to look up at least 10 times in the last year, so to save myself from searching my email inbox yet another time I will write it out here. This post is specific to the Kalkyl cluster at UPPMAX.

To make a virtual environment with Python 2.7 on UPPMAX, do the following

mkvirtualenv --no-site-packages --python=/sw/comp/python/2.7.1_kalkyl/bin/python2.7 new_venv


What’s the problem?

Firstly, the standard Python on Kalkyl is 2.6, which is incompatible with most things we do, as we all tend to use 2.7.

UPPMAX uses the module system to manage different versions of software. Naively, you would think one could make a Python 2.7 by doing

$ module load python/2.7.1
$ mkvirtualenv --no-site-packages  new_venv


This will not work.

In stead, one need to point to the Python binary as it is stored on the filesysten; the one that the module system links to when module load is called. And this is located in /sw/comp/python/2.7.1_kalkyl.

Visualizing overlapping intervals

Visualizing overlapping intervals

Rather often I find myself wanting to see a pattern of how some interval data looks. I never found a good package for this, so always reverted to either drawing the intervals after each other on the x = y line. Or pile the intervals to histogram like graphs.

But today I got an idea how I could do it rather simply. What I want to do is stack the intervals so they don’t overlap, but also doesn’t need too much space on the Y-axis.

First let’s create some intervals that we want to try stacking.

intervals = [(2, 6), (7, 10), (1, 3), (4, 6), (8, 10), (1, 2), (3, 7), 
            (9, 10), (1, 2), (3, 5), (6, 7), (8, 10), (2, 5), (8, 10)]


Here is the basic idea: We sort the intervals by the start values. We give the first interval an initial Y-value. For the next interval, we check if the lower value is smaller than the upper value of the previous interval. If not, we just place the interval after the previous one. But if, we increment the Y-value, and check the same thing for the previous interval in that Y-category.

Here is a python implementation of this Y-categorization.

def intervals2layers(si):
    layers = [[si[0]]]
    for p in si[1:]:
        for lay in layers:
            if lay[-1][-1] < p[0]:
                lay.append(p)
                break
        else:
            layers.append([p])

    return layers


Let’s try it out!

si = sorted(intervals, key=lambda p: p[0])
layers = intervals2layers(si)

figsize(8, 4)
for i, lay in enumerate(layers):
    x1, x2 = zip(*lay)
    plt.hlines([i + 1] * len(x1), x1, x2, lw=30)

plt.xlim(0, 11)
plt.ylim(0, 6);

This works quite nicely. Let’s try something more problematic; we’ll generate 1024 intervals and see how these end up stacking using this strategy.

intervals = []
for i in range(1024):
    x1 = random.normalvariate(0.5, 0.25)
    x2 = x1 + random.random() / 16
    intervals.append((x1, x2))


figsize(20, 10)
for i, lay in enumerate(layers):
    x1, x2 = zip(*lay)
    plt.hlines([i + 1] * len(x1), x1, x2)

ax = plt.gca()
adjust_spines(ax,['bottom']);

Precisely what I’m after! One can see overall patterns of the distribution of all the intervals, but it’s also possible to see certain subintervals which might be interesting.

This is how some real data looks like.

Single Linkage Clustering Heuristic

Explaining the magic fudge in Mapper.

Background

A few months back I attended a couple of seminars by Gunner Carlsson talking about topological data analysis, and discussed some methods they employ at Ayasdi. Given my background in mathematics I thought it was quite nice to see some concrete applications of some rather abstract theory. In addition to this, their topology based analysis methods seem to be succesful with bioinformatical data where other methods fail.

This first paper about the method describes it quite well, and on the groups website they are also hosting a Matlab script which they used for the applications in that first paper. This scientific report gives a very nice summary of capabilities of topology based data analysis.

Using their refined implementation of the method on your data means contacting them and sending them your data to run their analysis on.

Therefore I thought a nice evening project would be to reimplement the Matlab script in Python, so it can be easily run anywhere, and eventually optimize the implementation somewhat.

A great start would be at the [upcoming Seqahead Hackathon] (http://seqahead.cs.tu-dortmund.de/meetings:2013-06-hackathon), where I’d hoped to perhaps get some collaboration.

Mapper basics

Just extremely briefly (see the scientific report or paper for details), the basic idea of Mapper is to divide the space of the data in to sections. Within these sections clustering is performed, these clusters are then combined depending on overlaps to form a graph which reveals a “shape” of the data.

Single Linkage Clustering

In the Matlab implementation of Mapper the authors use single linkage clustering. Briefly, single linkage clustering means initially taking every object as a cluster on its own, then pairwise link together “clusters” which are the closest to each other. This is repeated until the algorithm is left with one single cluster.

Software performing this kind of clustering usually returns a representation of the tree being built throughout the clustering, with the intermediate distances between clusters in each step preserved.

Mysteries in the Matlab implementation

When reading the Matlab implementation one comes across the parameter magicFudge. This is described in the comments of the script.

%  magicFudge: The number of histograms to divide each homology plot into
%              Increase this number if you expect to see more clusters
%              and decrease it if you are seeing more clusters.
%              Default : 10

Heuristic for number of clusters

This is further explained in the paper (page 5).

Now, to find the number of clusters we use the edge length at which each cluster was merged. The heuristic is that the interpoint distance within each cluster would be smaller than the distance between clusters, so shorter edges are required to connect points within each cluster, but relatively longer edges are required to merge the clusters. If we look at the histogram of edge lengths in C, it is observed experimentally, that shorter edges which connect points within each cluster have a relatively smooth distribution and the edges which are required to merge the clusters are disjoint from this in the histogram. If we determine the histogram of C using k intervals, then we expect to find a set of empty in- terval(s) after which the edges which are required to merge the clusters appear.

The authors also points out problems with this approach.

Although this heuristic has worked well for many datasets that we have tried, it suffers from the following limitations: (1) If the clusters have very different densities, it will tend to pick out clusters of high density only. (2) It is possible to construct examples where the clusters are distributed in such a way such that we re- cover the incorrect clustering. Due to such limitations, this part of the procedure is open to exploration and change in the future.

While I was reading the paper, I didn’t think much about this heuristic, but when trying to implement the method, it got me fairly confused.

I figured it would be nice to try the heuristic out and see how well it performs.

First we want to simulate some data, there’s a nice function in scikits-learn to generate blobs of data (clusters): sklearn.datasets.make_blobs. First generate some clusters.

from sklearn import datasets
num_clusters = 3
X, y = datasets.make_blobs(n_samples=N, centers=num_clusters)

(Throughout this post, I will omit commands for creating plots, but it can all be viewd in this IPython notebook.)

To perform single linkage clustering, we need a distance matrix between all points in the data set.

from scipy.spatial import distance
d = distance.cdist(X, X)

Then we can use, e.g. this new fastcluster package to perform the clustering.

import fastcluster
Z = fastcluster.linkage(d, method='single')

This creates the linkage Z, a (num_points - 1, 4)-shaped array where the distance between Z[i, 0] and Z[i, 1] is Z[i, 2].

This graph contains a dendrogram representation of the linkage.

In Mapper they also add the longest distance from the distance matrix to the array of inter cluster distances.

lens = np.zeros(Z.shape[0] + 1)
lens[:-1] = Z[:, 2]
lens[-1] = d.max()

They also introduce the notion of calling this collection of distances a “lens”. I like this as it invokes the idea of “unfocusing” the view of the data to see the overall shape.

Now just look at the histogram of distances.

This finds one cluster, as can be seen by doing the following.

z = np.nonzero(hst == 0)[0]
hst[z[0]:len(hst)].sum()

Out[613]: 1

It’s pretty impossible to see, but this cluster has the distances in the 9th bin in the histogram. This is obviously wrong, since we knew beforehand that we made three clusters in this example. Thus we should “focus” the lens by increasing the magicFudge, in this case I increase it to 64.

magic_fudge = 64
hst, _ = np.histogram(lens, bins=magic_fudge)

z = np.nonzero(hst == 0)[0]
hst[z[0]:len(hst)].sum()

Out[632] 3

Here we see the pattern discussed in the paper quite clearly. A nice almost normal distribution of small distances, breaking at the 9th bin (black arrow), followed by three bins at various locations with a single distance in.

Now to the issue that motivated this blog post: It took my many tries of generating example clusters, as well as trying different magicFudge-values for this last histogram.

Testing the heuristic

To see how the heuristic performed, I wrote some code that randomly generates between one and six clusters, performs the cluster count using nine different magicFudge-values, and collects how many times for each value the heuristic found the correct number of clusters.

magic_fudge_settings = [8, 16, 25, 32, 50, 64, 100, 128, 256]
hit_counts = [0] * len(magic_fudge_settings)
n_tests = 500
for i in range(n_tests):
    true_num_clusters = random.randint(1, 6)

    X, _ = datasets.make_blobs(n_samples=300, centers=true_num_clusters)

    d = distance.cdist(X, X)
    Z = fastcluster.linkage(d, method='single')
    lens = np.zeros(Z.shape[0] + 1)
    lens[:-1] = Z[:, 2]
    lens[-1] = d.max()

    for j, magic_fudge in enumerate(magic_fudge_settings):
        hst, _ = np.histogram(lens, bins=magic_fudge)
        z = np.nonzero(hst == 0)[0]
        if len(z) == 0:
            num_clusters = 1
        else:
            num_clusters = hst[z[0]:len(hst)].sum()

        if num_clusters == true_num_clusters:
            hit_counts[j] += 1

Results

Are best summarized with the following plot.

These results are quite bad in my opinion. However, when using this process to find clusters in order to make a graph of connections between them, it is probably not that important to be accurate. Just as long as enough overlapping clusters are found we will end up with a nice network that describes the data. But I very much doubt this is the cluster counter method used in the production implementation at Ayasdi.

Teaching Python at SciLifeLab

This spring we organized the KTH course Scientific Programming in Python for Computational Biology, hosted at SciLifeLab Stockholm.

The course ran over seven weeks with one two hour session every wednesday, with alotted time after the session for students to ask questions about assignments. The examination was by passing the weekly assignments in a timely fashion. Our twist on this classic concept was that all hand ins had to be done via Github.

The structure of the course was inspired by Software Carpentry.

There was demand for a course that goes beyond the basics of Python, and one challange with this is that even thought there are plenty of online resources for introducing people to Python, structured lectures, and in particular exercises, of more advanced sections are hard to come by.

We had a brainstorming session where we talked about what kind of things we wanted to cover in the course. The course was intended for scientists, but in the spirit of software carpentry, we also wanted to teach good software engineering practices.

A survey to various mailing lists to gage the interest of a Python course for scientists and how they would feel about a more advanced course rather than the normal introductory courses.

Responses for the survey were great, though most people had stated they would prefer a simpler course, we felt that those are already available several times per year. When we announced the course we asked people who intended to attend to sign up on a mailing list. We got some 100 people signed up, but we figured plenty of people signed up for the list simply because it was easy, and wasn’t really planning to attend. However, when the course started, we had roughly 70 people on the first session! We had booked the largest presentation room in the building, but it was filled to the brim. Personally I had expected the course to be more like a 10-15 people seminar. Now it was something else.

Everything was in Github

In the first session, Roman Valls Guimera gave an interactive lecture where students were to set up their Github accounts, join the Github Organization for the course, and learn how to clone their repos for the course to their computers. Among many other things he taught the students how to structure a Python project. The thought was that at the end, the students should all have a repository for the course which we could later build upon with examples.

Part of the first assignment was to have everything set up so that you could push code to your personal repository in the course organization. Every student had a repository named after their last name. Surprisingly, there were no overlapping last names!

There were a couple of points of confusion here. Some inconsistencies in examples, along with a confusion even among us organizers about where the repo’s for the course should be kept, caused students repositories to be quite varied. But after just this first session we were lenient in the correction, enjoying the success of getting over 50 people to use git for the first time and join Github!

The takeaway from this in the end, is that consistency of organization and examples would have helped the students understand things quicker. And we organizers definitaly could have done a better job communicating with each other in the beginning.

A key here is that we forced the students to use version control, and while we didn’t go through anything particularly advanced, not even branching, getting people to do commits and push to a central server habitually is something that courses normally don’t do.

IPython Notebooks rather than slides

In the second session I talked about how to parse various file formats with Python, and gave recommendations for packages to use in various occasions. Actually, I can show exactly what I talked about, because for rather than using slides as is common, live coding, I used IPython notebooks which I kept on Github for students to read after the sessions.

Session 2 - Getting Data

Personally, I think the notebook is optimal for teaching Python. It takes a bit of explaining what it actually is, and how it works. (That you can run code in cells of a browser is not intuitive when you’re just getting used to the idea of running code in the terminal, as was the case for many students.)

It enables you to prepare a lecture in advance, like slides, but also gives you the abality to change things, or quickly show things you didn’t think of when students have questions.

The session also included a guest lecture by Mats Dahlberg showing how to talk to relational databases with Python in the second hour.

At the end, I went through how to talk to web services using requests. This is a bit unconventional, but we have the convention at SciLifeLab to write and use RESTful API’s, and then it makes a lot of sense to show people how to actually use these.

For the assignment the students were supposed to fetch some data from the web and do some calculation. But as a part of this they were also supposed to structure the repo, and make setup.py install scripts so they could be used globally.

Originally, I wanted to use data from the public transport network about the bus stop outside the SciLifeLab building, but it turned out that data was not available without a for-pay API key. So instead I used the New York City transit system data as an example. An idea I got from Python for Data Analysis.

To push the students to structure their repositories the way that was originally intended I explicitly wrote out what kind of structure they should have after the assignment was done:

[lastname]/
    [lastname]/
        __init__.py
    scripts/
        getting_data.py
    README.md
    setup.py

This caused some to change their structure at least. Since we had so many students, we really wanted the repos to look basically the same, so we could easily correct assignments, and assist students that needed help without having to figure out their repo structure.

In the end, we never required the students to use the repos in the organization account, and allowed them to update the fork of the repo they had on their personal accounts in stead. Thinking back, it would have been much better to only look at the repos in the organization.

Self correcting assignments

The third session was about some modern Python idioms that people normally don’t see in an introductory course. You can see the notebooks online.

The session also included a brief introduction to NumPy, and I remember talking a fair amount about the difference between lists and arrays.

Session 3 - Modern Python Idioms

When preparing the session I was struggling with three things.

  1. I wanted a self contained example that would use what I had talked aboud.
  2. Many students still had inconsistant repository structures.
  3. I wanted to find time to write a script that would correct assignments automatically.

Out of all the nice things we did with the course, this is the one I’m the proudest of, I made the assignment so that the students were supposed to make a class who’s objects would be used to check that any course repository was correct. This is an assignment that will correct itself! A script was supposed to be written which would output the string “PASS” is the specification of the directory was satisfied, and “FAIL” otherwise.

I made sure to be very explicit in what kind of structure the RepoChecker class should check for, and pointed the students to use git mv if their own repos didn’t satisfy the structure the checker was looking for.

After this session almost every student had the required structure on their repos!

Use data about students generated through the course

The fourth session was much about not reinvinting the wheel, in particular about data containers. I mostly went through use cases for the more useful types in the built in collections module.

Session 4 - Don’t reinvent the wheel

I also talked about SciPy and the SciKits. I didn’t want to delve too deeply in to any particular SciPy submodule as the course was supposed to be wide, and honestly, for bioinformaticians SciPy doesn’t contain extremely useful submodules.

Something I was glad to show and go through was Pandas, a package I really enjoy using. One thing to note, is that though the rendering of a DataFrame as a table in the IPython notebook is really nice, some students got confused of whether Pandas worked in the terminal, so I had to open a terminal and show a couple of examples.

Data containers are used everywhere, but the different ones are used in rather particular instances. So I didn’t even try to think of an assignment that would use the ones I talked about. In stead, I wanted the students to do an assignment where they used Pandas.

Inspired by the previous weeks assignment, I thought the data could be self generated as well. As the course had been going on for a few weeks now, there was quite a bit of activity on the Github repositories. You can see the entire formulation of the assignment at the end of the linked notebook. (I was even more explicit in the description this time. Really, if you want to be able to correct things automatically, you need to be very explicit.)

In gist, the assignment was to use the RESTful API of Github to take data about commits to the organization repos and put this in a Pandas DataFrame of time series, which they then couls use to find trends in the data.

Result: All students found that the vast majority of commits occured the hour before the weekly session (when also the assignment deadline occured).

Many of the students had problems with this assignment, in particular due to one repository that ruined most people strategies for fetching the data. In retrospect I think this was actually a good thing, it really demonstrated how programming turns out a lot of the time in the real world, they had to figure out how to handle the special case.

Scientists should learn CI and testing

In the fifth session Roman returned, and talked about how to work with tests, and how to use a continous integration system. He actually blogged about this here: Automated Python education via unit testing and Travis-CI

Rather than writing too much about it I’ll encourage anyone interested to read his blog post.

The session also had Hossein Farahani showing the graph package Networkx.

(Try to) use the code the students created.

For the sixth session I talked about how to optimize Python code, and quite in depth about how to use iterators / generators.

Session 6 - Optimizing Python

This also lead to an interesting construct in Python that I see very few people use, the coroutines. These are in essence the reverse of a generator, have a look at the notebook if you’re interested.

At the end I talked about how to use various profilers. When preparing this part of the lecture I realized even more how handy the IPython notebook is for teaching. By using the %%bash cell magic, I could store bash commands, and run them and see the output in a very nice way. In the process I also learned about the ability of starting background processes in a notebook cell using %%script --bg bash.

For the assignment, again thinking that it’s the best for the students to work with what they have already created, I figured they could profile the scripts they had made and report the results (by saving the outputs in the repo).

This, however, did not work out at all. (Yes this means the headline was never achieved). One of the profilers I wanted them to use turned out to be more experimental than I had thought, and just plain crashed on the students code. Various other differences in how the students had structured their scripts caused them to not be able to run several of the profilers. In addition to this, their script weren’t doing enough job for the profilers to have much time finding anything interesting.

In the end, we had to give the students an extra week to complete the assignment. And in stead of them using their scripts I found a script online for factoring numbers using the Elliptic Curve Method. I made sure to give very explicit instructions, and made sure the profiling worked.

Parallelization is hard but rewarding

For the seventh and final session of the course I went through the two most common strategies for parallelization. The multiprocessing module for programmatic parallelization, and IPython.parallel for interactive explorative parallelization.

Session 7 - Parallel Python

This session was very popular, it seemed a lot of people realized many of their time consuming problems could be rather easily parallelized to cut down computational time significally. As far as the lecture went, I just described how to use the various components of the modules. For this session the students had plenty of questions, and the feedback about the notebook was that it should have contained more comments and explanatory text.

One particular thing, is that multiprocessing doesn’t work in interactive prompts, so I had to make use of the %%file cell magic to write tiny scripts for each example, then again use the %%bash magic to run them. It went rather smooth, and I think the students understood what was going on.

Another detail, is that I piped the interactive notebook from an 8 core desktop to my laptop, as I figured the examples would be quite dull with my dual core laptop.

For the assignment I wanted the students to factorize many numbers, and calculate some statistics about the factors afterwards. So that this would be emberrasingly parallel over the list of numbers to factor. I wanted them to implement this using both multiprocessing and IPython.parallal.

It turned out that IPython.parallel was very bad for this problem, and everyone was confused as to why the parallel implementation was slower than a serial implementation. I managed to make an implementation that was slightly faster than serial, but only when using a much longer list of numbers than the one specified, and only large numbers.

Meanwhile the multiprocessing implementation gave everyone speedups quite quickly.

Making nicer looking pie charts with matplotlib

Firstly, one should in general stay away from pie charts, showing an area when the data relates to the arc lengths is confusing. Among other things, resizing a pie chart have a tendency to make people change their interpretation of the relation between the slices.

But, sometimes people really want them.

Matplotlib, the Python plotting framework, has a pie function implemented. And it’s used like this:

import numpy as np
import matplotlib.pyplot as plt
from random import shuffle

slices = [1,2,3] * 4 + [20, 25, 30] * 2
shuffle(slices)

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

cmap = plt.cm.prism
colors = cmap(np.linspace(0., 1., len(slices)))

labels = ["Some text"] * len(slices)

ax.pie(slices, colors=colors, labels=labels, labeldistance=1.05)
ax.set_title("Figure 1");

hämta1

The very first thing that comes to mind is that the black edges doesn’t really work. So as a first action, lets make them white. Unfortunately there’s no keyword for edgecolor in pie, so we need to go through the patches for the slices and change each of their edge color property.

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

pie_wedge_collection = ax.pie(slices, colors=colors, labels=labels, labeldistance=1.05);

for pie_wedge in pie_wedge_collection[0]:
    pie_wedge.set_edgecolor('white')

ax.set_title("Figure 2");

hämta2

Ok, so the point I originally wanted to make with this point, is what you can notice in the top left corner of the pie chart. The label text overlaps when thin slices are too vertical.

Let’s go through some steps on how to mend this. First, if we sort the data by size, we will at least now for sure where the thin slices will end up.

slices = sorted(slices)

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

pie_wedge_collection = ax.pie(slices, colors=colors, labels=labels, labeldistance=1.05);

for pie_wedge in pie_wedge_collection[0]:
    pie_wedge.set_edgecolor('white')

ax.set_title("Figure 3");

hämta3

That’s good, however, firstly, if we have more thin slices they will still after a while end up being too vertical and the label texts will again overlap. Secondly, in my opinion, it looks a bit odd.

For a nicer solution, take the sorted slices, and make four sections; two with large slices, and two with small slices.

large = slices[:len(slices) / 2]
small = slices[len(slices) / 2:]

reordered = large[::2] + small[::2] + large[1::2] + small[1::2]

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

pie_wedge_collection = ax.pie(reordered, colors=colors, labels=labels, labeldistance=1.05);

for pie_wedge in pie_wedge_collection[0]:
    pie_wedge.set_edgecolor('white')

ax.set_title("Figure 4");

hämta4

That looks much nicer, but it would be ideal if the thin slices where all as close to horizontal as possible. This we can fix by calculating the angle in degrees that we would need to shift the pie chart in order to get the first small section aligned around the horizontal axis.

fig = plt.figure(figsize=[10, 10])
ax = fig.add_subplot(111)

angle = 180 + float(sum(small[::2])) / sum(reordered) * 360

pie_wedge_collection = ax.pie(reordered, colors=colors, labels=labels, labeldistance=1.05, startangle=angle);

for pie_wedge in pie_wedge_collection[0]:
    pie_wedge.set_edgecolor('white')

ax.set_title("Figure 5");

hämta5

Perfect!

The code for making the pie charts can also be seen in this gist

Loadings with scikit-learn PCA

Loadings with scikit-learn PCA

The past couple of weeks I’ve been taking a course in data analysis for *omics data. One part of the course was about using PCA to explore your data.

Principal Component Analysis in essence is to take high dimensional data and find a projection such that the variance is maximized over the first basis. Then finding a second basis which is orthogonal to the first basis and maximizes the variance given that. And so on for as many components as one wish to explore.

The bases are reffered to as ‘principal components’, and the projected data points to the principal components are called 'scores’.

One can also find the data points distances from the vectors that becomes the new basis vectors. This is referred to as 'loadings’, and gives and idea of which basis in the original data contributes to the greatest variance in the scores.

For the course we were instructed in various analysis suites with all sorts of GUI’s and predefined analysis. I however felt that this was a perfect excuse to finally get some experience with Pandas and scikit-learn.

So just imagine I have some data in the DataFrame data. This is how you do a PCA for two components with scikits-learn and plot the result.

components

Very intuitive and tidy. However, the problem comes when one wants to look at the loadings. Most packages, for example R, will give you the loadings as well as the scores. Here the only things we get are these:

pca.components_
pca.explained_variance_
pca.explained_variance_ratio_
pca.mean_

(In scikits-learn, fields ending with _ are fields that where generated by training/fitting.)

Looking under the hood of PCA, on GitHub, reveals that the fitting PCA.fit method is just a wrapper for PCA._fit(), which returns the PCA object itself to allow for chaining method calls. The _fit() method performs a SVD on the data matrix, and sets the field pca.components_ to the first n_components columns of the right singular matrix. The rows of this new matrix will be the Loading points!

Thus we can plot the loadings in this fashion

loadings

Splitting a file

Splitting a file

Let us say we have a large file where every line is a separate piece of data. Continue to say that the lines can be classified by some criterion, and we would like to split the large file in to several smaller files based on this classification.

As an example, I created a 100 MB file of just numbers counting upwards on each line, padded with zeroes, using this script

This file has about 1.3 million lines. Let us split this in to several files based on what the number of the line is modulo 2000! We can quickly implement this as

We try the script out and time it, which is quite a dull experience since what we get will be this:

By doing some investigation one notices that what is actually taking a lot of the time is the open() function. For every one of those 1.3 million lines in the source file we open a file, write to it, and close it. Just doing wc -l on one of the output files reveals that each file must have been opened and closed roughly 644 times! This sounds fairly unnecessary, and it is.

Now, in this particular case, we now for sure that we will need to open precisely 2000 files, and we can figure out the names of those files a priori. So a strategy would be to open all possible number_XXXX.txt at the beginning and close them all after the writing is complete. But to keep things a bit more interesting, allow us to feign some ignorance and say that we only figure out what the out file is supposed to be when reading the lines from the source file.

Cached file handles

For this, I created a class that inherits from dict, the names of the files are the keys, and the file handles the values. Here follows the entire script

The main logic is to try to fetch the file handle for a file you wish to write to. Open it in case it isn’t opened and store the name and handle. So what are the results of this?

A tenth of the previous time!

While we are doing comparisons…

Let’s try out the “PyPy” everyone is talking about.

Not particularly impressive… But of course, the time is still dominated by opening and closing files.

That’s pretty cool! Less than half the time of CPython. I’m not going to pretend to understand why, but it’s good to keep in mind if somethings needs to be fast (and doesn’t require too many third party modules).

Great! Now we can split files quickly!

Well… no. Not really. Not in general anyway.

See, I actually chose to split up the source files in to 2000 files for a reason. Let us try the cached file splitter with 3000 files instead.

...
    num = int(line) % 3000
...

Now it gets a bit grittier. What one actually does when open() is called in Python is to reserve a file descriptor, and the amount of available file descriptors is limited by the operating system. You can find out how many are available by running ulimit -n, and in many cases actually changing the limit by simply appending the desired number of file descriptors to that command. On my laptop the number is set to 2560 descriptors. On other places I’ve experienced it to be around 16000, so in that case splitting in to 3000 files shouldn’t pose a problem.

So why not increase the descriptor limit and continue? You may not always have the privilege to do so, and in some cases even 16000 might be too few. Also one cannot always figure out from the beginning how many file descriptors will be needed.

What we can do is introduce some resource management in to the FileSplitter.

There are many problems here. First of all, one should really investigate how many files should be closed when too many are opened to fix the exception. And for this particular problem, there will always be a regular pattern of cache misses; forcing us to open more files. Since we will always go from 0 to 2999, in that order, over and over again. And list if dictionary items returned by self.items() is sorted by the hashes, and should be quite random. On a less trivial problem we might have that some files are accessed more often than others, and in that case randomly closing files might be more efficient than in this case.

Anyway, we increased the number of files to split into to 3000, for the naive splitting we now have

And for the resource managed version we get

So there is still a performance gain, but not in the least bit as impressive as before!

Conclusions

Splitting a large file in to a moderate number of other files can be done efficiently, if one is aware of the limitations of the system (or in control enough to change them).

What brought this on?

For work I’ve been working on a script to demultiplex FASTQ files, that means taking lines of data from a large file and putting them in to different files based on a string of six symbols from an alphabet of five characters at the end of a line, (the alphabet being {A, C, G, T, N}). The theoretical maximal number of files would be \( 5^6 = 15625 \), but this should never really happen; for one thing, N’s are supposed to be rare.

The system where I did most of the work had a file descriptor limit of 16000, but when I had to do some of the work on my personal computer I ran in to the resource issue. And sure, I could have increased my ulimit -n, but I figured making the script safe was a good idea. It should maybe be mentioned that the resource managing version runs almost as fast as the normal cached version if the “Too many files”-exception never is reached.

Solution path for 'Towers of Hanoi'

Solution path for ‘Towers of Hanoi’

A few years ago I figured out a nice strategy for solving the Towers of Hanoi puzzle in the minimum number of moves. By adding a few restrictions to which moves are allowed one gets one single legal move in every (except the initial) step.

To be able to play with the game easily I represented it as integers stacked on eachother, representing the sizes of the corresponding tiles. So for example the initial position for three tiles would look like this:

1
2
3
-   -   -

For some reason, it occured to me that if one took the sum of every column in such a representation one would get a point in \( \mathbb{R}^3 \), with the property that that \( x + y + z = s \), where \( s \) is the sum of all numbers up to the number of tiles. This means every such point, after every move in the game, will lie in a plane.

If one performs the game for a few low values of \( s \) and plots the path between the points generated by the moves one starts seeing a pattern, which is natural with the symmetric and recursive structure of the game. At the time I imagine a nice fractal pattern would form if one let the number of tiles grow large, but never investigated it.

Now on a whim, after doing some other things in python during the day that in nature reminded me of the Towers puzzle I wrote a script for playing the game for a given number of tiles, saving the point in every step, and plot the path when it was done.

Writing the script was fun, but the results were far from as interesting or pretty as I had thought. Here is a picture of the path for ten tiles.

Path for the solution of Towers of Hanoi with 10 tiles

The basic problem, visually, is that the path folds in over it self when the number of tiles get larger than about four.