Loadings with scikit-learn PCA

Loadings with scikit-learn PCA

The past couple of weeks I’ve been taking a course in data analysis for *omics data. One part of the course was about using PCA to explore your data.

Principal Component Analysis in essence is to take high dimensional data and find a projection such that the variance is maximized over the first basis. Then finding a second basis which is orthogonal to the first basis and maximizes the variance given that. And so on for as many components as one wish to explore.

The bases are reffered to as ‘principal components’, and the projected data points to the principal components are called 'scores’.

One can also find the data points distances from the vectors that becomes the new basis vectors. This is referred to as 'loadings’, and gives and idea of which basis in the original data contributes to the greatest variance in the scores.

For the course we were instructed in various analysis suites with all sorts of GUI’s and predefined analysis. I however felt that this was a perfect excuse to finally get some experience with Pandas and scikit-learn.

So just imagine I have some data in the DataFrame data. This is how you do a PCA for two components with scikits-learn and plot the result.

components

Very intuitive and tidy. However, the problem comes when one wants to look at the loadings. Most packages, for example R, will give you the loadings as well as the scores. Here the only things we get are these:

pca.components_
pca.explained_variance_
pca.explained_variance_ratio_
pca.mean_

(In scikits-learn, fields ending with _ are fields that where generated by training/fitting.)

Looking under the hood of PCA, on GitHub, reveals that the fitting PCA.fit method is just a wrapper for PCA._fit(), which returns the PCA object itself to allow for chaining method calls. The _fit() method performs a SVD on the data matrix, and sets the field pca.components_ to the first n_components columns of the right singular matrix. The rows of this new matrix will be the Loading points!

Thus we can plot the loadings in this fashion

loadings

Splitting a file

Splitting a file

Let us say we have a large file where every line is a separate piece of data. Continue to say that the lines can be classified by some criterion, and we would like to split the large file in to several smaller files based on this classification.

As an example, I created a 100 MB file of just numbers counting upwards on each line, padded with zeroes, using this script

This file has about 1.3 million lines. Let us split this in to several files based on what the number of the line is modulo 2000! We can quickly implement this as

We try the script out and time it, which is quite a dull experience since what we get will be this:

By doing some investigation one notices that what is actually taking a lot of the time is the open() function. For every one of those 1.3 million lines in the source file we open a file, write to it, and close it. Just doing wc -l on one of the output files reveals that each file must have been opened and closed roughly 644 times! This sounds fairly unnecessary, and it is.

Now, in this particular case, we now for sure that we will need to open precisely 2000 files, and we can figure out the names of those files a priori. So a strategy would be to open all possible number_XXXX.txt at the beginning and close them all after the writing is complete. But to keep things a bit more interesting, allow us to feign some ignorance and say that we only figure out what the out file is supposed to be when reading the lines from the source file.

Cached file handles

For this, I created a class that inherits from dict, the names of the files are the keys, and the file handles the values. Here follows the entire script

The main logic is to try to fetch the file handle for a file you wish to write to. Open it in case it isn’t opened and store the name and handle. So what are the results of this?

A tenth of the previous time!

While we are doing comparisons…

Let’s try out the “PyPy” everyone is talking about.

Not particularly impressive… But of course, the time is still dominated by opening and closing files.

That’s pretty cool! Less than half the time of CPython. I’m not going to pretend to understand why, but it’s good to keep in mind if somethings needs to be fast (and doesn’t require too many third party modules).

Great! Now we can split files quickly!

Well… no. Not really. Not in general anyway.

See, I actually chose to split up the source files in to 2000 files for a reason. Let us try the cached file splitter with 3000 files instead.

...
    num = int(line) % 3000
...

Now it gets a bit grittier. What one actually does when open() is called in Python is to reserve a file descriptor, and the amount of available file descriptors is limited by the operating system. You can find out how many are available by running ulimit -n, and in many cases actually changing the limit by simply appending the desired number of file descriptors to that command. On my laptop the number is set to 2560 descriptors. On other places I’ve experienced it to be around 16000, so in that case splitting in to 3000 files shouldn’t pose a problem.

So why not increase the descriptor limit and continue? You may not always have the privilege to do so, and in some cases even 16000 might be too few. Also one cannot always figure out from the beginning how many file descriptors will be needed.

What we can do is introduce some resource management in to the FileSplitter.

There are many problems here. First of all, one should really investigate how many files should be closed when too many are opened to fix the exception. And for this particular problem, there will always be a regular pattern of cache misses; forcing us to open more files. Since we will always go from 0 to 2999, in that order, over and over again. And list if dictionary items returned by self.items() is sorted by the hashes, and should be quite random. On a less trivial problem we might have that some files are accessed more often than others, and in that case randomly closing files might be more efficient than in this case.

Anyway, we increased the number of files to split into to 3000, for the naive splitting we now have

And for the resource managed version we get

So there is still a performance gain, but not in the least bit as impressive as before!

Conclusions

Splitting a large file in to a moderate number of other files can be done efficiently, if one is aware of the limitations of the system (or in control enough to change them).

What brought this on?

For work I’ve been working on a script to demultiplex FASTQ files, that means taking lines of data from a large file and putting them in to different files based on a string of six symbols from an alphabet of five characters at the end of a line, (the alphabet being {A, C, G, T, N}). The theoretical maximal number of files would be \( 5^6 = 15625 \), but this should never really happen; for one thing, N’s are supposed to be rare.

The system where I did most of the work had a file descriptor limit of 16000, but when I had to do some of the work on my personal computer I ran in to the resource issue. And sure, I could have increased my ulimit -n, but I figured making the script safe was a good idea. It should maybe be mentioned that the resource managing version runs almost as fast as the normal cached version if the “Too many files”-exception never is reached.

Solution path for 'Towers of Hanoi'

Solution path for ‘Towers of Hanoi’

A few years ago I figured out a nice strategy for solving the Towers of Hanoi puzzle in the minimum number of moves. By adding a few restrictions to which moves are allowed one gets one single legal move in every (except the initial) step.

To be able to play with the game easily I represented it as integers stacked on eachother, representing the sizes of the corresponding tiles. So for example the initial position for three tiles would look like this:

1
2
3
-   -   -

For some reason, it occured to me that if one took the sum of every column in such a representation one would get a point in \( \mathbb{R}^3 \), with the property that that \( x + y + z = s \), where \( s \) is the sum of all numbers up to the number of tiles. This means every such point, after every move in the game, will lie in a plane.

If one performs the game for a few low values of \( s \) and plots the path between the points generated by the moves one starts seeing a pattern, which is natural with the symmetric and recursive structure of the game. At the time I imagine a nice fractal pattern would form if one let the number of tiles grow large, but never investigated it.

Now on a whim, after doing some other things in python during the day that in nature reminded me of the Towers puzzle I wrote a script for playing the game for a given number of tiles, saving the point in every step, and plot the path when it was done.

Writing the script was fun, but the results were far from as interesting or pretty as I had thought. Here is a picture of the path for ten tiles.

Path for the solution of Towers of Hanoi with 10 tiles

The basic problem, visually, is that the path folds in over it self when the number of tiles get larger than about four.