The variability of a spoonful of coffee

I like freshly ground coffee, especially made with a French press. The common advice regarding coffee is to dose by weight , due to measurement variability regarding beans in spoons. I was curious how large this variability is.

The different sizes and shapes of the coffee beans makes the volume of the packed beans not correspond to the mass of the coffee, which will be ground to powder.

Different coffee beans are also differently dense, here I’m looking at Brazil Santos coffee.

I made 10 measurements, where I took a level spoon, weighed the contents, then replaced them in my jar of beans before I repeated the measurement.

To quantify the variability, I made a model in Stan. Here a spoonful is modeled normally distributed with a standard deviation parameter which correspond to the between-spoonful variability.

import pystan

code = '''
data {
    int<lower> N;
    real<lower> y[N];
}
parameters {
    real<lower> mu;
    real<lower> sigma;
}
model {
    // Priors
    mu ~ uniform(0, 25);
    sigma ~ uniform(0, 25);

    // Likelihood
    y ~ normal(mu, sigma);
}
'''

data = {
    'N': 10,
    'y': [13.54, 14.22, 13.63, 13.20, 13.32,
          13.23, 13.98, 13.66, 13.51, 14.18]
}

fit = pystan.stan(model_code=code, data=data)

fit

Inference for Stan model: anon_model_393a8d8deb17adb090b969fb1275a378.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
mu     13.64  2.3e-3   0.14  13.34  13.55  13.64  13.73  13.91   4000    1.0
sigma   0.43  2.1e-3   0.13   0.26   0.35   0.41   0.49   0.76   4000    1.0
lp__    5.14    0.02   1.13   2.27   4.74   5.47   5.91   6.22   4000    1.0

Samples were drawn using NUTS(diag_e) at Mon Jul 11 20:54:38 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

The $ y $ values in the data dictionaries are the weight measurements.

 

In the end this means that the confidence interval of the spoonful of coffee is between 12.8g and 14.5g.

Streaming RNA-seq data from ENA

For many of the projects I'm working on for my PhD I use published data. Up until now my strategy has been to download all read files of an experiment from ENA, then process them all with e.g. Salmon to get expression values. This feels a bit silly because sequencing read files are on the order of gigabytes in size, while a csv file of expression values is a few megabytes. In fact, currently my data directory has almost 50 terabytes of public data in it.

The other day I saw this gist from Mike Love. Supposedly it gives you the URL of a ftp hosted fastq from the ENA/SRA accession number of it. This is great, because you can just use curl on the URL, which by defualt streams a file in chunks, to fetch the contents of a sequencing read data set. We only need to know the name of it.

I made a small Bash script for streaming a given accession id.

#/bin/bash

fastq="$1"

prefix=ftp://ftp.sra.ebi.ac.uk/vol1/fastq

accession=$(echo $fastq | tr '.' '_' | cut -d'_' -f 1)

dir1=${accession:0:6}

a_len=${#accession}
if (( $a_len == 9 )); then
    dir2="";
elif (( $a_len == 10 )); then
    dir2=00${accession:9:1};
elif (( $a_len == 11)); then
    dir2=0${accession:9:2};
else
    dir2=${accession:9:3};
fi

url=$prefix/$dir1/$dir2/$accession/$fastq.gz

curl --keepalive-time 4 -s $url | zcat

We call this file stream_ena. You need to know the id of accession, and whether the file you want to look at is part of a pair. But then you have instant access to the contents of any published sequencing data-set!

If we want to look at some single-end data, we can just do

$ ./stream_ena SRR3185782.fastq | head
@SRR3185782.1 HWI-D00361:180:HJG3GADXX:2:1101:1460:2181/1
AGTGTGTTCATCAGTGTGGATTTGCCAATGCCGGTCTCCCCCACACAGAG
+
BBBFFBFFFB<FFFFFBFF<FFFFFFFFFFFFFIIIIFFFFFFFFIFFFF
@SRR3185782.2 HWI-D00361:180:HJG3GADXX:2:1101:1613:2218/1
GCCAATTTTCTTAATGTAAGTGCTGACTTCCTTAACAATTTCCTCATATC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR3185782.3 HWI-D00361:180:HJG3GADXX:2:1101:2089:2243/1
CGGGTTCTTGGACTTCAGCCAGTTGAGCAGGGCATCCTTGTTGAAGGCGG

If we want to quantify this data set with salmon, we can now simply run

$ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-r <(./stream_ena SRR3185782.fastq) -o SRR3185782

This will stream the entire contents of accession in to salmon directly from ENA without storing anything on disk, and quantified expression will be saved in the directory SRR3185782.

For a dataset with paired reads we would do for example

$ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-1 <(./stream_ena SRR1274127_1.fastq) \
-2 <(./stream_ena SRR1274127_2.fastq) -o SRR1274127

Many sequencing tools supports streaming out box don care whether you are your disk or server online.

Say for example we want to look at 5 random reads:

$ seqtk sample -s $(date +%s) <(./stream_ena SRR1274127_1.fastq) 0.001 | head -n 20
@SRR1274127.753 753/1
TTAGAAGGATTATGGATGCGGTTGCTTGCGTGAGGAAATACTTGATGGCAGCTTCTGTGGAACGAGGGTTTATTTTTTTGGGTAGAACTGGAATAAAAGCT
+
BCCFFFFFHHHHHJJJJJJJJHIJIJJJIICGHIJGIJJIJJJJJJJJJJJJJJJJJIJJJGHHFFFD;ADDDDEEDDDD>&2>?ACDDDCDDDDDDDDDC
@SRR1274127.1464 1464/1
AATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCT
+
CCCFFFFFHHHHHJJIIHIIJJIJJIGJGJJJJJJJJJJJJJJJJJJJJJJJJJJFIIJJJJJJIJJIIJJIIIJJHHHHFFDFFFFEEDDDDDDB@BBB9
@SRR1274127.1672 1672/1
CCCTACTACTATCTCGCACCTGAAACACCCTAACATGACTAACACCCTTAATTCCATCCACCCTCCTCTCCCTAGGAGGCCTGCCCCCGCTAACCGGCTTT
+
@@@DDDD>FABBD@BDFB:FE;+2+2<)):CFB?<3?4?B*99???FFFE98)>F;778)7(5@E1CF?DDB#############################
@SRR1274127.2188 2188/1
GATTATTAGGGGAACTAGTCAGTTGCCAAAGCCTCCGATTATGATGGGTATTACTATGAAGAAGATTATTACAAATGCATGGGCTGTGACGATAACGTTGT
+
@?@DFDEDFBHFHGGG@DCHEEHG@FEHG@HGGGGIID=FDHGGHGCG?8?CFHHGHGAHEACHA@E<D@EFE>??C@@CD@B@AABCCC@BB@BCB9992
@SRR1274127.4127 4127/1
CTTATACTAGTATCCTTAATCATTTTTATTGCCACAACTAACCTCCTCGGACTCCTGCCTCACTCATTTACACCAACCACCCAACTATCTATAAACCTAGC
+
??@DDFFFDHFFDGHGGGGGHEGGJEHIIAFDGIIGGIJGIGHHIJJIAFBGIGHJGEGIGGJGGGI=EGEEECHAB<?ACB?ABCCDDDDDDCCDDCDCD

Or if we want to check the quality of a dataset without wasting space downloading it:

$ ./stream_ena SRR1274127_1.fastq | fastqc -o SRR1274127_1_fastqc -f fastq stdin

Of course there are caveats. You can't just blindly put any reads in to salmon and get correct expression. You need to read the methods section of the related study to see what parameters use. The data might need some might preprocessing before it is useful. It is also very common that files uploaded to ENA should be merged before being input to processing tools. But this streaming approach greatly ease storage burden burden from working with public data.