# Streaming RNA-seq data from ENA

For many of the projects I’m working on for my PhD I use published data. Up until now my strategy has been to download all read files of an experiment from ENA, then process them all with e.g. Salmon to get expression values. This feels a bit silly because sequencing read files are on the order of gigabytes in size, while a csv file of expression values is a few megabytes. In fact, currently my data directory has almost 50 terabytes of public data in it.

The other day I saw this gist from Mike Love. Supposedly it gives you the URL of a ftp hosted fastq from the ENA/SRA accession number of it. This is great, because you can just use curl on the URL, which by defualt streams a file in chunks, to fetch the contents of a sequencing read data set. We only need to know the name of it.

I made a small Bash script for streaming a given accession id.

#/bin/bash

fastq="$1" prefix=ftp://ftp.sra.ebi.ac.uk/vol1/fastq accession=$(echo $fastq | tr '.' '_' | cut -d'_' -f 1) dir1=${accession:0:6}

a_len=${#accession} if (($a_len == 9 )); then
dir2="";
elif (( $a_len == 10 )); then dir2=00${accession:9:1};
elif (( $a_len == 11)); then dir2=0${accession:9:2};
else
dir2=${accession:9:3}; fi url=$prefix/$dir1/$dir2/$accession/$fastq.gz

curl --keepalive-time 4 -s $url | zcat  We call this file stream_ena. You need to know the id of accession, and whether the file you want to look at is part of a pair. But then you have instant access to the contents of any published sequencing data-set! If we want to look at some single-end data, we can just do $ ./stream_ena SRR3185782.fastq | head
AGTGTGTTCATCAGTGTGGATTTGCCAATGCCGGTCTCCCCCACACAGAG
+
BBBFFBFFFB<FFFFFBFF<FFFFFFFFFFFFFIIIIFFFFFFFFIFFFF
GCCAATTTTCTTAATGTAAGTGCTGACTTCCTTAACAATTTCCTCATATC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
CGGGTTCTTGGACTTCAGCCAGTTGAGCAGGGCATCCTTGTTGAAGGCGG


If we want to quantify this data set with salmon, we can now simply run

$salmon quant -l IU \ -i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \ -r <(./stream_ena SRR3185782.fastq) -o SRR3185782  This will stream the entire contents of accession in to salmon directly from ENA without storing anything on disk, and quantified expression will be saved in the directory SRR3185782. For a dataset with paired reads we would do for example $ salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-1 <(./stream_ena SRR1274127_1.fastq) \
-2 <(./stream_ena SRR1274127_2.fastq) -o SRR1274127


Many sequencing tools supports streaming out box don care whether you are your disk or server online.

Say for example we want to look at 5 random reads:

$seqtk sample -s$(date +%s) <(./stream_ena SRR1274127_1.fastq) 0.001 | head -n 20
@SRR1274127.753 753/1
TTAGAAGGATTATGGATGCGGTTGCTTGCGTGAGGAAATACTTGATGGCAGCTTCTGTGGAACGAGGGTTTATTTTTTTGGGTAGAACTGGAATAAAAGCT
+
@SRR1274127.1464 1464/1
AATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCT
+
CCCFFFFFHHHHHJJIIHIIJJIJJIGJGJJJJJJJJJJJJJJJJJJJJJJJJJJFIIJJJJJJIJJIIJJIIIJJHHHHFFDFFFFEEDDDDDDB@BBB9
@SRR1274127.1672 1672/1
CCCTACTACTATCTCGCACCTGAAACACCCTAACATGACTAACACCCTTAATTCCATCCACCCTCCTCTCCCTAGGAGGCCTGCCCCCGCTAACCGGCTTT
+
@@@DDDD>FABBD@BDFB:FE;+2+2<)):CFB?<3?4?B*99???FFFE98)>F;778)7(5@E1CF?DDB#############################
@SRR1274127.2188 2188/1
GATTATTAGGGGAACTAGTCAGTTGCCAAAGCCTCCGATTATGATGGGTATTACTATGAAGAAGATTATTACAAATGCATGGGCTGTGACGATAACGTTGT
+
@?@DFDEDFBHFHGGG@DCHEEHG@FEHG@HGGGGIID=FDHGGHGCG?8?CFHHGHGAHEACHA@E<D@EFE>??C@@CD@B@AABCCC@BB@BCB9992
@SRR1274127.4127 4127/1
CTTATACTAGTATCCTTAATCATTTTTATTGCCACAACTAACCTCCTCGGACTCCTGCCTCACTCATTTACACCAACCACCCAACTATCTATAAACCTAGC
+
??@DDFFFDHFFDGHGGGGGHEGGJEHIIAFDGIIGGIJGIGHHIJJIAFBGIGHJGEGIGGJGGGI=EGEEECHAB<?ACB?ABCCDDDDDDCCDDCDCD


\$ ./stream_ena SRR1274127_1.fastq | fastqc -o SRR1274127_1_fastqc -f fastq stdin