nxn

nxn

Share this post

nxn
nxn
Gene Discovery Rate in RNA-seq data

Gene Discovery Rate in RNA-seq data

Valentine Svensson's avatar
Valentine Svensson
Nov 08, 2013
Share

I was curious regarding at what rate new genes are detected as the number of reads in RNA-seq data increases. Let’s start by looking at the results.

The figure shows how many genes there are with nonzero counts at certain read depths.

In particular I was interested how this discovery rate looked for single-cell RNA-seq data. I found some read files at ENA which was from human cells. The two data sets used are SRR445722 and SRR499825.

In the interest of saving space on my tiny laptop, I only kept the first four million reads in the fastq files. For each read number investigated, reads are first randomly sampled from the fastq file using the reservoir sampling of seqtk. The sampled reads are then mapped to the human transcriptome (Homo_sapiens.GRCh37.73.cdna.abinitio.fa) using bwa mem. To map from transcripts to genes, I use eXpress (with the annotation file Homo_sapiens.GRCh37.73.gtf.gz). The final step is to count the number of genes which have counts different from 0.

In the interest of time and simplicity, I didn’t perform any quality or adapter trimming of the data.

For each read count, the sampling was performed five times with different random seeds. In the figure these replicates can barely be seen, but at some pints some smaller colored points can be glanced.

These are the pipeline for the process for the two datasets:

for READS in $(seq 10000 10000 1000000) do for i in $(seq 5) do echo $READS reads >> sample_num_genes_SRR445722.out seqtk sample -s$(date +%s) SRR445722_4M_reads.fastq $(echo "scale=8; $READS / 4000000" | bc) | \ bwa mem -t 2 Homo_sapiens.GRCh37.73.cdna.abinitio.fa - | \ express --no-bias-correct Homo_sapiens.GRCh37.73.cdna.abinitio.fa cut -f5 results.xprs | grep -v "\b0\b" | wc -l >> sample_num_genes.out done done

and

for READS in $(seq 10000 10000 1000000) do for i in $(seq 5) do echo $READS reads >> sample_num_genes_SRR499825.out SEED=$(date +%s) FRAC=$(echo "scale=8; $READS / 4000000" | bc) bwa mem -t 2 Homo_sapiens.GRCh37.73.cdna.abinitio.fa \ <(seqtk sample -s$SEED SRR499825_1_4M_reads.fastq $FRAC) \ <(seqtk sample -s$SEED SRR499825_2_4M_reads.fastq $FRAC) \ | express --no-bias-correct Homo_sapiens.GRCh37.73.cdna.abinitio.fa cut -f5 results.xprs | grep -v "\b0\b" | wc -l >> sample_num_genes_SRR499825.out done done

They are slightly different since SRR499825 is paired-ended while SRR445722 is single ended.

I ran these for as long as I could leave my laptop on and stationary at a time.

What I find interesting about this is how these logarithm-like curves seem are so smooth.

Both studies these data come from have plenty of replicates, so a more thorough version of this study would be to try data from multiple cells and make sure that correlates. It would also be interesting to see how bulk RNA- seq compares to these data sets. I also only looked at data from human cells because I didn’t want include another set of transcriptome and annotation file to keep track of, but there are a lot more mouse single-cell RNA-seq studies on ENA than human.

It would also be fun to run the analysis much longer, to get up to millions of reads, where we start seeing the majority of genes as detected. And see precisely at which point this starts to happen.

EDIT

I reran the computations, but this time I let both data sets run to completion (depth of 1000 000 reads). I also saved the counts for each gene, so I could query detected genes with a lower cutoff. The new graph above show the detection rates for lower cutoffs of one, three and five mapped reads.


Subscribe to nxn

By Valentine Svensson · Launched a year ago
What do you mean 'heterogeneity'?
Share
SCVI — Integrating or not?
In modern single cell analysis workflows you fit a model to your data that allow you to ask questions about the data.
Mar 18 • 
Valentine Svensson
Actionable scRNA-seq clusters
Recently Jean Fan published a blog post about "real" scRNA-seq clusters. The general idea is that if there is not differential expression between…
Mar 5, 2018 • 
Valentine Svensson
Merging SCVI models
There are massive amounts of single-cell RNA-seq data, and data is still being produced.
May 21 • 
Valentine Svensson

Ready for more?

© 2025 Valentine Svensson
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share