# Low mapping rate 5 - Human DNA contamination

This is (most likely) the final post in the series investigating the low mapping rate of our Smart-seq2 data from our study on the malaria immune response. If you have read the previous posts, you might have notied a population of cells which have been stuck at extremely low mapping rate, no matter how much things improved for the other cells. It turns out this population of cells are contaminated with human material.

In our study we invsetigated CD4+ T cells sorted from spleens of infected mice. There are a number of potential entry points for human material to contaminate the samples.

1. Human cells can be sorted with mouse cells.
2. Human material enter the plate of cell lysate during cDNA generation.
3. Human material can end up in the plate when creating the DNA sequencing library.

These different potential vectors of human contamination will lead to data with different characteristics. If human cells are sorted together mouse cells, the data will have heterogenous expression patterns as single cell data does, only with human genes rather than mouse genes. On the other hand, if human material enter the plate of sorted cells while cDNA is being generated, human mRNA will convert to cDNA but with consistent bulk-like expression patterns in the different wells. Finally, if human material enter the plate during library preperation no reverse transcription will be performed, and instead DNA from the human material will end up being sequenced.

To analyse this, I added all human Gencode transcripts to the Salmon reference from the previous post, along with a human 18S rRNA sequence. This will account for the first two possibilities. For the case of human DNA contamination, I extracted the unmapped reads from Salmon and aligned them to the human genome with HISAT2. With the remaining unmapped/unaligned reads I calculated the final mapping rate.

It is clear that we hace solved the mystery of the extremely low-mapping population! By breaking up the mapped reads into the sources of contribution like we have done before for each sample, we can see which have the contamination cases have happened.

In the entirety of the plate 20003_6, as well as stretches of the plate 20003_8 we see that by for the greatest contribution of material is from human intergenic DNA, suggesting that the contamination happened during library preperation.

At this point I want to illustrate how much of the reads we have no found an explenation for compared to the original reference.

The mapping rate have moved from a heterogenous stretch to a clearer distribution common for the plates. In the end the mapping rate is only 75% on average, but this is a great improvement from before, and I haven't managed to see something systematic about the remaining reads.

To summerise, the Salmon reference now contains Mouse Gencode genes, ERCC spike-in sequences, mouse ribosomal RNA, a TSO concatemer "bait" sequence, Pseudomonas 16S and 23S sequences, human Gencode genes, and human 18S. Additionally, the unmapped reads are aligned to the human genome.

I hope the series of posts have been helpful, and in particular illustrative of the many failure modes of scRNA-seq experiments. This was all within a single experiment!