Detecting scRNA-seq study duplicates using sentence embeddings
I have been maintaining a spreadsheet of publications that generated single-cell transcriptomics data for about five years. It is linked at the top of this website, and we wrote up a paper about it a few years ago (Svensson, da Veiga Beltrame, and Pachter 2020). Earlier today it had 1,933 entries.
Naturally, over time, mistakes such as inserting the same publication twice are bound to happen. I took some time today to identify accidental duplicates.
My initial approach was to search for entries with exactly the same DOI (digital object identifier; the unique string that identifies a publication). I found four papers with this strategy and deduplicated them.
In the spreadsheet, a DOI is the only required field, from which an author list, a title, a publication, and a date are collected automatically with a macro that calls the CrossRef API. Even though DOIs are unique identifiers of papers, sometimes the same study is duplicated with different DOIs. The main reason for this is that a paper gets a DOI when it is posted on a preprint server such as bioRxiv, and then a second DOI once it is published in a peer reviewed journal.
To identify preprint-journal duplicates, my first strategy was to find papers with exactly the same title. This identified five studies, all of which turned out to be the same but one being on bioRxiv and one being in a journal.
A paper that has been revised before being submitted to a journal, then further revised through the review process, and finally has to adhere to the style guide of the journal, is unlikely to retain exactly the same title. To find papers that were likely duplicates, I needed a way to identify titles that probably describe the same paper even if there are slight variations.
The quickest and simplest approach I came up with was to use the OpenAI API to create sentence embeddings of the titles. Then calculate all the pairwise distances between the embeddings, and look at the pairs of titles that were the closest to each other. This ended up being very simple and effective!
client = OpenAI() responses = [] for chunk in tqdm(np.array_split(data, 10)): query = chunk['Title'].to_list() response = client.embeddings.create(input = query, model = 'text-embedding-ada-002') responses += [response] embeddings_list = [] for response in responses: embeddings = np.array([d.embedding for d in response.data]) embeddings_list += [embeddings] embeddings = np.vstack(embeddings_list) pdists = sklearn.metrics.pairwise_distances(embeddings) mask = np.triu(np.ones(pdists.shape), k = 1).astype(bool) pdistsl = pd.DataFrame(pdists).where(mask).stack().reset_index() top_similar = pdistsl.sort_values(0).head(20) for _, r in top_similar.iterrows(): print('Distance:', r[0]) d_r_0 = data.iloc[r['level_0'].astype(int)] d_r_1 = data.iloc[r['level_1'].astype(int)] print(d_r_0['DOI'], '\n|', d_r_0['Title']) print(d_r_1['DOI'], '\n|', d_r_1['Title']) print() Distance: 0.04542879727657215 10.1101/2020.10.07.329839 | Single-nucleus transcriptome analysis reveals cell type-specific molecular signatures across reward circuitry in the human brain 10.1016/j.neuron.2021.09.001 | Single-nucleus transcriptome analysis reveals cell-type-specific molecular signatures across reward circuitry in the human brain Distance: 0.051210665464434646 10.1101/2020.07.11.193458 | Single-nucleus RNA-seq2 reveals a functional crosstalk between liver zonation and ploidy 10.1038/s41467-021-24543-5 | Single-nucleus RNA-seq2 reveals functional crosstalk between liver zonation and ploidy Distance: 0.07003401486501365 10.1101/2020.03.02.955757 | Diversification of molecularly defined myenteric neuron classes revealed by single cell RNA-sequencing 10.1038/s41593-020-00736-x | Diversification of molecularly defined myenteric neuron classes revealed by single-cell RNA sequencing Distance: 0.08156853865981699 10.1101/2021.07.19.452956 | The Tabula Sapiens: a multiple organ single cell transcriptomic atlas of humans 10.1126/science.abl4896 | The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans Distance: 0.1182708273417854 10.1101/2020.04.22.056341 | Deconvolution of Cell Type-Specific Drug Responses in Human Tumor Tissue with Single-Cell RNA-seq 10.1186/s13073-021-00894-y | Deconvolution of cell type-specific drug responses in human tumor tissue with single-cell RNA-seq Distance: 0.14183682263019862 10.1101/2020.01.19.911701 | Surveying Brain Tumor Heterogeneity by Single-Cell RNA Sequencing of Multi-sector Biopsies 10.1093/nsr/nwaa099 | Surveying brain tumor heterogeneity by single-cell RNA-sequencing of multi-sector biopsies Distance: 0.15672052837461234 10.21203/rs.3.rs-745435/v1 | Single cell analysis of endometriosis reveals a coordinated transcriptional program driving immunotolerance and angiogenesis across eutopic and ectopic tissues. 10.1038/s41556-022-00961-5 | Single-cell analysis of endometriosis reveals a coordinated transcriptional programme driving immunotolerance and angiogenesis across eutopic and ectopic tissues Distance: 0.16437164718666886 10.1101/2020.06.17.156943 | Chromatin potential identified by shared single cell profiling of RNA and chromatin 10.1016/j.cell.2020.09.056 | Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin Distance: 0.16911884570096825 10.1101/2021.04.24.441206 | Single-cell landscapes of primary glioblastomas and matched organoids and cell lines reveal variable retention of inter- and intra-tumor heterogeneity 10.1016/j.ccell.2022.02.016 | Single-cell landscapes of primary glioblastomas and matched explants and cell lines show variable retention of inter- and intratumor heterogeneity Distance: 0.183893761793663 10.1101/2020.02.12.946509 | No detectable alloreactive transcriptional responses during donor-multiplexed single-cell RNA sequencing of peripheral blood mononuclear cells 10.1186/s12915-020-00941-x | No detectable alloreactive transcriptional responses under standard sample preparation conditions during donor-multiplexed single-cell RNA sequencing of peripheral blood mononuclear cells Distance: 0.18895108556159476 10.1101/2020.01.13.891630 | Single-cell transcriptome analysis reveals cell-cell communication and thyrocyte diversity in the zebrafish thyroid gland 10.15252/embr.202050612 | Single‐cell transcriptome analysis reveals thyrocyte diversity in the zebrafish thyroid gland Distance: 0.2003776161396695 10.21203/rs.3.rs-599203/v1 | A Single-cell Interactome of Human Tooth Germ Elucidates Signaling Networks Regulating Dental Development 10.1186/s13578-021-00691-5 | A single-cell interactome of human tooth germ from growing third molar elucidates signaling networks regulating dental development Distance: 0.23986737520644938 10.1101/2022.01.12.476082 | Scalable in situ single-cell profiling by electrophoretic capture of mRNA 10.1038/s41587-022-01455-3 | Scalable in situ single-cell profiling by electrophoretic capture of mRNA using EEL FISH Distance: 0.25840869095237246 10.2337/db16-0405 | Single-Cell Transcriptomics of the Human Endocrine Pancreas 10.1016/j.cels.2016.09.002 | A Single-Cell Transcriptome Atlas of the Human Pancreas Distance: 0.26278269347286093 10.15252/embj.2018100811 | A single‐cell transcriptome atlas of the adult human retina 10.1093/nsr/nwaa179 | A single-cell transcriptome atlas of the aging human and macaque retina Distance: 0.26422422020526076 10.1038/s41467-018-08079-9 | Single-cell transcriptomic analysis of mouse neocortical development 10.1101/2020.04.23.056390 | Single-cell transcriptomic analysis identifies neocortical developmental differences between human and mouse Distance: 0.2916387113759244 10.1038/s41586-022-04518-2 | A single-cell atlas of human and mouse white adipose tissue 10.1038/s41467-023-36983-2 | An integrated single cell and spatial transcriptomic map of human white adipose tissue Distance: 0.29364709869497707 10.1016/j.devcel.2020.05.010 | Single-Cell RNA Sequencing of Human, Macaque, and Mouse Testes Uncovers Conserved and Divergent Features of Mammalian Spermatogenesis 10.1016/j.devcel.2020.07.018 | Single-Cell RNA Sequencing of the Cynomolgus Macaque Testis Reveals Conserved Transcriptional Profiles during Mammalian Spermatogenesis Distance: 0.2940628150528307 10.1126/science.aar4362 | Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo 10.1101/2021.10.21.465298 | Spatiotemporal mapping of gene expression landscapes and developmental trajectories during zebrafish embryogenesis Distance: 0.2991743187594748 10.1101/2022.02.01.478648 | Single-cell RNA profiling of Plasmodium vivax liver stages reveals parasite- and host-specific transcriptomic signatures and drug targets 10.1371/journal.pntd.0010633 | Single-cell RNA sequencing of Plasmodium vivax sporozoites reveals stage- and species-specific transcriptomic signatures
The majority of highly similar article titles were bioRxiv preprints with their matched journal publications, and a couple of medRxiv and ResearchSquare preprints. Some were genuinely different papers that just happened to have very similar titles. Through this process, I could remove 14 more duplicates. In total, I discovered 23 paper duplicates!
I was particularly impressed with how easy it is to get high quality sentence embeddings at this time. There are probably technically simpler strategies to get similar titles, such as removing punctuation and converting all characters to lowercase, both of which mostly depend on journal style guides. But in actuality, at this point, just getting text embeddings will be easier than any ad hoc strategy.
A notebook with code related to this post is available at Github
References
Svensson, Valentine, Eduardo da Veiga Beltrame, and Lior Pachter. 2020. “A Curated Database Reveals Trends in Single-Cell Transcriptomics.” Database: The Journal of Biological Databases and Curation 2020 (November). https://doi.org/10.1093/database/baaa073.