Organizing scRNA-seq data with LaminDB
Occasionally, when I’m curious about something scRNA-seq related, I analyze some data. Often this leads to posts on this website.
I have made it a habit to try to find some recent, somewhat interesting, scRNA-seq data for these kinds of analysis. This is because I don’t want to end up mentally overfitting to a particular dataset, and I also consider it a way to keep up with how the scale and complexity of datasets are changing over time.
Over the years, I have ended up collecting a large number of scRNA-seq datasets. I did a global search for ‘*.h5ad’ on my Google Drive, and found over 300 files in various directories scattered throughout my drive.
Due to the increasing popularity and scale of scRNA-seq experiments, products for organizing large collections of data, such as LaminDB.
LaminDB is designed for tracking biological data in general. It offers plugins to manage ontologies related to scRNA-seq data as well as for AnnData format data.
The LaminDB database tracks metadata about the data, as well as the location of the data. It can track data locally on a computer, but it is probably most useful when storing your data on cloud storage.
After filtering duplicates and non-public H5ADs from my list, I ingested all my H5ADs into LaminDB, while also spending some time to hunt down the original publications for all the datasets and tag them with the DOI of the source publication for each dataset.
LaminDB is organized around ‘artifacts’ and ‘collections’. Each H5AD file is considered an ‘artifact’, which can be annotated with ‘features’ and ‘labels’.
Since each artifact (H5AD file) has an article DOI associated, I was able to use the single cell studies database to add some further annotation to the files: organism, tissue, and technique associated with the DOI.
Features and labels can be arbitrarily registered, and you can build your own ontologies to organize your data. For single-cell genomics data, Lamin provides a plugin, ‘bionty’, with pre-defined labels and features using relevant public ontologies, including organisms and tissues. The data from the single cell studies database was used to create these labels for the artifacts.
The LaminDB database can be explored on lamin.ai, and can be loaded by running lamin load vals/scrna
in the terminal. In my case, the database is hosted on AWS. After running the load command, the database is directly accessible in Python through the ‘lamindb’ package, which is typically imported as ln
.
For example, we can count how many files are tracked by the database.
> ln.Artifact.filter().all().count()
236
These are the 236 H5AD files I ingested to the database after deduplication. This deduplication happened automatically as I ingested files based on the hash of the file contents.
Files of interest can be identified by filtering based on metadata (which in this case is on the publication level).
> organisms = lb.Organism.lookup()
> tissues = lb.Tissue.lookup()
> techniques = lb.ExperimentalFactor.lookup()
> query = (
ln.Artifact
.filter(organism = organisms.mouse)
.filter(tissues = tissues.brain)
.filter(experimental_factors = techniques.chromium)
)
> print(query.df().head())
uid storage_id
id
15 88k1xpSWBdJzEbYUf6MZ 1 \
16 UDH0CAo3JHo0uQUiAtUV 1
21 kFvwhKtIBY0mhQTOqaOX 1
22 PBAwUla9eERbI1KJQADZ 1
23 HW9uS90M2PVyeX50o6ct 1
key suffix accessor
id
15 cortex-align-2018/Zeisel 2018/l1_cortex2.h5ad .h5ad AnnData \
16 cortex-align-2018/Zeisel 2018/l1_cortex3.h5ad .h5ad AnnData
21 hca-benchmark-2019/Ding/2019-05-Ding-et-al-bio... .h5ad AnnData
22 hca-benchmark-2019/Ding/PBMC/PBMC_read_counts.... .h5ad AnnData
23 hca-benchmark-2019/Ding/PBMC/PBMC_umi_counts.h5ad .h5ad AnnData
description version size hash hash_type n_objects
id
15 None None 77363974 SlY49Cv9vnMGofdSCMOj90 sha1-fl None \
16 None None 64354637 ZVFW2H_B1ZlD9KK_WKS11T sha1-fl None
21 None None 224711660 e7bzC5ZhAV-e_swr_6jQie sha1-fl None
22 None None 307531280 DaQIiOBzYf5VjXmOmWE1tQ sha1-fl None
23 None None 296120754 aKSloGAGbodFRA8PlBR1Vz sha1-fl None
n_observations transform_id run_id visibility key_is_virtual
id
15 None None None 1 True \
16 None None None 1 True
21 None None None 1 True
22 None None None 1 True
23 None None None 1 True
created_at updated_at
id
15 2023-11-06 07:03:39.184199+00:00 2024-01-21 07:04:35.258135+00:00 \
16 2023-11-06 07:04:08.454681+00:00 2024-01-21 07:04:38.216085+00:00
21 2023-11-06 07:16:59.839767+00:00 2024-01-21 07:06:08.275496+00:00
22 2023-11-06 16:19:26.830698+00:00 2024-01-21 07:06:18.767072+00:00
23 2023-11-06 16:20:20.747104+00:00 2024-01-21 07:06:29.318662+00:00
created_by_id
id
15 1
16 1
21 1
22 1
23 1
Artifacts, here representing H5AD files, contain a lot of different metadata that relates to provenance, meaning the source and transformations that have occurred for the file, as well as annotations added to the files.
> artifact = query.first()
> artifact.describe()
Artifact(uid='88k1xpSWBdJzEbYUf6MZ', key='cortex-align-2018/Zeisel 2018/l1_cortex2.h5ad', suffix='.h5ad', accessor='AnnData', size=77363974, hash='SlY49Cv9vnMGofdSCMOj90', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-01-21 07:04:35 UTC)
Provenance:
🗃️ storage: Storage(uid='3wVRAheC', root='s3://vals-scrna', type='s3', region='us-west-1', updated_at=2023-11-05 20:36:19 UTC, created_by_id=1)
👤 created_by: User(uid='8joZB4lw', handle='vals', updated_at=2023-11-05 20:36:19 UTC)
⬇️ input_of (core.Run): ['2024-01-25 15:18:34 UTC']
Features:
external: FeatureSet(uid='b20JpuwZ5gfBUAvcS8jP', n=3, registry='core.Feature', hash='iKuEmOyrwt4opSDzC-dF', updated_at=2024-01-21 07:29:48 UTC, created_by_id=1)
🔗 organism (1, bionty.Organism): 'mouse'
🔗 tissue (1, bionty.Tissue): 'brain'
🔗 technique (1, bionty.ExperimentalFactor): 'Chromium'
Labels:
🏷️ organism (1, bionty.Organism): 'mouse'
🏷️ tissues (1, bionty.Tissue): 'brain'
🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'Chromium'
Since LaminDB has native support for H5AD files and AnnData objects, the contents of the artifact, as represented by a link to an H5AD file, can be loaded to the local Python session from the cloud storage it is stored in.
> adata = artifact.load()
> adata
AnnData object with n_obs × n_vars = 20811 × 27998
obs: 'Age', 'AnalysisPool', 'AnalysisProject', 'CellConc', 'Cell_Conc', 'ChipID', 'Class', 'ClassProbability_Astrocyte', 'ClassProbability_Astrocyte,Immune', 'ClassProbability_Astrocyte,Neurons', 'ClassProbability_Astrocyte,Oligos', 'ClassProbability_Astrocyte,Vascular', 'ClassProbability_Bergmann-glia', 'ClassProbability_Blood', 'ClassProbability_Blood,Vascular', 'ClassProbability_Enteric-glia', 'ClassProbability_Enteric-glia,Cycling', 'ClassProbability_Ependymal', 'ClassProbability_Ex-Neurons', 'ClassProbability_Ex-Vascular', 'ClassProbability_Immune', 'ClassProbability_Immune,Neurons', 'ClassProbability_Immune,Oligos', 'ClassProbability_Neurons', 'ClassProbability_Neurons,Cycling', 'ClassProbability_Neurons,Oligos', 'ClassProbability_Neurons,Satellite-glia', 'ClassProbability_Neurons,Vascular', 'ClassProbability_OEC', 'ClassProbability_Oligos', 'ClassProbability_Oligos,Cycling', 'ClassProbability_Oligos,Vascular', 'ClassProbability_Satellite-glia', 'ClassProbability_Satellite-glia,Cycling', 'ClassProbability_Satellite-glia,Schwann', 'ClassProbability_Schwann', 'ClassProbability_Ttr', 'ClassProbability_Vascular', 'Clusters', 'Comments', 'DateCaptured', 'Date_Captured', 'DonorID', 'Flowcell', 'Fraction Reads in Cells', 'Label', 'Mean Reads per Cell', 'Median Genes per Cell', 'MitoRiboRatio', 'NGI_PlateWell', 'NumPooledAnimals', 'Num_Pooled_Animals', 'Number of Reads', 'Outliers', 'PCRCycles', 'PCR_Cycles', 'PassedQC', 'PlugDate', 'Plug_Date', 'Project', 'Q30 Bases in Barcode', 'Q30 Bases in UMI', 'Reads Mapped Confidently to Exonic Regions', 'Reads Mapped Confidently to Intergenic Regions', 'Reads Mapped Confidently to Intronic Regions', 'Reads Mapped Confidently to Transcriptome', 'SampleID', 'SampleIndex', 'SampleOK', 'Sample_Index', 'SeqComment', 'SeqLibDate', 'SeqLibOk', 'Seq_Comment', 'Seq_Lib_Date', 'Seq_Lib_Ok', 'Serial_Number', 'Sex', 'Species', 'Strain', 'Subclass', 'TargetNumCells', 'Target_Num_Cells', 'TimepointPool', 'Tissue', 'Transcriptome', 'Valid Barcodes', '_KMeans_10', '_LogCV', '_LogMean', '_NGenes', '_Total', '_Valid', '_X', '_Y', '_tSNE1', '_tSNE2', 'cDNAConcNanogramPerMicroliter', 'cDNALibOk', 'cDNA_Lib_Ok', 'ngperul_cDNA', 'ClusterName', 'Description', 'Probable_location'
var: 'Accession', '_LogCV', '_LogMean', '_Selected', '_Total', '_Valid'
Say we want to look at some human 10x Chromium data, but we’re not sure which tissue, or even what tissues are represented in the database. We can query the database and investigate other pieces of metadata for the results.
> query = (
ln.Artifact
.filter(organism = organisms.human)
.filter(experimental_factors = techniques.chromium)
)
> (
query
.df(include = 'tissues__name')
.explode('tissues__name')
.groupby(['tissues__name'])
.size()
.sort_values(ascending = False)
.head(5)
)
tissues__name
blood 39
colon 35
cell culture 30
brain 3
lung 3
dtype: int64
Based on results, we decide that blood would be an interesting tissue for our current question.
> query = (
ln.Artifact
.filter(organism = organisms.human)
.filter(tissues = tissues.blood)
.filter(experimental_factors = techniques.chromium)
)
> print(query.df().head())
uid storage_id
id
90 UWETAZbzAURlLQCyTV1d 1 \
10 ZbAhp5ASiaOaKBqpOFQ2 1
11 bEIwnqnBRf3ewAjRzNOY 1
24 S2hi2e0q1t0rr2KL667H 1
25 GZN0nW18UUQ1JOfUPyQR 1
key suffix accessor
id
90 Boland et al/GSE125527.h5ad .h5ad AnnData \
10 220222 Exploratory analysis with scVI/GSE13826... .h5ad AnnData
11 220222 Exploratory analysis with scVI/GSE13826... .h5ad AnnData
24 hca-benchmark-2019/h5ads/10X2x5Kcell250Kreads_... .h5ad AnnData
25 hca-benchmark-2019/h5ads/10X2x5Kcell250Kreads_... .h5ad AnnData
description version size hash hash_type
id
90 None None 526865612 5c33pdXfI7munQb_UJV8bO sha1-fl \
10 None None 1246663095 hFW_l02ns5NOOE9pCOX64h sha1-fl
11 None None 619244593 Fj40Pif5bUgrAt4vllmHY2 sha1-fl
24 None None 232265080 yZysuTc6RctVHhd2E2H3lc sha1-fl
25 None None 245881240 w9nksKG3qtOHaKmYxNYmge sha1-fl
n_objects n_observations transform_id run_id visibility key_is_virtual
id
90 None None None None 1 True \
10 None None None None 1 True
11 None None None None 1 True
24 None None None None 1 True
25 None None None None 1 True
created_at updated_at
id
90 2023-11-07 04:42:48.631068+00:00 2024-01-21 07:04:08.645070+00:00 \
10 2023-11-06 06:58:09.539943+00:00 2024-01-21 07:04:00.774051+00:00
11 2023-11-06 06:59:46.752959+00:00 2024-01-21 07:04:04.738173+00:00
24 2023-11-06 16:25:39.070645+00:00 2024-01-21 07:06:44.435484+00:00
25 2023-11-06 16:26:15.070279+00:00 2024-01-21 07:06:59.559525+00:00
created_by_id
id
90 1
10 1
11 1
24 1
25 1
Then we can have a closer look at the first result, and load the data.
> artifact = query.first()
> artifact.describe()
Artifact(uid='UWETAZbzAURlLQCyTV1d', key='Boland et al/GSE125527.h5ad', suffix='.h5ad', accessor='AnnData', size=526865612, hash='5c33pdXfI7munQb_UJV8bO', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-01-21 07:04:08 UTC)
Provenance:
🗃️ storage: Storage(uid='3wVRAheC', root='s3://vals-scrna', type='s3', region='us-west-1', updated_at=2023-11-05 20:36:19 UTC, created_by_id=1)
👤 created_by: User(uid='8joZB4lw', handle='vals', updated_at=2023-11-05 20:36:19 UTC)
⬇️ input_of (core.Run): ['2024-01-25 15:18:34 UTC']
Features:
external: FeatureSet(uid='b20JpuwZ5gfBUAvcS8jP', n=3, registry='core.Feature', hash='iKuEmOyrwt4opSDzC-dF', updated_at=2024-01-21 07:29:48 UTC, created_by_id=1)
🔗 organism (1, bionty.Organism): 'human'
🔗 tissue (2, bionty.Tissue): 'rectum', 'blood'
🔗 technique (1, bionty.ExperimentalFactor): 'Chromium'
Labels:
🏷️ organism (1, bionty.Organism): 'human'
🏷️ tissues (2, bionty.Tissue): 'rectum', 'blood'
🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'Chromium'
> adata = artifact.load()
> adata
AnnData object with n_obs × n_vars = 68627 × 10160
obs: 'cell_id', 'patient_assignment', 'tissue_assignment', 'disease_assignment', 'celltype'
> print(adata.obs)
cell_id patient_assignment tissue_assignment
AAACCTGCATGGTAGG-1 1 C1 R \
AAACGGGAGATAGGAG-1 2 C1 R
AACACGTGTGTGCCTG-1 3 C1 R
AACCGCGGTTCAGACT-1 4 C1 R
AACCGCGTCTCGCATC-1 5 C1 R
... ... ... ...
AAAGATGAGAATGTGT-30 68623 C9 PBMC
CGACTTCGTGACTACT-30 68624 C9 PBMC
GTATTCTCAAGCCATT-30 68625 C9 PBMC
TGCCCATAGTTGCAGG-30 68626 C9 PBMC
TGGTTCCAGACGCACA-30 68627 C9 PBMC
disease_assignment celltype
AAACCTGCATGGTAGG-1 healthy T
AAACGGGAGATAGGAG-1 healthy T
AACACGTGTGTGCCTG-1 healthy T
AACCGCGGTTCAGACT-1 healthy T
AACCGCGTCTCGCATC-1 healthy T
... ... ...
AAAGATGAGAATGTGT-30 healthy T
CGACTTCGTGACTACT-30 healthy T
GTATTCTCAAGCCATT-30 healthy T
TGCCCATAGTTGCAGG-30 healthy T
TGGTTCCAGACGCACA-30 healthy T
[68627 rows x 5 columns]
Previously, my workflow would have been to remember that at some point I downloaded some human data with blood, remembered if I used it in a blog post, and hopefully a copy of the data would be stored in my Google Drive folder for that post. With the LaminDB database, all data I have used previously is instantly accessible.
At this point, I have only ingested H5AD files and annotated them. LaminDB also supports tracking metadata for all the single cells and genes in the H5AD files using standardized ontologies. This seems extremely useful for searching for data of interest, and keeping it harmonized, but would require quite a lot of work for my current collection.
Even just using the tracking of H5AD files, tracking the files with the database will be a far better solution than my many Google Drive folders.
Alex Wolf and Sunny Sun of Lamin helped me with adding annotations to my artifacts, as well as provided feedback on this post, and Lamin offered to host my collection so it can be made available publicly through lamin.ai. Notebooks with code related to this work are available on GitHub.