Bulk RNA-seq¶
# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage test-bulkrna --schema bionty
Show code cell output
💡 connected lamindb: testuser1/test-bulkrna
import lamindb as ln
from pathlib import Path
import bionty as bt
import pandas as pd
import anndata as ad
💡 connected lamindb: testuser1/test-bulkrna
Ingest data¶
Access ¶
We start by simulating a nf-core RNA-seq run which yields us a count matrix artifact.
(See Nextflow for running this with Nextflow.)
# pretend we're running a bulk RNA-seq pipeline
ln.track(transform=ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq"))
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.core.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
populate_registries=True
)
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.Artifact(path, description="Merged Bulk RNA counts").save()
Show code cell output
💡 saved: Transform(uid='mW5MekmWlxYoUl4Q', name='nf-core RNA-seq', type='pipeline', reference='https://nf-co.re/rnaseq', created_by_id=1, updated_at='2024-08-06 18:33:02 UTC')
💡 saved: Run(uid='SLo51hIsaNN9KWOWpWhC', transform_id=1, created_by_id=1)
Artifact(uid='0LjedQgK7Lo6qRVsY6Yx', description='Merged Bulk RNA counts', key='output_dir/salmon.merged.gene_counts.tsv', suffix='.tsv', size=3787, hash='xxw0k3au3KtxFcgtbEr4eQ', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-08-06 18:33:04 UTC')
Transform ¶
ln.settings.transform.stem_uid = "s5V0dNMVwL9i"
ln.settings.transform.version = "0"
ln.track()
💡 notebook imports: anndata==0.10.8 bionty==0.47.1 lamindb==0.75.0 pandas==1.5.3
💡 saved: Transform(uid='s5V0dNMVwL9i6K79', version='0', name='Bulk RNA-seq', key='bulkrna', type='notebook', created_by_id=1, updated_at='2024-08-06 18:33:04 UTC')
💡 saved: Run(uid='KGRMbsfm3JLeo3xx0ly7', transform_id=2, created_by_id=1)
Run(uid='KGRMbsfm3JLeo3xx0ly7', started_at='2024-08-06 18:33:04 UTC', is_consecutive=True, transform_id=2, created_by_id=1)
Let’s query the artifact:
artifact = ln.Artifact.filter(description="Merged Bulk RNA counts").one()
df = artifact.load()
If we look at it, we realize it deviates far from the tidy data standard Wickham14, conventions of statistics & machine learning Hastie09, Murphy12 and the major Python & R data packages.
Variables are not in columns and observations are not in rows:
df
gene_id | gene_name | RAP1_IAA_30M_REP1 | RAP1_UNINDUCED_REP1 | RAP1_UNINDUCED_REP2 | WT_REP1 | WT_REP2 | |
---|---|---|---|---|---|---|---|
0 | Gfp_transgene_gene | Gfp_transgene_gene | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
1 | HRA1 | HRA1 | 0.0 | 8.572 | 0.0 | 0.0 | 0.0 |
2 | snR18 | snR18 | 3.0 | 8.000 | 4.0 | 8.0 | 8.0 |
3 | tA(UGC)A | TGA1 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
4 | tL(CAA)A | SUP56 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
120 | YAR064W | YAR064W | 0.0 | 2.000 | 0.0 | 0.0 | 0.0 |
121 | YAR066W | YAR066W | 3.0 | 13.000 | 8.0 | 5.0 | 11.0 |
122 | YAR068W | YAR068W | 9.0 | 28.000 | 24.0 | 5.0 | 7.0 |
123 | YAR069C | YAR069C | 0.0 | 0.000 | 0.0 | 0.0 | 1.0 |
124 | YAR070C | YAR070C | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
125 rows × 7 columns
Let’s change that and move observations into rows:
df = df.T
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | Gfp_transgene_gene | HRA1 | snR18 | tA(UGC)A | tL(CAA)A | tP(UGG)A | tS(AGA)A | YAL001C | YAL002W | YAL003W | ... | YAR050W | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
gene_name | Gfp_transgene_gene | HRA1 | snR18 | TGA1 | SUP56 | TRN1 | tS(AGA)A | TFC3 | VPS8 | EFB1 | ... | FLO1 | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
RAP1_IAA_30M_REP1 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 55.0 | 36.0 | 632.0 | ... | 4.357 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 9.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP1 | 0.0 | 8.572 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 72.0 | 33.0 | 810.0 | ... | 15.72 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 13.0 | 28.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP2 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 | 82.0 | 1693.0 | ... | 13.772 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 8.0 | 24.0 | 0.0 | 0.0 |
WT_REP1 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 1.0 | 0.0 | 60.0 | 63.0 | 1115.0 | ... | 13.465 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 5.0 | 0.0 | 0.0 |
WT_REP2 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 25.0 | 704.0 | ... | 6.891 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 11.0 | 7.0 | 1.0 | 0.0 |
7 rows × 125 columns
Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.
Let’s create an AnnData object to model this. First, create a dataframe for the variables:
var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
gene_name | |
---|---|
gene_id | |
Gfp_transgene_gene | Gfp_transgene_gene |
HRA1 | HRA1 |
snR18 | snR18 |
tA(UGC)A | TGA1 |
tL(CAA)A | SUP56 |
Now, let’s create an AnnData:
# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)
adata
AnnData object with n_obs × n_vars = 5 × 125
var: 'gene_name'
The AnnData object is in tidy form and complies with conventions of statistics and machine learning:
adata.to_df()
gene_id | Gfp_transgene_gene | HRA1 | snR18 | tA(UGC)A | tL(CAA)A | tP(UGG)A | tS(AGA)A | YAL001C | YAL002W | YAL003W | ... | YAR050W | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RAP1_IAA_30M_REP1 | 0.0 | 0.000 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 55.0 | 36.0 | 632.0 | ... | 4.357 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 9.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP1 | 0.0 | 8.572 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 72.0 | 33.0 | 810.0 | ... | 15.720 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 13.0 | 28.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP2 | 0.0 | 0.000 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 | 82.0 | 1693.0 | ... | 13.772 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 8.0 | 24.0 | 0.0 | 0.0 |
WT_REP1 | 0.0 | 0.000 | 8.0 | 0.0 | 0.0 | 1.0 | 0.0 | 60.0 | 63.0 | 1115.0 | ... | 13.465 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 5.0 | 0.0 | 0.0 |
WT_REP2 | 0.0 | 0.000 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 25.0 | 704.0 | ... | 6.891 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 11.0 | 7.0 | 1.0 | 0.0 |
5 rows × 125 columns
Validate ¶
Let’s create a Artifact object from this AnnData.
Almost all gene IDs are validated:
genes = bt.Gene.from_values(
adata.var.index,
bt.Gene.stable_id,
organism="saccharomyces cerevisiae", # or set globally with bt.settings.organism
)
Show code cell output
❗ did not create Gene records for 2 non-validated stable_ids: 'Gfp_transgene_gene', 'YAR062W'
# also register the 2 non-validated genes obtained from Bionty
ln.save(genes)
Register ¶
efs = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()
curated_file = ln.Artifact.from_anndata(
adata,
description="Curated bulk RNA counts"
)
Hence, let’s save this artifact:
curated_file.save()
Show code cell output
Artifact(uid='L2Y7GInXNYcIVbpA0Mx9', description='Curated bulk RNA counts', suffix='.h5ad', type='dataset', _accessor='AnnData', size=28180, hash='6bieh8XjOCCz6bJToN4u1g', _hash_type='md5', visibility=1, _key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-08-06 18:33:06 UTC')
Link to validated metadata records:
curated_file.features._add_set_from_anndata(var_field=bt.Gene.stable_id, organism="saccharomyces cerevisiae")
❗ 2 terms (1.60%) are not validated for stable_id: Gfp_transgene_gene, YAR062W
curated_file.labels.add(efs.rna_seq, features.assay)
curated_file.labels.add(organism.saccharomyces_cerevisiae, features.organism)
curated_file.describe()
Artifact(uid='L2Y7GInXNYcIVbpA0Mx9', description='Curated bulk RNA counts', suffix='.h5ad', type='dataset', _accessor='AnnData', size=28180, hash='6bieh8XjOCCz6bJToN4u1g', _hash_type='md5', visibility=1, _key_is_virtual=True, updated_at='2024-08-06 18:33:06 UTC')
Provenance
.created_by = 'testuser1'
.storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna'
.transform = 'Bulk RNA-seq'
.run = '2024-08-06 18:33:04 UTC'
Labels
.organisms = 'saccharomyces cerevisiae'
.experimental_factors = 'RNA-Seq'
Features
'assay' = 'RNA-Seq'
'organism' = 'saccharomyces cerevisiae'
Feature sets
'var' = 'None', 'TGA1', 'SUP56', 'TRN1', 'TFC3', 'VPS8', 'EFB1', 'SSA1', 'ERP2', 'FUN14', 'SPO7', 'MDM10', 'SWC3', 'CYS3', 'DEP1', 'SYN8', 'NTG1'
Query data¶
We have two files in the artifact registry:
ln.Artifact.df()
uid | version | description | key | suffix | type | _accessor | size | hash | _hash_type | n_objects | n_observations | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
2 | L2Y7GInXNYcIVbpA0Mx9 | None | Curated bulk RNA counts | None | .h5ad | dataset | AnnData | 28180 | 6bieh8XjOCCz6bJToN4u1g | md5 | None | None | 1 | True | 1 | 2 | 2 | 1 | 2024-08-06 18:33:06.398477+00:00 |
1 | 0LjedQgK7Lo6qRVsY6Yx | None | Merged Bulk RNA counts | output_dir/salmon.merged.gene_counts.tsv | .tsv | None | None | 3787 | xxw0k3au3KtxFcgtbEr4eQ | md5 | None | None | 1 | False | 1 | 1 | 1 | 1 | 2024-08-06 18:33:04.397164+00:00 |
curated_file.view_lineage()
Let’s by query by gene:
genes = bt.Gene.lookup()
genes.spo7
Gene(uid='2pkcLeMEB6aS', symbol='SPO7', stable_id='YAL009W', ncbi_gene_ids='851224', biotype='protein_coding', synonyms='', description='Putative regulatory subunit of Nem1p-Spo7p phosphatase holoenzyme; regulates nuclear growth by controlling phospholipid biosynthesis, required for normal nuclear envelope morphology, premeiotic replication, and sporulation ', created_by_id=1, run_id=2, organism_id=1, source_id=19, updated_at='2024-08-06 18:33:06 UTC')
# a gene set containing SPO7
feature_set = ln.FeatureSet.filter(genes=genes.spo7).first()
# artifacts that link to this feature set
ln.Artifact.filter(feature_sets=feature_set).df()
uid | version | description | key | suffix | type | _accessor | size | hash | _hash_type | n_objects | n_observations | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
2 | L2Y7GInXNYcIVbpA0Mx9 | None | Curated bulk RNA counts | None | .h5ad | dataset | AnnData | 28180 | 6bieh8XjOCCz6bJToN4u1g | md5 | None | None | 1 | True | 1 | 2 | 2 | 1 | 2024-08-06 18:33:06.398477+00:00 |
# clean up test instance
!lamin delete --force test-bulkrna
!rm -r test-bulkrna
Show code cell output
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.10.14/x64/bin/lamin", line 8, in <module>
sys.exit(main())
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 367, in __call__
return super().__call__(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamin_cli/__main__.py", line 164, in delete
return delete(instance, force=force)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/_delete.py", line 98, in delete
n_objects = check_storage_is_empty(
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/core/upath.py", line 779, in check_storage_is_empty
raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna/.lamindb contains 1 objects ('_is_initialized' ignored) - delete them prior to deleting the instance