Analysis flow¶

Here, we’ll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: Project flow.

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./analysis-usecase --schema bionty

import lamindb as ln
import bionty as bt
from lamin_utils import logger

💡 connected lamindb: testuser1/analysis-usecase

Register an initial dataset¶

Here we register an initial artifact with a pipeline script register_example_file.py.

!python analysis-flow-scripts/register_example_file.py

Show code cell output Hide code cell output

💡 connected lamindb: testuser1/analysis-usecase

💡 saved: Transform(uid='K4wsS5DTYdFp6K79', version='0', name='register_example_file.py', key='register_example_file.py', type='script', created_by_id=1, updated_at='2024-08-06 18:35:11 UTC')
💡 saved: Run(uid='t62HOVWEIcvS4Ag6GjiH', transform_id=1, created_by_id=1)

✅ added 3 records with Feature.name for columns: 'cell_type', 'tissue', 'disease'

💡 1 non-validated categories are not saved in Feature.name: ['cell_type_id']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
💡 saving labels for 'cell_type'

✅ added 3 records from public with CellType.name for cell_type: 'T cell', 'hematopoietic stem cell', 'hepatocyte'
❗ 1 non-validated categories are not saved in CellType.name: ['my new cell type']!
      → to lookup categories, use lookup().cell_type
      → to save, run .add_new_from('cell_type')
💡 saving labels for 'tissue'

💡 saving labels for 'disease'

✅ added 1 record with CellType.name for cell_type: 'my new cell type'

✅ created 1 Organism record from Bionty matching name: 'human'

💡 mapping var_index on Gene.ensembl_gene_id
❗    found 99 validated terms: ['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', 'ENSG00000002079', 'ENSG00000002330', 'ENSG00000002549', 'ENSG00000002586', 'ENSG00000002587', 'ENSG00000002726', 'ENSG00000002745', 'ENSG00000002746', 'ENSG00000002822', 'ENSG00000002834', 'ENSG00000002919', 'ENSG00000002933', 'ENSG00000003056', 'ENSG00000003096', 'ENSG00000003137', 'ENSG00000003147', 'ENSG00000003249', 'ENSG00000003393', 'ENSG00000003400', 'ENSG00000003402', 'ENSG00000003436', 'ENSG00000003509', 'ENSG00000003756', 'ENSG00000003987', 'ENSG00000003989', 'ENSG00000004059', 'ENSG00000004139', 'ENSG00000004142', 'ENSG00000004399', 'ENSG00000004455', 'ENSG00000004468', 'ENSG00000004478', 'ENSG00000004487', 'ENSG00000004534', 'ENSG00000004660', 'ENSG00000004700', 'ENSG00000004766', 'ENSG00000004776', 'ENSG00000004777', 'ENSG00000004779', 'ENSG00000004799', 'ENSG00000004809', 'ENSG00000004838', 'ENSG00000004846', 'ENSG00000004848', 'ENSG00000004864', 'ENSG00000004866', 'ENSG00000004897', 'ENSG00000004939', 'ENSG00000004948', 'ENSG00000004961', 'ENSG00000004975', 'ENSG00000005001', 'ENSG00000005007', 'ENSG00000005020', 'ENSG00000005022', 'ENSG00000005059', 'ENSG00000005073', 'ENSG00000005075', 'ENSG00000005100', 'ENSG00000005102', 'ENSG00000005108', 'ENSG00000005156', 'ENSG00000005175', 'ENSG00000005187', 'ENSG00000005189', 'ENSG00000005194', 'ENSG00000005206', 'ENSG00000005238', 'ENSG00000005243', 'ENSG00000005249', 'ENSG00000005302', 'ENSG00000005339', 'ENSG00000005379', 'ENSG00000005381', 'ENSG00000005421', 'ENSG00000005436', 'ENSG00000005448', 'ENSG00000005469']
      → save terms via .add_validated_from_var_index()
✅ var_index is validated against Gene.ensembl_gene_id
✅ cell_type is validated against CellType.name

✅ tissue is validated against Tissue.name
✅ disease is validated against Disease.name
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/IIVJ2io8MEIBERrIYGWL.h5ad')

✅ storing artifact 'IIVJ2io8MEIBERrIYGWL' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/IIVJ2io8MEIBERrIYGWL.h5ad'
💡 parsing feature names of X stored in slot 'var'
❗    Your Gene registry is empty, consider populating it first!
   → use `.import_from_source()` to import records from a source, e.g. a public ontology
❗ skip linking features to artifact in slot 'var'
💡 parsing feature names of slot 'obs'
✅    3 terms (75.00%) are validated for name
❗    1 term (25.00%) is not validated for name: cell_type_id
✅    linked: FeatureSet(uid='xuBUNi7IQzIp4OXKTzQE', n=3, registry='Feature', hash='3oDgqiOGp7x48LU91NvOJQ', created_by_id=1, run_id=1)
✅ saved 1 feature set for slot: 'obs'

Pull the registered dataset, apply a transformation, and register the result¶

Track the current notebook:

ln.settings.transform.stem_uid = "eNef4Arw8nNM"
ln.settings.transform.version = "0"
ln.track()

💡 notebook imports: bionty==0.47.1 lamin_utils==0.13.2 lamindb==0.75.0

💡 saved: Transform(uid='eNef4Arw8nNM6K79', version='0', name='Analysis flow', key='analysis-flow', type='notebook', created_by_id=1, updated_at='2024-08-06 18:35:25 UTC')

💡 saved: Run(uid='vqaqqdiBSJt2h3fnMFQB', transform_id=2, created_by_id=1)

Run(uid='vqaqqdiBSJt2h3fnMFQB', started_at='2024-08-06 18:35:25 UTC', is_consecutive=True, transform_id=2, created_by_id=1)

artifact = ln.Artifact.filter(description="anndata with obs").one()
artifact.describe()

Artifact(uid='IIVJ2io8MEIBERrIYGWL', description='anndata with obs', suffix='.h5ad', type='dataset', _accessor='AnnData', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', _hash_type='md5', n_observations=40, visibility=1, _key_is_virtual=True, updated_at='2024-08-06 18:35:23 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase'
    .transform = 'register_example_file.py'
    .run = '2024-08-06 18:35:11 UTC'
  Labels
    .tissues = 'kidney', 'liver', 'heart', 'brain'
    .cell_types = 'T cell', 'hematopoietic stem cell', 'hepatocyte', 'my new cell type'
    .diseases = 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
  Features
    'cell_type' = 'T cell', 'hematopoietic stem cell', 'hepatocyte', 'my new cell type'
    'disease' = 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
    'tissue' = 'kidney', 'liver', 'heart', 'brain'
  Feature sets
    'obs' = 'cell_type', 'tissue', 'disease'

Get a backed AnnData object¶

adata = artifact.open()
adata

AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object IIVJ2io8MEIBERrIYGWL.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases¶

cell_types = artifact.cell_types.all().lookup(return_field="name")
diseases = artifact.diseases.all().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))

adata_subset = adata[subset_obs]
adata_subset

AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']

adata_subset.obs[["cell_type", "disease"]].value_counts()

cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64

Register the subsetted AnnData:

curate = ln.Curate.from_anndata(
    adata_subset.to_memory(), 
    var_index=bt.Gene.ensembl_gene_id, 
    categoricals={
        "cell_type": bt.CellType.name, 
        "disease": bt.Disease.name, 
        "tissue": bt.Tissue.name,
    },
    organism="human"
)

curate.validate()

Show code cell output Hide code cell output

💡 1 non-validated categories are not saved in Feature.name: ['cell_type_id']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns

/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

💡 mapping var_index on Gene.ensembl_gene_id

❗    found 99 validated terms: ['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', 'ENSG00000002079', 'ENSG00000002330', 'ENSG00000002549', 'ENSG00000002586', 'ENSG00000002587', 'ENSG00000002726', 'ENSG00000002745', 'ENSG00000002746', 'ENSG00000002822', 'ENSG00000002834', 'ENSG00000002919', 'ENSG00000002933', 'ENSG00000003056', 'ENSG00000003096', 'ENSG00000003137', 'ENSG00000003147', 'ENSG00000003249', 'ENSG00000003393', 'ENSG00000003400', 'ENSG00000003402', 'ENSG00000003436', 'ENSG00000003509', 'ENSG00000003756', 'ENSG00000003987', 'ENSG00000003989', 'ENSG00000004059', 'ENSG00000004139', 'ENSG00000004142', 'ENSG00000004399', 'ENSG00000004455', 'ENSG00000004468', 'ENSG00000004478', 'ENSG00000004487', 'ENSG00000004534', 'ENSG00000004660', 'ENSG00000004700', 'ENSG00000004766', 'ENSG00000004776', 'ENSG00000004777', 'ENSG00000004779', 'ENSG00000004799', 'ENSG00000004809', 'ENSG00000004838', 'ENSG00000004846', 'ENSG00000004848', 'ENSG00000004864', 'ENSG00000004866', 'ENSG00000004897', 'ENSG00000004939', 'ENSG00000004948', 'ENSG00000004961', 'ENSG00000004975', 'ENSG00000005001', 'ENSG00000005007', 'ENSG00000005020', 'ENSG00000005022', 'ENSG00000005059', 'ENSG00000005073', 'ENSG00000005075', 'ENSG00000005100', 'ENSG00000005102', 'ENSG00000005108', 'ENSG00000005156', 'ENSG00000005175', 'ENSG00000005187', 'ENSG00000005189', 'ENSG00000005194', 'ENSG00000005206', 'ENSG00000005238', 'ENSG00000005243', 'ENSG00000005249', 'ENSG00000005302', 'ENSG00000005339', 'ENSG00000005379', 'ENSG00000005381', 'ENSG00000005421', 'ENSG00000005436', 'ENSG00000005448', 'ENSG00000005469']
      → save terms via .add_validated_from_var_index()

✅ var_index is validated against Gene.ensembl_gene_id

✅ cell_type is validated against CellType.name

✅ disease is validated against Disease.name

✅ tissue is validated against Tissue.name

True

artifact = curate.save_artifact(description="anndata with obs subset")

artifact.describe()

Artifact(uid='Nn0BqIP6PuluchOxMoP8', description='anndata with obs subset', suffix='.h5ad', type='dataset', _accessor='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', _hash_type='md5', n_observations=20, visibility=1, _key_is_virtual=True, updated_at='2024-08-06 18:35:29 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase'
    .transform = 'Analysis flow'
    .run = '2024-08-06 18:35:25 UTC'
  Labels
    .tissues = 'kidney', 'liver'
    .cell_types = 'T cell', 'hematopoietic stem cell'
    .diseases = 'chronic kidney disease', 'liver lymphoma'
  Features
    'cell_type' = 'T cell', 'hematopoietic stem cell'
    'disease' = 'chronic kidney disease', 'liver lymphoma'
    'tissue' = 'kidney', 'liver'
  Feature sets
    'obs' = 'cell_type', 'tissue', 'disease'

Examine data flow¶

Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:

cell_types = bt.CellType.lookup()

my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()

my_subset

Artifact(uid='Nn0BqIP6PuluchOxMoP8', description='anndata with obs subset', suffix='.h5ad', type='dataset', _accessor='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', _hash_type='md5', n_observations=20, visibility=1, _key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=2, run_id=2, updated_at='2024-08-06 18:35:29 UTC')

Common questions that might arise are:

What is the history of this artifact?
Which features and labels are associated with it?
Which notebook analyzed and registered this artifact?
By whom?
And which artifact is its parent?

Let’s answer this using LaminDB:

print("--> What is the history of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
logger.print(artifact.features)
logger.print(artifact.labels)

print("\n\n--> Which notebook analyzed and registered this artifact\n")
logger.print(artifact.transform)

print("\n\n--> By whom\n")
logger.print(artifact.created_by)

print("\n\n--> And which artifact is its parent\n")
display(artifact.run.input_artifacts.df())

--> What is the history of this artifact?

_images/9eeaf6c86a7a76cdc639463542b51b65fa1b71156267273c4868f15c6408b33b.svg

--> Which features and labels are associated with it?

  Features
    'cell_type' = 'T cell', 'hematopoietic stem cell'
    'disease' = 'chronic kidney disease', 'liver lymphoma'
    'tissue' = 'kidney', 'liver'
  Feature sets
    'obs' = 'cell_type', 'tissue', 'disease'

  Labels
    .tissues = 'kidney', 'liver'
    .cell_types = 'T cell', 'hematopoietic stem cell'
    .diseases = 'chronic kidney disease', 'liver lymphoma'

--> Which notebook analyzed and registered this artifact

Transform(uid='eNef4Arw8nNM6K79', version='0', name='Analysis flow', key='analysis-flow', type='notebook', created_by_id=1, updated_at='2024-08-06 18:35:25 UTC')

--> By whom

User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at='2024-08-06 18:35:07 UTC')

--> And which artifact is its parent

	uid	version	description	key	suffix	type	_accessor	size	hash	_hash_type	n_objects	n_observations	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
1	IIVJ2io8MEIBERrIYGWL	None	anndata with obs	None	.h5ad	dataset	AnnData	46992	IJORtcQUSS11QBqD-nTD0A	md5	None	40	1	True	1	1	1	1	2024-08-06 18:35:23.952704+00:00