Concatenate datasets to a single array store¶
In the previous notebooks, we’ve seen how to incrementally create a collection of scRNA-seq datasets and train models on it.
Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata (see this blog post). This is what CELLxGENE does to create Census: a number of .h5ad
files are concatenated to give rise to a single tiledbsoma
array store (CELLxGENE: scRNA-seq).
Note
This notebook is based on the tiledbsoma documentation.
import lamindb as ln
import pandas as pd
import scanpy as sc
import tiledbsoma
import tiledbsoma.io
from functools import reduce
💡 connected lamindb: testuser1/test-scrna
ln.settings.transform.stem_uid = "oJN8WmVrxI8m"
ln.settings.transform.version = "1"
ln.track()
Show code cell output
💡 notebook imports: lamindb==0.75.0 pandas==1.5.3 scanpy==1.9.6 tiledbsoma==1.12.3
💡 saved: Transform(uid='oJN8WmVrxI8m5zKv', version='1', name='Concatenate datasets to a single array store', key='scrna6', type='notebook', created_by_id=1, updated_at='2024-08-06 18:32:25 UTC')
💡 saved: Run(uid='O5Mmou9m2mXaX98GugSR', transform_id=6, created_by_id=1)
Run(uid='O5Mmou9m2mXaX98GugSR', started_at='2024-08-06 18:32:25 UTC', is_consecutive=True, transform_id=6, created_by_id=1)
Query the collection of h5ad
files that we’d like to convert into a single array.
collection = ln.Collection.filter(
name="My versioned scRNA-seq collection", version="2"
).one()
collection.describe()
Show code cell output
Collection(uid='1BzyBzHTHpjklT2dprxh', version='2', name='My versioned scRNA-seq collection', hash='ALSqXiQ6gOCCwEGmVnuLQA', visibility=1, updated_at='2024-08-06 18:31:55 UTC')
Provenance
.created_by = 'testuser1'
.transform = 'Standardize and append a batch of data'
.run = '2024-08-06 18:31:33 UTC'
.input_of_runs = ["'2024-08-06 18:32:05 UTC'", "'2024-08-06 18:32:18 UTC'"]
Feature sets
'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
'obs' = 'donor', 'tissue', 'cell_type', 'assay'
Prepare the array store¶
Prepare a path and a context for a new tiledbsoma.Experiment
.
We will create our array store at the LaminDB instance root with folder name "scrna.tiledbsoma"
.
storage_settings = ln.setup.settings.storage
soma_path = (storage_settings.root / "scrna.tiledbsoma").as_posix() # we could take any AWS S3 path, here
If our path is on AWS S3
, we need to create a context with region information (exception: us-east-1
). You can find more about tiledb
configuration parameters in the tiledb
documentation.
if storage_settings.type == "s3": # if the storage location is on AWS S3
storage_region = storage_settings.region
ctx = tiledbsoma.SOMATileDBContext(tiledb_config={"vfs.s3.region": storage_region})
else:
ctx = None
Prepare the AnnData objects¶
We need to prepare theAnnData
objects in the collection to be concatenated into one tiledbsoma.Experiment
. They need to have the same .var
and .obs
columns, .uns
and .obsp
should be removed.
adatas = [artifact.load() for artifact in collection.ordered_artifacts]
Compute the intersetion of all columns. All AnnData
objects should have the same columns in their .obs
, .var
, .raw.var
to be ingested into one tiledbsoma.Experiment
.
obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])
var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])
var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])
Prepare the AnnData
objects for concatenation. Prepare id fields, sanitize index
names, intersect columns, drop slots. Here we have to drop .obsp
, .uns
and also columns from the dataframes that are not in the intersections obtained above, otherwise the ingestion will fail. We will need to provide obs
and var
names in tiledbsoma.io.register_anndatas
, so we create these fileds (obs_id
, var_id
) from the dataframe indices.
for i, adata in enumerate(adatas):
del adata.obsp
del adata.uns
adata.obs = adata.obs.filter(obs_columns)
adata.obs["obs_id"] = adata.obs.index
adata.obs["dataset"] = i
adata.obs.index.name = None
adata.var = adata.var.filter(var_columns)
adata.var["var_id"] = adata.var.index
adata.var.index.name = None
drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
adata.raw.var["var_id"] = adata.raw.var.index
adata.raw.var.index.name = None
Create the array store¶
Register all the AnnData objects. Pass experiment_uri=None
because tiledbsoma.Experiment
doesn’t exist yet:
registration_mapping = tiledbsoma.io.register_anndatas(
experiment_uri=None,
adatas=adatas,
measurement_name="RNA",
obs_field_name="obs_id",
var_field_name="var_id",
append_obsm_varm=True
)
Ingest the AnnData
objects sequentially, providing the context. This saves the AnnData
objects in one array store.
for adata in adatas:
tiledbsoma.io.from_anndata(
experiment_uri=soma_path,
anndata=adata,
measurement_name="RNA",
registration_mapping=registration_mapping,
context=ctx
)
Show code cell output
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
Register the array store¶
Register the created tiledbsoma.Experiment
store in lamindb
:
soma_artifact = ln.Artifact(soma_path, description="My scRNA-seq SOMA Experiment").save()
soma_artifact.describe()
Show code cell output
Artifact(uid='7ucpUuNCF2gC9uLh8P4w', description='My scRNA-seq SOMA Experiment', key='scrna.tiledbsoma', suffix='.tiledbsoma', size=15054524, hash='lJc_py5PTIGZzkGuzTHi8w', _hash_type='md5-d', n_objects=143, visibility=1, _key_is_virtual=False, updated_at='2024-08-06 18:32:32 UTC')
Provenance
.created_by = 'testuser1'
.storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna'
.transform = 'Concatenate datasets to a single array store'
.run = '2024-08-06 18:32:25 UTC'
Query the array store¶
Open and query the experiment. We can use the registered Artifact
. We query X
and obs
from the array store.
with soma_artifact.open() as soma_store:
obs = soma_store["obs"]
ms_rna = soma_store["ms"]["RNA"]
n_obs = len(obs)
n_var = len(ms_rna["var"])
X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()
print(obs.read().concat().to_pandas())
Show code cell output
soma_joinid cell_type \
0 0 dendritic cell
1 1 B cell, CD19-positive
2 2 dendritic cell
3 3 B cell, CD19-positive
4 4 effector memory CD4-positive, alpha-beta T cel...
... ... ...
1713 1713 naive thymus-derived CD4-positive, alpha-beta ...
1714 1714 naive thymus-derived CD4-positive, alpha-beta ...
1715 1715 naive thymus-derived CD4-positive, alpha-beta ...
1716 1716 CD8-positive, alpha-beta memory T cell
1717 1717 naive thymus-derived CD4-positive, alpha-beta ...
obs_id dataset
0 GCAGGGCTGGATTC-1 0
1 CTTTAGTGGTTACG-6 0
2 TGACTGGAACCATG-7 0
3 TCAATCACCCTTCG-8 0
4 CGTTATACAGTACC-8 0
... ... ...
1713 Pan_T7991594_CTCACACTCCAGGGCT 1
1714 Pan_T7980358_CGAGCACAGAAGATTC 1
1715 CZINY-0064_AGACCATCACGCTGCA 1
1716 CZINY-0050_TCGATTTAGATGTTGA 1
1717 CZINY-0064_AGTGTTGTCCGAGCTG 1
[1718 rows x 4 columns]
Update the array store¶
Calculate PCA from the queried X
.
pca_array = sc.pp.pca(X, n_comps=2)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/anndata/_core/anndata.py:430: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
warnings.warn(
soma_artifact
Artifact(uid='7ucpUuNCF2gC9uLh8P4w', description='My scRNA-seq SOMA Experiment', key='scrna.tiledbsoma', suffix='.tiledbsoma', size=15054524, hash='lJc_py5PTIGZzkGuzTHi8w', _hash_type='md5-d', n_objects=143, visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=6, run_id=6, updated_at='2024-08-06 18:32:32 UTC')
Open the array store in write mode and add PCA. When the store is updated, the corresponding artifact also gets updated with a new version.
with soma_artifact.open(mode="w") as soma_store:
tiledbsoma.io.add_matrix_to_collection(
exp=soma_store,
measurement_name="RNA",
collection_name="obsm",
matrix_name="pca",
matrix_data=pca_array
)
Show code cell output
❗ The hash of the tiledbsoma store has changed, creating a new version of the artifact.
❗ artifact version 2 will _update_ the state of folder /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/scrna.tiledbsoma - to _retain_ the old state by duplicating the entire folder, do _not_ pass `is_new_version_of`
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
return _abc_instancecheck(cls, instance)
Note that the artifact has been changed.
soma_artifact
Artifact(uid='7ucpUuNCF2gC9uLhMRxz', version='2', description='My scRNA-seq SOMA Experiment', key='scrna.tiledbsoma', suffix='.tiledbsoma', size=15074968, hash='E-NTO5u9aSorktMBIEHGCQ', _hash_type='md5-d', n_objects=152, visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=6, run_id=6, updated_at='2024-08-06 18:32:34 UTC')