Jupyter Notebook Binder

CellTypist

Cell types classify cells based on public and private knowledge from studying transcription, morphology, function & other properties. Established cell types have well-characterized markers and properties; however, cell subtypes and states are continuously being discovered, refined and better understood.

In this notebook, we register the immune cell type vocabulary from CellTypist, a computational tool used for cell type classification in scRNA-seq data.

In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to curate datasets analyzed with CellTypist enrichment analysis and track the dataset with LaminDB.

# pip install 'lamindb[jupyter,bionty]'
!lamin load use-cases-registries
Hide code cell output
💡 connected lamindb: testuser1/use-cases-registries
Hide code cell content
# filter warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import lamindb as ln
import bionty as bt
💡 connected lamindb: testuser1/use-cases-registries

Access CellTypist records

As a first step we will read in CellTypist’s immune cell encyclopedia

import pandas as pd
description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"

celltypist_df = pd.read_excel(celltypist_source_v2_url)

It provides an ontology_id of the public Cell Ontology for the majority of records.

celltypist_df.head()
High-hierarchy cell types Low-hierarchy cell types Description Cell Ontology ID Curated markers
0 B cells B cells B lymphocytes with diverse cell surface immuno... CL:0000236 CD79A, MS4A1, CD19
1 B cells Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 CXCR5, TNFRSF13B, CD22
2 B cells Proliferative germinal center B cells proliferating germinal center B cells CL:0000844 MKI67, SUGCT, AICDA
3 B cells Germinal center B cells proliferating mature B cells that undergo soma... CL:0000844 POU2AF1, CD40, SUGCT
4 B cells Memory B cells long-lived mature B lymphocytes which are form... CL:0000787 CR2, CD27, MS4A1

The “Cell Ontology ID” is associated with multiple “Low-hierarchy cell types”:

celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)
High-hierarchy cell types Description Curated markers
Cell Ontology ID Low-hierarchy cell types
CL:0000236 B cells B cells B lymphocytes with diverse cell surface immuno... CD79A, MS4A1, CD19
CL:0000843 Follicular B cells B cells resting mature B lymphocytes found in the prim... CXCR5, TNFRSF13B, CD22
CL:0000844 Proliferative germinal center B cells B cells proliferating germinal center B cells MKI67, SUGCT, AICDA
Germinal center B cells B cells proliferating mature B cells that undergo soma... POU2AF1, CD40, SUGCT
CL:0000787 Memory B cells B cells long-lived mature B lymphocytes which are form... CR2, CD27, MS4A1
Age-associated B cells B cells CD11c+ T-bet+ memory B cells associated with a... FCRL2, ITGAX, TBX21
CL:0000788 Naive B cells B cells mature B lymphocytes which express cell-surfac... IGHM, IGHD, TCL1A
CL:0000818 Transitional B cells B cells immature B cell precursors in the bone marrow ... CD24, MYO1C, MS4A1
CL:0000817 Large pre-B cells B-cell lineage proliferative B lymphocyte precursors derived ... MME, CD24, MKI67
Small pre-B cells B-cell lineage non-proliferative B lymphocyte precursors deri... MME, CD24, IGLL5

Validate CellTypist records

For any cell type record that can be validated against the public Cell Ontology, we’d like to ensure that it’s actually validated.

This will avoid that we’ll refer to the same cell type with different identifiers.

We need a Bionty object for this:

bionty = bt.CellType.public()
bionty
PublicOntology
Entity: CellType
Organism: all
Source: cl, 2024-02-13
#terms: 2918

We can now validate the "Cell Ontology ID" column:

bionty.inspect(celltypist_df["Cell Ontology ID"], bionty.ontology_id);

This looks good!

But when inspecting the names, most of them don’t validate:

bionty.inspect(celltypist_df["Low-hierarchy cell types"], bionty.name);
97 terms (99.00%) are not validated for name: B cells, Follicular B cells, Proliferative germinal center B cells, Germinal center B cells, Memory B cells, Age-associated B cells, Naive B cells, Transitional B cells, Large pre-B cells, Small pre-B cells, Pre-pro-B cells, Pro-B cells, Cycling B cells, Cycling DCs, Cycling gamma-delta T cells, Cycling monocytes, Cycling NK cells, Cycling T cells, DC, DC1, ...
   detected 6 terms with synonyms: DC1, DC2, ETP, ILC2, ILC3, pDC
→  standardize terms via .standardize()

A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology:

celltypist_df["Low-hierarchy cell types"][0]
'B cells'
bionty.search(celltypist_df["Low-hierarchy cell types"][0]).head(2)
ontology_id definition synonyms parents __agg__ __ratio__
name
B cell CL:0000236 A Lymphocyte Of B Lineage That Is Capable Of B... B-cell|B-lymphocyte|B lymphocyte [CL:0000945] b cell 92.307692
B-2 B cell CL:0000822 A Conventional B Cell Subject To Antigenic Sti... B2 B-cell|B-2 B-cell|B2 B cell|B2 B-lymphocyte... [CL:0000785] b-2 b cell 85.714286

Let’s try to strip "s" and inspect if more names are now validated. Yes, there are!

bionty.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    bionty.name,
);
93 terms (94.90%) are not validated for name: Follicular B cell, Proliferative germinal center B cell, Germinal center B cell, Memory B cell, Age-associated B cell, Naive B cell, Transitional B cell, Large pre-B cell, Small pre-B cell, Pre-pro-B cell, Pro-B cell, Cycling B cell, Cycling DC, Cycling gamma-delta T cell, Cycling monocyte, Cycling NK cell, Cycling T cell, DC, DC1, DC2, ...
   detected 31 terms with inconsistent casing/synonyms: Follicular B cell, Germinal center B cell, Memory B cell, Naive B cell, Transitional B cell, Small pre-B cell, Pro-B cell, DC1, DC2, Endothelial cell, Epithelial cell, Erythrocyte, ETP, Fibroblast, Granulocyte, Neutrophil, ILC2, ILC3, NK cell, Alveolar macrophage, ...
→  standardize terms via .standardize()

Every “low-hierarchy cell type” has an ontology id and most “high-hierarchy cell types” also appear as “low-hierarchy cell types” in the Cell Typist table. Four, however, don’t, and therefore don’t have an ontology ID.

high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()

high_terms_nonval = set(high_terms).difference(low_terms)
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}

Register CellTypist records

Let’s first add the “High-hierarchy cell types” as a column "parent".

This enables LaminDB to populate the parents and children fields, which will enable you to query for hierarchical relationships.

celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")

# if high and low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None

# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "ct_name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()

# add standardize names for each ontology_id
celltypist_df["name"] = bionty.df().loc[celltypist_df["ontology_id"]].name.values
celltypist_df.head(2)
ct_name description ontology_id parent name
0 B cells B lymphocytes with diverse cell surface immuno... CL:0000236 None B cell
1 Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 B cells follicular B cell

Now, let’s create records from the public ontology:

public_records = bt.CellType.from_values(
    celltypist_df.ontology_id, bt.CellType.ontology_id
)
ln.save(public_records)

Let’s now amend public ontology records so that they maintain additional annotations that Cell Typist might have.

public_records_dict = {r.ontology_id: r for r in public_records}

for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    try:
        record.add_synonym(row["ct_name"])
    except ValueError:
        pass
Hide code cell output
❌ input synonyms ['DC2'] already associated with the following records:
created_at created_by_id run_id updated_at id uid name ontology_id abbr synonyms description source_id
0 2024-08-06 18:30:47.274635+00:00 1 None 2024-08-06 18:30:47.274646+00:00 92 3JO0EdVd plasmacytoid dendritic cell CL:0000784 None pDC|type 2 DC|plasmacytoid T cell|T-associated... A Dendritic Cell Type Of Distinct Morphology, ... 31
❌ input synonyms ['ILC2'] already associated with the following records:
created_at created_by_id run_id updated_at id uid name ontology_id abbr synonyms description source_id
0 2024-08-06 18:30:47.275296+00:00 1 None 2024-08-06 18:30:47.275307+00:00 114 4ny4oBnr group 2 innate lymphoid cell CL:0001069 None natural helper cell|ILC2|nuocyte An Innate Lymphoid Cell That Is Capable Of Pro... 31
❌ input synonyms ['ILC3'] already associated with the following records:
created_at created_by_id run_id updated_at id uid name ontology_id abbr synonyms description source_id
0 2024-08-06 18:30:47.275326+00:00 1 None 2024-08-06 18:30:47.275337+00:00 115 3tILnbqv group 3 innate lymphoid cell CL:0001071 None ILC3 An Innate Lymphoid Cell That Constituitively E... 31
❌ input synonyms ['pDC'] already associated with the following records:
created_at created_by_id run_id updated_at id uid name ontology_id abbr synonyms description source_id
0 2024-08-06 18:30:47.274635+00:00 1 None 2024-08-06 18:30:47.274646+00:00 92 3JO0EdVd plasmacytoid dendritic cell CL:0000784 None pDC|type 2 DC|plasmacytoid T cell|T-associated... A Dendritic Cell Type Of Distinct Morphology, ... 31

Add parent-child relationship of the records from Celltypist

We still need to add the renaming 4 High hierarchy terms:

list(high_terms_nonval)
['B-cell lineage', 'Erythroid', 'T cells', 'Cycling cells']

Let’s get the top hits from a search:

for term in list(high_terms_nonval):
    print(f"Term: {term}")
    display(bionty.search(term).head(2))
Term: B-cell lineage
ontology_id definition synonyms parents __agg__ __ratio__
name
obsolete cell by lineage CL:0000220 None None [] obsolete cell by lineage 73.684211
obsolete cell line cell CL:0007014 Obsolete: A Cultured Cell That Has Been Passag... passaged cultured cell [] obsolete cell line cell 64.864865
Term: Erythroid
ontology_id definition synonyms parents __agg__ __ratio__
name
erythroid progenitor cell CL:0000038 A Progenitor Cell Committed To The Erythroid L... None [CL:0000764, CL:0000839] erythroid progenitor cell 90.0
megakaryocyte-erythroid progenitor cell CL:0000050 A Progenitor Cell Committed To The Megakaryocy... colony forming unit erythroid megakaryocyte|CF... [CL:0002032, CL:0011026, CL:0000763] megakaryocyte-erythroid progenitor cell 90.0
Term: T cells
ontology_id definition synonyms parents __agg__ __ratio__
name
T cell CL:0000084 A Type Of Lymphocyte Whose Defining Characteri... T-lymphocyte|T lymphocyte|T-cell [CL:0000542] t cell 92.307692
exhausted T cell CL:0011025 None An effector T cell that displays impaired effe... [CL:0000911] exhausted t cell 80.000000
Term: Cycling cells
ontology_id definition synonyms parents __agg__ __ratio__
name
circulating cell CL:0000080 A Cell Which Moves Among Different Tissues Of ... None [CL:0000000] circulating cell 75.862069
lining cell CL:0000213 A Cell Within An Epithelial Cell Sheet Whose M... boundary cell [CL:0000215] lining cell 75.000000

So we decide to:

  • Add the “T cells” to the synonyms of the public “T cell” record

  • Create the remaining 3 terms only using their names (we think “B cell flow” shouldn’t be identified with “B cell”)

for name in high_terms_nonval:
    if name == "T cells":
        record = bt.CellType.from_public(name="T cell")
        record.add_synonym(name)
        record.save()
    elif name == "Erythroid":
        record = bt.CellType.from_public(name="erythroid lineage cell")
        record.add_synonym(name)
        record.save()
    else:
        record = bt.CellType(name=name)
        record.save()
❗ records with similar names exist! did you mean to load one of them?
uid name ontology_id abbr synonyms description source_id run_id created_by_id updated_at
id
24 2KfvYuU7 erythroid lineage cell CL:0000764 None Mid erythroid|Early erythroid|Late erythroid|e... A Immature Or Mature Cell In The Lineage Leadi... 31 None 1 2024-08-06 18:30:48.138269+00:00
99 2liiRiq1 lymphoid lineage restricted progenitor cell CL:0000838 None None A Progenitor Cell Restricted To The Lymphoid L... 31 None 1 2024-08-06 18:30:47.274859+00:00
100 1PUkYUbI myeloid lineage restricted progenitor cell CL:0000839 None None A Progenitor Cell Restricted To The Myeloid Li... 31 None 1 2024-08-06 18:30:47.274890+00:00
105 7GpphKmr lymphocyte of B lineage CL:0000945 None None A Lymphocyte Of B Lineage With The Commitment ... 31 None 1 2024-08-06 18:30:47.275039+00:00
111 3yMnmkVh hematopoietic oligopotent progenitor cell, lin... CL:0001060 None None A Hematopoietic Oligopotent Progenitor Cell Th... 31 None 1 2024-08-06 18:30:47.275219+00:00
116 5c7URAC4 lymphocyte of B lineage, CD19-positive CL:0001200 None None A Lymphocyte Of B Lineage That Is Cd19-Positive. 31 None 1 2024-08-06 18:30:47.275371+00:00
120 5ECPCxc3 hematopoietic lineage restricted progenitor cell CL:0002031 None None A Hematopoietic Progenitor Cell That Is Capabl... 31 None 1 2024-08-06 18:30:47.275490+00:00
❗ `.from_public()` is deprecated, use `.from_source()`!'
❗ `.from_public()` is deprecated, use `.from_source()`!'
❗ records with similar names exist! did you mean to load one of them?
uid name ontology_id abbr synonyms description source_id run_id created_by_id updated_at
id
1 ryEtgi1y B cell CL:0000236 None Cycling B cells|B cells|B-lymphocyte|B-cell|B ... A Lymphocyte Of B Lineage That Is Capable Of B... 31 None 1 2024-08-06 18:30:47.722744+00:00
2 2EhFTUoZ follicular B cell CL:0000843 None Fo B cell|follicular B lymphocyte|follicular B... A Resting Mature B Cell That Has The Phenotype... 31 None 1 2024-08-06 18:30:47.517920+00:00
3 4IowPafD germinal center B cell CL:0000844 None GC B cell|GC B-lymphocyte|germinal center B-ce... A Rapidly Cycling Mature B Cell That Has Disti... 31 None 1 2024-08-06 18:30:47.557619+00:00
4 2cUPBtY8 memory B cell CL:0000787 None memory B lymphocyte|memory B-lymphocyte|Age-as... A Memory B Cell Is A Mature B Cell That Is Lon... 31 None 1 2024-08-06 18:30:47.594382+00:00
5 3jdCg7zi naive B cell CL:0000788 None Naive B cells|naive B lymphocyte|naive B-lymph... A Naive B Cell Is A Mature B Cell That Has The... 31 None 1 2024-08-06 18:30:47.612199+00:00
6 75GaqGOI transitional stage B cell CL:0000818 None transitional stage B-cell|Transitional B cells... An Immature B Cell Of An Intermediate Stage Be... 31 None 1 2024-08-06 18:30:47.630132+00:00
7 4rQVc1EA precursor B cell CL:0000817 None Large pre-B cells|Small pre-B cells A Precursor B Cell Is A B Cell With The Phenot... 31 None 1 2024-08-06 18:30:47.667510+00:00
8 2FWj3GpL early pro-B cell CL:0002046 None Pre-pro-B cells A Pro-B Cell That Is Cd22-Positive, Cd34-Posit... 31 None 1 2024-08-06 18:30:47.685474+00:00
9 1gKQ0rC2 pro-B cell CL:0000826 None progenitor B cell|progenitor B lymphocyte|prog... A Progenitor Cell Of The B Cell Lineage, With ... 31 None 1 2024-08-06 18:30:47.704035+00:00
10 5tVkOPwK dendritic cell, human CL:0001056 None Migratory DCs|Cycling DCs|DC3|DC A Dendritic Cell With The Phenotype Hla-Dra-Po... 31 None 1 2024-08-06 18:30:47.915471+00:00
11 1HuNn2EP gamma-delta T cell CL:0000798 None Cycling gamma-delta T cells|CRTAM+ gamma-delta... A T Cell That Expresses A Gamma-Delta T Cell R... 31 None 1 2024-08-06 18:30:49.141931+00:00
13 37mWPv6o natural killer cell CL:0000623 None NK cell|CD16+ NK cells|NK cells|Cycling NK cel... A Lymphocyte That Can Spontaneously Kill A Var... 31 None 1 2024-08-06 18:30:48.519057+00:00
14 22LvKd01 T cell CL:0000084 None CD8a/a|T lymphocyte|T-lymphocyte|T-cell|T cell... A Type Of Lymphocyte Whose Defining Characteri... 31 None 1 2024-08-06 18:30:49.646530+00:00
15 1uF1evnz conventional dendritic cell CL:0000990 None DC1|type 1 DC|cDC|dendritic reticular cell Conventional Dendritic Cell Is A Dendritic Cel... 31 None 1 2024-08-06 18:30:47.862024+00:00
16 1lOJ8BJ3 immature conventional dendritic cell CL:0000840 None Transitional DC An Immature Cell Of The Conventional Dendritic... 31 None 1 2024-08-06 18:30:47.933278+00:00
20 VEvqzCKG megakaryocyte progenitor cell CL:0000553 None megakaryocytic progenitor cell|megakaryoblast|... The Earliest Cytologically Identifiable Precur... 31 None 1 2024-08-06 18:30:48.773244+00:00
21 1J6s4gSi endothelial cell CL:0000115 None Endothelial cells|endotheliocyte An Endothelial Cell Comprises The Outermost La... 31 None 1 2024-08-06 18:30:48.036727+00:00
22 68LNvDH7 epithelial cell CL:0000066 None epitheliocyte|Epithelial cells A Cell That Is Usually Found In A Two-Dimensio... 31 None 1 2024-08-06 18:30:48.056330+00:00
24 2KfvYuU7 erythroid lineage cell CL:0000764 None Erythroid|Mid erythroid|Early erythroid|Late e... A Immature Or Mature Cell In The Lineage Leadi... 31 None 1 2024-08-06 18:30:49.629374+00:00
31 6xDuAGup granulocyte monocyte progenitor cell CL:0000557 None granulocyte/monocyte precursor|granulocyte-mac... A Hematopoietic Progenitor Cell That Is Commit... 31 None 1 2024-08-06 18:30:48.293451+00:00
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}
bt.CellType(name="B-cell lineage").save()
💡 returning existing CellType record with same name: 'B-cell lineage'
CellType(uid='5gxL2SWr', name='B-cell lineage', created_by_id=1, updated_at='2024-08-06 18:30:49 UTC')

Now let’s add the parent records:

celltypist_df["parent"] = bt.CellType.standardize(celltypist_df["parent"])
for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    if row["parent"] is not None:
        parent_record = bt.CellType.filter(name=row["parent"]).one()
        record.parents.add(parent_record)

Access the registry

The previously added CellTypist ontology registry is now available in LaminDB. To retrieve the full ontology table as a Pandas DataFrame we can use .filter:

bt.CellType.df()
uid name ontology_id abbr synonyms description source_id run_id created_by_id updated_at
id
139 5gxL2SWr B-cell lineage None None None None NaN None 1 2024-08-06 18:30:49.683352+00:00
140 5jshKSVL Cycling cells None None None None NaN None 1 2024-08-06 18:30:49.662782+00:00
14 22LvKd01 T cell CL:0000084 None CD8a/a|T lymphocyte|T-lymphocyte|T-cell|T cell... A Type Of Lymphocyte Whose Defining Characteri... 31.0 None 1 2024-08-06 18:30:49.646530+00:00
24 2KfvYuU7 erythroid lineage cell CL:0000764 None Erythroid|Mid erythroid|Early erythroid|Late e... A Immature Or Mature Cell In The Lineage Leadi... 31.0 None 1 2024-08-06 18:30:49.629374+00:00
68 7j3YpGzu T-helper 17 cell CL:0000899 None Th17 T-lymphocyte|Th17 T cell|Th17 T lymphocyt... Cd4-Positive, Alpha-Beta T Cell With The Pheno... 31.0 None 1 2024-08-06 18:30:49.446783+00:00
... ... ... ... ... ... ... ... ... ... ...
71 6Sq9ZVSG professional antigen presenting cell CL:0000145 None None A Cell Capable Of Processing And Presenting Li... 31.0 None 1 2024-08-06 18:30:47.273998+00:00
70 4y4o4m6R blood cell CL:0000081 None None A Cell Found Predominately In The Blood. 31.0 None 1 2024-08-06 18:30:47.273964+00:00
69 4bKGljt0 cell CL:0000000 None None A Material Entity Of Anatomical Origin (Part O... 31.0 None 1 2024-08-06 18:30:47.273924+00:00
38 1aLpWgJc group 3 innate lymphoid cell, human CL:0001078 None ILC3, human A Group 3 Innate Lymphoid Cell In The Human Wi... 31.0 None 1 2024-08-06 18:30:46.869199+00:00
37 6NmzCwsn group 2 innate lymphoid cell, human CL:0001081 None ILC2, human A Group 2 Innate Lymphoid Cell In The Human Wi... 31.0 None 1 2024-08-06 18:30:46.869169+00:00

140 rows × 10 columns

This enables us to look for cell types by creating a lookup object from our new CellType registry.

db_lookup = bt.CellType.lookup()
db_lookup.memory_b_cell
CellType(uid='2cUPBtY8', name='memory B cell', ontology_id='CL:0000787', synonyms='memory B lymphocyte|memory B-lymphocyte|Age-associated B cells|memory B-cell|Memory B cells', description='A Memory B Cell Is A Mature B Cell That Is Long-Lived, Readily Activated Upon Re-Encounter Of Its Antigenic Determinant, And Has Been Selected For Expression Of Higher Affinity Immunoglobulin. This Cell Type Has The Phenotype Cd19-Positive, Cd20-Positive, Mhc Class Ii-Positive, And Cd138-Negative.', created_by_id=1, source_id=31, updated_at='2024-08-06 18:30:47 UTC')

See cell type hierarchy:

db_lookup.memory_b_cell.view_parents()
_images/62d6b8bad9ba0cb460de717b8e65de786731fd81ea2a410c54c6f036ad10b41b.svg

Access parents of a record:

db_lookup.memory_b_cell.parents.list()
[CellType(uid='ryEtgi1y', name='B cell', ontology_id='CL:0000236', synonyms='Cycling B cells|B cells|B-lymphocyte|B-cell|B lymphocyte', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', created_by_id=1, source_id=31, updated_at='2024-08-06 18:30:47 UTC'),
 CellType(uid='71xItrKo', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B lymphocyte|mature B-cell|mature B-lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', created_by_id=1, source_id=31, updated_at='2024-08-06 18:30:47 UTC')]

Move on to the next registry: GO pathways