Search & lookup terms#
Entities and ontologies can be complex with many different identifiers.
Here we show Bionty’s lookup model for species, genes, proteins and cell markers. You’ll see how to
access the reference table via
.df()
look up an entity term via
.lookup()
look up an entity term via
.search()
import bionty as bt
.fields
: fields of an ontology reference#
gene_bt = bt.Gene()
gene_bt
Gene
Species: human
Source: ensembl, release-110
#terms: 77043
📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
✅ Gene.validate(): strictly validate values
🧐 Gene.inspect(): full inspection of values
👽 Gene.map_synonyms(): map synonyms to standardized names
🪜 Gene.diff(): difference between two versions
🔗 Gene.ontology: Pronto.Ontology object
gene_bt.fields
{'biotype',
'description',
'ensembl_gene_id',
'ncbi_gene_id',
'symbol',
'synonyms'}
Fields can be accessed as attributes for autocompletion:
(You can pass them to the field
parameter in any bionty function instead of strings.)
gene_bt.ncbi_gene_id
ncbi_gene_id
.df()
: reference table#
Data scientists love DataFrames, and every entity has a reference table containing all the fields.
df = gene_bt.df()
df.head()
ensembl_gene_id | symbol | ncbi_gene_id | biotype | description | synonyms | |
---|---|---|---|---|---|---|
0 | ENSG00000000003 | TSPAN6 | 7105 | protein_coding | tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] | TM4SF6|T245|TSPAN-6 |
1 | ENSG00000000005 | TNMD | 64102 | protein_coding | tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] | TEM|MYODULIN|CHM1L|TENDIN|BRICD4 |
2 | ENSG00000000419 | DPM1 | 8813 | protein_coding | dolichyl-phosphate mannosyltransferase subunit... | CDGIE|MPDS |
3 | ENSG00000000457 | SCYL3 | 57147 | protein_coding | SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... | PACE-1|PACE1 |
4 | ENSG00000000460 | C1orf112 | 55732 | protein_coding | chromosome 1 open reading frame 112 [Source:HG... | FLJ10706|APOLO1|FLIP |
To access the information of, for example the multiple gene symbols, we select the corresponding species through Pandas:
df.set_index("symbol").loc[["LMNA", "TCF7", "BRCA1"]]
ensembl_gene_id | ncbi_gene_id | biotype | description | synonyms | |
---|---|---|---|---|---|
symbol | |||||
LMNA | ENSG00000160789 | 4000 | protein_coding | lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] | CMD1A|LGMD1B|LMNL1|MADA|LMN1|PRO1|HGPS |
LMNA | LRG_254 | None | LRG_gene | lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] | CMD1A|LGMD1B|LMNL1|MADA|LMN1|PRO1|HGPS |
TCF7 | ENSG00000081059 | 6932 | protein_coding | transcription factor 7 [Source:HGNC Symbol;Acc... | TCF-1 |
BRCA1 | ENSG00000012048 | 672 | protein_coding | BRCA1 DNA repair associated [Source:HGNC Symbo... | BRCC1|FANCS|RNF53|PPP1R53 |
BRCA1 | LRG_292 | None | LRG_gene | BRCA1 DNA repair associated [Source:HGNC Symbo... | BRCC1|FANCS|RNF53|PPP1R53 |
.lookup()
: Lookup terms and records with autocompletion#
Terms can be searched with auto-complete using a lookup object.
lookup = gene_bt.lookup()
We provide dot.
accessor for normalized terms (lower case, only contains alphanumeric characters and underscores):
lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')
To look up the exact original strings, convert the lookup object to dict and use the bracket[]
accessor for autocompletion:
lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')
By default, the name
field is used to generate lookup keys.
You can specify another field to look up:
lookup = gene_bt.lookup(gene_bt.ncbi_gene_id)
If multiple entries are matched, they are returned as a list:
lookup.bt_100126572
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 [Source:HGNC Symbol;Acc:HGNC:33251]', synonyms='CX23')
lookup_dict = lookup.dict()
lookup_dict["100126572"]
Gene(ensembl_gene_id='ENSG00000203733', symbol='GJE1', ncbi_gene_id='100126572', biotype='protein_coding', description='gap junction protein epsilon 1 [Source:HGNC Symbol;Acc:HGNC:33251]', synonyms='CX23')
.search
: Search a term against a field#
celltype_bt = bt.CellType()
Matching scores are stored in the __ratio__
column:
celltype_bt.search("cytotoxic T cells").head(3)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
cytotoxic T cell | CL:0000910 | A Mature T Cell That Differentiated And Acquir... | cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... | [CL:0000911] | 96.969697 |
obsolete cytotoxic T cell | CL:0000491 | Obsolete: A Cell Responsible For Spontaneous C... | None | [] | 76.190476 |
Tc2 cell | CL:0000918 | A Cd8-Positive, Alpha-Beta Positive T Cell Exp... | Th2 non-TFH CD8-positive T cell|Th2 CD8-positi... | [CL:0000908] | 76.190476 |
By default, search also matches against each of the synonyms:
celltype_bt.search("P cell").head(3)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
nodal myocyte | CL:0002072 | A Specialized Cardiac Myocyte In The Sinoatria... | myocytus nodalis|P cell|cardiac pacemaker cell | [CL:0002086] | 100.000000 |
double-positive, alpha-beta thymocyte | CL:0000809 | A Thymocyte Expressing The Alpha-Beta T Cell R... | DP cell|DP thymocyte|double-positive, alpha-be... | [CL:0000790] | 92.307692 |
PP cell | CL:0000696 | A Cell That Stores And Secretes Pancreatic Pol... | type F enteroendocrine cell | [CL:0000167, CL:0000164] | 92.307692 |
You can turn off synonym matching with synonyms_field=None
:
celltype_bt.search("P cell", synonyms_field=None).head(3)
ontology_id | definition | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
name | |||||
PP cell | CL:0000696 | A Cell That Stores And Secretes Pancreatic Pol... | type F enteroendocrine cell | [CL:0000167, CL:0000164] | 92.307692 |
cap cell | CL:0000676 | None | None | [CL:0000378, CL:0000548] | 85.714286 |
GIP cell | CL:0002278 | An Enteroendocrine Cell Of Duodenum And Jejunu... | type K enteroendocrine cell | [CL:0000167, CL:0000164] | 85.714286 |
Match against another field (default is “name”):
celltype_bt.search("CD8 postive alpha beta T cells", field=celltype_bt.definition).head(
3
)
ontology_id | name | synonyms | parents | __ratio__ | |
---|---|---|---|---|---|
definition | |||||
A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor. | CL:0000625 | CD8-positive, alpha-beta T cell | CD8-positive, alpha-beta T lymphocyte|CD8-posi... | [CL:0000791] | 95.081967 |
A Mature Alpha-Beta T Cell That Expresses An Alpha-Beta T Cell Receptor And The Cd4 Coreceptor. | CL:0000624 | CD4-positive, alpha-beta T cell | CD4-positive, alpha-beta T lymphocyte|CD4-posi... | [CL:0000791] | 91.803279 |
A Cd8-Positive, Alpha-Beta T Cell That Has Differentiated Into A Memory T Cell. | CL:0000909 | CD8-positive, alpha-beta memory T cell | CD8-positive, alpha-beta memory T lymphocyte|C... | [] | 85.294118 |