Terminological resources for Text Mining over Biomedical Scientific Literature

Statistics

In the following the frequency information about the data (e.g. ambiguity, more frequent types, etc.) is provided.

Average (term) ambiguity = number_of_entries_in_the_db / number_of_different_terms.

Average synonymy (i.e. ID ambiguity) = number_of_entries_in_the_db / number_of_different_ids.

Average synonymy can be understood as the average size of the synsets where a synset is a collection of terms that an ID can be expressed by. Each ID has its own synset (i.e. the synset can be named by the name of the ID). If there is ambiguity then some synsets overlap.

Terms

Different terms: 2347734

Average ambiguity of terms: 2.00633035940187

Frequency distribution of terms in the term list (top 15). This essentially shows the ambiguity (number of IDs) of the terms.
Frequency Term Description
6182hypothetical protein
2891MHC class I antigen
2241Cytochrome b
1973CYTB
1769Cytochrome b-c1 complex subunit 3
1769Complex III subunit 3
1769Ubiquinol-cytochrome-c reductase complex cytochrome b subunit
1769Complex III subunit III
1706HLA-B
1623COB
1620MTCYB
1602MT-CYB
1436MHC class II antigen
1394major histocompatibility complex, class I, B
1386B7.2
Frequency distribution of multi word terms. Only unique terms are considered.
Frequency Token count Description
12052781
3727042
2588183
1918994
1163895
769666
445087
291748
185459
1185910
773411
466512
305713
171414
73715
47616
47419
43418
33917
27520
17821
16222
13229
13028
12627
12131
11925
11226
11023
10930
10924
8532
7533
5634
2635
1836
1237
340
338
161
139
143

Type ambiguity of terms

Frequency distribution of most type-ambiguous terms (top 20).
Ambiguity Term Types
8TPPmolecule:ChEBI, molecule:ChemIDplus, compound:CAS, human:UNIPROT, PROT, molecule:KEGG_COMPOUND, enzyme:EC, GEN
8CTXdisease:UMLS, molecule:KEGG_DRUG, human:UNIPROT, drug:KEGG, drug:DrugBank, PROT, GEN, compound:CAS
8MEAdisease:UMLS, human:UNIPROT, molecule:ChEMBL, molecule:ChemIDplus, PROT, drug:DrugBank, GEN, compound:CAS
8CPZmolecule:KEGG_DRUG, human:UNIPROT, drug:KEGG, molecule:ChemIDplus, drug:DrugBank, PROT, GEN, compound:CAS
8PCPcompound:CAS, disease:UMLS, drug:KEGG, human:UNIPROT, PROT, molecule:KEGG_COMPOUND, GEN, enzyme:EC
8CPdisease:UMLS, molecule:ChEBI, human:UNIPROT, drug:KEGG, drug:DrugBank, PROT, GEN, compound:CAS
8PAHmolecule:ChEBI, molecule:DrugBank, compound:CAS, human:UNIPROT, PROT, molecule:SUBMITTER, enzyme:EC, GEN
7ETAmolecule:ChEBI, human:UNIPROT, drug:DrugBank, PROT, molecule:KEGG_COMPOUND, GEN, compound:CAS
7ADHdisease:UMLS, human:UNIPROT, PROT, GEN, molecule:KEGG_COMPOUND, enzyme:EC, compound:CAS
7CDdisease:UMLS, molecule:DrugBank, MI, human:UNIPROT, PROT, GEN, compound:CAS
7CDPmolecule:NIST_Chemistry_WebBook, molecule:UniProt, human:UNIPROT, PROT, drug:DrugBank, molecule:KEGG_COMPOUND, compound:CAS
7TTPdisease:UMLS, molecule:ChEBI, human:UNIPROT, PROT, molecule:KEGG_COMPOUND, GEN, compound:CAS
7AMPmolecule:UniProt, molecule:ChEBI, human:UNIPROT, PROT, GEN, molecule:KEGG_COMPOUND, compound:CAS
7ATPmolecule:UniProt, molecule:ChEMBL, drug:KEGG, human:UNIPROT, PROT, molecule:KEGG_COMPOUND, compound:CAS
7PAMdisease:UMLS, molecule:ChemIDplus, human:UNIPROT, PROT, GEN, enzyme:EC, compound:CAS
7PGAmolecule:NIST_Chemistry_WebBook, disease:UMLS, molecule:ChEBI, human:UNIPROT, PROT, GEN, compound:CAS
7HEPmolecule:NIST_Chemistry_WebBook, disease:UMLS, human:UNIPROT, PROT, enzyme:EC, GEN, compound:CAS
7DNAmolecule:UniProt, MI, human:UNIPROT, PROT, molecule:KEGG_COMPOUND, compound:CAS, molecule:IUPAC
7NAmolecule:ChEBI, molecule:ChEMBL, drug:KEGG, human:UNIPROT, PROT, GEN, compound:CAS
7PAmolecule:ChEBI, drug:KEGG, human:UNIPROT, PROT, drug:DrugBank, GEN, compound:CAS

Ambiguity of UniProtKB terms

Unchanged No whitespace Alphanumeric Lowercase Alpha
ID_ORG 2.43562737417767 2.43811966525815 2.46853332590569 2.55692978139286 9.99613695360654
ID 1.05875795915954 1.05921690918326 1.06319099451482 1.06834813308531 4.13384102456848

IDs

Different IDs: 1462783

Average ambiguity of IDs: 3.220115355456

Frequency distribution of IDs in the term list (top 15).
Frequency ID Description
5811CLKB:HUMAN
1703CLKB:MOUSE
746compound:CAS:CAS:95422-24-5
701compound:CAS:CAS:83665-54-7
639compound:CAS:CAS:499-02-5
634compound:CAS:CAS:4836-13-9
549compound:CAS:CAS:103-90-2
542compound:CAS:CAS:119459-68-6
528CLKB:_CLKB
517compound:CAS:CAS:13243-65-7
482compound:CAS:CAS:2847-00-9
459compound:CAS:CAS:69408-81-7
456compound:CAS:CAS:100676-10-6
438compound:CAS:CAS:122864-73-7
381compound:CAS:CAS:63712-45-8

Types

Different types: 46

Frequency distribution of types in the term list.
Frequency Type Description
1599655PROTUniProtKB protein name
1005216GENUniProtKB gene name
820721human:UNIPROT
624331compound:CAS
424748disease:UMLS
27430enzyme:EC
25893molecule:IUPAC
22866symptom:UMLS
20900drug:DrugBank
17829molecule:ChEBI
17001ocsNCBI common name, species or below
16341molecule:ChemIDplus
12414drug:KEGG
10765molecule:ChEMBL
10687molecule:KEGG_COMPOUND
8878ogs2oss name, genus abbreviated (e.g. `A. thaliana')
8877ossNCBI scientific name, species or below
8727CLKBCLKB cell line name
4469molecule:NIST_Chemistry_WebBook
3917molecule:UniProt
3316ocaNCBI common name, above species
2908molecule:PDBeChem
2561osaNCBI scientific name, above species
2270molecule:Chemical_Ontology
2189MIPSI-MI term
1061molecule:DrugBank
1056molecule:JCBN
1011molecule:SUBMITTER
551molecule:KEGG_DRUG
264molecule:UM-BBD
258molecule:MolBase
231molecule:CBN
215molecule:LIPID_MAPS
186molecule:WHO_MedNet
146molecule:IUBMB
121molecule:Patent
96molecule:KEGG_GLYCAN
79molecule:IUPHAR
53molecule:RESID
41molecule:COMe
24molecule:EMBL
14molecule:PDB
7ogs1NCBI selected genus name (e.g. `Arabidopsis')
4molecule:EuroFIR
2molecule:Beilstein
1molecule:EBI_Industry_Programme