Abstract
A Study of Term Replacement in a Large Biomedical Corpus
We present an approach that combines a terminological resource and a large domain corpus to systematically study the change of domain terminology over time. Term replacement can be observed at the lexical level among synonymous terms and occurs due to the competition of terms describing the same phenomenon. The typical pattern of term replacement can be described as a drop of occurrences of one (or more) synonym(s) whereas remaining synonyms from the same synonym set keep the level of occurrences unchanged or attain more occurrences. In order to study term replacement over time, we extract synonym sets from the largest biomedical terminology resource – UMLS meta-thesaurus. We use the entire PubMed (http://www.ncbi.nlm.nih.gov/pubmed) dataset as a chronological reference corpus to study occurrences of extracted synonyms. PubMed dataset contains over 20 million documents (consisting of titles and partially abstracts) between 1881 and 2012. Term occurrences in PubMed are identified with the MetaMap (http://metamap.nlm.nih.gov/) term recognizer. We propose to capture term replacement by dividing the chronological reference corpus into time periods and by using linear regression models to analyze tendencies of occurrence for each synonym over time. The full presentation will include details regarding the approach for capturing term replacement and experimental results revealing how pervasive term replacement is in the biomedical domain. Our experiments on the disease subsets of the UMLS terminology reveal that the phenomenon of term replacement can be observed in a substantial part of the extracted synonym sets.