Machine learning for disambiguation of clinical trial scientist names

Student: N.N.

Supervisor: Fabio Rinaldi


In order to understand and plan clinical trials it is important to identify and characterize the scientists that typically run them. However, information about clinical trial scientists is sparse and their names are ambiguous. In order to further characterize clinical trial scientists it is necessary to gather their total scientific output, such as all the trials they have run in the past and the scientific articles they have published. This can be done with the aid of supervised machine learning, in which a gold standard corpus of pairs of scientific articles and clinical trials are judged by an algorithm on whether they belong to the same scientist.

For this project there is available a not-yet-published gold standard corpus based on MEDLINE abstracts and records that could be used. The project would entail creating features out of the MEDLINE abstracts and records following the scientific state-of-the-art and then train and test machine learning models that achieve the best performance. Because the particular topic of this project is novel there exist the possibility of publication, either in journal or conference, if carried out successfully


The proposed project is based on a collaboration with a major
Pharma company, and involves occasional trips to Basel (which
will be fully reimbursed).