Where Computer Science, Linguistics and Biology Meet: Using Lexical Chaining to Analyze Biomedical Text
Chrysanne DiMarco, PhD
David R. Cheriton School of Computer Science
University of Waterloo
May 9, 2007
12:00 PM - 1:30 PM
Davis Centre 1304, University of Waterloo
View Video of Presentation in HI Alive Archive: Research Seminars Archive 2006-2007
Biomedical information extraction is becoming an increasingly important focus in Computational Linguistics research. To perform more semantics-based information extraction, we require specialized domain models, but creating such models can be very difficult and time-consuming. We have developed a hybrid methodology for constructing a domain-specific ontology, "PPIWordNet'', which integrates key concepts about protein-protein interactions with the Gene Ontology.
In addition, we present a method for using our PPIWordNet ontology in discourse-based information extraction to analyze full-text articles on protein interactions. Our discourse-analysis approach uses "lexical chaining'' to extract strings of semantically related words that represent the topic structure of the text. We show that the domain-specific PPIWordNet ontology significantly improves the performance of the lexical-chaining analysis. As well, the topic structure as represented by the lexical chains contains important information about protein interactions which we propose may be useful in evaluating the biological validity of these interactions.
This talk describes collaborative work with Xiaofen He, Shady Hassan Shehata, Gabriel Musso, and Zhou Yu.
About the Speaker
Chrysanne DiMarco has been a member of the Artificial Intelligence Group in the David R. Cheriton School of Computer Science at the University of Waterloo since graduating from the University of Toronto in 1990. Her research interests include: computational models of natural language pragmatics, computational linguistics, health informatics, and biomedical information extraction. Professor DiMarco is the project leader of the HealthDoc Project (1994-present), which has been developing Web-based natural language generation systems for producing health information tailored to the medical condition and personal characteristics of an individual. Professor DiMarco is also the President of Inkpot Software Inc., the spin-off company from the HealthDoc Project.