Abstract

A Semi-Supervised Pattern-Learning Approach to Extract Pharmacogenomics-Specific Drug-Gene Pairs from Biomedical Literature

Rong Xu and Quanqiu Wang

Personalized medicine is to deliver the right drug to the right patient in the right dose. Pharmacogenomics (PGx), the studies in identifying genetic variants that may affect drug response, is important for personalized medicine. Computational approaches in studying the relationships between genes and drug response are emerging as an active area of research for personalized medicine. Currently, systematic study of drug-gene relationships is limited because a large-scale machine understandable drug-gene relationship knowledge base is difficult to build and to keep update. Scientific literature contains rich information of drug-gene relationships, therefore is the ultimate knowledge source for PGx studies and for personalized medicine. However, this information is largely buried in free text with limited machine understandability. There is a need to develop automatic approaches to extract structured drug-gene relationships from biomedical literature. In this study, we present a semi-supervised approach to extracting drug-gene relationships from MEDLINE. The technique uses one seed pattern and iteratively learns various ways the relationship may be expressed in 20 million MEDLINE abstracts. Our approach has achieved high precisions (0.961-1.00) in extracting drug-gene relationships from MEDLINE and found many drug-gene pairs that are not available in PharmGKB, a large-scale manually curated PGx knowledge base.