Using Text Mining of PubMed Abstracts As An Evidence Source in Computational Predictions of WW Domain-Mediated Protein-Protein Interactions

Olhovsky, Marina

Using Text Mining of PubMed Abstracts As An Evidence Source in Computational Predictions of WW Domain-Mediated Protein-Protein Interactions

dc.contributor.advisor	Pearlman, Ronald E
dc.creator	Olhovsky, Marina
dc.date.accessioned	2016-09-20T16:22:54Z
dc.date.available	2016-09-20T16:22:54Z
dc.date.copyright	2015-08-25
dc.date.issued	2016-09-20
dc.date.updated	2016-09-20T16:22:54Z
dc.degree.discipline	Biology
dc.degree.level	Master's
dc.degree.name	MSc - Master of Science
dc.description.abstract	Protein-protein interactions (PPIs) are a key regulatory mechanism in coordinating a multitude of processes vital to normal cellular function. There exist a number of wet-lab small-scale and high-throughput methods for accurately identifying PPIs; however, despite their accuracy, these methods are expensive both in terms of time and finances. Complementing experimental methods with computational predictions increases the effectiveness of wet-lab small scale methodologies in identifying high quality protein interaction networks. Computational predictions are made by applying bioinformatics and machine-learning algorithms to large-scale training sets obtained from wet-lab experiments, or by extracting information on PPIs from high volumes of published data that do not directly identify protein interactions but are nonetheless correlated with them. A disadvantage of computational predictions is their high degree of inaccuracy, namely too many false positives and false negatives. To improve the accuracy of computational predictions, it is important to consider interactions that are likely to occur in vivo under certain biological conditions, termed context. One technique for improving prediction accuracy is analyzing data obtained via different types of experiments that consider different features of the co-occurring proteins, such as co-localization, co-expression, correlated mutations, or semantic similarity. These experimental sources and their resulting data are called sources of evidence. Integrating data from multiple independent supporting evidence sources improves prediction accuracy. In this work, I used text mining of PubMed abstracts as an evidence source for protein interactions. I hypothesized that proteins whose names are frequently mentioned in the same abstract are more likely to interact in vivo compared to randomly chosen proteins. A comparison of three text mining techniques gene name co-occurrence, MeSH term indexing, and co-occurrence with a controlled vocabulary shows that co-occurrence with a controlled vocabulary yields the highest precision and recall. I concluded that gene name co-occurrence with a controlled vocabulary can, therefore, be used as a novel evidence source for prediction of WW domain-mediated PPIs.
dc.identifier.uri	http://hdl.handle.net/10315/32083
dc.language.iso	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Bioinformatics
dc.subject.keywords	Text mining
dc.subject.keywords	Protein interaction
dc.subject.keywords	Python
dc.subject.keywords	Precision
dc.subject.keywords	Recall
dc.subject.keywords	WW domain
dc.title	Using Text Mining of PubMed Abstracts As An Evidence Source in Computational Predictions of WW Domain-Mediated Protein-Protein Interactions
dc.type	Electronic Thesis or Dissertation