Extending Topic Models With Syntax and Semantics Relationships

Delpisheh, Elnaz

Extending Topic Models With Syntax and Semantics Relationships

dc.contributor.advisor	An, Aijun
dc.creator	Delpisheh, Elnaz
dc.date.accessioned	2015-12-16T19:09:47Z
dc.date.available	2015-12-16T19:09:47Z
dc.date.copyright	2015-05-06
dc.date.issued	2015-12-16
dc.date.updated	2015-12-16T19:09:47Z
dc.degree.discipline	Computer Science
dc.degree.level	Doctoral
dc.degree.name	PhD - Doctor of Philosophy
dc.description.abstract	Probabilistic topic modeling is a powerful tool to uncover hidden thematic structure of documents. These hidden structures are useful for extracting concepts of documents and other data mining tasks, such as information retrieval. Latent Dirichlet allocation (LDA), is a generative probabilistic topic model for collections of discrete data such as text corpora. LDA represents documents as a bag-of-words, where the important structure of documents is neglected. In this work, we proposed three extended LDA models that incorporates syntactic and semantic structures of text documents into probabilistic topic models. Our first proposed topic model enriches text documents with collapsed typed dependency relations to effectively acquire syntactic and semantic dependencies between consecutive and nonconsecutive words of text documents. This representation has several benefits. It captures relations between consecutive and nonconsecutive words of text documents. In addition, the labels of the collapsed typed dependency relations help to eliminate less important relations, i.e., relations involving prepositions. Moreover, in this thesis, we introduced a method to enforce topic similarity to conceptually similar words. As a result, this algorithm leads to more coherent topic distribution over words. Our second and third proposed generative topic models incorporate term importance into latent topic variables by boosting the probability of important terms and consequently decreasing the probability of less important terms to better reflect the themes of documents. In essence, we assign weights to terms by employing corpus-level and document-level approaches. We incorporate term importance using a nonuniform base measure for an asymmetric prior over topic term distributions in the LDA framework. This leads to better estimates for important terms that occur less frequently in documents. Experimental studies have been conducted to show the effectiveness of our work across a variety of text mining applications. Furthermore, we employ our topic models to build a personalized content-based news recommender system. Our proposed recommender system eases reading and navigation through online newspapers. In essence, the recommender system acts as filters, delivering only news articles that can be considered relevant to a user. This recommender system has been used by The Globe and Mail, a company that offers most authoritative news in Canada, featuring national and international news.
dc.identifier.uri	http://hdl.handle.net/10315/30620
dc.language.iso	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Computer science
dc.subject	Computer engineering
dc.subject.keywords	Data mining
dc.subject.keywords	Text mining
dc.subject.keywords	Topic modeling
dc.subject.keywords	LDA
dc.subject.keywords	Latent Dirichlet allocation
dc.subject.keywords	Recommender system
dc.subject.keywords	Computational linguistics
dc.title	Extending Topic Models With Syntax and Semantics Relationships
dc.type	Electronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Delpisheh_Elnaz_2015_PhD.pdf
Size:: 1.37 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.83 KB
Format:: Plain Text
Description:

Download

Name:: YorkU_ETDlicense.txt
Size:: 3.38 KB
Format:: Plain Text
Description:

Download

Collections

Computer Science and Engineering