Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders

Abbasi, Saeed

Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders

dc.contributor.advisor	An, Aijun
dc.contributor.advisor	Davoudi, Heidar
dc.contributor.author	Abbasi, Saeed
dc.date.accessioned	2024-11-07T11:11:57Z
dc.date.available	2024-11-07T11:11:57Z
dc.date.copyright	2024-08-14
dc.date.issued	2024-11-07
dc.date.updated	2024-11-07T11:11:56Z
dc.degree.discipline	Computer Science
dc.degree.level	Master's
dc.degree.name	MSc - Master of Science
dc.description.abstract	The subdivision of documents into semantically coherent segments is a fundamental challenge in Natural Language Processing (NLP), with notable applications in information retrieval and question answering. Effective text segmentation is crucial for enhancing Retrieval-Augmented Generation (RAG) systems by providing coherent segments that improve the contextual accuracy of responses. We introduce a weighted sliding window framework, WeSWin, that effectively segments arbitrarily long documents using Transformers. WeSWin consists of overlapping document partitioning followed by the weighted aggregation of multiple sentence predictions within the generated windows, ensuring that sentences with larger context visibility contribute more to the ultimate label. Additionally, we propose a multi-task training framework, WeSWin-Ret, which combines text segmentation with an auxiliary sentence retrieval task. This approach injects query-awareness into the embedding space of the shared Transformer, resulting in improved segmentation performance. Extensive experiments demonstrate that our methods outperform state-of-the-art approaches. On the Wiki-727k benchmark, both our WeSWin and WeSWin-Ret models surpass existing works based on BERT, RoBERTa, or LongFormer Transformers. Notably, our RoBERTa baseline often matches LongFormer’s performance while being significantly more efficient in training and inference. We validate our model’s robustness on domain-specific segmentation benchmarks, including en_city, en_disease, and an industrial automotive dataset, demonstrating generalizability across domains. Lastly, our model proves to be highly effective in enhancing down-stream RAG applications by providing cohesive chunks for knowledge retrieval.
dc.identifier.uri	https://hdl.handle.net/10315/42470
dc.language	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Computer science
dc.subject	Artificial intelligence
dc.subject	Linguistics
dc.subject.keywords	Text segmentation
dc.subject.keywords	Document segmentation
dc.subject.keywords	Semantic text segmentation
dc.subject.keywords	Neural text segmentation
dc.subject.keywords	Text chunking
dc.subject.keywords	Semantic text chunking
dc.subject.keywords	Transformers
dc.subject.keywords	Natural language processing
dc.subject.keywords	Retrieval-augmented generation
dc.subject.keywords	Large language models
dc.title	Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders
dc.type	Electronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Abbasi_Saeed_2024_Masters.pdf
Size:: 2.66 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.87 KB
Format:: Plain Text
Description:

Download

Name:: YorkU_ETDlicense.txt
Size:: 3.39 KB
Format:: Plain Text
Description:

Download

Collections

Computer Science