Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders

dc.contributor.advisorAn, Aijun
dc.contributor.advisorDavoudi, Heidar
dc.contributor.authorAbbasi, Saeed
dc.date.accessioned2024-11-07T11:11:57Z
dc.date.available2024-11-07T11:11:57Z
dc.date.copyright2024-08-14
dc.date.issued2024-11-07
dc.date.updated2024-11-07T11:11:56Z
dc.degree.disciplineComputer Science
dc.degree.levelMaster's
dc.degree.nameMSc - Master of Science
dc.description.abstractThe subdivision of documents into semantically coherent segments is a fundamental challenge in Natural Language Processing (NLP), with notable applications in information retrieval and question answering. Effective text segmentation is crucial for enhancing Retrieval-Augmented Generation (RAG) systems by providing coherent segments that improve the contextual accuracy of responses. We introduce a weighted sliding window framework, WeSWin, that effectively segments arbitrarily long documents using Transformers. WeSWin consists of overlapping document partitioning followed by the weighted aggregation of multiple sentence predictions within the generated windows, ensuring that sentences with larger context visibility contribute more to the ultimate label. Additionally, we propose a multi-task training framework, WeSWin-Ret, which combines text segmentation with an auxiliary sentence retrieval task. This approach injects query-awareness into the embedding space of the shared Transformer, resulting in improved segmentation performance. Extensive experiments demonstrate that our methods outperform state-of-the-art approaches. On the Wiki-727k benchmark, both our WeSWin and WeSWin-Ret models surpass existing works based on BERT, RoBERTa, or LongFormer Transformers. Notably, our RoBERTa baseline often matches LongFormer’s performance while being significantly more efficient in training and inference. We validate our model’s robustness on domain-specific segmentation benchmarks, including en_city, en_disease, and an industrial automotive dataset, demonstrating generalizability across domains. Lastly, our model proves to be highly effective in enhancing down-stream RAG applications by providing cohesive chunks for knowledge retrieval.
dc.identifier.urihttps://hdl.handle.net/10315/42470
dc.languageen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectComputer science
dc.subjectArtificial intelligence
dc.subjectLinguistics
dc.subject.keywordsText segmentation
dc.subject.keywordsDocument segmentation
dc.subject.keywordsSemantic text segmentation
dc.subject.keywordsNeural text segmentation
dc.subject.keywordsText chunking
dc.subject.keywordsSemantic text chunking
dc.subject.keywordsTransformers
dc.subject.keywordsNatural language processing
dc.subject.keywordsRetrieval-augmented generation
dc.subject.keywordsLarge language models
dc.titleNeural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders
dc.typeElectronic Thesis or Dissertation

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Abbasi_Saeed_2024_Masters.pdf
Size:
2.66 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
license.txt
Size:
1.87 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
YorkU_ETDlicense.txt
Size:
3.39 KB
Format:
Plain Text
Description:

Collections