Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders
dc.contributor.advisor | An, Aijun | |
dc.contributor.advisor | Davoudi, Heidar | |
dc.contributor.author | Abbasi, Saeed | |
dc.date.accessioned | 2024-11-07T11:11:57Z | |
dc.date.available | 2024-11-07T11:11:57Z | |
dc.date.copyright | 2024-08-14 | |
dc.date.issued | 2024-11-07 | |
dc.date.updated | 2024-11-07T11:11:56Z | |
dc.degree.discipline | Computer Science | |
dc.degree.level | Master's | |
dc.degree.name | MSc - Master of Science | |
dc.description.abstract | The subdivision of documents into semantically coherent segments is a fundamental challenge in Natural Language Processing (NLP), with notable applications in information retrieval and question answering. Effective text segmentation is crucial for enhancing Retrieval-Augmented Generation (RAG) systems by providing coherent segments that improve the contextual accuracy of responses. We introduce a weighted sliding window framework, WeSWin, that effectively segments arbitrarily long documents using Transformers. WeSWin consists of overlapping document partitioning followed by the weighted aggregation of multiple sentence predictions within the generated windows, ensuring that sentences with larger context visibility contribute more to the ultimate label. Additionally, we propose a multi-task training framework, WeSWin-Ret, which combines text segmentation with an auxiliary sentence retrieval task. This approach injects query-awareness into the embedding space of the shared Transformer, resulting in improved segmentation performance. Extensive experiments demonstrate that our methods outperform state-of-the-art approaches. On the Wiki-727k benchmark, both our WeSWin and WeSWin-Ret models surpass existing works based on BERT, RoBERTa, or LongFormer Transformers. Notably, our RoBERTa baseline often matches LongFormer’s performance while being significantly more efficient in training and inference. We validate our model’s robustness on domain-specific segmentation benchmarks, including en_city, en_disease, and an industrial automotive dataset, demonstrating generalizability across domains. Lastly, our model proves to be highly effective in enhancing down-stream RAG applications by providing cohesive chunks for knowledge retrieval. | |
dc.identifier.uri | https://hdl.handle.net/10315/42470 | |
dc.language | en | |
dc.rights | Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests. | |
dc.subject | Computer science | |
dc.subject | Artificial intelligence | |
dc.subject | Linguistics | |
dc.subject.keywords | Text segmentation | |
dc.subject.keywords | Document segmentation | |
dc.subject.keywords | Semantic text segmentation | |
dc.subject.keywords | Neural text segmentation | |
dc.subject.keywords | Text chunking | |
dc.subject.keywords | Semantic text chunking | |
dc.subject.keywords | Transformers | |
dc.subject.keywords | Natural language processing | |
dc.subject.keywords | Retrieval-augmented generation | |
dc.subject.keywords | Large language models | |
dc.title | Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders | |
dc.type | Electronic Thesis or Dissertation |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Abbasi_Saeed_2024_Masters.pdf
- Size:
- 2.66 MB
- Format:
- Adobe Portable Document Format