Neural Document Segmentation Using Weighted Sliding Windows with Transformer Encoders

Loading...
Thumbnail Image

Date

2024-11-07

Authors

Abbasi, Saeed

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The subdivision of documents into semantically coherent segments is a fundamental challenge in Natural Language Processing (NLP), with notable applications in information retrieval and question answering. Effective text segmentation is crucial for enhancing Retrieval-Augmented Generation (RAG) systems by providing coherent segments that improve the contextual accuracy of responses. We introduce a weighted sliding window framework, WeSWin, that effectively segments arbitrarily long documents using Transformers. WeSWin consists of overlapping document partitioning followed by the weighted aggregation of multiple sentence predictions within the generated windows, ensuring that sentences with larger context visibility contribute more to the ultimate label. Additionally, we propose a multi-task training framework, WeSWin-Ret, which combines text segmentation with an auxiliary sentence retrieval task. This approach injects query-awareness into the embedding space of the shared Transformer, resulting in improved segmentation performance. Extensive experiments demonstrate that our methods outperform state-of-the-art approaches. On the Wiki-727k benchmark, both our WeSWin and WeSWin-Ret models surpass existing works based on BERT, RoBERTa, or LongFormer Transformers. Notably, our RoBERTa baseline often matches LongFormer’s performance while being significantly more efficient in training and inference. We validate our model’s robustness on domain-specific segmentation benchmarks, including en_city, en_disease, and an industrial automotive dataset, demonstrating generalizability across domains. Lastly, our model proves to be highly effective in enhancing down-stream RAG applications by providing cohesive chunks for knowledge retrieval.

Description

Keywords

Computer science, Artificial intelligence, Linguistics

Citation

Collections