Dual-Agent Deep Reinforcement Learning Approach to GPU Job Scheduling
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Public cloud GPU clusters are increasingly used for distributed deep learning tasks, making the job scheduler critical for minimizing job waiting and completion times. However, scheduling is inherently complex and NP-hard. Current approaches typically address job scheduling and GPU allocation separately, leading to suboptimal performance. DRL-based scheduling methods, while flexible, often overlook two challenges. Firstly, they focus on minimizing the total job completion time and ignore fairness in waiting times. Secondly, distributed training speed is significantly influenced by GPU communication costs, often overlooked. To address this, we introduce AttentiveSched, a DRL-based framework that simultaneously optimizes job selection and GPU assignment. AttentiveSched considers cluster topology for informed scheduling. Its two agents (job and GPU) use attention mechanisms to capture global relationships in the input sequence. By addressing fairness, job completion time, and communication costs in its rewards, AttentiveSched outperforms heuristics-based, meta-heuristics-based, and other DRL-based schedulers on real-world datasets.