Dual-Agent Deep Reinforcement Learning Approach to GPU Job Scheduling

Shao, Yiming

Dual-Agent Deep Reinforcement Learning Approach to GPU Job Scheduling

Files

Shao_Yiming_2024_MSc.pdf (3.77 MB)

Date

2025-04-10

Authors

Shao, Yiming

Abstract

Public cloud GPU clusters are increasingly used for distributed deep learning tasks, making the job scheduler critical for minimizing job waiting and completion times. However, scheduling is inherently complex and NP-hard. Current approaches typically address job scheduling and GPU allocation separately, leading to suboptimal performance. DRL-based scheduling methods, while flexible, often overlook two challenges. Firstly, they focus on minimizing the total job completion time and ignore fairness in waiting times. Secondly, distributed training speed is significantly influenced by GPU communication costs, often overlooked. To address this, we introduce AttentiveSched, a DRL-based framework that simultaneously optimizes job selection and GPU assignment. AttentiveSched considers cluster topology for informed scheduling. Its two agents (job and GPU) use attention mechanisms to capture global relationships in the input sequence. By addressing fairness, job completion time, and communication costs in its rewards, AttentiveSched outperforms heuristics-based, meta-heuristics-based, and other DRL-based schedulers on real-world datasets.

URI

https://hdl.handle.net/10315/42847

Collections

Computer Science

Full item page

Dual-Agent Deep Reinforcement Learning Approach to GPU Job Scheduling

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections