AdapTrain: Adaptive Model Partitioning for Efficient Independent Subnet Training on Heterogeneous and Dynamic Cloud Infrastructures

Naderi, Mohammadhossein

AdapTrain: Adaptive Model Partitioning for Efficient Independent Subnet Training on Heterogeneous and Dynamic Cloud Infrastructures

Files

Naderi_Mohammadhossein_2025_MSc.pdf (4.9 MB)

Date

2025-11-11

Authors

Naderi, Mohammadhossein

Abstract

Modern distributed training systems face significant challenges in heterogeneous computing environments, where heterogeneity in computational resources among workers often leads to resource underutilization and extended training durations, particularly in resource-constrained environments. To address these challenges, we propose Adaptive Model Partitioning for Efficient Independent Subnet Training on Heterogeneous and Dynamic Cloud Infrastructures (AdapTrain), a novel framework that dynamically adjusts model partitioning to align with the computational capacities of heterogeneous workers. AdapTrain reduces the overhead of synchronization, thereby minimizing total end-to-end training time by ensuring synchronized completion of training rounds across all workers. Its adaptive design enables robust performance under workload variations, inherent resource heterogeneity, and multi-tenancy effects prevalent in cloud computing environments. An experimental evaluation of production workloads reveals that AdapTrain accelerates model convergence by more than 8x compared to the current training methods. Furthermore, AdapTrain integrates seamlessly into existing systems, introducing negligible system performance overhead while significantly enhancing training efficiency.

Keywords

Computer science, Artificial intelligence, Computer engineering

URI

https://hdl.handle.net/10315/43346

Collections

Computer Science

Full item page

AdapTrain: Adaptive Model Partitioning for Efficient Independent Subnet Training on Heterogeneous and Dynamic Cloud Infrastructures

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections