PAMBA: Partition-aware and Multi-SLO Batching for Serverless Inference on Heterogeneous Clouds

Abedini, Alireza

PAMBA: Partition-aware and Multi-SLO Batching for Serverless Inference on Heterogeneous Clouds

Files

Abedini_Alireza_2026_MSc.pdf (2.17 MB)

Date

2026-03-10

Authors

Abedini, Alireza

Abstract

Serverless computing offers elasticity and fine-grained billing for machine learning inference, but efficiently supporting large models under diverse latency service-level objectives (SLOs) remains challenging. In particular, existing approaches face a wide cost–performance gap between CPU and GPU execution, while batching and resource selection become increasingly complex under heterogeneous workloads and multiple SLOs. This thesis presents PAMBA, a partition-aware and multi-SLO batching system for serverless inference on heterogeneous clouds. PAMBA combines multi-SLO batching with analytical latency and cost models for CPU, GPU, and partitioned execution, enabling consistent provisioning decisions across different execution modes. To bridge the CPU–GPU gap, the system employs a customized partitioning strategy derived from latency-optimal partitioning, adapted to satisfy serverless resource constraints and jointly consider latency feasibility and per-request cost. This adaptation allows partitioned execution to emerge as an effective intermediate regime between monolithic CPU and GPU deployments. By jointly optimizing execution mode selection, batching, and resource allocation, PAMBA enables flexible inference deployment across a wide range of SLOs and arrival rates, including scenarios where GPU resources are unavailable or inefficiently utilized. Experimental results on convolutional neural networks demonstrate that PAMBA identifies distinct execution frontiers and reduces inference cost compared to existing serverless batching techniques, while maintaining SLO feasibility across heterogeneous workloads.

Keywords

Computer science

URI

https://hdl.handle.net/10315/43654

Collections

Computer Science

Full item page

PAMBA: Partition-aware and Multi-SLO Batching for Serverless Inference on Heterogeneous Clouds

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections