PAMBA: Partition-aware and Multi-SLO Batching for Serverless Inference on Heterogeneous Clouds

Abedini, Alireza

PAMBA: Partition-aware and Multi-SLO Batching for Serverless Inference on Heterogeneous Clouds

dc.contributor.advisor	Khazaei, Hamzeh
dc.contributor.author	Abedini, Alireza
dc.date.accessioned	2026-03-10T16:20:53Z
dc.date.available	2026-03-10T16:20:53Z
dc.date.copyright	2026-02-05
dc.date.issued	2026-03-10
dc.date.updated	2026-03-10T16:20:52Z
dc.degree.discipline	Computer Science
dc.degree.level	Master's
dc.degree.name	MSc - Master of Science
dc.description.abstract	Serverless computing offers elasticity and fine-grained billing for machine learning inference, but efficiently supporting large models under diverse latency service-level objectives (SLOs) remains challenging. In particular, existing approaches face a wide cost–performance gap between CPU and GPU execution, while batching and resource selection become increasingly complex under heterogeneous workloads and multiple SLOs. This thesis presents PAMBA, a partition-aware and multi-SLO batching system for serverless inference on heterogeneous clouds. PAMBA combines multi-SLO batching with analytical latency and cost models for CPU, GPU, and partitioned execution, enabling consistent provisioning decisions across different execution modes. To bridge the CPU–GPU gap, the system employs a customized partitioning strategy derived from latency-optimal partitioning, adapted to satisfy serverless resource constraints and jointly consider latency feasibility and per-request cost. This adaptation allows partitioned execution to emerge as an effective intermediate regime between monolithic CPU and GPU deployments. By jointly optimizing execution mode selection, batching, and resource allocation, PAMBA enables flexible inference deployment across a wide range of SLOs and arrival rates, including scenarios where GPU resources are unavailable or inefficiently utilized. Experimental results on convolutional neural networks demonstrate that PAMBA identifies distinct execution frontiers and reduces inference cost compared to existing serverless batching techniques, while maintaining SLO feasibility across heterogeneous workloads.
dc.identifier.uri	https://hdl.handle.net/10315/43654
dc.language	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Computer science
dc.subject.keywords	Serverless computing
dc.subject.keywords	ML inference
dc.subject.keywords	Multi-SLO batching
dc.subject.keywords	Model partitioning
dc.subject.keywords	Optimization
dc.subject.keywords	Cloud computing
dc.title	PAMBA: Partition-aware and Multi-SLO Batching for Serverless Inference on Heterogeneous Clouds
dc.type	Electronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Abedini_Alireza_2026_MSc.pdf
Size:: 2.17 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.87 KB
Format:: Plain Text
Description:

Download

Name:: YorkU_ETDlicense.txt
Size:: 3.39 KB
Format:: Plain Text
Description:

Download

Collections

Computer Science