PAMBA: Partition-aware and Multi-SLO Batching for Serverless Inference on Heterogeneous Clouds

dc.contributor.advisorKhazaei, Hamzeh
dc.contributor.authorAbedini, Alireza
dc.date.accessioned2026-03-10T16:20:53Z
dc.date.available2026-03-10T16:20:53Z
dc.date.copyright2026-02-05
dc.date.issued2026-03-10
dc.date.updated2026-03-10T16:20:52Z
dc.degree.disciplineComputer Science
dc.degree.levelMaster's
dc.degree.nameMSc - Master of Science
dc.description.abstractServerless computing offers elasticity and fine-grained billing for machine learning inference, but efficiently supporting large models under diverse latency service-level objectives (SLOs) remains challenging. In particular, existing approaches face a wide cost–performance gap between CPU and GPU execution, while batching and resource selection become increasingly complex under heterogeneous workloads and multiple SLOs. This thesis presents PAMBA, a partition-aware and multi-SLO batching system for serverless inference on heterogeneous clouds. PAMBA combines multi-SLO batching with analytical latency and cost models for CPU, GPU, and partitioned execution, enabling consistent provisioning decisions across different execution modes. To bridge the CPU–GPU gap, the system employs a customized partitioning strategy derived from latency-optimal partitioning, adapted to satisfy serverless resource constraints and jointly consider latency feasibility and per-request cost. This adaptation allows partitioned execution to emerge as an effective intermediate regime between monolithic CPU and GPU deployments. By jointly optimizing execution mode selection, batching, and resource allocation, PAMBA enables flexible inference deployment across a wide range of SLOs and arrival rates, including scenarios where GPU resources are unavailable or inefficiently utilized. Experimental results on convolutional neural networks demonstrate that PAMBA identifies distinct execution frontiers and reduces inference cost compared to existing serverless batching techniques, while maintaining SLO feasibility across heterogeneous workloads.
dc.identifier.urihttps://hdl.handle.net/10315/43654
dc.languageen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectComputer science
dc.subject.keywordsServerless computing
dc.subject.keywordsML inference
dc.subject.keywordsMulti-SLO batching
dc.subject.keywordsModel partitioning
dc.subject.keywordsOptimization
dc.subject.keywordsCloud computing
dc.titlePAMBA: Partition-aware and Multi-SLO Batching for Serverless Inference on Heterogeneous Clouds
dc.typeElectronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Abedini_Alireza_2026_MSc.pdf
Size:
2.17 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.87 KB
Format:
Plain Text
Description:
Loading...
Thumbnail Image
Name:
YorkU_ETDlicense.txt
Size:
3.39 KB
Format:
Plain Text
Description:

Collections