Abstract

The increasing computation demand for deep learning workloads has driven the need for efficient GPU sharing. GPU spatial sharing among workloads is an effective approach to increase resource utilization and reduce the monetary and environmental costs of running deep learning workloads. Spatial sharing allows multiple workloads to execute concurrently on a GPU by partitioning GPU resources via various supported hardware and software mechanisms. Common spatial sharing techniques, such as NVIDIA MPS and NVIDIA MIG, achieve performance isolation by partitioning compute or memory resources for individual workloads. However, managing GPU resources across multiple colocated workloads presents significant challenges, particularly performance degradation due to resource contention. Existing approaches to mitigate interference often require extensive profiling of all colocation candidates, making them impractical for deployment. In this RPE, we propose a lightweight, prediction-based approach to effectively colocate workloads on a spatially shared GPU. We use NVIDIA MPS, a commonly used spatial sharing mechanism that partitions GPU Streaming Multiprocessors (SM) to achieve compute isolation, as our framework. We test our solution on 7 commonly used deep learning training and inference workloads, and accurately predict colocation interference using exclusive kernel metrics with limited training data and minimal training time, eliminating the need for extensive online profiling. Experimental results show our method outperforms existing rule- based and prediction-based policies by 16% and 10%, respectively, and achieves performance within 10% of an offline-optimal oracle policy. As future work, we plan to extend our solution beyond pairs of colocated workloads.

Year

8-21-2024

Document Type

Thesis

Keywords

Cloud computing, Systems for ML, GPU Sharing

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

Advisor

Anshul Gandhi

Share

COinS