Kubernetes supports efficient utilization of resources by enabling applications to request the precise amounts of resources it needs. Unlike fractional requests for CPUs, fractional requests for GPUs are not allowed in Kubernetes. GPU resources requested in the pod manifest must be an integer number. This means one GPU is fully allocated to one container even if the container only needs a fraction of GPU for its workload. Without the support for fractional GPUs, GPU resources are invariably over provisioned leading to a wastage. This is especially true for inference workloads that process a handful of data samples in real-time. To address this limitation, we have developed user-friendly solutions that allow a single GPU to be shared by multiple containers thereby improving utilization of GPUs and saving cost. In this talk, we will show the demos of our solutions and share performance results.
Click here to view captioning/translation in the MeetingPlay platform!