Closing the HPC-Cloud Convergence Gap: Multi-Tenant Slingshot RDMA for Kubernetes

Converged HPC-Cloud computing is an emerging computing paradigm that aims to support increasingly complex and multi-tenant scientific workflows. These systems require reconciliation of the isolation requirements of native cloud workloads and the performance demands of HPC applications. In this conte...

Full description

Saved in:
Bibliographic Details
Published inProceedings / IEEE International Conference on Cluster Computing pp. 1 - 10
Main Authors Friese, Philipp A., Eleliemy, Ahmed, Haus, Utz-Uwe, Schulz, Martin
Format Conference Proceeding
LanguageEnglish
Published IEEE 02.09.2025
Subjects
Online AccessGet full text
ISSN2168-9253
DOI10.1109/CLUSTER59342.2025.11186471

Cover

More Information
Summary:Converged HPC-Cloud computing is an emerging computing paradigm that aims to support increasingly complex and multi-tenant scientific workflows. These systems require reconciliation of the isolation requirements of native cloud workloads and the performance demands of HPC applications. In this context, networking hardware is a critical boundary component: it is the conduit for high-throughput, low-latency communication and enables isolation across tenants. HPE Slingshot is a high-speed network interconnect that provides up to 200 Gbps of throughput per port and targets high-performance computing (HPC) systems. The Slingshot host software, including hardware drivers and network middleware libraries, is designed to meet HPC deployments, which predominantly use singletenant access modes. Hence, the Slingshot stack is not suited for secure use in multi-tenant deployments, such as converged HPCCloud deployments. In this paper, we design and implement an extension to the Slingshot stack targeting converged deployments on the basis of Kubernetes. Our integration provides secure, container-granular, and multi-tenant access to Slingshot RDMA networking capabilities at minimal overhead.
ISSN:2168-9253
DOI:10.1109/CLUSTER59342.2025.11186471