Batching Model Slices for Resource-Efficient Execution of Transformer Models in Edge AI

The emergence of Vision Transformer (ViT) models and their variants (e.g., Swin Transformers) are prevalent in recent years due to their higher accuracy in vision AI applications. However, their efficient execution in edge computing environments (e.g., mobile phones/embedded platforms) remains a cha...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Mobile Data Management pp. 271 - 280
Main Authors	Mubark, Waleed Hassan, Uddin, Md Yusuf Sarwar
Format	Conference Proceeding
Language	English
Published	IEEE 02.06.2025
Subjects	Computational efficiency Computational modeling Computer vision Data models Edge AI Graphics processing units Memory management model batch per slicing Resource management Servers smart batching (variable point split and batching) split inference Transformers Vision Transformer (ViT)
Online Access	Get full text
ISSN	2375-0324
DOI	10.1109/MDM65600.2025.00060

Cover

More Information
Summary:	The emergence of Vision Transformer (ViT) models and their variants (e.g., Swin Transformers) are prevalent in recent years due to their higher accuracy in vision AI applications. However, their efficient execution in edge computing environments (e.g., mobile phones/embedded platforms) remains a challenge due to the heavy computational demands (both GPU cycles and GPU memory) of these large-sized models. To serve these models efficiently at the edge, we introduce a novel approach combining model slicing and smart batching to distribute workloads between resource-constrained client devices and powerful edge servers. Model slicing allows breaking a large model into smaller segments, called slices, and let the client execute a few initial but a variable number of slices (head slices) and a nearby edge server runs the rest of the slices (tail slices), smart batching enables the server to queue several requests from multiple clients batch together for inference leading to better GPU resource utilization. We propose two batching strategies at the server: one runs faster but requires higher GPU memory and the other one demands less memory with slight overhead of internal data movement. Experimental results show that our approach achieves inference time reductions of up to 67 % while maintaining high GPU utilization and demonstrates significant improvements in inference speed, showcasing the viability of this approach for distributed AI systems.
ISSN:	2375-0324
DOI:	10.1109/MDM65600.2025.00060