Batching Model Slices for Resource-Efficient Execution of Transformer Models in Edge AI

The emergence of Vision Transformer (ViT) models and their variants (e.g., Swin Transformers) are prevalent in recent years due to their higher accuracy in vision AI applications. However, their efficient execution in edge computing environments (e.g., mobile phones/embedded platforms) remains a cha...

Full description

Saved in:
Bibliographic Details
Published inProceedings / IEEE International Conference on Mobile Data Management pp. 271 - 280
Main Authors Mubark, Waleed Hassan, Uddin, Md Yusuf Sarwar
Format Conference Proceeding
LanguageEnglish
Published IEEE 02.06.2025
Subjects
Online AccessGet full text
ISSN2375-0324
DOI10.1109/MDM65600.2025.00060

Cover

More Information
Summary:The emergence of Vision Transformer (ViT) models and their variants (e.g., Swin Transformers) are prevalent in recent years due to their higher accuracy in vision AI applications. However, their efficient execution in edge computing environments (e.g., mobile phones/embedded platforms) remains a challenge due to the heavy computational demands (both GPU cycles and GPU memory) of these large-sized models. To serve these models efficiently at the edge, we introduce a novel approach combining model slicing and smart batching to distribute workloads between resource-constrained client devices and powerful edge servers. Model slicing allows breaking a large model into smaller segments, called slices, and let the client execute a few initial but a variable number of slices (head slices) and a nearby edge server runs the rest of the slices (tail slices), smart batching enables the server to queue several requests from multiple clients batch together for inference leading to better GPU resource utilization. We propose two batching strategies at the server: one runs faster but requires higher GPU memory and the other one demands less memory with slight overhead of internal data movement. Experimental results show that our approach achieves inference time reductions of up to 67 % while maintaining high GPU utilization and demonstrates significant improvements in inference speed, showcasing the viability of this approach for distributed AI systems.
ISSN:2375-0324
DOI:10.1109/MDM65600.2025.00060