Batching Model Slices for Resource-Efficient Execution of Transformer Models in Edge AI
The emergence of Vision Transformer (ViT) models and their variants (e.g., Swin Transformers) are prevalent in recent years due to their higher accuracy in vision AI applications. However, their efficient execution in edge computing environments (e.g., mobile phones/embedded platforms) remains a cha...
        Saved in:
      
    
          | Published in | Proceedings / IEEE International Conference on Mobile Data Management pp. 271 - 280 | 
|---|---|
| Main Authors | , | 
| Format | Conference Proceeding | 
| Language | English | 
| Published | 
            IEEE
    
        02.06.2025
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 2375-0324 | 
| DOI | 10.1109/MDM65600.2025.00060 | 
Cover
| Summary: | The emergence of Vision Transformer (ViT) models and their variants (e.g., Swin Transformers) are prevalent in recent years due to their higher accuracy in vision AI applications. However, their efficient execution in edge computing environments (e.g., mobile phones/embedded platforms) remains a challenge due to the heavy computational demands (both GPU cycles and GPU memory) of these large-sized models. To serve these models efficiently at the edge, we introduce a novel approach combining model slicing and smart batching to distribute workloads between resource-constrained client devices and powerful edge servers. Model slicing allows breaking a large model into smaller segments, called slices, and let the client execute a few initial but a variable number of slices (head slices) and a nearby edge server runs the rest of the slices (tail slices), smart batching enables the server to queue several requests from multiple clients batch together for inference leading to better GPU resource utilization. We propose two batching strategies at the server: one runs faster but requires higher GPU memory and the other one demands less memory with slight overhead of internal data movement. Experimental results show that our approach achieves inference time reductions of up to 67 % while maintaining high GPU utilization and demonstrates significant improvements in inference speed, showcasing the viability of this approach for distributed AI systems. | 
|---|---|
| ISSN: | 2375-0324 | 
| DOI: | 10.1109/MDM65600.2025.00060 |