Context-aware CLIP for Enhanced Food Recognition

Generalization of food image recognition frameworks is difficult due to the wide variety of food categories in cuisines across cultures. The performance of the deep neural network models highly depends on the training dataset. To overcome this problem, we propose to extract context information from...

Full description

Saved in:

Bibliographic Details
Published in	Advances in artificial intelligence research : (Online) Vol. 5; no. 1; pp. 7 - 13
Main Author	Öztürk Ergün, Övgü
Format	Journal Article
Language	English
Published	16.06.2025
Online Access	Get full text
ISSN	2757-7422 2757-7422
DOI	10.54569/aair.1707867

Cover

More Information
Summary:	Generalization of food image recognition frameworks is difficult due to the wide variety of food categories in cuisines across cultures. The performance of the deep neural network models highly depends on the training dataset. To overcome this problem, we propose to extract context information from images in order to increase the discrimination capacity of networks. In this work, we utilize the CLIP architecture with the automatically derived ingredient context from food images. A list of ingredients are associated with each food category, which is later modeled as text after a voting process and fed to a CLIP architecture together with input image. Experimental results on the Food101 dataset show that this approach significantly improves the model’s performance, achieving a 2% overall increase in accuracy. This improvement varies across food classes, with increases ranging from 0.5% to as much as 22%. The proposed framework, CLIP fed with ingredient text, outperforms Yolov8 (81.46%) with 81.80% top 1 overall accuracy over 101 classes.
ISSN:	2757-7422 2757-7422
DOI:	10.54569/aair.1707867