Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategy

Determining the optical flow of a video is a compute-intensive task essential for computer vision. For achieving this processing in real time, the whole algorithm deployment chain must be thought of for efficiency first. The development is usually divided into two parts: first, designing an algorith...

Full description

Saved in:
Bibliographic Details
Published inJournal of real-time image processing Vol. 19; no. 2; pp. 317 - 329
Main Authors Seznec, Mickaël, Gac, Nicolas, Orieux, François, Naik, Alvin Sashala
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2022
Springer Nature B.V
Springer Verlag
Subjects
Online AccessGet full text
ISSN1861-8200
1861-8219
1861-8219
DOI10.1007/s11554-021-01187-8

Cover

More Information
Summary:Determining the optical flow of a video is a compute-intensive task essential for computer vision. For achieving this processing in real time, the whole algorithm deployment chain must be thought of for efficiency first. The development is usually divided into two parts: first, designing an algorithm that meets precision constraints, then, implementing and optimizing its execution on the targeted platform. We argue that unifying those operations enhances performance on the embedded processor. This paper is based on an industrial use case of computer vision. The objective is to determine dense optical flow in real time on an embedded GPU platform: the Nvidia AGX Xavier. The CLG (combined local–global) optical flow method, initially chosen, is analyzed to understand the convergence speed of its underlying optimization problem. The Jacobi solver is selected for implementation because of its parallel nature. The whole multi-level processing is then ported to the GPU, using several specific optimization strategies. In particular, we analyze the impact of fusing the solver’s iterations with the roofline model. As a result, with a 30 W power budget, our implementation runs at 60FPS, on 640 × 512 images, with a four-level processing. Hopefully, this example should provide feedback on the issues that arise when trying to port a method to a parallel platform and serve for further implementations of computer vision algorithms on specialized hardware.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1861-8200
1861-8219
1861-8219
DOI:10.1007/s11554-021-01187-8