No-Reference Video Quality Assessment Using Local Structural and Quality-Aware Deep Features

Due to the growing demand for high-quality video services in 4G and 5G applications, measuring the quantitative quality of video services is expected to become a major vital task. The no-reference video quality assessment (NR-VQA) work published so far regresses computationally complex statistical t...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on instrumentation and measurement Vol. 72; pp. 1 - 12
Main Authors Vishwakarma, Anish Kumar, Bhurchandi, Kishor M.
Format Journal Article
LanguageEnglish
Published New York IEEE 2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN0018-9456
1557-9662
DOI10.1109/TIM.2023.3273654

Cover

More Information
Summary:Due to the growing demand for high-quality video services in 4G and 5G applications, measuring the quantitative quality of video services is expected to become a major vital task. The no-reference video quality assessment (NR-VQA) work published so far regresses computationally complex statistical transforms or convolutional neural network (CNN) features to predict a quality score. In this article, we propose a novel NR-VQA scheme using systematic sampling of spatiotemporal planes (<inline-formula> <tex-math notation="LaTeX">XY </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">XT </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">YT </tex-math></inline-formula>) based on the high standard deviation (<inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula>) of their high-frequency bands to represent distortion. The human visual system (HVS) is highly sensitive to structural information in visual scenes, and distortions disrupt the structural properties. The proposed scheme encodes two-level, 3-D structural video information using novel local spatiotemporal tetra patterns (LSTP) on the sampled highest <inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula> planes from each block of planes. Besides, we extract quality-aware deep features from the second highest <inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula> sampled video frames (<inline-formula> <tex-math notation="LaTeX">XY </tex-math></inline-formula>-spatial) from each block using a fine-tuned CNN model. The extracted LSTP and deep quality-aware features of the two highest <inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula> frames are average pooled and concatenated with the top <inline-formula> <tex-math notation="LaTeX">100~\sigma </tex-math></inline-formula> values of other frames to form video-level final features. Finally, the concatenated features are fed to a support vector regression (SVR) to predict the perceptual quality scores of test videos. The proposed method is evaluated on ten publicly available standard exhaustive video quality assessment (VQA) databases containing synthetic, authentic, and mixed distortions. Comprehensive, robust, and extensive experiments indicate that the proposed model outperforms all the state-of-the-art VQA models and is consistent with human subjective assessment.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9456
1557-9662
DOI:10.1109/TIM.2023.3273654