No-Reference Video Quality Assessment Using Local Structural and Quality-Aware Deep Features
Due to the growing demand for high-quality video services in 4G and 5G applications, measuring the quantitative quality of video services is expected to become a major vital task. The no-reference video quality assessment (NR-VQA) work published so far regresses computationally complex statistical t...
        Saved in:
      
    
          | Published in | IEEE transactions on instrumentation and measurement Vol. 72; pp. 1 - 12 | 
|---|---|
| Main Authors | , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        New York
          IEEE
    
        2023
     The Institute of Electrical and Electronics Engineers, Inc. (IEEE)  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0018-9456 1557-9662  | 
| DOI | 10.1109/TIM.2023.3273654 | 
Cover
| Summary: | Due to the growing demand for high-quality video services in 4G and 5G applications, measuring the quantitative quality of video services is expected to become a major vital task. The no-reference video quality assessment (NR-VQA) work published so far regresses computationally complex statistical transforms or convolutional neural network (CNN) features to predict a quality score. In this article, we propose a novel NR-VQA scheme using systematic sampling of spatiotemporal planes (<inline-formula> <tex-math notation="LaTeX">XY </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">XT </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">YT </tex-math></inline-formula>) based on the high standard deviation (<inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula>) of their high-frequency bands to represent distortion. The human visual system (HVS) is highly sensitive to structural information in visual scenes, and distortions disrupt the structural properties. The proposed scheme encodes two-level, 3-D structural video information using novel local spatiotemporal tetra patterns (LSTP) on the sampled highest <inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula> planes from each block of planes. Besides, we extract quality-aware deep features from the second highest <inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula> sampled video frames (<inline-formula> <tex-math notation="LaTeX">XY </tex-math></inline-formula>-spatial) from each block using a fine-tuned CNN model. The extracted LSTP and deep quality-aware features of the two highest <inline-formula> <tex-math notation="LaTeX">\sigma </tex-math></inline-formula> frames are average pooled and concatenated with the top <inline-formula> <tex-math notation="LaTeX">100~\sigma </tex-math></inline-formula> values of other frames to form video-level final features. Finally, the concatenated features are fed to a support vector regression (SVR) to predict the perceptual quality scores of test videos. The proposed method is evaluated on ten publicly available standard exhaustive video quality assessment (VQA) databases containing synthetic, authentic, and mixed distortions. Comprehensive, robust, and extensive experiments indicate that the proposed model outperforms all the state-of-the-art VQA models and is consistent with human subjective assessment. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14  | 
| ISSN: | 0018-9456 1557-9662  | 
| DOI: | 10.1109/TIM.2023.3273654 |