The Precision and Repeatability of Media Quality Comparisons: Measurements and New Statistical Methods

This paper calculates confidence intervals for 89 datasets that use the 5-level Absolute Category Rating (ACR) method to evaluate the quality of speech, video, images, and video with audio. This data allows us to compute the subjective test confidence interval <inline-formula> <tex-math not...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on broadcasting Vol. 69; no. 2; pp. 1 - 18
Main Author	Pinson, Margaret H.
Format	Journal Article
Language	English
Published	New York IEEE 01.06.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Analysis of variance Audio data Audiovisual quality Confidence confidence interval Confidence intervals confusion matrix Correlation Datasets Decisions false ranking Image quality Mathematical analysis Measurement Media metric MOS precision Statistical analysis Statistical methods statistics subjective test Telecommunications video quality
Online Access	Get full text
ISSN	0018-9316 1557-9611 1557-9611
DOI	10.1109/TBC.2023.3236528

Cover

More Information
Summary:	This paper calculates confidence intervals for 89 datasets that use the 5-level Absolute Category Rating (ACR) method to evaluate the quality of speech, video, images, and video with audio. This data allows us to compute the subjective test confidence interval <inline-formula> <tex-math notation="LaTeX">(\Delta </tex-math> </inline-formula>SCI) for 5-level ACR tests. We use a confusion matrix to compare conclusions reached by 88 lab-to-lab comparisons, 22 method-to-method comparisons, and 12 comparisons between expert and naïve subjects. We estimate the differences in conclusions reached by ad hoc evaluations, compared to subjective tests. We recommend using the disagree incidence rate to identify lab-to-lab differences (i.e., the likelihood that significantly different stimulus pairs receive opposing rank order from the two labs). Disagree incidence rates above 0.31% are unusual enough to warrant investigation and disagree incidence rates above 1.0% indicate differences in method, test environment, test implementation, or subject demographics. These incidence rates form the basis for a new statistical method that calculates the confidence interval of a metric (<inline-formula> <tex-math notation="LaTeX">\Delta </tex-math> </inline-formula>MCI). When <inline-formula> <tex-math notation="LaTeX">\Delta </tex-math> </inline-formula>MCI is used to make decisions, the equivalence to a video-quality test (EVQT) method determines whether a metric acts similarly to a subjective test. When <inline-formula> <tex-math notation="LaTeX">\Delta </tex-math> </inline-formula>MCI is not used, the metric is likened to a certain number of people in a video-quality test (PVQT). This information will help users make the better decisions when applying quality metrics. The algorithm code is made available for any purpose. Most of the ratings used in this paper come from open datasets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0018-9316 1557-9611 1557-9611
DOI:	10.1109/TBC.2023.3236528