Subjective video quality

Subjective video quality is video quality as experienced by humans. It is concerned with how video is perceived by a viewer (also called "observer" or "subject") and designates their opinion on a particular video sequence. The measurement of subjective video quality is necessary since objective algorithms such as PSNR have been shown to correlate badly with ratings. Subjective ratings may also be used as ground truth to develop new algorithms.

Subjective video quality tests are psychophysical experiments in which a number of viewers rates a given set of stimuli. These tests are quite expensive in terms of time (preparation and running) and human resources and must therefore be carefully designed.

In subjective video quality tests, typically, SRCs ("Sources", i.e. original video sequences) are treated with various conditions (HRCs for "Hypothetical Reference Circuits") to generate PVSs ("Processed Video Sequences").[1]

Measurement

The main idea of measuring subjective video quality is similar to the Mean Opinion Score (MOS) evaluation for audio. To evaluate the subjective video quality of a video processing system, the following steps are typically taken:

Many parameters of the viewing conditions may influence the results, such as room illumination, display type, brightness, contrast, resolution, viewing distance, and the age and educational level of viewers. It is therefore advised to report this information along with the obtained ratings.

Source selection

Typically, a system should be tested with a representative number of different contents and content characteristics. For example, one may select excerpts from contents of different genres, such as action movies, news shows, and cartoons. The length of the source video depends on the purpose of the test, but typically, sequences of no less than 10 seconds are used.

The amount of motion and spatial detail should also cover a broad range. This ensures that the test contains sequences which are of different complexity.

Sources should be of pristine quality. There should be no visible coding artifacts or other properties that would lower the quality of the original sequence.

Settings

The design of the HRCs depends on the system under study. Typically, multiple independent variables are introduced at this stage, and they are varied with a number of levels. For example, to test the quality of a video codec, independent variables may be the video encoding software, a target bitrate, and the target resolution of the processed sequence.

It is advised to select settings that result in ratings which cover the full quality range. In other words, assuming an Absolute Category Rating scale, the test should show sequences that viewers would rate from bad to excellent.

Viewers

Viewers are also called "observers" or "subjects". In order to obtain representative ratings, a certain number of viewers should be invited. This number is not strictly defined. According to ITU-T, any number between 4 and 40 is possible, where 4 is the absolute minimum for statistical reasons, and inviting more than 40 subjects has no added value.[2] It is claimed that at minimum 10 subjects are needed to obtain meaningful averaged ratings.[3]

Viewers should be non-experts in the sense of not being professionals in the field of video coding or related domains. This requirement is introduced to avoid potential subject bias.

Typically, viewers are screened for normal vision or corrected-to-normal vision.

Test environment

Subjective quality tests can be done in any environment. However, due to possible influence factors from heterogenous contexts, it is typically advised to perform tests in a neutral environment, such as a dedicated laboratory room. Such a room may be sound-proofed, with walls painted in neutral grey, and using properly calibrated light sources. Several recommendations specify these conditions.

Crowdsourcing has recently been used for subjective video quality evaluation, and more generally, in the context of Quality of Experience.[4] Here, viewers give ratings using their own computer, at home, rather than taking part in a subjective quality test in laboratory rooms.

Analysis of results

Opinions of viewers are typically averaged into the Mean Opinion Score (MOS). To this aim, the labels of categorical scales may be translated into numbers. MOS values should always be reported with their statistical confidence intervals so that the general agreement between observers can be evaluated.

Often, additional measures are taken before evaluating the results. Subject screening is a process in which viewers whose ratings are considered invalid or unreliable are rejected from further analysis. The reliability can be determined by various procedures, some of which are outlined in ITU-R and ITU-T recommendations.[2][5]

Standardized testing methods

There are many ways to select proper sequences, system settings, and test methodologies. A few of them have been standardized. They are thoroughly described in several ITU-R and ITU-T recommendations, among those ITU-R BT.500[5] and ITU-T P.910.[2] While there is an overlap in certain aspects, the BT.500 recommendation has its roots in broadcasting, whereas P.910 focuses on multimedia content.

A standardized testing method usually describes the following aspects:

Another recommendation, ITU-T P.913,[6] gives researchers more freedom to conduct subjective quality tests in environments different from a typical testing laboratory, while still requiring them to report all details necessary to make such tests reproducible.

Examples

Single-Stimulus

Double-Stimulus or Multiple Stimulus

Choice of methodology

Which method to choose largely depends on the purpose of the test and possible constraints in time and other resources. Some methods may have fewer context effects (i.e. where the order of stimuli influences the results), which are unwanted test biases.[7] In ITU-T P.910, it is noted that methods such as DCR should be used for testing the fidelity of transmission, especially in high quality systems. ACR and ACR-HR are better suited for qualification tests and – due to giving absolute results – comparison of systems. The PC method has a high discriminatory power, but it requires longer test sessions.

External links

References

  1. ITU-T Tutorial: Objective perceptual assessment of video quality: Full reference television, 2004.
  2. 2.0 2.1 2.2 2.3 2.4 2.5 ITU-T Rec. P.910 : Subjective video quality assessment methods for multimedia applications, 2008.
  3. Winkler, Stefan. "On the properties of subjective ratings in video quality experiments". Proc. Quality of Multimedia Experience, 2009.
  4. Hossfeld, Tobias (2014-01-15). "Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing". IEEE Transactions on Multimedia.
  5. 5.0 5.1 5.2 5.3 5.4 ITU-R BT.500: Methodology for the subjective assessment of the quality of television pictures, 2012.
  6. ITU-T P.913: Methods for the subjective assessment of video quality, audio quality and audiovisual quality of Internet video and distribution quality television in any environment, 2014.
  7. Pinson, Margaret and Wolf, Stephen. "Comparing Subjective Video Quality Testing Methodologies". SPIE Video Communications and Image Processing Conference, Lugano, Switzerland, July 2003.