Mean opinion score
Mean opinion score (MOS) is a test that has been used for decades in telephony networks to obtain the human user's view of the quality of the network. Historically, and implied by the word Opinion in its name, MOS was a subjective measurement where listeners would sit in a "quiet room" and score call quality as they perceived it; per ITU-T recommendation P.800, "The talker should be seated in a quiet room with volume between 30 and 120 m3 and a reverberation time less than 500 ms (preferably in the range 200-300 ms). The room noise level must be below 30 dBA with no dominant peaks in the spectrum." Measuring Voice over IP (VoIP) is more objective, and is instead a calculation based on performance of the IP network over which it is carried. The calculation is defined in the ITU-T PESQ P.862 standard. Like most standards, the implementation is somewhat open to interpretation by the equipment or software manufacturer. Moreover, due to technological progress of phone manufacturers, a calculated MOS of 3.9 in a VoIP network may actually sound better than the formerly subjective score of > 4.0.
In multimedia (audio, voice telephony, or video) especially when codecs are used to compress the bandwidth requirement (for example, of a digitized voice connection from the standard 64 kilobit/second PCM modulation), the MOS provides a numerical indication of the perceived quality from the users' perspective of received media after compression and/or transmission. The MOS is expressed as a single number in the range 1 to 5, where 1 is lowest perceived audio quality, and 5 is the highest perceived audio quality measurement.
MOS tests for voice are specified by ITU-T recommendation P.800
The MOS is generated by averaging the results of a set of standard, subjective tests where a number of listeners rate the heard audio quality of test sentences read aloud by both male and female speakers over the communications medium being tested. A listener is required to give each sentence a rating using the following rating scheme:
MOS | Quality | Impairment |
---|---|---|
5 | Excellent | Imperceptible |
4 | Good | Perceptible but not annoying |
3 | Fair | Slightly annoying |
2 | Poor | Annoying |
1 | Bad | Very annoying |
The MOS is the arithmetic mean of all the individual scores, and can range from 1 (worst) to 5 (best).
Compressor/decompressor (codec) systems and digital signal processing (DSP) are commonly used in voice communications, and can be configured to conserve bandwidth, but there is a trade-off between voice quality and bandwidth conservation. The best codecs provide the most bandwidth conservation while producing the least degradation of voice quality. Bandwidth can be measured quantitatively, but voice quality requires human interpretation, although estimates of voice quality can be made by automatic test systems.
A similar process can be used to evaluate subjective video quality.
As an example, the following are mean opinion scores for one implementation of different codecs:[1]
Codec | Data rate [kbit/s] |
Mean opinion score (MOS) |
---|---|---|
G.711 (ISDN) | 64 | 4.1 |
iLBC | 15.2 | 4.14 |
AMR | 12.2 | 4.14 |
G.729 | 8 | 3.92 |
G.723.1 r63 | 6.3 | 3.9 |
GSM EFR | 12.2 | 3.8 |
G.726 ADPCM | 32 | 3.85 |
G.729a | 8 | 3.7 |
G.723.1 r53 | 5.3 | 3.65 |
G.728 | 16 | 3.61 |
GSM FR | 12.2 | 3.5 |
One consideration when planning a VoIP deployment is the bandwidth usage for a particular codec versus the potential MOS. For example, G.711, with a sample size of 64kbit/s, achieves a maximum MOS of 4.1, whereas G.729, with a much smaller sample size of 8kbit/s, can achieve a MOS of 3.9. G.729 is "compressed eight times smaller than G.711 while sounding almost as good."[2]
A drawback of obtaining MOS estimations is that it may be more time-consuming and expensive as it requires hiring experts to make estimations. When a voice coding system is under development, or the developer has to test and compare a couple of audio systems, it's very important to have a possibility for a quick check.
Some suitable English-language phrases used for determining a MOS as suggested by ITU-T recommendation P.800 are:
- You will have to be very quiet.
- There was nothing to be seen.
- I want a minute with the inspector.
- Did he need any money?
There exist some analytical formulas to estimate the MOS from packet losses in percentage and the packets duration in ms (see External Links referenced paper): Predicted MOS = 4.0 - 0.7 ln(%loss) - 0.1 ln(size_ms).
See also
- Subjective video quality
- MUSHRA ITU BS.1534 Recommendation
- PSQM Perceptual Speech Quality Measure (ITU-T P.861 - withdrawn and replaced with PESQ ITU-T P.862)
- PESQ Perceptual Evaluation of Speech Quality, is mechanism for automated assessment of the speech quality enjoyed by the user of a telephone system. It is standardised as ITU-T recommendation P.862 (02/01).
- POLQA Perceptual Objective Listeninq Quality Assessment, is replacing PESQ. It is standardised as ITU-T recommendation P.863.
- PEVQ Perceptual Evaluation of Video Quality, a measurement algorithm for the automated assessment of video quality.
- PEAQ Perceptual Evaluation of Audio Quality, a measurement algorithm for the automated assessment of audio quality.
- Absolute Category Rating
- MNRU