ABX test

An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples (sample A, the first reference, and sample B, the second reference) followed by one unknown sample X that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.

ABX tests can easily be performed as double-blind trials, eliminating any possible unconscious influence from the researcher or the test supervisor. Because samples A and B are provided just prior to sample X, the difference does not have to be discerned from assumption based on long-term memory or past experience. Thus, the ABX test answers whether or not, under ideal circumstances, a perceptual difference can be found.

ABX tests are commonly used in evaluations of digital audio data compression methods; sample A is typically an uncompressed sample, and sample B is a compressed version of A. Audible compression artifacts that indicate a shortcoming in the compression algorithm can be identified with subsequent testing. ABX tests can also be used to compare the different degrees of fidelity loss between two different audio formats at a given bitrate.

ABX tests can be used to audition input, processing, and output components as well as cabling: virtually any audio product or prototype design.

History

The history of ABX testing and naming dates back to 1950 in a paper published by two Bell Labs researchers, W. A. Munson and Mark B. Gardner, titled Standardizing Auditory Tests.^[1]

"The purpose of the present paper is to describe a test procedure which has shown promise in this direction and to give descriptions of equipment which have been found helpful in minimizing the variability of the test results. The procedure, which we have called the “ABX” test, is a modification of the method of paired comparisons. An observer is presented with a time sequence of three signals for each judgment he is asked to make. During the first time interval he hears signal A, during the second, signal B, and finally signal X. His task is to indicate whether the sound heard during the X interval was more like that during the A interval or more like that during the B interval. For a threshold test, the A interval is quiet, the B interval is signal, and the X interval is either quiet or signal. "

The test has evolved to other variations such as user control over duration and sequence of testing. One such example was the hardware ABX comparator in 1977, built by the ABX company in Troy, Michigan and documented by one of his founders, David Clark in his Audio Engineering Society Journal Paper, High-Resolution Subjective Testing Using a Double-Blind Comparator^[2]

REFINEMENTS TO THE A/B TEST
The author's first experience with double-blind audibility testing was as a member of the SMWTMS Audio Club in early 1977. A button was provided which would select at random component A or B. Identifying one of these, the X component was greatly hampered by not having the known A and B available for reference.

This was corrected by using three interlocked pushbuttons, A, B, and X. Once an X was selected, it would remain that particular A or B until it was decided to move on to another random selection.
However, another problem quickly became obvious. There was always an audible relay transition time delay when switching from A to B. When switching from A to X, however, the time delay would be missing if X was really A and present if X was really B. This extraneous cue was removed by inserting a fixed length dropout time when any change was made. The dropout time was selected to be 50 ms which produces a slight consistent click while allowing subjectively instant comparison.

The ABX company is now defunct and hardware comparators in general as commercial offerings extinct. Myriad of software tools exist such as Foobar ABX plug-in for performing file comparisons. But hardware equipment testing requires building custom implementations.

Hardware tests

Two QSC ABX Comparators in a traveling rack

ABX test equipment utilizing relays to switch between two different hardware paths can help determine if there are perceptual differences in cables and components. Video, audio and digital transmission paths can be compared. If the switching is microprocessor controlled, double-blind tests are possible.

Loudspeaker level and line level audio comparisons could be performed on an ABX test device offered for sale as the ABX Comparator by QSC Audio Products from 1998 to 2004. Other hardware solutions have been fabricated privately by individuals or organizations for internal testing.

Confidence

If only one ABX trial were performed, random guessing would incur a 50% chance of choosing the correct answer, the same as flipping a coin. In order to make a statement having some degree of confidence, many trials must be performed. By increasing the number of trials, the likelihood of statistically asserting a person's ability to distinguish A and B is enhanced for a given confidence level. A 95% confidence level is commonly considered statistically significant.^[3] The company QSC, in the ABX Comparator user manual, recommended a minimum of ten listening trials in each round of tests.^[4]

Results required for a 95% confidence level:^[5]^[6]

Number of trials	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25
Minimum number correct	9	9	10	10	11	12	12	13	13	14	15	15	16	16	17	18

QSC recommended that no more than 25 trials be performed, as listener fatigue can set in, making the test less sensitive (less likely to reveal one's actual ability to discern the difference between A and B).^[4] However a more sensitive test can be obtained by pooling the results from a number of such tests using separate individuals or tests from the same listener conducted in between rest breaks. For a large number of total trials N, a significant result (one with 95% confidence) can be claimed if the number of correct responses exceeds $N/2+\sqrt{N}$ . Important decisions are normally based on a higher level of confidence, since an erroneous "significant result" would be claimed in one of 20 such tests simply by chance.

Software tests

The foobar2000 and the Amarok audio players support software-based ABX testing, the latter using a third-party script. Lacinato ABX is a cross-platform testing tool for Linux, Windows, and 64-bit Mac. aveX is an open-source software mainly developed for Linux which also provides test-monitoring from a remote computer. ABX patcher is an ABX implementation for Max/MSP. More ABX software can be found at the archived PCABX website.

Potential flaws

ABX is a type of forced choice testing. The listener at all times can vote whether "X" sounds the same as "A" or "B." Both answers are available to him. Such answers could be on merit, i.e. the listener indeed tried to identify whether X sounded closer to A or B. Or just voted randomly without even listening. Simply looking at the outcome of the test, i.e. X out of Y answers correct is not revealing of this problem. If not caught, incorrect tests will dilute the results of others who intently took the test and subjects the outcome to Simpson's paradox, resulting in false summary results.

This problem becomes more acute if the differences are small, or the content is selected that is not very revealing of the differences under test. The user may get frustrated and simply aim to finish the test by voting randomly. In this regard, forced choice tests such as ABX tend to favor negative outcome when differences are small if proper protocols are not used to guard against this problem.

Best practices as for example outlined in ^[7] calls for 1) existence of controls and 2) screening of listeners:

A major consideration is the inclusion of appropriate control conditions. Typically, control conditions include the presentation of unimpaired audio materials, introduced in ways that are unpredictable to the subjects. It is the differences between judgement of these control stimuli and the potentially impaired ones that allows one to conclude that the grades are actual assessments of the impairments.

3.2.2 Post-screening of subjects

Post-screening methods can be roughly separated into at least two classes; one is based on inconsistencies compared with the mean result and another relies on the ability of the subject to make correct identifications. The first class is never justifiable. Whenever a subjective listening test is performed with the test method recommended here, the required information for the second class of post-screening is automatically available. A suggested statistical method for doing this is described in Attachment 1.

The methods are primarily used to eliminate subjects who cannot make the appropriate discriminations. The application of a post-screening method may clarify the tendencies in a test result. However, bearing in mind the variability of subjects’ sensitivities to different artefacts,caution should be exercised.

Other flaws include lack of listener training and familiarization with the test and content selected:

4.1 Familiarization or training phase
Prior to formal grading, subjects must be allowed to become thoroughly familiar with the test facilities, the test environment, the grading process, the grading scales and the methods of their use. Subjects should also become thoroughly familiar with the artefacts under study. For the most sensitive tests they should be exposed to all the material they will be grading later in the formal grading sessions. During familiarization or training, subjects should be preferably together in groups (say, consisting of three subjects), so that they can interact freely and discuss the artefacts they detect with each other.

Other problems might arise from the abx equipment itself, as outlined by the previous Clark reference where the equipment provides a tell, allowing the listener to identify the source. Lack of transparency of the ABX fixture creates similar problems.

Since auditory tests such as ABX rely on short-term memory which only lasts a few seconds, it is critical that the test fixture include mechanisms for the listener to locate short segments that can be compared quickly. Pops and glitches in switching apparatus likewise must be eliminated as otherwise they dominate what is stored in listener memory as opposed to the system under test.

Alternatives

Algorithmic Audio Compression Evaluation

Since ABX testing requires human beings for evaluation of lossy audio codecs, it is time-consuming and costly. Therefore, cheaper approaches have been developed, e.g. PEAQ, which is an implementation of the ODG.

MUSHRA

In MUSHRA, the listener is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference and one or more anchors. A 0-100 RATING scale makes it possible to rate very small differences.

Discrimination testing

Alternative general methods are used in discrimination testing, such as paired comparison, duo–trio, and triangle testing. Of these, duo–trio and triangle testing are particularly close to ABX testing. Schematically:

Duo–trio: AXY – one known, two unknown (one equals A, other equals B), test is which unknown is the known: X = A (and Y = B), or Y = A (and X = B).
Triangle: XXY – three unknowns (two are A and one is B or one is A and two are B), test which is the odd one out: Y = 1, Y = 2, or Y = 3.

In this context, ABX testing is also known as "duo–trio" in "balanced reference" mode – both knowns are presented as references, rather than one alone.^[8]

References

↑ Munson, W.A.; Gardner, Mark (June 18, 1950). "Standardizing Auditory Tests". Acoustical Society of America 22: 675. doi:10.1121/1.1917190. Retrieved January 2015.
↑ Clark, David (May 1, 1982). "High-Resolution Subjective Testing Using a Double-Blind Comparator". Audio Engineering Society 30 (5): 330–338. Retrieved January 2015.
↑ David Clark (1982). "High-Resolution Subjective Testing Using a Double-Blind Comparator". AES Journal 30 (5).
↑ 4.0 4.1 QSC ABX Comparator user manual. (1998) p. 10
↑ David Carlstrom. "Probability of Experimental Result Being the Same as Random Guesses". ABX Web Page. Retrieved 2011-12-14. ] at
↑ P-value
↑ "Recommendation ITU-R BS.1116-2". Retrieved January 2015.
↑ Meilgaard, Morten; Gail Vance Civille; B. Thomas Carr (1999). Sensory evaluation techniques (3 ed.). CRC Press. pp. 68–70. ISBN 0-8493-0276-5.