Mel-frequency cepstrum

From Wikipedia, the free encyclopedia

In the sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression.

MFCCs are commonly derived as follows:[1]

  1. Take the Fourier transform of (a windowed excerpt of) a signal.
  2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
  3. Take the logs of the powers at each of the mel frequencies.
  4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
  5. The MFCCs are the amplitudes of the resulting spectrum.

There can be variations on this process, for example, differences in the shape or spacing of the windows used to map the scale.[2]

Contents

[edit] Applications

MFCCs are often used in speech recognition systems, such as the systems which can automatically recognize numbers spoken into a telephone.

They are also common in speaker recognition, which is the task of recognizing people from their voices.

They are also increasingly finding uses in music information retrieval applications such as genre classification, audio similarity measures, etc.

[edit] Noise sensitivity

MFCC values are not very robust in the presence of additive noise, and so some researchers propose modifications to the basic MFCC algorithm to account for this - e.g. by raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT, which reduces the influence of low-energy components.[3]

[edit] References

  1. ^ Min Xu et al. (2004). "HMM-based audio keyword generation", in Kiyoharu Aizawa, Yuichi Nakamura, Shin'ichi Satoh: Advances in Multimedia Information Processing - PCM 2004: 5th Pacific Rim Conference on Multimedia. Springer. 
  2. ^ Fang Zheng, Guoliang Zhang and Zhanjiang Song, Comparison of Different Implementations of MFCC, J. Computer Science & Technology, 16(6): 582-589, Sept. 2001.
  3. ^ V. Tyagi and C. Wellekens, On desensitizing the Mel-Cepstrum to spurious spectral components for Robust Speech Recognition , in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 1, 2005, pp. 529–532.
  • P. Mermelstein, Distance measures for speech recognition, psychological and instrumental, in Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed., pp. 374–388. Academic, New York, 1976.
  • S.B. Davis, and P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28(4), 1980, pp. 357–366.
  • T. Ganchev, N. Fakotakis, and G. Kokkinakis, Comparative evaluation of various MFCC implementations on the speaker verification task, in 10th International Conference on Speech and Computer (SPECOM 2005), vol. 1, 2005, pp. 191–194.


[edit] See also


Image:Signal-icon.png This signal processing-related article is a stub. You can help Wikipedia by expanding it.
Languages