Voice activity detection

From Wikipedia, the free encyclopedia

Voice activity detection or voice activity detector is an algorithm used in speech processing wherein, the presence or absence of human speech is detected from the audio samples. The main uses of VAD are in speech coding and speech recognition. A VAD may not just indicate the presence or absence of speech, but also whether the speech is voiced or unvoiced, sustained or early, etc.

VAD(Voice Activity Detection)- While talking to someone over VOIP there will be some sort of silent period when we are not talking. So by using VAD feature, silence packets can be disabled and the silent period can utilised to transmit some other traffic other than voice.

Contents

[edit] Voice Activity Detection (VAD)

The process of separating conversational speech and silence is called the voice activity detection (VAD). The primary function of a voice activity detector is to provide an indication of speech presence in order to facilitate speech processing as well as possibly providing delimiters for the beginning and end of a speech segment. It was first investigated for use on Time Assigned Speech Interpolation (TASI) systems. VAD is an important enabling technology for a variety of speech-based applications. For these purposes, there have been proposed various types of VAD algorithms that trade off delay, sensitivity, accuracy and computational cost.

[edit] VAD Applications

  • A VAD is an integral part of different speech communication systems such as audio conferencing, echo cancellation, speech recognition, speech encoding, and Hands-free telephony.
  • In the field of multimedia applications, a VAD guarantees simultaneous voice and data applications.
  • Similarly, in Universal Mobile Telecommunications Systems (UMTS), it controls and reduces the average bit rate and enhances overall coding quality of speech.
  • In cellular radio systems (for instance GSM and CDMA systems) based on Discontinuous Transmission(DTX) mode, this facility is essential for enhancing system capacity by reducing co-channel interference and power consumption in portable digital devices.

For a wide range of applications such as digital mobile radio, Digital Simultaneous Voice and Data (DSVD) or speech storage, it is desirable to provide a discontinuous transmission of speech-coding parameters. The advantages can be a lower average power consumption in mobile handsets, a higher average bit rate for simultaneous services like data transmission or a higher capacity on storage chips. However, the improvement depends mainly on the percentage of pauses during speech and the reliability of the VAD used to detect these intervals. On the one hand, it is advantageous to have a low percentage of speech activity but, on the other hand, clipping of active speech should be avoided to preserve the quality. This is the crucial problem for a VAD algorithm under heavy noise conditions.

[edit] Robust VAD

There are robust VAD algorithms that have been suggested to solve some of the problems ordinary VAD cannot solve.

Voice activity detection is an outstanding problem for speech transmission, enhancement and recognition. The variety and the varying nature of speech and background noise makes it especially challenging. The earlier algorithms are based on the Itakura LPC distance measure, energy levels, timing, pitch, and zero crossing rates, cepstral features, adaptive noise modeling of voice signals and the periodicity measure. Unfortunately, these algorithms have some problems in low SNR values, especially when the noise is non-stationary. Consistent accuracy cannot be achieved since most algorithms rely on a threshold level for comparison. This threshold level is often assumed to be fixed or calculated in the silence (voice-inactive) intervals. During the last decade numerous researchers have studied different strategies for detecting speech in noise and the influence of the VAD decision on speech processing systems.

[edit] Technical Overview of VAD

The basic function of a VAD algorithm is to extract some measured features or quantities from the input signal and to compare these values with thresholds, usually extracted from the characteristics of the noise and speech signals. Then, voice-active decision is made if the measured values exceed the thresholds. VAD in non-stationary noise requires a time-varying threshold value. This value is usually calculated in the voice-inactive segments.

The VAD is more critical for non-stationary noise environments since it is needed to update the constantly varying noise statistics affecting a misclassification error strongly to the system performance. A representative set of recently published VAD methods formulates the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise. The different measures which are used in VAD methods include spectral slope, correlation coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures.

A VAD can be decomposed in two steps:

  • the computation of metrics
  • the application of a classification rule.

Independently from the VAD method, we have to operate a compromise between having voice detected as noise or noise detected as voice. A VAD operating in a mobile environment must be able to detect speech in the presence of a range of very diverse types of acoustic background noise. In these difficult detection conditions it is vital that a VAD should ``fail-safe, indicating ``speech detected when the decision is in doubt so that no clipping is introduced. The biggest difficulty in the detection of speech in this environment is the very low signal-to-noise ratios (SNRs) that are encountered. It is impossible to distinguish between speech and noise using simple level detection techniques when parts of the speech utterance are buried below the noise.

[edit] Evaluation of VAD Performace

Performance of VAD can be measured in terms of activity and the degree and severity of clipping. In order to evaluate the amount of clipping and how often noise is detected as speech, the VAD output is compared with those of an ideal VAD. The performance of a VAD is evaluated on the basis of the following four traditional parameters:

  • FEC (Front End Clipping): clipping introduced in passing from noise to speech activity;
  • MSC (Mid Speech Clipping): clipping due to speech misclassified as noise;
  • OVER: noise interpreted as speech due to the VAD flag remaining active in passing from speech activity to noise;
  • NDS (Noise Detected as Speech): noise interpreted as speech within a silence period.

Although the method described above provides useful objective information concerning the performance of a VAD, it only gives an initial estimate with regard to the subjective effect. For example, the effects of speech signal clipping can at times be hidden by the presence of background noise, depending on the model chosen for the comfort noise synthesis, so some of the clipping measured with objective tests is in reality not audible. It is therefore important to carry out subjective tests on the VAD's, the main aim of which is to ensure that the clipping perceived is acceptable. This kind of test requires a certain number of listeners to judge recordings containing the processing results of the VAD's being tested. The listeners have to give marks on the following features:

  • Quality;
  • Comprehension difficulty;
  • Audibility of clipping.

These marks, obtained by listening to several speech sequences, are then used to calculate average results for each of the features listed above, thus providing a global estimate of the behavior of the VAD being tested. To conclude, whereas objective methods are very useful in an initial stage to evaluate the quality of a VAD, subjective methods are more significant. As, however, they are more expensive (since they require the participation of a certain number of people for a few days), they are generally only used when a proposal is about to be standardized.

[edit] References

  • D.K. Freeman, G. Cosier, C.B. Southcott and I. Boyd, "The voice activity detection for the pan-european digital cellular mobile telephone service" in Proc. Int. Conf. acoustics, speech, signal processing, May 1989, pp. 369-372 ;
  • Beritelli.F; Casale.S; Ruggeri.G; Serrano.S, "Performance evaluation and comparison of G.729/AMR/fuzzy voice activity detectors", Signal Processing Letters, IEEE ,Vol. 9 , Issue 3 , March 2002, pp.85 - 88
  • DMA minimum performance standards for discontinuous transmission operation of mobile stations� TIA doc. and database IS-727, June 1998.
  • Stephen W. Laverty, Donald R. Brown, "Improved voice activity detection in the presence of passing vehicle", Worcester Polytechnic Institute;
  • Chen Dong, Kuang Jingming, "A robust voice activity detector applied for AMR", Department of Electronic Engineering, Beijing Institute of technology;
  • Philippe Renevey and Andrzej Drygajlo, "Entropy based voice activity detection in very noisy conditions", Swiss center for electronics and microtechnology, Swiss federal institute of technology.
In other languages