Speech and Audio Processing: Lecture-3
Speech and Audio Processing: Lecture-3
Speech and Audio Processing: Lecture-3
1. Parametric Coders
Parametric coders model the speech signal using a set of model parameters like spectral envelope, pitch and energy contour, etc. The extracted parameters at the encoder are quantized and transmitted to the decoder. The decoder synthesizes speech according to the specified model. The speech production model does not account for the quantization noise or try to preserve the waveform similarity between the synthesized and the original speech signals.
[email protected]
1. Parametric Coders
The model parameter estimation may be an open loop process with no feedback from the quantization or the speech synthesis. Furthermore, they do not preserve the waveform similarity and the measurement of signal to noise ratio (SNR) is meaningless, as often the SNR becomes negative when expressed in dB (as the input and output waveforms may not have phase alignment).
2. Waveform-approximating Coders
Waveform coders minimize the error between the synthesized and the original speech waveforms. Examples of this type of coder are Pulse Code Modulation (PCM) and Adaptive Differential Pulse Code Modulation (ADPCM) PCM transmit a quantized value for each speech sample.
2. Waveform-approximating Coders
ADPCM employs an adaptive pole zero predictor and quantizes the error signal, with an adaptive quantizer step size. ADPCM predictor coefficients and the quantizer step size are backward adaptive and updated at the sampling rate. The recent waveform-approximating coders based on time domain analysis by synthesis such as Code Excited Linear Prediction (CELP), explicitly make use of the vocal tract model and the long term prediction.
2. Waveform-approximating Coders
CELP coders buffer the speech signal and perform block based analysis and transmit the prediction filter coefficients along with an index for the excitation vector. They also employ perceptual weighting so that the quantization noise spectrum is masked by the signal level.
For systems that connect to the Public Switched Telephone Network (PSTN) and associated systems, the quality requirements are strict and must conform to constraints and guidelines imposed by the relevant regulatory bodies, e.g. ITU (previously CCITT). Such systems demand high quality (toll quality) coding. Private commercial networks and military systems may compromise the quality to lower the capacity requirements.
2. Coding Delay Coding delay may be algorithmic (the buffering of speech for analysis), computational (the time taken to process the stored speech samples) or due to transmission. Only the first two concern the speech coding subsystem, although very often the coding scheme is tailored such that transmission can be initiated even before the algorithm has completed processing all of the information in the analysis frame
For PSTN applications, low delay is essential if the major problem of echo is to be minimized. So, extra echo cancellers will be required if coders with long delays are introduced. 3. Robustness For many applications, the speech source coding rate typically occupies only a fraction of the total channel capacity, the rest being used for forward error correction (FEC) and signalling.
[email protected]
For mobile connections, which suffer greatly from both random and burst errors, a coding schemes builtin tolerance to channel errors is vital for an acceptable average overall performance, i.e. communication quality. For other applications employing less severe channels, e.g. fibre-optic links, the problems due to channel errors are reduced significantly and robustness can be ignored for higher clean channel speech quality. This is a major difference between the wireless mobile systems and those of the fixed link systems.
In addition to the channel noise, coders may need to operate in noisy background environments. As background noise can degrade the performance of speech parameter extraction, it is crucial that the coder is designed in such a way that it can maintain good performance at all times. 4. Complexity and Cost As more sophisticated algorithms are devised, the computational complexity is increased.
One technique for overcoming power consumption whilst also improving channel efficiency is digital speech interpolation (DSI) . DSI exploits the fact that only around half of speech conversation is actually active speech thus, during inactive periods, the channel can be used for other purposes, including limiting the transmitter activity, hence saving power. An important subsystem of DSI is the voice activity detector (VAD) which must operate efficiently and reliably to ensure that real speech is not mistaken for silence and vice versa.