Sensory Systems/Computer Models/Auditory System Simulation/PerceptualAudioCoding

Perceptual Audio Coding and Filtering Properties of the Cochlea
On this page the principle mechanisms of perceptual audio coding are reviewed. The underlying psychoacoustic principles are explained and the relation to the filtering properties of the cochlea and higher cortical processing stages are pointed out.

Perceptual Audio Coding
Mp3 (MPEG-1 Layer 3, predecessor of MPEG-2 and MPEG-4 Advanced Audio Coding (AAC)) is probably still the best known audio format that exploits perceptual coding of audio signals. AAC is a more efficient extension, generally achieves better sound quality, allows for a wider range of bandwidths but relies on the same coding principles than Mp3. Both formats are standardized by ISO and IEC but only the decoder is fully specified and the encoder implementation is left open. This led to a variety of available encoders differing in reproduction quality, achievable bit-rate, performance and coding efficiency.

In contrast to classical signal compression algorithms, where the goal is to represent information with a minimum number of bits while maintaining signal reproduction quality, perceptual audio coding takes into account knowledge from the human auditory system and reduces bit rate by removing information that is perceptually irrelevant for most listeners. This lossy compression is achieved by exploring properties of the human auditory system and statistical redundancies. A commonly used coding bitrate for Mp3 is 128 kbit/s and efficient encoders typically achieve a factor around 10 in data reduction, when compressing CD-quality audio (16 bit PCM, 44.1 kHz, ≈ 1411 kBit/s). To state it differently, around 90 % of the data stored on a CD can not be perceived by the listener. CD-quality is what users typically expect when listening to music (There is a long debate if CD-quality is good enough to reproduce the analog original audio. Among many different expert opinions, these two references might be of interest for further reading.). The requirement for more efficient audio coding arose from network, multimedia system and storage applications and Mp3 was originally created for the more efficient transmission of audiovisual content.

The theoretical limit of perceptual audio coding was investigated by Johnston, which led to the notion of perceptual entropy. Based on measurements the perceptual entropy was estimated around 2 bits per sample for CD-quality audio. State-of-the-art encoders confirm such an efficiency for transparent (near) CD-quality audio coding. The quality of a perceptual coding algorithm is typically evaluated by listening tests and more recently also in combination with a standardized algorithm for the objective measurement of perceived audio quality called Perceptual Evaluation of Audio Quality (PEAQ).

Basic Architecture of a Perceptual Audio Encoder


Most perceptual audio encoders can be described with the basic architecture shown in the figure. The analysis filter bank approximates the temporal and spectral analysis properties of the human auditory system. The input is segmented into frames, which are transformed into a set of parameters that can be quantized and encoded. The quantization and coding stage exploits statistical redundancies and relies on thresholds that are delivered by the perceptual model for bit allocation and quantization noise shaping. The perceptual model describes masking thresholds as a function of frequency for coding. Finally, the encoding stage uses standard lossless coding techniques, such as Huffman Coding. For a technical algorithm explanation and example implementation I refer to the online books of J.O. Smith.

Psychoacoustic Principles Used for Perceptual Audio Coding
The basic idea of perceptual audio coding is to shape the quantization noise such that it is masked by the audio signal itself and therefore not perceivable by the listener. This is achieved by exploiting psychoacoustic principles including the threshold of absolute hearing, critical band frequency analysis and auditory masking. As the playback level is often unknown at the coding stage, typically conservative estimates regarding the absolute hearing threshold are used for signal normalization during the coding procedure. Auditory masking describes phenomena, where the perception of one sound is affected by the presence of another sound. Masking effects occur in frequency domain (simultaneous masking), as well as in time domain (non-simultaneous masking).

Simultaneous Masking


For simultaneous masking the frequency resolution of the cochlea plays a central role. Inside the cochlea a frequency-to-place transformation takes place and distinct regions tuned to different frequency bands are created. These distinct frequency regions are called critical bands of hearing (or critical bandwidth). The critical bandwidth tends to remain constant ≈ 100 Hz up to 500 Hz and increases to approx 20 % of center frequency above 500 Hz. The first 24 critical bands are described by the Bark scale. The presence of a tone leads to an excitation of the basilar membrane, which affects the detection threshold for a second tone inside its critical band (intra-band masking). In addition, also neighboring bands are affected (inter-band masking). The affection of neighboring bands is described by the spreading function. A measured spreading function for a critical-band noise masker of varying intensity is shown in the figure on the right hand side. As illustrated in the figure, a masker is more efficient in masking higher frequency bands than lower frequency bands, which is referred as the upward spread of masking. The cause of the spreading function is supposed to be a byproduct of the mechanical filter property of the cochlea, where the outer hair cells amplify the motion of the basilar membrane in order to increase frequency resolution. The reason for the upward spread of masking is not clearly identified and in addition to mechanical excitation also suppression plays a role. Further, as the second peak in the figure arises around 2 kHz (the second harmonic of 1 kHz) at higher sound pressure levels also the nonlinear transfer characteristic of the inner and middle ear play a role.

The presence of a strong noise or tone masker creates thus an excitation that is sufficient in strength on the basilar membrane to effectively block transmission of a weaker signal in its critical band and by the spread of masking also neighbouring bands are affected. Two types of simultaneous maskers have been observed: Noise-Masking-Tone and Tone-Masking-Noise. For a noise-masking-tone the presence of a tone allows to predict a threshold for the noise spectrum that is masked and for the tone-masking-noise the presence of a noise allows to predict a threshold for the tone that is masked. Different thresholds have been reported for pure tones and critical-band limited noise. Regarding the perceptual coding of music, these thresholds are interpolated depending on the content of the time-frequency analysis of the perceptual encoder before the spreading function is taken into account. The objective signal-to-noise ratio (SNR) can be very low, e.g. 20 dB, but depends on the audio content, while the subjective SNR is high enough to achieve transparent coding. For comparison an audio CD has a SNR of 96 dB.

Non-simultaneous masking


Abrupt transients (or strong attacks) in audio signals can cause masking effects in time domain. The perception before (pre- or backward masking) as well as after the transient (post- or forward masking) is affected, as illustrated in the figure. The backward masking region lasts in the order of milliseconds and the forward masking region lasts longer and is in the order of tenths of milliseconds.

Temporal masking is still not fully understood and an active research topic. However, there is evidence that higher cortical processing is involved in this phenomena. It remains unclear if this effect is related to the integration of sounds, interruption or inhibition of neural processing and/or differences in the transmission velocities. Forward and backward masking show different characteristics and are therefore supposed to arise from different properties of the human auditory system.

Masking and joint stereo coding
An efficient technique common in audio coding is joint stereo coding. As the left and the right audio channels for a music signal are typically highly correlated it is sometimes more efficient to do sum/difference (L-R, L+R) coding of the audio signal. In the case of Mp3 the potential of sum/difference coding was not fully exploited, an efficient technique would compare the thresholds for left/right and sum/difference coding and dynamically chose the more efficient one. Special care has to be taken when calculating the masking thresholds, because joint channel coding can cause audible artifacts due to binaural listening.

Artifacts ( Compression_artifact)
For Mp3 and AAC the coding bit rate is chosen and not a compression factor, because the compression factor is content dependent. A lower bit rate yields to a higher compression ratio and a higher bit rate leads to a lower compression ratio with lower probability of possible artifacts. This leads to working regions (or bit rates), where a particular algorithm performs best and just improves slightly for higher bit rates. In contrast to noise and distortion artifacts from playback equipment, which we are all used to when listening to CDs, audible artifacts from perceptual encoders can be annoying. If the bit rate is too low for transparent coding the resulting noise and distortions can be described as a time-varying signal, where distortions are not harmonically related, noise is band-limited and as bandwidth might change from frame to frame the signal can sound rough.

Loss of Bandwidth
If the encoder runs out of bits there is a basic tradeoff between frequency bandwidth and accurate coding of lower frequency content. This can lead to a coded frequency bandwidth that changes from frame to frame that can sound very unpleasing. Generally this artifact is counteracted by limiting the frequency bandwidth for low bit rates.

Preecho
Preecho is the most difficult error to avoid and is related to the frame size of the perceptual encoder. If a strong attack of an audio signal occurs in the middle of a frame the calculated threshold for the noise might spread over the backward masking region due to the frame size and thus become audible. There are various techniques to minimize the occurrence of preechos, such as a variable frame size analysis filterbank.

Relation to the Filtering Properties of the Cochlea
To sum up and conclude, perceptual coding makes extensive use of the properties of the human auditory system. The absolute hearing threshold is related to properties of the cochlea but also to acoustic and mechanic properties of the middle and outer ear. In simultaneous masking intra- and inter-critical band masking thresholds (the spreading function) arise from the filtering properties of the cochlea. However, the upward spread of masking can not only be explained by the properties of the cochlea and other phenomena such as suppression might play a role. Finally, the phenomena of temporal masking can only be explained by higher cortical processing in the auditory system and also artifacts that can arise from joint stereo coding and thus involve binaural listening for detection suggest that various stages of the human auditory system are involved.