Sensory Systems/Computer Models/Auditory System Simulation

Working with Sound
Audio signals can be stored in a variety of formats. They can be uncompressed or compressed, and the encoding can be open or proprietary. On Windows systems, the most common format is the WAV-format ( WAV). It contains a header with information about the number of channels, sample rate, bits per sample etc. This header is followed by the data themselves. The usual bitstream encoding is the linear pulse-code modulation (LPCM, ( Pulse-code_modulation) format.

Many programing languages provide commands for reading and writing WAV-files. When working with data in other formats, you have two options:
 * You can either you convert them into WAV-format, and go on from there. A very comprehensive free cross-platform solution to record, convert and stream audio and video is ffmpeg (http://www.ffmpeg.org/).
 * Or you can obtain special programs moduls for reading/writing the desired format.

Reminder of Fourier Transformations (Fourier_transform)
To transform a continuous function, one uses the Fourier Integral:


 * $$F(k)=\int_{-\infty}^{\infty} {f(t)} \cdot e^{-2 \pi ikt} dt$$

where k represents frequency. Note that F(k) is a complex value: its absolute value gives us the amplitude of the function, and its phase defines the phase-shift between cosine and sine components.

The inverse transform is given by


 * $$f(t)=\int_{-\infty}^{\infty} F(k) \cdot e^{2 \pi ikt} dk$$



If the data are sampled with a constant sampling frequency and there are N data points,


 * $$f(\tau)= \frac{1}{N}*\sum_{n=0}^{N-1} F_n e^{2 \pi in \tau /N} $$

The coefficients Fn can be obtained by


 * $$ F_n = \sum_{\tau = 0}^{N-1} f(\tau) \cdot e^{-2 \pi in \tau/N} $$

Since there are a discrete, limited number of data points and with a discrete, limited number of waves, this transform is referred to as Discrete Fourier Transform (DFT). The Fast Fourier Transform (FFT) is just a special case of the DFT, where the number of points is a power of 2: $$N = 2^n$$.

Note that each $$F_n$$ is a complex number: its magnitude defines to the amplitude of the corresponding frequency component in the signal; and the phase of $$F_n$$ defines the corresponding phase (see illustration). If the signal in the time domain "f(t)" is real valued, as is the case with most measured data, this puts a constraint on the corresponding frequency components: in that case we have


 * $$ F_n = F_{N-n}^* $$

A frequent source of confusion is the question: “Which frequency corresponds to $$F_n$$?” If there are N data points and the sampling period is $$T_s$$, the $$n^{th}$$ frequency is given by


 * $$ f_n = \frac{n}{N \cdot T_s}, 1 \le n \le N (in \; Hz) $$

In other words, the lowest frequency is $$\frac{1}{N \cdot T_s} $$ [in Hz], while the highest independent frequency is $$ \frac{1}{2T_s}$$ due to the Nyquist-Shannon theorem. Note that in MATLAB, the first return value corresponds to the offset of the function, and the second value to n=1!

Power Spectrum of Stationary Signals
Most FFT functions and algorithms return the complex Fourier coefficients $$F_n$$. If we are only interested in the magnitude of the contribution at the corresponding frequency, we can obtain this information by


 * $$ P_n = \frac{1}{N} \cdot F_n \cdot F_n^* = \frac{1}{N} \cdot |F_n|^2 $$

This is the power spectrum of our signal, and tells us how big the contribution of the different frequencies is.

Power Spectrum of Non-stationary Signals
Often one has to deal with signals that are changing their characteristics over time. In that case, one wants to know how the power spectrum changes with time. The simplest way is to take only a short segment of data at a time, and calculate the corresponding power spectrum. This approach is called Short Time Fourier Transform (STFT). However in that case edge effects can significantly distort the signals, since we are assuming that our signal is periodic.



To eliminate edge artifacts, the signals can be filtered, or "windowed" ( Window_function). An examples of such a window is shown in the figure above. While some windows provide better frequency resolution (e.g. the rectangular window), others exhibit fewer artifacts such as spectral leakage (e.g. Hanning window). For a selected section of the signal, the data resulting from windowing are obtained by multiplying the signal with the window (left Figure):

An example can show how cutting a signal, and applying a window to it, can affect the spectral power distribution, is shown in the right figure above. (The corresponding Python code can be found at ) Note that decreasing the width of the sample window increases the width of the corresponding powerspectrum!

Stimulation strength for one time window
To obtain the power spectrum for one selected time window, the first step is to calculate the power spectrum through the Fast Fourier Transform (FFT) of the time signal. The result is the sound intensity in frequency domain, and the corresponding frequencies. The second step is to concentrate those intensities on a few distinct frequencies ("binning"). The result is a sound signal consisting of a few distinct frequencies - the location of the electrodes in the simulated cochlea. Back conversion into the time domain gives the simulated sound signal for that time window.

The following Python function does sound processing on a given signal.

Sound Transduction by Pinna and Outer Ear
The outer ear is divided into two parts: the visible part on the side of the head (the pinna), and the external auditory meatus (outer ear canal) leading to the eardrum, as shown in the figure below. With such a structure, the outer ear contributes the ‘spectral cues’ for people’s sound localization abilities, making people not only have the ability to detect and identify a sound, but also have the ability to localize a sound source.

Pinna Function
The Pinna’s cone shape enables it to gather sound waves and funnel them into the out ear canal. On top of that, its various folds make the pinna a resonant cavity which amplifies certain frequencies. Furthermore, the interference effects resulting from the sound reflection caused by the pinna are directionally dependent and will attenuate other frequencies. Therefore, the pinna could be simulated as a filter function applied to the incoming sound, modulating its amplitude and phase spectra.



The resonance of the pinna cavity can be approximated well by 6 normal modes. Among these normal modes, the first mode, which mainly depends on the concha depth (i.e. the depth of the bowl-shaped part of the pinna nearest the ear canal), is the dominant one.

The cancellation of certain frequencies caused by the pinna reflection is called “pinna notch”. As shown in the right figure, sound transmitted by the pinna goes through two paths, a direct path and a longer reflected path. The different paths have different length, and thereby produce phase differences. When the frequency of incoming sound signal reaches certain criterion, which is that the path difference is half of the sound wavelength, the interference of sounds via direct and reflected paths will be destructive. This phenomenon is called “pinna notch”. Normally the notch frequency could happen in the range from 6k Hz to 16k Hz depending on the pinna shape. It is also seen that the frequency response of pinna is directionally dependent. This makes the pinna contribute to the spatial cues for sound localization.

Ear Canal Function
The outer ear canal is approximately 25 mm long and 8 mm in diameter, with a tortuous path from the entrance of the canal to the eardrum. The outer ear canal can be modeled as a cylinder closed at one end which leads to a resonant frequency around 3k Hz. This way the outer ear canal amplifies sounds in a frequency range important for human speech.

Simulation of Outer Ear
Based on the main functions of the outer ear, it is easy to simulate the sound transduction by the pinna and outer ear canal with a filter, or a filter bank, if we know the characteristics of the filter.

Many researchers are working on the simulation of human auditory system, which includes the simulation of the outer ear. In the next chapter, a Pinna-Related Transfer Function model is first introduced, followed by two MATLAB toolboxes developed by Finnish and British research groups, respectively.

Model of Pinna-Related Transfer Function by Spagnol
This part is entirely from the paper published by S.Spagnol, M.Geronazzo, and F.Avanzini. In order to model the functions of the pinna, Spagnol developed a reconstruction model of the Pinna-Related Transfer Function (PRTF), which is a frequency response characterizing how sound is transduced by the pinna. This model is composed by two distinct filter blocks, accounting for resonance function and reflection function of the pinna respectively, as shown in the figure below.



There are two main resonances in the interesting frequency range of the pinna, which can be represented by two second-order peak filters with fixed bandwidth $$f_b = 5 kHz $$ :
 * $$H_{res} (z)= \frac{V_0 (1-h)(1-z^{-2})}{1+2dhz^{-1}+(2h-1)z^{-2}} $$

where
 * $$h= \frac{1}{1+\tan(\pi\frac{f_B}{f_s})} $$
 * $$ d= -\cos(2\pi \frac{f_C}{f_s} )$$
 * $$V_0=10^{\frac{G}{20}}$$

and $$f_s$$ is the sampling frequency, $$f_C$$ the central frequency, and $$G$$ the notch depth. For the reflection part, three second-order notch filters of the form are designed with the parameters including center frequency $$f_C$$, notch depth $$G$$, and bandwidth $$f_B$$.
 * $$H_{refl}(z)= \frac{1+(1+k)\frac{H_0}{2}+d(1-k)z^{-1}+(-k-(1+k)\frac{H_0}{2})z^{-2}} {1+d(1-k) z^{-1}-kz^{-2}}$$

where $$d$$ is the same as previously defined for the resonance function, and
 * $$V_0=10^{\frac{-G}{20}}$$
 * $$H_0= V_0-1$$
 * $$k= \frac{\tan(\pi\frac{f_B}{f_s})-V_0}{\tan(\pi\frac{f_B}{f_s})+V_0}$$

each accounting for a different spectral notch.

By cascading the three in-series placed notch filters after the parallel two peak filters, an eighth-order filter is designed to model the PRTF. By comparing the synthetic PRTF with the original one, as shown in the figures below, Spagnol concluded that the synthesis model for PRTF was overall effective. This model may have missing notches due to the limitation of cutoff frequency. Approximation errors may also be brought in due to the possible presence of non-modeled interfering resonances.



HUTear MATLAB Toolbox
HUTear is a MATLAB Toolbox for auditory modeling developed by Lab of Acoustics and Audio Signal Processing at Helsinki University of Technology. This open source toolbox could be downloaded from here. The structure of the toolbox is shown in the right figure.

In this model, there is a block for “Outer and Middle Ear” (OME) simulation. This OME model is developed on the basis of Glassberg and Moor. The OME filter is usually a linear filter. Auditory filter is generated with taking the "Equal Loudness Curves at 60 dB"(ELC)/"Minimum Audible Field"(MAF)/"Minimum Audible Pressure at ear canal"(MAP) correction into account. This model accounts for the outer ear simulation. By specifying different parameters with the "OEMtool", you may compare the MAP IIR approximation and MAP data, as shown in the figure below.

MATLAB Model of the Auditory Periphery (MAP)
MAP is developed by researchers in the Hearing Research Lab at University of Essex, England. Being a computer model of physiological basis of human hearing, MAP is an open-source code package for testing, developing the model, which could be downloaded from here. Its model structure is shown in the right figure.



Within the MAP model, there is the “Outer Middle Ear (OME)” sub-model, allowing the user to test and create an OME model. In this OME model, the function of the outer ear is modeled as a resonance function. The resonances are composed by two parallel bandpass filters, respectively, representing concha resonance and outer ear canal resonance. These two filters are specified by the pass frequency range, gain and order. By adding the output of resonance filters to the original sound pressure wave, the output of the outer ear model is obtained.

To test the OME model, run the function named “testOME.m”. A figure plotting the external ear resonances and stapes peak displacement will be displayed. (as shown in the figure below)

Summary
The outer ear, including pinna and outer ear canal, can be simulated as a linear filter, or a filter bank. This reflects its resonance and reflection effect to incoming sound. It is worth noting that since the pinna shape varies from person to person, the model parameters, like the resonant frequencies, depend on the subject.

One aspect not included in the models described above is the Head-Related Transfer Function(HRTF). The HRTF describes how an ear receives a sound from a point sound source in space. It is not introduced here because it goes beyond the effect of the outer ear (pinna and outer ear canal) as it is also influenced by the effects of head and torso. There are plenty of literature and publications for HRTF for the interested reader.(wiki, tutorial 1,2, reading list for spatial audio research including HRTF)

Simulation of the Inner Ear
The shape and organisation of the basilar membrane means that different frequencies resonate particularly strongly at different points along the membrance. This leads to a tonotopic organisation of the sensitivity to frequency ranges along the membrane, which can be modeled as being an array of overlapping band-pass filters known as "auditory filters". The auditory filters are associated with points along the basilar membrane and determine the frequency selectivity of the cochlea, and therefore the listener’s discrimination between different sounds. They are non-linear, level-dependent and the bandwidth decreases from the base to apex of the cochlea as the tuning on the basilar membrane changes from high to low frequency. The bandwidth of the auditory filter is called the critical bandwidth, as first suggested by Fletcher (1940). If a signal and masker are presented simultaneously then only the masker frequencies falling within the critical bandwidth contribute to masking of the signal. The larger the critical bandwidth the lower the signal-to-noise ratio (SNR) and the more the signal is masked.



Another concept associated with the auditory filter is the "equivalent rectangular bandwidth" (ERB). The ERB shows the relationship between the auditory filter, frequency, and the critical bandwidth. An ERB passes the same amount of energy as the auditory filter it corresponds to and shows how it changes with input frequency. At low sound levels, the ERB is approximated by the following equation according to Glasberg and Moore:



ERB = 24.7*(4.37F + 1) \, $$

where the ERB is in Hz and F is the centre frequency in kHz.

It is thought that each ERB is the equivalent of around 0.9mm on the basilar membrane.

Gammatone Filters


One filter type used to model the auditory filters is the "gammatone filter". It provides a simple linear filter for describing the movement of one location of the basilar membrane for a given sound input, which is therefore easy to implement. Linear filters are popular for modeling different aspects of the auditory system. In general, they are IIR-filters (infinite impulse response) incorporating feedforward and feedback, which are defined by


 * $$ \sum\limits_{j = 0}^m {{a_{j + 1}}y(k - j)} = \sum\limits_{i = 0}^n {{b_{i + 1}}x(k - i)} $$

where a1=1. In other words, the coefficients ai and bj uniquely determine this type of filter. The feedback-character of these filters can be made more obvious by re-shuffling the equation


 * $$ y(k) = {b_1}x(k) + {b_2}x(k - 1) + ... + {b_{n + 1}}x(k - n) - \left( {{a_2}y(k - 1) + ... + {a_{m + 1}}y(k - m)} \right) $$

(In contrast, FIR-filters, or finite impulse response filters, only involve feedforward: for them $$a_i=0 $$ for i>1.)



Linear filters cannot account for nonlinear aspects of the auditory system. They are nevertheless used in a variety of models of the auditory system. The gammatone impulse response is given by



g(t) = at^{n-1} e^{-2\pi bt} \cos(2\pi ft + \phi), \, $$

where $$f$$ is the frequency, $$\phi$$ is the phase of the carrier, $$a$$ is the amplitude, $$n$$ is the filter's order, $$b$$ is the filter's bandwidth, and $$t$$ is time.

This is a sinusoid with an amplitude envelope which is a scaled gamma distribution function.

Variations and improvements of the gammatone model of auditory filtering include the gammachirp filter, the all-pole and one-zero gammatone filters, the two-sided gammatone filter, and filter cascade models, and various level-dependent and dynamically nonlinear versions of these.

For computer simulations, efficient implementations of gammatone models are available for Matlab and for Python .

When working with gammatone filters, we can elegantly exploit Parseval's Theorem to determine the energy in a given frequency band:


 * $$ \int_{ - \infty }^\infty {{{\left| {f(t)} \right|}^2}dt = } \int_{ - \infty }^\infty  {{{\left| {F(\omega )} \right|}^2}d\omega } $$