Sensory Systems/Computer Models/Efficient Coding

Why do we need efficient coding?
As already described, the visual signals are processed in the visual cortex in order to interpret the information. After understanding, how the visual information is processed, the question arises how the information could be coded.

Input Data Amount
Especially in the vision system, the amount of data is huge:  the retina senses about 1010 bits/sec, from which approx. 3-6 * 106 bits/sec are transmitted by ca. 1 million axons through each optical nerve. The result is that only 104 bit/sec make it to layer IV of V1. Since it is estimated that consciousness has a capability of <= 100 bit/sec1, reducing the data amount is not only sufficient, but also necessary.

Processing Speed and Accuracy
In human, neural cells fire with a rate of approximately 0.2 Hz to 10 Hz. The coding of information relies also on exact timing and frequency of firing. To make it more difficult, the processing network has also to deal with noise: retinal noise, i.e., "spontaneous fluctuations in the electrical signals of the retina’s photoreceptors", arises in the rods by thermal decomposition of rhodopsin creating events "that cannot be distinguished from the events which occur when light falls on the rods and a quantum is absorbed" and also arises in the cons having a molecular origin. It is argued that retinal noise limits much more the visual sensitivity than noise in the central nervous system induced by random activities at the synapses of nervous cells creating additional action potentials.

Energy Consumption
Every neural activity needs energy: the brain consumes about 20 % of the resting metabolism. An increase by one action potential per neuron per second will increase oxygen consumption by 145 mL/100 g grey matter/h. The human blood circulatory system provides about 1.5 l of blood per minute to the human brain supplying it with energy and oxygen. "For an action potential frequency of 4 Hz in active cells, approximately 15% of a set of neurons should be simultaneously active to encode a condition".

Solution
To deal with the circumstances of huge data amount to be processed by a nervous system limited in speed, accuracy and available energy, efficient coding is needed.

In the auditory system, the basic structure on which human (verbal) communication relies on are the phonemes, i.e. the distinct basic sound elements of a language that distinguish one word from another. For example, the word "eye" consists just of one phoneme, /ai/, whereas the word "code" consists of the phonemes /k/, /ə/, /ʋ/, /d/.

Analogously, for the visual system, an efficient code would consist of image structures as basic elements that can be combined in order to represent the sensed environment (i.e. image). As a model that preserves the basic characteristics of visual perceptive fields, Olshausen & Field proposed an optimization algorithm that finds a sparse code while preserving the information of the image.

Technical Demonstration
The principle of information compression can be nicely demonstrated with the "k-means" method, applied to (2-dimensional) images. This is implemented as part of the python library scikit-image. The idea as illustrated in Figure 1 is to compress an image or data in general, process it and afterwards, transform it back. The processing step is much more efficient in that way and in contrast to the methods present in biological systems, there exist also lossless compression methods, e.g., wavelets, that allow a correct back transformation.

Lossless Compression is not required for biological systems. Information loss is shown with an example of the previously mentioned k-means algorithm on scikit-learn and also youtube.

Introduction

Between the late 1990s and at the beginning of the 21st century Bruno Olshausen and Michael Lewicki respectively studied how natural images and natural sounds are encoded by the brain and tried to create a model which would replicate this process as accurately as possible. It was found that the process of both input signals could be modeled with very similar methods. The goal of efficient coding theory is to conceil a maximal amount of information about a stimulus by using a set of statistically independent characteristics. Efficient coding of natural images arises to a population of localized, oriented Gabor wavelet-like filters,. Gammatone filters are the equivalent of these for the auditory system. In order to distinguish shapes in an image the most important feature is edge detection, which is achieved with Gabor filters. In sound processing, sound onsets or 'acoustic edges' can be encoded by a pool of filters similar to a gammatone filterbank.

Vision

In 1996, Bruno Olshausen and his team were the first to create a learning algorithm which aims to find sparse linear codes for natural images and maximizes sparseness will form a group of localized, oriented, bandpass receptive fields, analogous to those found in the primary visual cortex.

They start out assuming that an image $$I(x,y)$$ can be depicted as a linear superposition of basis functions, $$\phi_i (x,y) $$:

$$I(x,y) = \sum_{i} a_i \phi_i (x,y) $$

The parameters $$a_i$$ depend on which basis functions $$\phi_i (x,y) $$ are chosen, and are different for each image. The objective of efficient coding is to find a family of $$\phi_i (x,y) $$ that spans the image space and obtains parameters $$a_i$$ which are as statistically independent as possible.

Natural scenes contain many higher-order forms of statistical structure which are non-gaussian. Using principal component analysis to attain these two objectives would thereby be unsuitable. Statistical dependencies among a pool of parameters can be detected as soon as the joint entropy is less than the sum of individual entropies:

$$H(a_1, a_2,...,a_n) < \sum_{i} H(a_i)$$

Entropy here is meant as the Shannon entropy, which is the expected value (average) of a variable. The joint entropy is a measure of the uncertainty associated with a set of variables. It is assumed that natural images have a 'sparse structure', meaning the image can be expressed in function of a a small amount of characteristics amongst a larger set,. The objective is to look for a code lowering entropy, where the probability distribution of each parameter is unimodal and tops out around zero. This can be articulated as an optimization problem :

$$E = - [\text{preserve information}] - \lambda[\text{sparseness of } a_i]$$

where $$\lambda$$ is positive weight coefficient. The first quantity evaluates the mean square error between the natural image and the reconstructed image.

$$[\text{preserve information}] = - \sum_{x,y} [ I(x,y) - \sum_{i}a_i\phi_i (x,y) ]^2$$

The second quantity is attributed a higher cost if for a given picture the different parameters are distributed sparsely. This is calculated by adding up each coefficient's activity plugged in a nonlinear function $$S(x)$$.

$$[\text{sparseness of } a_i] = - \sum_{i} S\left ( \frac{a_i}{\sigma} \right )$$

where $$\sigma$$ is a scaling constant. For $$S(x)$$, functions favoring amid activity states with equal variance those with the least amount of non-zero parameters(e.g. $$-e^{-x^2}$$, $$log(1+x^2)$$, $$\left\vert x \right\vert$$).

By minimizing the total cost $$E$$ over $$a_i$$, learning is achieved. The $$\phi_i$$ converges by gradient descent on $$E$$ averaged over multiple image variations. The algorithm enables the basis functions to be overcomplete dimensionwise and non-orthogonal, without decreasing the state of sparseness.

After the learning process, the algorithm was tested on artificial datasets, confirming that it is suited to detecting sparse structure in the data. Basis functions are well localized, oriented and selective to diverse spatial scales. Arranging the response of each $$a_i$$ to spots at every position established a similarity between the receptive fields and the basis functions. All basis functions form together an accomplished image code spanning the joint space of spatial position, orientation and scale in a manner similar to wavelet codes.

To conclude, the results of Olshausen's team show that the two sufficient objectives for the emergence of localized, oriented, bandpass receptive fields are that information be preserved and the representation be sparse.

Audition



Lewicki published his findings after Olshausen, in 2002. He tested the efficient coding theory inspired from the prior paper to derive efficient codes for different classes of natural sounds, which were animal vocalizations, environmental sound and human speech.

They used independent component analysis (ICA), which enables the extraction of linear decomposition of signals minimizing correlations and higher-order statistical dependencies. This learning algorithm then yields a filter for each data set, which can be interpreted in the form of a time-frequency windows. The filter shape is determined by the statistical structure of the ensemble.

When applied to the different sample sounds, the method obtained filters with time-frequency windows similar to that of a wavelet for environmental sounds where sound is localized in both time and frequency (Fig. 1c). For animal vocalizations a tiling pattern similar to Fourier transform is obtained where sound is localized in frequency but not in time (Fig. 1d). Speech contains a mixture of both with a weighting of 2:1 of environmental to animal sounds (Fig. 1e). That is due to the fact that speech is composed of harmonic vowels and non-harmonic consonants. These patterns have been observed experimentally in animals and humans previously.

In order to break down the core differences of these three types of sounds, Lewicki's team analyzed bandwidth, filter sharpness, and the temporal envelope. Bandwidth increases as a function of center frequency for environmental sounds, whereas it stays constant for animal vocalizations. Speech increases as well but less than environmental sounds. Due to the time/frequency trade-off the temporal envelope curves behave similarly. When comparing the sharpness with respect to center frequency of physiological measurements, from speech data with the sharpness of the combined sound ensembles, consistency between both intricacies was confirmed.

It must be noted that several approximations were necessary to conduct this analysis. Their analysis omitted to include the variations in intensity of sound. The auditory system obeys to certain intensity thresholds according to which frequencies are chosen. However the physiological measurements, with which these measurements are compared, are made using isolated pure tones, which in term limits the extent of application of this model but does not discredit it. Moreover the filters' symmetry in time does not match the physiologically characterized 'gamma-tone filters'. Modifying the algorithm to be causal is possible and the filters' temporal envelopes would then become asymmetric, similarly to gamma-tone filters.

Conclusion

There is an analogy which surfaces between these two systems. The location and spatial frequency of visual stimuli is encoded by the neurons in the visual cortex. The adjustment between these two variables is similar to that between timing and frequency in auditory coding.

Another interesting aspect of this parallel is why ICA elucidates the neural response properties in the earlier stages of analysis in the auditory system, while it elucidates the response properties of cortical neurons in the visual system. It must be noted that the neuronal anatomy of both systems differs. In the visual system a bottleneck occurs at the optic nerve, where information from 100 million photoreceptors is condensed into 1 million optic nerve fibers. The information is then spread by a factor of 50 in the cortex. In the auditory system no bottleneck occurs and information from 3000 cochlea inner hair cells directly bolster onto 30000 auditory nerve fibers. ICA is then actually assigned to the point of expansion in the representation.