Sensory Systems/Multisensory Integration

= Multisensory Integration =

Overview
Human beings are able to process sensory information from a variety of sources, across different sensory modalities like sight, touch, smell, and sound. These sensory signals provide us with important information about our environment, and help us make critical decisions in life-or-death situations. Signals stemming entirely from one modality provide us with information about one facet of an object, for example its color or shape. Only relying on input from one sensor modality is not very effective, however, as objects generally provide signals across various modalities, and cannot be exclusively described by analyzing only one of these signals. For example, a coconut-flavored jelly bean has the same shape as a vanilla-flavored bean, and deforms the same under touch, but it tastes drastically different. In order to correctly identify the bean, a person would have to both touch and taste it, and combine the resulting stimuli to arrive at a coherent explanation for what the object truly is. Only then can the brain decide how to respond. This task of processing and converging sensory data across more than one modality is known as multisensory integration.

Functional multisensory integration is critical in allowing humans to have a coherent and robust perception of their environment. The human sensory system is constantly bombarded with a cacophony of stimuli from its surroundings, and the brain must correctly determine what sources these stimuli correspond to, and filter out those critical to its behavior control.

Multisensory Illusions
The complexity of multisensory integration often results in mysterious phenomena. The McGurk effect, for example, describes how the simultaneous processing of two different stimuli can create a convergent result entirely different from either of the constituent parts. In a study by McGurk and MacDonald in 1976, participants that watched a video of a person speaking a particular phoneme, dubbed with the audio of a person speaking a second phoneme, reported that they had in fact heard and seen a third, different phoneme. Visual representation of the phoneme “ga” combined with the auditory representation of the phoneme “ba” resulted in the converged percept of the phoneme “da”. McGurk and MacDonald’s explanation for this was that particular phoneme groups could easily be confused depending on the modality of the stimuli expression. For example, it is visually hard to differentiate between “ga” and “da”, and audibly difficult to differentiate between “ba” and “da”. Therefore the brain perceived “da” as the most likely common explanation for both of the two conflicting stimuli, and determined this as the most reliable percept.

A second common audiovisual illusion is known as the double-flash illusion. Participants are shown a combination of one to four flashes and zero to four beeps, and asked to determine the number of flashes that occurred. When more beeps occurred than flashes, participants reported seeing illusory flashes to match their auditory perception.

Theories of information processing
While the exact method for multisensory integration is not yet understood, various theories exist for how the brain chooses to segregate or combine multiple signals into coherent percepts.

Colavita Visual Dominance Effect
Visual dominance refers to the idea that visual stimuli are prioritized for processing above other modalities within the brain. When a visual stimulus is simultaneously presented with an auditory or tactile one, only the visual stimulus is recognized. A study presented in 1973 by F.B. Colavita exposed subjects to three types of stimuli: purely auditory, purely visual, or concurrent audiovisual. After each test, subjects were asked to identify what kind of stimulus had occurred. Colavita found that while subjects responded correctly to both types of unimodal targets, they often failed to recognize the auditory stimulus when it occurred simultaneously with a visual one. The same results were found when combinations of tactile and visual stimuli were tested. The study suggested that in multimodal sensing, vision takes some sort of precedence within the brain, even so far as to entirely eradicate the subjects’ perception of the other stimuli. Harther-O’Brien et al attribute the Colavita effect to an imbalance in access to processing resources between the various modalities. Visual sensors may have improved access to processing and awareness than simultaneous stimuli of a different modality, resulting in this sensory dominance. Furthermore, vision is one of the most accurate and reliable modalities, as only few external disturbances could alter the stimulus to provide false information to the brain (compare this to an audio signal which can reverberate off walls or be carried off-course by strong wind, tricking the listener about the true location of the source). The neurological basis of the Colavita effect is not yet entirely understood; results remain inconclusive even on whether the effect occurs at a reflexive or voluntary level.

General visual dominance has been observed in various species, including cows, birds, and humans. However, some studies have demonstrated that the Colavita effect can also be reversed: for example, Sinnet and Ngo showed that under certain conditions, participants respond stronger to auditory stimuli than visual ones. This may be due to the fact that humans and animals rely strongest on auditory stimuli during especially stressful situations, suggesting that visual dominance is context-dependent.

Modality appropriateness
In 1980, Welsh and Warren developed the theory of modality appropriateness, which states that the priority of a modality in multisensory integration depends on that modality’s suitability for a particular situation. The greatest support for this theory is that different sensory modalities are better suited for different particular sensory tasks. For example, when determining the exact location of a source (a task referred to as spatial processing), visual stimuli dominate all others – even when localizing a sound source. This effect is well-illustrated in television: an actor’s voice seems to come from the actor’s mouth instead of the television sound system, because our visual system recognizes the moving lips and dominates the sensory processing. Similarly, auditory stimuli dominate in determining exact timing of an event, or an order of events (temporal processing).

Recent studies by Alais and Durr showed that modality dominance decreases with sensory uncertainty. When participants were solving spatial processing tasks, they initially relied heaviest on the visual stimuli, as expected via the theories of modality appropriateness and visual dominance. However, as the quality of the visual stimulus was purposefully worsened through blurring and filtering, participants began prioritizing the information from the accompanying auditory stimulus. Thus, the modality appropriateness theory suggests that the brain constantly prioritizes and weighs the reliability of each stimulus so as to produce the most trustworthy combination.

Bayesian integration
Bayesian integration gives a statistical backbone to the theory of modality appropriateness. It suggests that the brain uses Bayesian inference to determine the most likely common source of a group of multimodal stimuli. Bayesian inference is a method of statistical inference based on Bayes’ theorem :

$$P(H|E) = \frac{P(E|H)\cdot P(H)}{P(E)}$$

Bayes’ formula calculates the probability of a current hypothesis H given evidence E, based on previous statistical knowledge about the separate probabilities of the hypothesis P(H), evidence P(E), and of observing the evidence if the hypothesis is true P(E|H). As more data becomes available and the evidence set is expanded, the hypothesis probability can be refined based on whether the new data further support or detract from the current hypothesis. Bayesian updating is a useful way of determining how to best understand the data in order to determine the most likely explanation.

Applied to multisensory integration, Bayesian inference evaluates the probability of multimodal stimuli E corresponding to a particular source event H, and stimuli within E are combined or segregated accordingly to produce the hypothesis with the greatest probability. Bayesian integration is therefore most effective in brains that have been exposed to plentiful sensory experiences, and therefore have access to large statistical data sets to produce the a priori probabilities.

Three general principles
While none of the presented theories explain all of the studied multisensory experiences, they help to lead to three general principles that have consistently shown to hold:
 * 1) The spatial rule : multisensory integration is strongest when the various unisensory stimuli are sourced in approximately the same location.
 * 2) The temporal rule : multisensory integration is strongest when the various unisensory stimuli occur at approximately the same time.
 * 3) Principle of inverse effectiveness : multisensory integration is strongest when the constituent unisensory stimuli would provide only weak signals individually.

Details of Visual-Auditory Integration
As reliable as visual cues typically are on their own, stimuli are perceived as stronger and more reliable when accompanied by auditory ones. Indeed, the thresholds for detecting stimuli represented by concurrent auditory and visual cues are much lower than thresholds for the same stimuli represented by only one modality. Many of the presented theories and effects focus on the integration of visual and auditory stimuli into coherent percepts. Indeed, audiovisual integration is one of the most studied sensory integrations, as the two modalities often convey information about the same events, and the appropriate combination of the two is critical for everyday life. Each of the two sensory systems provides critical information that the other would not have been able to pick up on: for example, the visual system can not determine the location of an object hidden in the shadows, while the auditory system can supplement this knowledge.

Unimodal sensory information is first processed in the cortex, as has been detailed for auditory and visual stimuli. Multisensory integration occurs primarily in the superior colliculus (SC), located in the midbrain (though poorly-understood multisensory processing clusters have been found in some other regions of the brain as well). The SC is made up of seven layers of alternating white and grey matter. Information arrives directly from the retina and other regions of the cortex onto the outer layer of the SC, which thus makes up a topographic map of the entire visual field. The deeper levels of the SC contain converged two-dimensional multisensory maps from combinations of the visual, auditory, and somatosensory modalities.

Convergence of multimodal stimuli in the SC follows what is known as the ‘spatial rule’, requiring that stimuli from different modalities must fall on the same or adjacent receptive fields within the SC in order to excite a neuron. Signals are then sent out of the SC to the spinal cord, cerebellum, thalamus, and occipital lobe to nearby neurons. These neurons then propagate their signals along muscles and further neural structures to direct a person’s orientation or behavior in response to the stimuli.

The resulting neural excitation is strongest and most unified if a visual stimulus arrives before an auditory one. This temporal offset is believed necessary in order to offset the relatively slow processing time of the visual stimulus. While processing of an auditory stimulus is an entirely mechanical process lasting around one millisecond, visual processing requires phototransduction within the retina as well as various neurochemical processes, and therefore lasts around 50ms. The visual stimulus must therefore occur approximately 50ms before the sound in order for the two stimuli to be perceived as coincident. Luckily, the speed of light is faster than the speed of sound, causing the two percepts to naturally arrive with a slight offset even when caused by a single event. This natural offset is not always 50ms however, and studies have described a critical way the brain can compensate for this problem. Data presented by Alais and Burr suggests an active process within the brain that can exploit the depth cues of the auditory stimuli in order to temporally align it with the visual. The distance to an auditory stimulus source can be reliably deduced through the energy ratio between direct and reverberant energy within the signal. The brain then seemed to use an innate knowledge of the speed of sound to determine the time lag induced by the particular distance. This approximation of the speed of sound is most likely experience-based, and refined with each successive sensory experience.

The middle layers of the SC also play an important role in the division of attention, both automatic (exogenous) and voluntary (endogenous). In the case of voluntary attention, the brain can actively select certain stimuli from the SC layers to focus attention and processing on. The stimuli that are more attention-grabbing are furthermore often perceived to have occurred before those that are less salient, even if they coincided in reality. In this way, if the color and direction of motion of an object are altered simultaneously, the color change is perceived to have occurred first, because it is attracted more of the brain’s immediate attention.

Low-level sensory integration
The previous examples discuss sensory integration achieved through neurological processes of various complexities. But sensory integration also occurs at much lower, more instinctive levels.

For example, the vestibular system provides a plethora of sensory information. It is responsible for detecting our bodies’ motion, keeping our balance, and determining our orientation in space. To do this, it primarily relies on a mechanical system in our inner ear, which interprets the flow of fluid inside the semicircular canals and otoliths to determine the corresponding movement of our head. The vestibular and visual system are heavily integrated, most notably through the vestibulo-ocular reflex which induces eye movements in response to detections of head movement in order to preserve the image currently in the visual field and maintain balance. (This is similar to when we are told to focus on one specific spot while we are spinning as a means of preventing or reducing the normally induced nausea). When the semicircular canals detect head rotation, the stimulus is sent to the vestibular nuclei in the brainstem via the vestibular nerve. The vestibular nuclei receive the stimulus and in turn stimulate the contralateral oculomotor nucleus, which contains neurons that induce the eye muscle activity. This cross-modality stimulation occurs automatically without real neurological processing of the stimulus, as opposed to the other multisensory effects described above.

This interaction also happens in reverse, providing a good example of low-level multisensory integration: certain visual movements can stimulate the vestibular system, even when we are not moving. For example, when we sit in a stationary train and watch an adjacent train slowly pull out of the station, we feel that our train is moving. The moving image stimulus is sent along the NOT (Nucleus of the Optic Tract), leading to activity in the vestibular nucleus typically caused by head movement. Our body also responds to the induced stimulus, adapting our posture to counteract the perceived - but nonexistent - acceleration. Similarly, watching point-of-view videos of a rollercoaster can make us feel dizzy even when we are sitting still. In this way a vestibular stimulus can directly drive optical muscle movements, without the involvement of deep brain processing.