Hearing is our system for analyzing patterns of air waves. Like vision, which transducers light waves into neural signals, hearing transducers air waves into neural signals. In vision, the transducer is the eye. In hearing, the transducer is the ear.
Human Ear
The ear comes in three sections: the outer, middle and inner ear. The outer and middle ear sections are filled with air. They have direct and indirect access, respectively, to the outside world. The inner ear is fluid-filled and self-contained. This is where hearing actually occurs. The outer is for collection of sound waves. The middle ear is for amplification. The inner ear is for transduction into neural signals.
These are the steps in the hearing process.
1. Pinna
The outer portion of the ear is called the pinna (or pinnae). This is the visible ear. It funnels sound toward the ear drum, the tympanic membrane. The pinna (Latin for wing or fin) is composed of cartilage. One ear on each side, a bit more than halfway back. The symmetry and location helps in sound localization.
Humans are erect ears. We are more like German Shepherds than Beagles. The advantage of an erect ear is it maximizes the collection of sound waves. The disadvantage is that they are unprotected.
The ear is an oval, with a circular lobe added to the bottom. The lobe can “attached” to the neck, offering no movement, or “unattached, meaning its base moves more easily. The difference is genetic, and has no specific advantage.
The oval of the ear is a rim, tilted slightly downward. A bent Y-shaped mesa is in the center. The opening to the auditory canal is located in curve of the Y. The canal itself is covered with a flap called the tragus. The tragus also improves your ability to hear sounds from behind you.
The shapes of the ear’s ridges terminates both high and low frequencies. Low frequencies hit the ear and are bounced into the auditory canal. Unfocused low sounds are reflected away. High frequencies bounce off the ridges, slowing them down. This delay cancels phase cancellation between the direct and indirect waves. This combination creates a pinna notch, a deletion of certain frequencies in order to improve signal clarity.
2. Tympanic Membrane
The auditory canal is blocks air from entering the middle ear. This slightly indented cone is the tympanic membrane. It looks like a circle with a bone on its back. Air waves make the membrane vibrate, which causes the malleus bone to move.
The membrane is fragile, so keep objects away from it. feel free to flood it with water if you want to clean it. Alternatively, you can use your tongue, if you are a giraffe. Otherwise, leave it alone.
3. Ossicular Chain
The middle ear is an air filled cavity. Inward pressure on the tympanic membrane is balance by the Eustachian tube, a small canal that leads to the throat. When flying, “popping” your ears is using swallowing or air pressure to push air into the middle air to balance tympanic membrane pressure.
The middle ear cavity contains the ossicular chain, three interconnected bones. The malleus, incus and stapes are connected by swivel joints. Moving one bone, moves all three. Air waves hitting the tympanic membrane are amplified by the mechanical movement of the bones. The middle ear acts as a preamplifier.
4. Windows
Two windows (membranes) separate the inner ear from the middle ear. The stapes bangs on the oval window, sending waves of fluid up the cochlear canal.
The pressure exerted on the oval window is balanced by pressure on the round window, located just below it. When the oval window is pushed in, the round window pushes out.
5. Cochlea
If the ossicular chain is an amplifier, the cochlea is a frequency analyzer. It takes complex patterns of air waves and separates them into tones of different frequencies. It transducers vibrations into neural signals.
Uncoil the snail-like structure of the cochlea, and you have a long triangle. The base of the triangle is tick, and has both the oval and round windows located there.
There are two connected canals filled with perilymph fluid. At the base of the triangle, the oval window connects to the upper canal, and the round window connects to the lower canal. They two canals are joined at the apex, forming a continuous flow of fluid from one to the other. Fluid travels up one canal and down the other.
Lengthwise, the outer triangle holds an inverted triangle, called the Organ of Corti. It is filled with endolymph, and holds hair-like fibers called stereocilia. The stereocilia are connected by extracellular links into bundles. They are graded in height, arranged in pseudo-hexagonal bundles.
It is the mechanical movement of these cilia which transducer vibrations into neural patterns. The structure is narrow where the base is think, and broad where at the apex. This inner triangle is laid out like a xylophone; high tones near the base, low tones near the apex of the outer triangle.
Roll it all up like a giant crescent roll. Pressure on the oval window causes fluid to be pushed toward the apex. High frequency waves don’t travel far, so they don’t have far to go to stimulate sensors. Low frequency sound waves travel a long way, ask your neighbors if your stereo subwoofer is bothering them. Like the rumble of tricks on the freeway, low frequencies travel nearly all the up to the apex.
6. Auditory Nerve
The Ascending Auditory Pathway starts at the cochlea and ends in the temporal lobes. The first drop is the cochlear nucleus. This is where all the tiny fibers of the cochlea are bound together and create the auditory nerve. This signal is presenting the information from one ear. Think of it as a mono signal.
The next stop is the Superior Olivary Nucleus, also called the superior olivary complex (SOC). There is one on each side of the head, and they share their input with the other side. Think of it as a stereo signal because each olivary nucleus has information from both ears.
The dual ear data is compared and used in a reflexive localization process. That is, without having to think about is, you have a sense of where sound is coming from. In the signal strength is greater from the left ear, you sense the source is located on that side. Note that both SOCs do this calculation.
The SOCs are in the pons. They take the input from the cochlear nucleus, and out put it to lateral superior olive for detecting differences in sound levels, and to the medial superior olive for detecting differences in time (which ear hears first).
The next stop is the Inferior Colliculus, where visual location information\is added.
The next stop is the medial geniculate nucleus (MGN), part of the thalami’s system. The MGN helps with selective attention, so we can listen to one conversation and not another.
7. Auditory Cortex
The temporal lobes are specialized, and tonotopically organized. One hemisphere, usually the left, specializes in language. The other hemisphere specializes in sounds. Both are spatially organized by frequency. Frequencies are parallel to the surface. Lower frequencies are inferior (lower), high frequencies are farther up the side of the cortex. Soft sounds are close to the surface, and louder sounds are deeper in.
The auditory cortex is also organized in concentric circles. The primary auditory cortex is in the middle, surrounded by the secondary auditory cortex which is surrounded by, of course, the tertiary auditory cortex.
The primary auditory cortex (PAC) receives input from the MGN. Its neurons are organized by the frequency they respond to the best. Low frequency neurons are at one end, neurons specializing in high frequencies are at the other. The PAC identifies loudness, pitch and rhythm. It passes the signals on the the secondary auditory cortex SAC).
The SAC processes auditory data further, identifying harmony, melody and more complicated rhythm patterns.
The tertiary auditory cortex integrates our musical experience. It sends signals to the parietal and frontal lobes to identify which instrument is being played, and how they can sound different when they play the same note.
Together the regions of the auditory cortex work together to analyze complex sounds from multiple sources, and make sense of them. Grouping are made based on harmony, timing and pitch. Streams of input can be separated into conversions .all these tasks are done in real time.
Other Ears
Humans aren’t the only ones who can hear. But is difficult to test other species. We simply ask people to report when they hear a tone. You can’t get a bear to raise a paw during a hearing test. Still it is clear we are not alone in our ability to hear.
We are not even the best at it. We are put to shame by bats, owls, cats, dogs, dolphins and horses. But we are much better at hearing than armadillos and spiders. Elephants and goldfish hear lower tones than we do but not as high. Dogs can hear sounds from 65-45,000 Hz. Dolphins are in the 100-150,000 Hz. range. Rabbits and raccoons can detect frequencies between 100 and 40,000 Hz.
The human ear theoretically responds to sounds between 20 cycles and 20,000 cycles per second (Hz). In practice, it is more typically 30 to 16k Hz. There is a lot of variation between people, particularly at the high end. This is also where older people lose ability. In general, human physiology optimizes for language. We will drop off the highs and lows but keep the ability to communicate for as long as possible.
Some amount of practice matters too. Different languages have different frequency preferences. German uses deep guttural sounds. They usually use the 100 to 3000 Hz. range. French speakers tend to be bimodal, using 100 to 300 Hz., and 1k-2k Hz. English speakers tend toward the high end with a 2k-12k Hz. range. These are habitual limits, not actual.
Theories
We know that it is had to fool our hearing system. We can flash pictures at 24 fps to create films that look real enough to entertain us. And we can up the frame rate to 60 fps to look very lifelike. But the auditory system is harder to fool. CD’s sample at 44,100 bps, not 24 like movies, but they still don’t sound like a live performance.
We know the auditory system does a great job. We just don’t know how it does it.
But we have some theories.
Pitch
Pitch varies from low to high based on frequency, instruction and amplitude.
Frequency is measured in cycles per second (cps), also called Hertz. One cycle includes an ebb and flow. Graphically, it curves up to a high point, curves down to its starting value, curves down to its lowest point, and curves up to its starting value. An octave is a doubling of frequency. 30, 60, and 120 Hz. Are octaves. Similarly, 1000 Hz. And 2000 Hz. Are octaves.
Two prominent encoding theories are used. Bekesy’s Place Theory does a good job of explaining the encoding of frequencies above 5000 Hz. Wever’s Volley Theory does a good job of explaining encoding of frequencies below 500 Hz. Either theory explains the 500-500- range well.
Von Bekesy’s Place Theory suggests that along the Organ of Corti, different pitches are located at different places. The pitch which is recorded is determined by where the peak of traveling wave through the fluid-filled canal stops. High notes are near the stapes; low tones are near the helicotrema, the spot in the apex where the two outside canals meet. The competing theory, Wever’s Volley Theory, argues that neurons fire in a volley (sequence) to match the frequency of a specific pitch.
Instructions have a surprising influence on pitch perception. What subjects are asked to do produces different results. If people are asked to indicate when a tone changes, they produce a just-noticeable difference (JND) scale. If subjects are asked to indicate a change to a musical note, they create an octave scale. Being asked about music influences our perception. We seem to have separate systems for tones and music, or separate ways of interpreting the same data.
Amplitude (loudness) impacts our perception of pitch by overemphasis. Loud high tones sound higher, and loud low tones sound lower. Middle tones remain the same.
Loudness
The variation of sound levels from soft to loud is based on amplitude and frequency. Just as amplitude impacts our perception of pitch, pitch impacts amplitude. Different pitches have different loudness contours.
Amplitude is the degree of change in pressure, measured in decibels (dB). Decibels is a logarithmic scale. If silence is at 0 dB, a sound 10x louder is 10 dB. A sound 100 times more intense than silence (leaves rustling ( is 20 dB. At 1000 times more powerful it is 30 dB. At 40 dB, an environment is 10,000 louder than silence. This is how loud a quiet suburb is.
‘Speaking conversationally is at 60 dB, a million times louder than silence. A subway train is 100 dB. A jet taking off, a rock concert, a gunshot and a firecracker are in the 120-140 dB range. They are very loud. They are deafeningly loud. You can lose your hearing from exposure to extreme levels of sound.
Your body tries to protect you from loud sounds.When the body senses a loud noise is coming, the attenuation reflex kicks in. Without thinking about it, the muscles surrounding the ossicular chain contract, holding the little bones more tightly and preventing damage. The result dampens low frequencies more than high frequencies. It preserves language tones, and decreases rumble and boom.
It seems that loudness is encoded by specialization, and a combination of spatial and temporal summation. Some neurons seem to respond to a range of intensity values. There are low volume, mid level and high loudness neurons. Within each range, firing rates would pinpoint actual values.
If a set of hair cells respond to 10-30 dB, for example, quiet sounds of 10 dB would make neurons fire slowly. When a signal reaches 20 dB, they would fire faster. This is temporal (time) summation. Rapid firing rate would trigger another neuron to report loud sounds. When they reach their maximum fairing rate, and just can’t fire faster, other neurons in the area are stimulated. This is spatial summation.
Timbre
Timbre is the set of qualities and characteristics that distinguish one voice or instrument from another. A choir, a cello, and a piano can hit the same note at the same loudness but will sound different. Even singers in a choir have different timbres. Some are more like flutes, and some more like trumpets.
Timbre is the number of frequencies and overtones present. We don’t hear harmonics as separate tones but as voice qualities. A fundamental of 100 Hz. and a first harmonic of 200, and a second harmonic of 300 Hz. are not heard as separate sounds but as a richer, fuller sound. One on the reasons we like listening to choirs and singing groups is the combination of difficulty voices add more harmonies and overtones. We like timbre.
Converting Sounds To Objects
Usually we hear more than one sound at a time. All the sounds enter ears mixed together. Yet somehow we are able to mseparate conversation from background noise. The cochlea works as a frequency analyze. It separates fundamental frequency and overtones. This is the relatively easy part. Each part of the cochlea responds to only certain frequencies. It is like having a series of band-pass filters.
What’s difficult is creating auditory objects. This is achieved through Auditory Scene Analysis (ASA).
The auditory nerve reflects all of the frequency components of the sound. It forms a power spectrum. We convert this complex signal to a perception of separate objects, also called streams.
Streaming is the auditory equivalent of visual object perception. It comes two parts. First, there is the relatively automatic and pre-attentive stage. This saves our attentional resources, and leaves us to consciously get on with other things, including the second phase of auditory perception.
Complex sounds are separated into components, then reassembled into groups. This grouping is done on the vases of time, forming sequential groups, or by frequency, is the sounds occur simultaneously.
The second phase is to convert the streams of data into mental representations. This stage is task oriented. We -at attention to certain streams and ignore others because we are trying to hear what our friend is saying while at dinner in a crowded restaurant.
Both stages are part of our attentional system. We consciously and pre-consciously use cues to decide what we are hearing. These cues include frequencies, changes over time, similarity, spatial location and vision.
Pitch Cues
We are quite good at analyzing complex sounds but we get confused when sounds are similar. Scheffers (1983) found we need a 6% difference in frequency to separate voices. Closer than that, we try to unite them. Moore et al (1986) slightly mistuned harmonics in complex tones. If mistimed, we perceived them as separate entities. If properly tuned, we interpret them as timbre. Frequencies bond harmonics together.
Change Cues
Streams occur in a timeline. It is the onset/offset of harmonics that helps group them into objects. This is analogous to the the starting and stopping of words in a sentence.
Similarity Cues
We tend to group sounds based on similarity of pitch or timbre. We are less likely to group on the basis of time. Grouping by similarity allows us to hear speech in noise.
Visual Cues
We use visual cues more often than it might seem. It is one of the reasons that we prefer videoconferencing to phone calls. Our highest preference, of course, is meeting people in person.
What you hear impacts what you see. Imagine some visual and auditory stimuli are moving from left to right. When they are in sync, everything seems normal. But when they are moved out of step, the auditory signal affects people’s ability to see the motion. We focus on one and ignore the other.
What you see impacts what you hear too. Ventriloquists ”‘throw” their voices because the visual stimulus messes with our auditory localization process. This is an example of a broader phenomenon called the McGurk effect.
In the 1970’s, Harry McGurk and John McDonald showed a robust illusion that works in all languages and at all ages. It works with real and highly reduced, stylized faces. It even works with subjects simply touching a face, and those unaware they are looking at a face.
It turns out we lip read more than we suspect. If, for example, you watch a person saying “ga” but the soundtrack says “da,”you hear ‘“a.” Vision overrides hearing. Similarly, seeing “vase” but hearing “base” produces “vase.”
You know from your own experience that it doesn’t seem right when your loudspeaker isn’t next to the TV. Even very young babies dislike sound coming from the “wrong” place.
Spatial Location Cues
We use cues for locating sounds, which also tells us something about their distance and motion.
Azimuth (horizontal) cues are mainly from differences between the left and right ears. This provides a sense of shadowing.
Binaural cues (up and down) are prim7Ily pinna-induced. They are the result of head movements. The pinna also influences distance cues. Distance is mostly determined by intensity and clarity. Sounds in the distance are often quiet and muffled.
Time cues comes from differences in arrival times. Fibers cross in the trapezoid bodies to each superior olive, signaling an interaural time-difference ITD). A large ITD means sound hits one ear before the other. The process is more complicated than it sounds.
The coincidence detection model (Jeffress, 1948) works for low-frequency sound localization, but less well at higher frequencies. Raleigh’s Duplex theory works well for pure tones but not so well for complex sounds, such as voices.
Vestibular
Want to jump ahead?
- What Is Perception?
- Perceptual Efficiency
- Vision
- Taste
- Smell
- Touch, Temperature, Pain & Itch
- Hearing
- Vestibular
- Visceral
- Proprioception
- Time
Photo by Pauline Loroy on Unsplash