Return to list
Speech perception
Jacqueline Vaissière | Université de la Sorbonne Nouvelle Paris 3
Introduction

Speech perception is a difficult matter, not as fully understood as speech production. It is a seamless and innate ability, typically acquired in early infancy. Yet, it remains considerably less subject to conscious scrutiny compared to speech production.

Speech perception (SP) can be delineated in two distinct ways.

Broadly, SP is the understanding of the full range of information transmitted by the spoken message. It includes 1) the decoding of the linguistic information with preexisting syntactic, semantic, and pragmatic analyses (equivalent to the written text); 2) the speaker's feelings about what he is saying (doubt, conviction), 3) his relationship with the listener (friendship, irritation), 4) his identity, personality traits, gender, age, physical and mental state, thanks to cues that are not controlled by the speaker; 5) Understanding the message can lead to creating empathy or sadness in the listener. Narrowly, SP is understood as the phonetic analysis of the incoming signal. The continuous and varying acoustic signal is transformed into a sequence of discrete speech units, features, phonemes, diphones, or syllables (the hierarchy is not well known). These units interact with our mental lexicon.

Related fields

Both speech and nonspeech sounds are perceived by the same ear and transmitted to the brain via the same auditory nerve. Consequently, they share common neural mechanisms. The ear conducts a form of non-linear frequency analysis on incoming signals within the cochlea, akin to a filter bank that can be depicted in a spectrogram. The fine-tuned analysis of low-frequency components (the frequency of the first spectral peaks and harmonics) is crucial for the detection of nasality, breathiness, speaker characteristics, and voice quality. Both speech and nonspeech sounds undergo temporal masking, as well as frequency masking, along with discontinuity reinforcement, before they are transmitted to the auditory cortex. Critical bands represent frequency bandwidths within which the ear cannot distinguish individual frequencies: when two formants are in close proximity, they are perceived as a single formant, like in the case of quantal vowels). Additionally, integration times come into play, involving the integration of intensity at the syllable level.

Speech perception is intricately connected to various research fields, including psychoacoustics, which investigates how the human auditory system perceives various sounds, including noise, speech, and music; auditory physiology, which explores the ear's anatomy and the functions of the auditory nervous system; ever-evolving field of fundamental neurobiology and behavioral neurosciences, which delves into the roles of both the peripheral and central nervous systems in relation to brain function, mainly using brain imaginary; and psycholinguistics, which centers on the mechanisms involved in language acquisition, usage, processing, comprehension, and representation in the mind and brain, mainly using behavioral paradigms.

Research questions

Critical unanswered research questions in the field of speech perception include

  • At which level speech and nonspeech sounds are processed differently? When individuals are instructed to perceive certain acoustic stimuli as speech, they process speech differently. For example, even if /i/ and /a/ have different physical intensities, if the listener perceives that the speaker exerted similar articulatory effort to pronounce both vowels, they will perceive both vowels as having the same loudness (as observed in Lehiste, 1970).
  • How far speech perception and speech production are closely interconnected processes? Brain imaging studies have shown that motor areas associated with speech production are active during speech perception. This suggests that the brain's motor system plays a role in understanding speech, but it is not a prove. Additionally, individuals with Broca's aphasia have difficulty speaking but can still understand speech, while individuals with Wernicke's aphasia have difficulty understanding speech but may speak normally. This supports the idea that speech perception and production are interconnected but can be affected independently.
  • What about the link with the visual cortex? The McGurk effect demonstrates that when individuals hear one sound (e.g., "da") but see a person's face articulating a different sound (e.g., "ba"), they may perceive a different consonant altogether. This suggests that visual information from the speaker's face influences speech perception and can override the auditory input.
  • How do auditory input, linguistic and non-linguistic information contribute to word recognition in speech perception? Auditory input, higher-level linguistic information (such as the mental lexicon, syntax, and pragmatics), and even visual sensory input all play a role.
  • At what point in the process does a word become a candidate, get eliminated, or get recognized by the listener?
Experimental designs for studying speech perception

Two types of methods can be used:

  • Behavioral paradigms involve participants in various tasks related to discrimination and identification. In a discrimination task, the listener is asked if two stimuli are the same or different. In an identification task, participants may explicitly label a sound. This typically involves presenting a stimulus one at a time, and the listener must answer questions like "Do hear /p/, /t/, or /k/?" (Forced choice) or "What phoneme do you hear?" (Free choice). Reaction time refers to the interval between presenting a stimulus and the participant's response to a task that involves a basic action, such as pressing a key. A prolonged reaction time indicates hesitance and the presence of conflicting cues, shedding light on the role otherwise imperceptible variations can play in the perception process. In the case of unborn children, neonates, and young infants, diverse methods are used to detect changes in their responses after they become accustomed to stimuli (equivalent to a discrimination task). These methods include monitoring heart rate (for the fetuses), non-nutritive sucking rate (newborns), observing head-turning behavior and neuroimaging techniques.
  • Neuroimaging techniques offer windows into the brain and allows visualizing the neural correlates of behavioral investigation. They are used in awake or asleep children and adults (and in non-human primates) to provide insights into brain activity, and connectivity when listening to speech. How does the brain react when a subject listens to a long series of sounds representing the same phoneme, and then a different phoneme is introduced (equivalent to a discrimination task)? Some of the most commonly used neuroimaging techniques include Magnetic Resonance Imaging (MRI) for spatial localization, and Magnetoencephalography (MEG) and Electroencephalography (EEG) for temporal resolution.

The listeners may be asked to identify, or to discriminate language or regional dialect, utterance modality, speaker’s identity, gender, age, physical condition, emotional state, speaker’s attitude toward the content (such as irony or incredulity), and his attitude toward the interlocutor (such as disgust or compassion). They may be asked to locate word, phrase, and utterance boundaries or stresses or to press a button when they hear a given phoneme, or a word, and to evaluate intelligibility, naturalness, speech quality, voice quality, etc.

The diversity of participants contributes to a comprehensive understanding of speech perception across different contexts, ages, linguistic backgrounds, early and late second or third language learners, proficient/struggling readers, or children and adults with normal hearing or profound hearing loss (either born deaf or experiencing deafness after acquiring language skills), cochlear implants, monolingual/bilingual individuals, listeners from various cultures, listeners who may or may not have visual access to the speaker's face, musicians and non-musicians, listeners with language problems, etc.

A multitude of software tools, such as Praat, provide us with the capability to synthesize, visualize, cross-splice, manipulate speech, and create continua such as /ri-li/ continuum for use in speech perception experiments. Additionally, formant and articulatory synthesizers are readily accessible as well to create stimuli. Other softwares are available for listening experiments (e.g., labeling, discrimination, reaction time measurement, magnitude estimation, and so on).

Models and theories on speech perception

Several theories have been proposed in the field of speech perception. Multiple theories have been put forth, each theory accumulating partial evidence that supports its claims. These theories, to varying degrees, demonstrate compatibility with one another. Remarkably, the brain possesses the remarkable ability to adapt to specific contexts and employs numerous strategies to process various facets of speech sounds simultaneously to extract the pertinent information for the listener, while discarding irrelevant details.

The following models are related to access to the identification of the vowels and consonants.

  • Alwin Liberman's Motor Theory of Speech Production: listeners are capable of detecting relevant articulatory gestures from speech, thanks to an inherited speech-specific module. For instance, when they hear sounds like /p/, /t/, or /k/, they retrieve the intended neuromotor commands for lip closure, and vocal tract constriction thanks to tongue root or apex. The Motor Theory of Speech Production posits that speech perception relies on prior awareness of articulatory gestures. However, it is worth noting that even newborns can perceive contrasts between these sounds, even though they may have never produced them. This theory has been mainly tested on consonants.
  • Browman and Goldstein's Articulatory Phonology Model posits that abstract articulatory gestures serve as the fundamental unit for phonological contrast in both speech production and perception. The "gestures" are the intentions to particular articulatory tasks implemented by vocal tract variables.
  • Fowler's Direct Realistic Theory of Speech Perception: speech perception focuses on gestures, but they aren't intended gestures as in the two previous models. There's no dedicated "speech module". Instead, the acoustic signal directly determines the gestures structuring it.
  • Stevens's Invariance Theory primarily concerns the acoustic properties of the consonants. For example, it has been hypothesized that the short-time spectrum sampled at the onset of a stop consonant should exhibit gross properties that uniquely specify the consonantal place of articulation independent of the following vowel.
  • Magnet Theory (Patricia Kuhl): the listener judges the distance between what they hear and a set of prototypes of vowels stored in the brain to recognize speech sounds.
  • Chomsky’s innate Distinctive Feature Detector Theory suggests the existence of innate detectors for distinctive features in speech sounds.

The following models are more closely linked to access to the mental lexicon.

  • Episodic Model (Stephen Goldinger): detailed memory traces of each word heard by the listener are stored in their mental lexicon.
  • Cohort Model (William Marslen-Wilson): the listener forms an online cohort of possible candidate words that correspond to the initial sound they hear. As the speech progresses, candidates are ruled out.
  • Trace Model (David McClelland) suggests that speech perception relies on simultaneous constraints and the integration of independent sources of information. It doesn't assign a special status to word beginnings.
  • Logical Model of Speech Perception (Dominic Massaro): independent sources of information are integrated to provide an overall degree of support for alternatives, and guide perceptual identification and interpretation based on the relative degree of support among alternatives.
Some factors influencing speech perception

Indeed, there are numerous factors that can influence speech perception, making it a complex and multifaceted process. These factors interact in complex ways.

Some of these factors include:

  • Background noise. For example, voicing and nasality features are little affected by noise, but place is severely affected by low-pass and noisy systems.
  • Coarticulation. Subtle differences in vowel quality, such as the difference between the phonetic realization of /u/ in /tut/ and in /ʁuʁ/ can only be heard if the vowel is extracted from the continuous speech, pointing out to the importance span of the listening window.  The listener compensates for the effects of coarticulation at the level of the syllable.
  • Linguistic experience. Native language affects their perception of the segmental and prosodic cues. For instance, when presented with the English three-syllable word 'MacDonald', the Japanese listener 'hears' and repeats five syllable /ma.kW.do/na.RW.do/, following the phonotactic properties of Japanese.
  • Familiarity with the speaker. Knowing the speaker and his or her speaking style can facilitate speech perception.
  • Expectations: The listener is biased by what he expects to hear.
  • Speech rate and style. Fast speech may be more challenging to comprehend than slower, well-articulated speech.
  • Contextual, pragmatic cues help listeners interpret meaning and resolve ambiguities.
  • Visual cues from the speaker's face can enhance speech perception, especially in noisy environments or when speech is unclear.
Perception of the segmental aspects of speech

Regardless of their level of sophistication, no acoustic, aerodynamic, articulatory, or physiological analysis can definitively determine whether an intervocalic stop will be perceived as voiced or voiceless, or whether a sentence will be perceived as a question or a simple request for confirmation. Perceptual experiments are necessary. The ear is the final judge.

Phonological features are distinctive characteristics that set one phoneme apart from another in a specific language. Some common phonological features composing the phonemes encompass voicing, place and manner of articulation, nasality, vowel height and backness, rounding, length, voice quality , tone, etc.

Each of these phonological features has a clear set of acoustic correlates: duration, fundamental frequency, harmonics and formant values and relative amplitude, formant transitions, spectral shape at specific moments, instantaneous and continuous noise, etc. The acoustic correlates associated with the realization of a given phonological feature mainly depend on the phoneme's position within the syllable (onset or coda) and the syllable's position within the word. In American English, as many as 16 acoustic properties contribute to the distinction between 'rapid' or 'rabbit' (Lisker, 1986). A perceived /t/ may be realized as a glottal stop in post-stressed positions or as a flap ( "butter" [/ˈbʌtɚ/ > [ˈb̥ʌʔɚ] [ˈb̥ʌɾɚ]). It can also manifest as a nasal-released stop (“garden") or as a nasal consonant in phrases like ("want to” > “wanna.”).

Perceptual hallucinations: There can be a discrepancy between the speech signal and what is actually perceived by listeners. For example, a phoneme or a word may be absent or replaced by silence or noise, but still may be ‘perceived’ by the listener; word-final lengthening may be interpreted by the listener as the presence of a pause, etc.

Very young infants exhibit the ability to distinguish nonnative phonemes, although this sensitivity tends to wane between 6 and 12 months as their perceptual abilities become more aligned with the phonetic categories of their native language.

Perception of prosodic aspects of speech

Prosody is a term often used to describe the stress, melodic (intonation) and rhythmic (temporal) aspects of speech. However, it can be challenging to study separately stress, rhythm and intonation, since they are perceived simultaneously and contribute to the particular rhythm of the language. For instance, intonational events are aligned with stressed syllables or boundary syllables. The regularities of intonational movements participate to the perceived rhythm. Recurrent rising and lengthened syllable dominates the perception of French rhythm and are also imitated in the babbling of French infants, which is different from the babbling of English infants with falling pattern. Parameters other than the ones used at the segmental level include pause, and semi-global or global cues, such as resetting of the fundamental frequency (Fo) baseline, manipulating the range and register of Fo. It also includes parameters, which are not part of the phonological system of the language in question, such as breathiness/creakiness, etc. In the same way, it is not possible to draw a strict distinction between the segmental (individual sounds) and suprasegmental (larger units of speech) aspects of speech. Prosodic cues are carried on by spectral, temporal, and amplitude variations in the realizations of phonemes, such as allophonic variations of the segments according to their position relative to the word stress and boundaries.

Independently of the language spoken, resetting of the Fo and intensity baseline, strengthening of the articulation seem to evoke to the listener the presence of left boundary marker. Pause and lengthening evoke a right boundary. Similarly, breathy voice might be associated with sensuality, intimacy and femininity, and higher Fo can convey a sense of dominancy, continuity or interrogation, etc. There is however language-dependent interpretation. In some cultures, breathy voice might be associated with pathology; what is perceived as a positive emotion may vary across cultures and among different people.

Rhythm is primarily a perceptual phenomenon. Rhythm is based in a regular repetition and alternation of similar events at all levels of the prosodic structure, forming coherent patterns. Events, such as syllables in French, and stresses in English, are perceived as more synchronous as they are to conform to expected patterns. The child, most likely in the womb since he is able to recognize his own language very soon after birth, learns the multi-level regularities present in a language very early. Neonates can even differentiate between different languages after habituation.

The language-specific regularities generates a set of implicit or explicit expectations, such as statistical preference of 1) open over closed syllables, simple versus complex syllable structure, isochronous syllables in French as compared to English, 2) rising in French versus falling intonation in English , or 3) alternation between strong/weak syllables in English, etc.

The listener is very attentive to disruption of the expected pattern. The perceived lengthening of interval between two stresses (in English) or a larger rise is interpreted as the underlying presence of a boundary.


References

Several journals focus specifically on speech perception. These journals include Ear and Hearing, Journal of Speech, Language and Hearing Research, Perception and Psychophysics, Speech Hearing Research, among others.

General

Handel, Stephen. 1993. Listening: An introduction to the perception of auditory events. Cambridge, Massachusetts.

Johnson, Keith. 1997. Acoustic and auditory phonetics. Cambridge, MA: Blackwell.

Pisoni, David B. & Robert E. Remez. 2008. The handbook of speech perception. Hoboken, New Jersey: John Wiley & Sons.

Ohala, John. J. (1995) Speech perception is hearing sounds, not tongues. Journal of the Acoustical Society of America 99:1718–25.

Cooper, Williams. (1979). Speech perception and production: Studies in selective adaptation. Norwood, NJ: Ablex.

Eimas, Peter. D., Siqueland, E. R., Jusczyk, P., & Vigorito , J. (1971). Speech perception in infants. Science, 171, 303–306.

Raphael, Laurence. J.; Borden, G. J.; Harris, K. S. (2007). Speech science primer: Physiology, acoustics, and perception of speech. Lippincott Williams & Wilkins.

Psychoacoustics

Chistovich, Ludmilla A. & Valentina V. Lublinskaya. 1979. The ‘center of gravity’ effect in vowel spectra and critical distance between the formants: Psychoacoustical study of the perception of vowel-like stimuli. Hearing Research 1(3). 85–195.

Kozhevnikov, Valerii Aleksandrovich & Ludmila Andreevna Chistovich. 1965. Speech, articulation and perception. Washington, DC: U.S. Department of Commerce.

Models of speech perception

Diehl, Randy L. 1981. Feature detectors for speech: a critical reappraisal. Psychological Bulletin 89(1).

Kuhl, Patricia K. 1991. Human adults and human infants show a ’perceptual magnet effect’ for the prototypes of speech categories, monkeys do not. Perception & psychophysics 50(2). 93–107.

Liberman, Alvin M & Ignatius G Mattingly. 1985. The motor theory of speech perception revised. Cognition 21(1). 1–36.

Marslen-Wilson, William D. 1987. Functional parallelism in spoken word recognition. Cognition 25(1-2). 71–102.

Massaro, Dominic W. 1987. Categorical partition: a fuzzy-logical model of categorization behavior.

McClelland, James L & Jeffrey L Elman. 1986. The trace model of speech perception.Cognitive psychology 18(1). 1–86.

McGurk, Harry & John MacDonald. 1976. Hearing lips and seeing voices. Nature 264(5588). 746–748.

Stevens, Kenneth N. & Sheila E. Blumstein. 1981. The search for invariant acoustic correlates of phonetic features. In Peter D. Eimas & Joanne L. Miller (eds.), Perspectives on the study of speech, 1–38.

Studdert-Kennedy. 1967. Perception of the speech code. Psychological review 74(6). 431.

Segmental aspects

Stevens, Kenneth N. 2000. Acoustic phonetics. Cambridge, Massachusetts, United States: The MIT press. https://mitpress.mit.edu/books/acoustic-phonetics.

Lisker, Leigh. 1986. “Voicing” in English: A catalogue of acoustic features signaling /b/ versus /p/ in trochees. Language and speech 29(1). 3–11.

Warren, Richard M. & Gary L. Sherman. 1974. Phonemic restorations based on subsequent context. Perception and Psychophysics 16(1). 150–156.

Prosodic aspects

Barbosa, Plinio. 2012. Panorama of experimental prosody research. Proc. of the VIIth GSCP International Conference. Speech and Corpora, 33. 33-42.

Cutler, Anne. 2005. Lexical stress. In David B. Pisoni & Robert E. Remez (eds.), The handbook of speech perception, 264–299. Hoboken, New Jersey: John Wiley& Sons.

Hadding, Kerstin & Michael Studdert-Kennedy. 1974. Are you asking me, telling me, or talking to yourself? Journal of Phonetics 2(1). 7–14.

Lehiste, Ilse. 1970. Suprasegmentals. Cambridge, Massachusetts, United States.

Pierrehumbert, Janet B. 1979. The perception of fundamental frequency declination. The Journal of the Acoustical Society of America 66(2). 363–69.

Studdert-Kennedy, Mickael & Kerstin Hadding-Koch. 1973. Auditory and linguistic processes in the perception of intonation contours. Language and Speech 16(4). 293–313.

Thorsen, Nina G. 1980. A study of the perception of sentence intonation — evidence from danish. The Journal of the Acoustical Society of America 67(3). 1014–1030.

Vaissière, Jacqueline. 2008. Perception of intonation. In David B. Pisoni & Robert E. Remez (eds.), The handbook of speech perception, 236–263. Hoboken, New Jersey: John Wiley & Sons. https://halshs.archives-ouvertes.fr/halshs-00185517/.

More specifically on rhythm

Allen, George D. 1975. Speech rhythm: its relation to performance universals and articulatory timing. Journal of Phonetics 3(2). 75–86.

Delattre, Pierre C. 1963. Comparing the prosodic features of English, German, Spanish and French. Heidelberg: Germany: Julius Groos Verlag.

Fraisse, Paul. 1956. Les structures rhythmiques: Étude psychologique. Louvain, Belgium: Publications Universitaires De Louvain.

Lehiste, Ilse. 1977. Isochrony reconsidered. Journal of Phonetics 5(3). 253–263.

Woodrow, Herbert. 1951. Time perception. In Stanley Smith Stevens (ed.), Handbook of experimental psychology, 1224–1236. Hoboken, New Jersey: Wiley-Blackwell.