Speech perception is a difficult matter, not as fully understood as speech production. It is a seamless and innate ability, typically acquired in early infancy. Yet, it remains considerably less subject to conscious scrutiny compared to speech production.
Speech perception (SP) can be delineated in two distinct ways.
Broadly, SP is the understanding of the full range of information transmitted by the spoken message. It includes 1) the decoding of the linguistic information with preexisting syntactic, semantic, and pragmatic analyses (equivalent to the written text); 2) the speaker's feelings about what he is saying (doubt, conviction), 3) his relationship with the listener (friendship, irritation), 4) his identity, personality traits, gender, age, physical and mental state, thanks to cues that are not controlled by the speaker; 5) Understanding the message can lead to creating empathy or sadness in the listener. Narrowly, SP is understood as the phonetic analysis of the incoming signal. The continuous and varying acoustic signal is transformed into a sequence of discrete speech units, features, phonemes, diphones, or syllables (the hierarchy is not well known). These units interact with our mental lexicon.
Both speech and nonspeech sounds are perceived by the same ear and transmitted to the brain via the same auditory nerve. Consequently, they share common neural mechanisms. The ear conducts a form of non-linear frequency analysis on incoming signals within the cochlea, akin to a filter bank that can be depicted in a spectrogram. The fine-tuned analysis of low-frequency components (the frequency of the first spectral peaks and harmonics) is crucial for the detection of nasality, breathiness, speaker characteristics, and voice quality. Both speech and nonspeech sounds undergo temporal masking, as well as frequency masking, along with discontinuity reinforcement, before they are transmitted to the auditory cortex. Critical bands represent frequency bandwidths within which the ear cannot distinguish individual frequencies: when two formants are in close proximity, they are perceived as a single formant, like in the case of quantal vowels). Additionally, integration times come into play, involving the integration of intensity at the syllable level.
Speech perception is intricately connected to various research fields, including psychoacoustics, which investigates how the human auditory system perceives various sounds, including noise, speech, and music; auditory physiology, which explores the ear's anatomy and the functions of the auditory nervous system; ever-evolving field of fundamental neurobiology and behavioral neurosciences, which delves into the roles of both the peripheral and central nervous systems in relation to brain function, mainly using brain imaginary; and psycholinguistics, which centers on the mechanisms involved in language acquisition, usage, processing, comprehension, and representation in the mind and brain, mainly using behavioral paradigms.
Critical unanswered research questions in the field of speech perception include
Two types of methods can be used:
The listeners may be asked to identify, or to discriminate language or regional dialect, utterance modality, speaker’s identity, gender, age, physical condition, emotional state, speaker’s attitude toward the content (such as irony or incredulity), and his attitude toward the interlocutor (such as disgust or compassion). They may be asked to locate word, phrase, and utterance boundaries or stresses or to press a button when they hear a given phoneme, or a word, and to evaluate intelligibility, naturalness, speech quality, voice quality, etc.
The diversity of participants contributes to a comprehensive understanding of speech perception across different contexts, ages, linguistic backgrounds, early and late second or third language learners, proficient/struggling readers, or children and adults with normal hearing or profound hearing loss (either born deaf or experiencing deafness after acquiring language skills), cochlear implants, monolingual/bilingual individuals, listeners from various cultures, listeners who may or may not have visual access to the speaker's face, musicians and non-musicians, listeners with language problems, etc.
A multitude of software tools, such as Praat, provide us with the capability to synthesize, visualize, cross-splice, manipulate speech, and create continua such as /ri-li/ continuum for use in speech perception experiments. Additionally, formant and articulatory synthesizers are readily accessible as well to create stimuli. Other softwares are available for listening experiments (e.g., labeling, discrimination, reaction time measurement, magnitude estimation, and so on).
Several theories have been proposed in the field of speech perception. Multiple theories have been put forth, each theory accumulating partial evidence that supports its claims. These theories, to varying degrees, demonstrate compatibility with one another. Remarkably, the brain possesses the remarkable ability to adapt to specific contexts and employs numerous strategies to process various facets of speech sounds simultaneously to extract the pertinent information for the listener, while discarding irrelevant details.
The following models are related to access to the identification of the vowels and consonants.
The following models are more closely linked to access to the mental lexicon.
Indeed, there are numerous factors that can influence speech perception, making it a complex and multifaceted process. These factors interact in complex ways.
Some of these factors include:
Regardless of their level of sophistication, no acoustic, aerodynamic, articulatory, or physiological analysis can definitively determine whether an intervocalic stop will be perceived as voiced or voiceless, or whether a sentence will be perceived as a question or a simple request for confirmation. Perceptual experiments are necessary. The ear is the final judge.
Phonological features are distinctive characteristics that set one phoneme apart from another in a specific language. Some common phonological features composing the phonemes encompass voicing, place and manner of articulation, nasality, vowel height and backness, rounding, length, voice quality , tone, etc.
Each of these phonological features has a clear set of acoustic correlates: duration, fundamental frequency, harmonics and formant values and relative amplitude, formant transitions, spectral shape at specific moments, instantaneous and continuous noise, etc. The acoustic correlates associated with the realization of a given phonological feature mainly depend on the phoneme's position within the syllable (onset or coda) and the syllable's position within the word. In American English, as many as 16 acoustic properties contribute to the distinction between 'rapid' or 'rabbit' (Lisker, 1986). A perceived /t/ may be realized as a glottal stop in post-stressed positions or as a flap ( "butter" [/ˈbʌtɚ/ > [ˈb̥ʌʔɚ] [ˈb̥ʌɾɚ]). It can also manifest as a nasal-released stop (“garden") or as a nasal consonant in phrases like ("want to” > “wanna.”).
Perceptual hallucinations: There can be a discrepancy between the speech signal and what is actually perceived by listeners. For example, a phoneme or a word may be absent or replaced by silence or noise, but still may be ‘perceived’ by the listener; word-final lengthening may be interpreted by the listener as the presence of a pause, etc.
Very young infants exhibit the ability to distinguish nonnative phonemes, although this sensitivity tends to wane between 6 and 12 months as their perceptual abilities become more aligned with the phonetic categories of their native language.
Prosody is a term often used to describe the stress, melodic (intonation) and rhythmic (temporal) aspects of speech. However, it can be challenging to study separately stress, rhythm and intonation, since they are perceived simultaneously and contribute to the particular rhythm of the language. For instance, intonational events are aligned with stressed syllables or boundary syllables. The regularities of intonational movements participate to the perceived rhythm. Recurrent rising and lengthened syllable dominates the perception of French rhythm and are also imitated in the babbling of French infants, which is different from the babbling of English infants with falling pattern. Parameters other than the ones used at the segmental level include pause, and semi-global or global cues, such as resetting of the fundamental frequency (Fo) baseline, manipulating the range and register of Fo. It also includes parameters, which are not part of the phonological system of the language in question, such as breathiness/creakiness, etc. In the same way, it is not possible to draw a strict distinction between the segmental (individual sounds) and suprasegmental (larger units of speech) aspects of speech. Prosodic cues are carried on by spectral, temporal, and amplitude variations in the realizations of phonemes, such as allophonic variations of the segments according to their position relative to the word stress and boundaries.
Independently of the language spoken, resetting of the Fo and intensity baseline, strengthening of the articulation seem to evoke to the listener the presence of left boundary marker. Pause and lengthening evoke a right boundary. Similarly, breathy voice might be associated with sensuality, intimacy and femininity, and higher Fo can convey a sense of dominancy, continuity or interrogation, etc. There is however language-dependent interpretation. In some cultures, breathy voice might be associated with pathology; what is perceived as a positive emotion may vary across cultures and among different people.
Rhythm is primarily a perceptual phenomenon. Rhythm is based in a regular repetition and alternation of similar events at all levels of the prosodic structure, forming coherent patterns. Events, such as syllables in French, and stresses in English, are perceived as more synchronous as they are to conform to expected patterns. The child, most likely in the womb since he is able to recognize his own language very soon after birth, learns the multi-level regularities present in a language very early. Neonates can even differentiate between different languages after habituation.
The language-specific regularities generates a set of implicit or explicit expectations, such as statistical preference of 1) open over closed syllables, simple versus complex syllable structure, isochronous syllables in French as compared to English, 2) rising in French versus falling intonation in English , or 3) alternation between strong/weak syllables in English, etc.
The listener is very attentive to disruption of the expected pattern. The perceived lengthening of interval between two stresses (in English) or a larger rise is interpreted as the underlying presence of a boundary.
Several journals focus specifically on speech perception. These journals include Ear and Hearing, Journal of Speech, Language and Hearing Research, Perception and Psychophysics, Speech Hearing Research, among others.
General
Handel, Stephen. 1993. Listening: An introduction to the perception of auditory events. Cambridge, Massachusetts.
Johnson, Keith. 1997. Acoustic and auditory phonetics. Cambridge, MA: Blackwell.
Pisoni, David B. & Robert E. Remez. 2008. The handbook of speech perception. Hoboken, New Jersey: John Wiley & Sons.
Ohala, John. J. (1995) Speech perception is hearing sounds, not tongues. Journal of the Acoustical Society of America 99:1718–25.
Cooper, Williams. (1979). Speech perception and production: Studies in selective adaptation. Norwood, NJ: Ablex.
Eimas, Peter. D., Siqueland, E. R., Jusczyk, P., & Vigorito , J. (1971). Speech perception in infants. Science, 171, 303–306.
Raphael, Laurence. J.; Borden, G. J.; Harris, K. S. (2007). Speech science primer: Physiology, acoustics, and perception of speech. Lippincott Williams & Wilkins.
Psychoacoustics
Chistovich, Ludmilla A. & Valentina V. Lublinskaya. 1979. The ‘center of gravity’ effect in vowel spectra and critical distance between the formants: Psychoacoustical study of the perception of vowel-like stimuli. Hearing Research 1(3). 85–195.
Kozhevnikov, Valerii Aleksandrovich & Ludmila Andreevna Chistovich. 1965. Speech, articulation and perception. Washington, DC: U.S. Department of Commerce.
Models of speech perception
Diehl, Randy L. 1981. Feature detectors for speech: a critical reappraisal. Psychological Bulletin 89(1).
Kuhl, Patricia K. 1991. Human adults and human infants show a ’perceptual magnet effect’ for the prototypes of speech categories, monkeys do not. Perception & psychophysics 50(2). 93–107.
Liberman, Alvin M & Ignatius G Mattingly. 1985. The motor theory of speech perception revised. Cognition 21(1). 1–36.
Marslen-Wilson, William D. 1987. Functional parallelism in spoken word recognition. Cognition 25(1-2). 71–102.
Massaro, Dominic W. 1987. Categorical partition: a fuzzy-logical model of categorization behavior.
McClelland, James L & Jeffrey L Elman. 1986. The trace model of speech perception.Cognitive psychology 18(1). 1–86.
McGurk, Harry & John MacDonald. 1976. Hearing lips and seeing voices. Nature 264(5588). 746–748.
Stevens, Kenneth N. & Sheila E. Blumstein. 1981. The search for invariant acoustic correlates of phonetic features. In Peter D. Eimas & Joanne L. Miller (eds.), Perspectives on the study of speech, 1–38.
Studdert-Kennedy. 1967. Perception of the speech code. Psychological review 74(6). 431.
Segmental aspects
Stevens, Kenneth N. 2000. Acoustic phonetics. Cambridge, Massachusetts, United States: The MIT press. https://mitpress.mit.edu/books/acoustic-phonetics.
Lisker, Leigh. 1986. “Voicing” in English: A catalogue of acoustic features signaling /b/ versus /p/ in trochees. Language and speech 29(1). 3–11.
Warren, Richard M. & Gary L. Sherman. 1974. Phonemic restorations based on subsequent context. Perception and Psychophysics 16(1). 150–156.
Prosodic aspects
Barbosa, Plinio. 2012. Panorama of experimental prosody research. Proc. of the VIIth GSCP International Conference. Speech and Corpora, 33. 33-42.
Cutler, Anne. 2005. Lexical stress. In David B. Pisoni & Robert E. Remez (eds.), The handbook of speech perception, 264–299. Hoboken, New Jersey: John Wiley& Sons.
Hadding, Kerstin & Michael Studdert-Kennedy. 1974. Are you asking me, telling me, or talking to yourself? Journal of Phonetics 2(1). 7–14.
Lehiste, Ilse. 1970. Suprasegmentals. Cambridge, Massachusetts, United States.
Pierrehumbert, Janet B. 1979. The perception of fundamental frequency declination. The Journal of the Acoustical Society of America 66(2). 363–69.
Studdert-Kennedy, Mickael & Kerstin Hadding-Koch. 1973. Auditory and linguistic processes in the perception of intonation contours. Language and Speech 16(4). 293–313.
Thorsen, Nina G. 1980. A study of the perception of sentence intonation — evidence from danish. The Journal of the Acoustical Society of America 67(3). 1014–1030.
Vaissière, Jacqueline. 2008. Perception of intonation. In David B. Pisoni & Robert E. Remez (eds.), The handbook of speech perception, 236–263. Hoboken, New Jersey: John Wiley & Sons. https://halshs.archives-ouvertes.fr/halshs-00185517/.
More specifically on rhythm
Allen, George D. 1975. Speech rhythm: its relation to performance universals and articulatory timing. Journal of Phonetics 3(2). 75–86.
Delattre, Pierre C. 1963. Comparing the prosodic features of English, German, Spanish and French. Heidelberg: Germany: Julius Groos Verlag.
Fraisse, Paul. 1956. Les structures rhythmiques: Étude psychologique. Louvain, Belgium: Publications Universitaires De Louvain.
Lehiste, Ilse. 1977. Isochrony reconsidered. Journal of Phonetics 5(3). 253–263.
Woodrow, Herbert. 1951. Time perception. In Stanley Smith Stevens (ed.), Handbook of experimental psychology, 1224–1236. Hoboken, New Jersey: Wiley-Blackwell.