Return to list
Tools and methods for acoustic and perceptual analysis
Nicolas Audibert | Université Sorbonne Nouvelle

Nicolas Audibert | Université Sorbonne Nouvelle

This entry provides an overview of the methods and technical tools most widely used in acoustic and perceptual phonetics. In order to limit redundancy with the other entries, we will first briefly introduce the general characteristics of the information collected and some uses of this data for speech analysis, before mentioning some of the tools used to obtain them. The first section focuses on tools for annotating acoustic data, and in particular to forced alignment. The second section focuses on tools for extracting acoustic measurements from the speech signal. Finally, the third section presents some tools dedicated to the collection of perceptual data.

Acoustic data annotation

In the vast majority of cases, data annotation is a prerequisite for acoustic analysis of speech data, which requires knowledge of certain characteristics of the speech units being analyzed. The most common level of annotation is that of the phonetic segment (also called phone), which corresponds to the realization of the individual sounds produced one after the other in the speech stream. Annotation then consists of associating symbols (IPA or equivalent coding such as SAMPA notation) with the start and end time boundaries of the annotated segments. Depending on the needs of the analysis, this level of annotation can be replaced or supplemented by units of a higher level of granularity (syllables, words, sequences separated by silent pauses, turns of speech or other features relevant to the analysis of spoken communication) and/or by discrete acoustic events.

Although other annotation tools associated with a different format may be applied to physiological speech data or video streams, Praat's TextGrid format (Boersma, 2001) has become the standard for annotating acoustic speech data. This format, also compatible with many other analysis tools, enables multiple mutually independent annotation fields to be defined, each containing a set of temporal intervals associated with a start and end time and symbolic information or annotations of discrete events.

The classic approach to phonetic annotation consists of manual segmentation into phones based on auditory cues and, above all, visual observation of the spectrogram and oscillogram. Since the early 2000s, the development of forced alignment systems has made it possible to automate this annotation process and obtain segmentation into phones and words. In most cases, this segmentation of healthy adult speech is of sufficient quality to derive usable duration and acoustic measurements, provided that an accurate (generally orthographic) transcription is available. Some of these systems now offer grapheme-to-phoneme conversion and acoustic models for many languages, and sometimes the possibility of adding models for new languages.

Among the forcible alignment tools currently in widespread use and capable of handling a large number of languages are the online tool WebMAUS (Kisler et al., 2017) integrated with the BAS Web Services speech annotation platform, and the Montreal Forced Aligner command-line tool (McAuliffe et al., 2017) which offers the user numerous parameterization and adaptation possibilities.

Extraction of acoustic measurements

The approach to quantitative acoustic analysis most commonly used in phonetics consists in extracting measurements assumed to be representative of a particular phenomenon on a given class of segments, taking into account the annotation of the data. In addition to temporal measurements derived directly from annotation, a significant proportion of these measurements consist in projecting part of the spectral information, obtained by means of a Fourier transform on a frame of acoustic signal whose duration depends on the goals of the analysis. In some minority cases, the spectral envelope is estimated using the Linear Predictive Coding method (see, for example, Fulop (2011) for a detailed presentation of the different types of spectral representations of speech and the associated calculation methods). These measurements can be considered at a fixed point in time, typically the middle of the realization of a segment considered to be the most stable point, or over a series of consecutive and partially overlapping frames to account for some of the dynamic spectral information as represented in a spectrogram. In most cases, only the power spectrum is taken into account when extracting acoustic measurements, the phase being ignored. Moreover, most measurements derived from spectral information focus on frequencies not exceeding 8000 Hz, which makes them compatible with the 16000 Hz sampling frequency classically used for automatic speech processing applications.

The spectral measures classically employed to characterize vowel articulation is formant frequency, generally defined as the frequency of spectral energy peaks (see for example Maurer (2016)). Reference can be made to Kent & Vorperian (2018) for a review of the various methods for analyzing formant frequencies, the main potential sources of error and the measures that can be derived from formant values to characterize vowels. Although the overwhelmingly majority approach is limited to the frequencies of the first two formants, which is generally considered sufficient for English vowels, consideration of the third formant for the characterization of rounding contrast and/or convergence between close formants in the case of focal vowels may prove necessary (Vaissière, 2011). Beyond the characterization of vowels themselves, formant frequency transitions at the onset of vowel realization, and in particular that of the second formant, make it possible to characterize the place of articulation of the preceding consonant and certain aspects of consonant-to-vowel coarticulation, and form the basis of the locus equations (Sussman et al., 1991)..

Spectral measurements applied to consonants focus mainly on the characterization of friction noise, particularly in the production of fricatives, but also to a lesser extent in the release of stops when the annotation is precise enough to delimit it. The most common measures for characterizing the noise spectrum are based on spectral moments (mainly the first CoG spectral moment) or on the frequency of the energy peak, but other measures taking into account the energy distribution in frequencies up to 14 kHz have also been proposed to better account for the spectral properties of fricatives (Shadle et al., 2023).

In addition, other frequently used acoustic measurements seek to capture voice-related characteristics, first and foremost the fundamental frequency f0, which determines the perceived pitch. See Vaysse et al. (2022) for a review of different fundamental frequency estimation algorithms and a comparative evaluation of their performance in the particularly challenging context of pathological speech. Most other acoustic voice measurements, which seek to capture features related to various dimensions of voice quality, rely on fundamental frequency detection or other periodicity measures. This is the case for measurements of amplitude differences between harmonics (see, for example, Kreiman et al. (2012)), which reflect the degree of vocal tension. Measures of the degree of aperiodicity in the voice through the irregularity of the duration of glottal cycles (jitter) or their amplitude (shimmer), or the energy of the harmonic structure compared to that of the noise (harmonics-to-noise ratio HNR) are also of common use.

Among other acoustic measures of consonants, Voice Onset Time applied to occluding consonants derives from temporal annotation from acoustic information, most often from manual annotation but also in some cases extracted automatically (see, for example, Stuart-Smith et al. (2015)).

One may also mention the use of parameters derived from cepstral representations of the speech signal, such as MFCC coefficients, which enable the main characteristics of the spectrum to be described in a non-redundant way from a restricted number of numerical values, and have therefore been widely used in automatic speech processing (although they tend to be replaced by modeling directly from the acoustic signal in end-to-end deep neural networks). Cepstral representations also form the basis of some voice feature measurements, notably the cepstral peak prominence CPP (Hillenbrand & Houde, 1996) used as an acoustic correlate of breathiness and to characterize dysphonic voices.

Among the tools available for extracting these different acoustic measures, the most widely used in phonetics is undoubtedly Praat (Boersma, 2001), most of whose analysis methods are also accessible via the Parselmouth Python library (Jadoul et al., 2018). For the analysis of large datasets, OpenSMILE (Eyben et al., 2010) can also be used to efficiently extract a large number of acoustic measurements from consecutive frames of fixed duration. In addition to the numerous tools dedicated to the extraction of a specific measurement, toolboxes such as SPTK (Yoshimura et al., 2023) offer a wide range of command-line tools for acoustic speech analysis. We can also mention analysis tools more specifically dedicated to voice analysis, such as VoiceSauce (Shue et al., 2011) as well as the MDVP software developed by Kay Elemetrics, which is often considered the standard for voice analysis in a medical context.

Tools for studying speech perception

Whatever the paradigm used, the study of speech perception requires experimental protocols that combine the randomized presentation of audio stimuli, sometimes combined with visual stimuli, and the collection of subjects' responses to the proposed task, generally in the form of a forced choice or a rating on a Likert scale. In tasks in which subjects' reaction time is measured and cross-modal priming experiments, it is also necessary to ensure fine synchronization between stimulus presentation and response collection. In addition to the perceptual task itself, most protocols are complemented by the collection of demographic information on the participants and any other metadata required for subsequent analysis of the results.

Numerous tools dedicated to setting up these protocols in a controlled laboratory context have been developed for general use in experimental psychology or other social science fields, such as OpenSesame (Mathôt et al., 2012) or PsychoPy (Peirce et al., 2019) which enable experiments to be set up from a graphical user interface, with support for behavioral measures such as eye-tracking, and the possibility of adding extensions coded in the Python language. Other tools, such as PERCEVAL (André et al., 2003) are more specifically oriented towards speech perception.

In recent years, a number of platforms dedicated to setting up online experiments have been developed. While online experiments obviously do not offer the same level of control over environmental conditions as in a laboratory setting, they do provide easier access to a larger number of listeners, especially when the specificity of the criteria for inclusion of these listeners (e.g. multilingualism or expertise in a particular field) constrains recruitment possibilities. While some platforms, such as JATOS (Lange et al., 2015) offer great versatility but require installation on a server and programming skills, others allow both hosting of experiment data and easy handling by researchers and students. This is the case of PsyToolkit (Stoet, 2017) which, unlike other comparable tools, has the advantage of being free for academic use.


References

André, C., Ghio, A., Cavé, C., & Teston, B. (2003). PERCEVAL: a Computer-Driven System for Experimentation on Auditory and Visual Perception. International Congress of Phonetic Sciences (ICPhS), 1421–1424.

Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5(9), 341–345.

Eyben, F., Wöllmer, M., & Schuller, B. (2010). openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor. Proceedings of the 18th ACM International Conference on Multimedia, 1459–1462.

Fulop, S. A. (2011). Speech Spectrum Analysis. Springer Science & Business Media.

Hillenbrand, J., & Houde, R. A. (1996). Acoustic Correlates of Breathy Vocal Quality: Dysphonic Voices and Continuous Speech. Journal of Speech, Language, and Hearing Research, 39(2), 311–321. https://doi.org/10.1044/jshr.3902.311

Jadoul, Y., Thompson, B., & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71, 1–15. https://doi.org/10.1016/j.wocn.2018.07.001

Kent, R. D., & Vorperian, H. K. (2018). Static measurements of vowel formant frequencies and bandwidths: A review. Journal of Communication Disorders, 74, 74–97.

Kisler, T., Reichel, U., & Schiel, F. (2017). Multilingual processing of speech via web services. Computer Speech & Language, 45, 326–347.

Kreiman, J., Shue, Y.-L., Chen, G., Iseli, M., Gerratt, B. R., Neubauer, J., & Alwan, A. (2012). Variability in the relationships among voice quality, harmonic amplitudes, open quotient, and glottal area waveform shape in sustained phonation. The Journal of the Acoustical Society of America, 132(4), 2625–2632. https://doi.org/10.1121/1.4747007

Lange, K., Kühn, S., & Filevich, E. (2015). "Just Another Tool for Online Studies” (JATOS): An Easy Solution for Setup and Management of Web Servers Supporting Online Studies. PLOS ONE, 10(6), e0130834. https://doi.org/10.1371/journal.pone.0130834

Mathôt, S., Schreij, D., & Theeuwes, J. (2012). OpenSesame: An open-source, graphical experiment builder for the social sciences. Behavior Research Methods, 44(2), 314–324. https://doi.org/10.3758/s13428-011-0168-7

Maurer, D. (2016). Acoustics of the vowel-preliminaries. Peter Lang International Academic Publishers.

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using kaldi. Proceedings of Interspeech 2017, 2017, 498–502. https://doi.org/10.21437/Interspeech.2017-1386

Peirce, J., Gray, J. R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., Kastman, E., & Lindeløv, J. K. (2019). PsychoPy2: Experiments in behavior made easy. Behavior Research Methods, 51(1), 195–203. https://doi.org/10.3758/s13428-018-01193-y

Shadle, C. H., Chen, W.-R., Koenig, L. L., & Preston, J. L. (2023). Refining and extending measures for fricative spectra, with special attention to the high-frequency rangea). The Journal of the Acoustical Society of America, 154(3), 1932–1944. https://doi.org/10.1121/10.0021075

Shue, Y.-L., Keating, P., Vicenik, C., & Voicesauce, K. Y. (2011). VoiceSauce: A program for voice analysis. Proceedings of the Seventeenth International Congress of Phonetic Sciences, 1846–1849.

Stoet, G. (2017). PsyToolkit: A Novel Web-Based Method for Running Online Questionnaires and Reaction-Time Experiments. Teaching of Psychology, 44(1), 24–31. https://doi.org/10.1177/0098628316677643

Stuart-Smith, J., Sonderegger, M., Rathcke, T., & Macdonald, R. (2015). The private life of stops: VOT in a real-time corpus of spontaneous Glaswegian. Laboratory Phonology, 6(3–4), 505–549.

Sussman, H. M., McCaffrey, H. A., & Matthews, S. A. (1991). An investigation of locus equations as a source of relational invariance for stop place categorization. The Journal of the Acoustical Society of America, 90(3), 1309–1325. https://doi.org/10.1121/1.401923

Vaissière, J. (2011). On the acoustic and perceptual characterization of reference vowels in a cross-language perspective. The 17th International Congress of Phonetic Sciences (ICPhS XVII), 52–59.

Vaysse, R., Astésano, C., & Farinas, J. (2022). Performance analysis of various fundamental frequency estimation algorithms in the context of pathological speech. The Journal of the Acoustical Society of America, 152(5), 3091–3101.

Yoshimura, T., Fujimoto, T., Oura, K., & Tokuda, K. (2023). SPTK4: An open-source software toolkit for speech signal processing. 12th Speech Synthesis Workshop (SSW) 2023.