Speech processing is a multidisciplinary field that integrates linguistics, computer science, and engineering concepts to analyze and model human speech (Rabiner & Schafer, 2011; Jurafsky & Martin, 2024). Its methodologies and tools are increasingly employed in linguistic research, where they assist in the analysis, transformation, and synthesis of spoken language data. By automating tasks such as transcription, segmentation, and acoustic analysis, speech processing facilitates the study of linguistic phenomena across domains, i.e., phonetics, phonology, morphology, syntax, semantics, and pragmatics.
Speech processing focuses on the computational modeling and manipulation of speech signals. The field encompasses tasks at varying levels of complexity, from low-level signal analysis, such as voice activity detection (VAD, identification of moments when speech is the dominant event; Lavechin et al., 2020) and source separation (recover a signal of interest from its mixtures; Scheibler et al., 2023), to higher-level tasks, such as emotion recognition (Khalil et al., 2019) and spoken language understanding (SLU; Ghannay et al., 2021). These tasks have been applied to linguistic research to address questions about speech production, perception, and variation (Rilliard et al., 2018). For example, forced alignment enables precise phonetic data labeling, allowing, e.g., detailed investigations into segmental timing, prosodic patterns, or vowel quality (McAuliffe et al., 2017). Speaker diarization (i.e., the association of segments of speech to their respective speaker in a multi-speaker recording condition; Bredin, 2023) and identification (Nagrani et al., 2017) contribute to the analysis of conversational dynamics, sociolinguistic variation, and speaker-specific traits.
Speech generation and transformation techniques, such as text-to-speech (TTS) synthesis (Dutoit, 1997) and vocoding (Dudley, 1938), offer experimental tools for linguists to study perception and production aspects using controlled stimuli. Algorithms like TD-PSOLA (time-domain pitch-synchronous overlap and add; Moulines & Charpentier, 1990) and Kawahara’s STRAIGHT vocoder (Kawahara et al., 1999) enable the manipulation of speech parameters, including fundamental frequency (F0), segmental duration, and spectral features (for the later), which can be used to create or control stimuli for perceptual experiments. These tools provide opportunities for experimental validation within linguistic research.
This text provides an overview of speech processing methods, organizing them into tasks that range from signal-level analysis to higher-order transformations and applications. It highlights their relevance to linguistic research, particularly for graduate students, aiming to demonstrate how computational approaches can support and extend traditional linguistic methodologies.
VAD is a preprocessing step in speech processing that identifies speech segments within an audio signal when speech is the dominant event in an audio scene, i.e., when speech is audible and not predominantly music or noises. It is a key step before subsequent tasks such as diarization, automatic speech recognition (ASR), and speaker identification, by isolating audio portions containing spoken voice. This functionality is essential for linguistic research — among others — where spoken language analysis often relies on segmenting speech from mixed or noisy recordings. In fieldwork settings, where recording conditions are variable, VAD facilitates transcription, phonetic analysis, and the preparation of corpora for linguistic investigation.
Current approaches to VAD frequently resort to machine learning (ML) techniques, including deep neural networks, which allow for improved adaptability across diverse acoustic environments. Tools such as the Pyannote.audio1 framework incorporate these techniques to achieve effective speech segmentation in a range of applications (Bredin, 2023). This framework also supports the integration with related tasks, including speaker diarization and source separation, making it particularly valuable for studies involving conversational dynamics or the interaction of overlapping speakers.
In linguistic research, the application of VAD has enabled more efficient processing of audio data, particularly in large-scale studies. By reducing the need for manual segmentation, it contributes to the construction of large annotated datasets that can serve as input for tasks such as ASR, TTS synthesis, or phonetic studies. For example, automatic identification of speech intervals can aid in quantifying prosodic patterns or segmental timing in cross-linguistic comparisons.
Diarization refers to the process of splitting an audio stream based on continuous speech segments uttered by a specific speaker, identifying and labeling distinct speaker segments regrouped in a conversation or interaction system. The segments produced are labeled with a unique ID corresponding to the speaker identified.
This task is particularly well suited for analyzing conversations or multiparty interactions, as it allows for the precise attribution of speech segments to specific speakers. Diarization techniques typically rely on acoustic features along with ML algorithms like clustering methods or deep learning approaches (Anguera et al., 2012; Wang et al., 2018). The development of diarization tools has significantly advanced in recent years, particularly with the integration of speaker embeddings and deep neural networks, improving accuracy in complex scenarios (Bredin, 2023). Again2, the Pyannote.audio3 framework proposes a performant open-source solution. A spin-off of this project, pyannoteAI4, offers a commercial tool advertised as more powerful and easier to use.
In linguistic studies, speaker diarization allows the investigation of various aspects of spoken language, including turn-taking, conversational dynamics, and discourse structure. It is particularly useful in studying how speakers alternate turns in a conversation, which can provide insights into social and cultural communication practices, such as interruptions or pauses (Sacks et al., 1974). It allows researchers to analyze the temporal organization of dialogue and how speech acts are distributed among speakers (Jefferson, 2004). It is an important step in the corpus construction process, where distinguishing between different speakers in recorded conversations allows for further analysis of speech patterns and linguistic phenomena (Barras et al., 2006).
Speaker identification refers to the process of recognizing or verifying an individual based on their voice (Jahangir et al., 2020). It is achieved by extracting vocal features, such as F0, formant structure, or speech rate, and comparing them to a database of known speakers. Traditional techniques, such as Gaussian mixture models (GMMs; Reynolds, 2002) or identity vectors (i-Vectors; Dehak et al., 2011), rely on statistical models to characterize speaker-specific vocal traits. More recently, deep learning methods, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been applied to speaker identification, improving the accuracy and robustness of speaker recognition systems (Snyder et al., 2018; Bai et al., 2021).
In phonetic research, speaker identification has a wide range of applications, particularly in the fields of sociophonetics and forensic phonetics. In sociophonetics, the ability to identify individual speakers allows for the study of phonetic variation and the social factors influencing speech. For instance, by analyzing speech samples from specific speakers in a community, researchers can investigate how factors such as age, gender, or social class affect pronunciation (Adda-Decker & Lamel, 2017). Moreover, speaker identification can be used to explore dialectal variation and the ways in which linguistic features correlate with regional or social identities (Wells, 1982).
In forensic phonetics, speaker identification is used to determine the identity of individuals in criminal investigations from audio recordings (Batliner et al., 2020). This process involves comparing unknown voice samples to a database of known speakers, which can provide evidence in various legal cases (Rose, 2002). Advances in speaker identification systems, driven by ML approaches, have further increased the reliability of such methods in legal contexts (Hansen et al., 2015).
Source separation refers to the task of disentangling audio mixtures into their constituent sound sources, such as separating speech from background noise or distinguishing between overlapping speakers. This task is critical in scenarios where clear and isolated audio signals are required for analysis. Traditional methods, such as independent component analysis (ICA), leverage statistical independence among sources to achieve separation (Comon & Jutten, 2010). Recent advancements, however, have been driven by deep learning techniques, including supervised and unsupervised approaches that exploit large-scale datasets and neural architectures to improve separation accuracy and robustness (Hershey et al., 2016; Subakan et al., 2021).
For linguistic research, source separation techniques are particularly useful in field recordings, where environmental noise and overlapping speech are common. These methods enable researchers to isolate individual speakers, making it possible to analyze linguistic phenomena that might otherwise be obscured. Note that the source separation process may change the original signal in unknown and uncontrolled ways, so the original acoustic characteristics of a given speaker’s voice may not be adequately retrieved using such methods; this is already true for the original signal when recordings were made in adverse conditions.
These tools are essential for creating clean datasets for training ASR systems or developing linguistic corpora for endangered languages, where recording conditions are often suboptimal (Avanzi & de Mareüil, 2017). Modern frameworks, such as Pyannote and other source-separation pipelines, offer operative implementations of state-of-the-art separation models (Kalda et al., 2024), making them accessible for linguistic researchers without extensive programming expertise.
Forced alignment is an automated process that aligns an audio recording with its corresponding transcription at the utterance, word, or phoneme level, producing precise time-stamped annotations for each segment. Tools handling this task combine speech recognition algorithms with phonetic models to map transcriptions onto the temporal structure of speech. Early implementations (Schiel, 1999) of forced alignment, as well as recent systems (McAuliffe et al., 2017), still rely on hidden Markov models (HMMs). Some more recent systems incorporate deep learning approaches to improve accuracy and robustness, such as Zhu et al. (2022) Phone-to-audio system based on Wav2vec 2.0 (Baevski et al., 2020).
In phonetic and phonological research, forced alignment is widely utilized to study speech timing, segmental duration, and articulation. Researchers use it to analyze how phonetic segments are realized within different prosodic contexts, enabling investigations into segmental reduction, coarticulation, and syllable timing (Adda-Decker et al., 2005). By providing precise temporal information, forced alignment also facilitates the study of suprasegmental features, such as stress, rhythm, and intonation patterns, across diverse linguistic datasets (Ladd, 2008; Raso et al., 2023). Available tools handling forced-alignment include the Montreal Forced Aligner5 (MFA; McAuliffe et al., 2017) and MAUS6 (Schiel, 1999).
For linguists working with large corpora, forced alignment automates a traditionally time-consuming task, allowing for the efficient processing of large speech corpora. This automation enables extensive analyses of phonetic variation across languages, dialects, or sociolinguistic groups, contributing to a broader understanding of speech patterns and linguistic diversity (Labov et al., 2006). Beyond linguistic analysis, it has practical applications in language documentation, facilitating the transcription and annotation of speech. For instance, there has been a growing interest in applying these methodologies to under-resourced languages (Strickland et al., 2024).
ASR systems convert spoken language into written text. They have applications in linguistic research, particularly in creating transcriptions for large spoken corpora and enabling phonological analysis. As for forced alignment, early systems relied on statistical techniques such as HMMs paired with GMMs. Modern ASR systems, however, are primarily based on deep neural networks, which significantly enhance performance by learning robust representations from vast amounts of labeled and unlabeled data (Hinton et al., 2012; Radford et al., 2023). These models7 are designed to accommodate variability in accents, languages, speaking styles, and environmental noise, making them increasingly robust and accurate (Evrard, 2021).
ASR systems are critical in transcribing large spoken corpora in linguistic research, enabling efficient data processing for studies across various subfields. Combined with forced alignment systems, they facilitate phonological analysis by providing large-scale datasets for examining segmental and suprasegmental features, such as vowel variation or intonation patterns. In Field Linguistics and Language Documentation, they accelerate the transcription process, aiding in the development of linguistic resources. While modern systems, especially those pretrained on large datasets, can generalize across many languages, adapting them to low-resource languages remains an active area of research (Jimerson et al., 2023).
ASR systems support the integration with other speech technologies, such as speaker diarization and forced alignment, to create richly annotated datasets for conversational and interactional analysis (e.g., Bain et al., 2023’s WhisperX8). Independent evaluation of their use on large untranscribed media corpus showed word error rate and phone rate for French at or below 10%, mostly related to untranscribed fillers and hesitation words (Devauchelle et al., 2024).
Speech emotion recognition (SER) processing involves analyzing vocal cues to infer a speaker’s emotional state (El Ayadi et al., 2011). This task focuses on prosodic features such as F0, speech intensity, and rate, typically mapped to emotional categories like happiness, sadness, anger, or neutrality (Scherer, 2003; Cowie et al., 2001). Computational approaches include traditional ML classification methods, such as support vector machines, and modern deep learning techniques, which leverage large datasets to enhance performance (Khalil et al., 2019).
In addition to categorical approaches, dimensional models of emotion provide an alternative framework for understanding emotional expression. A widely utilized model for categorizing emotions is the circumplex model, which maps emotional states along two main continuous dimensions: valence, describing the spectrum from positive to negative affect, and arousal, indicating levels from calm to excited (Russell, 1980). This approach enables more nuanced analyses of emotional expression, as speech can vary in intensity and affective quality even within traditional categorical boundaries. Dimensional models have been successfully integrated into emotion recognition systems to capture subtleties in vocal expression, offering a more comprehensive understanding of emotional dynamics in speech (Gunes & Pantic, 2010; Macary et al., 2021).
Emotion recognition links phonetic and acoustic variations with the broader psychological and pragmatic aspects of language use. In linguistic studies, it may be applied to examine how speakers encode emotions in speech and how prosody interacts with syntactic and semantic structures in diverse discourse contexts (Gussenhoven, 2004; Ladd, 2008). Dimensional approaches are particularly beneficial for cross-cultural investigations where researchers examine how emotional cues differ across social groups, cultural contexts, and situational settings (Shochi et al., 2009).
A popular open-source toolkit for audio analysis, openSMILE9 (open-source speech and music interpretation by large-space extraction; Eyben et al., 2010) allows for the extraction of several features typically associated with emotion recognition (Eyben et al., 2015), e.g., speech intensity, F0, spectral information (alpha ratio, Hammarberg index, spectral slope), voice quality (jitter, shimmer), formants, LPC.
While progress has been significant, challenges remain, including the subjective nature of emotion labeling, variability in emotional expression across individuals and cultures, and the integration of contextual factors into computational models (Cauzinille, 2022).
SLU is a higher-level speech processing task that moves beyond ASR’s transcription capabilities to focus on extracting semantic content, speaker intent, and discourse meaning from spoken input (Ghannay et al., 2021). It encompasses a range of tasks, including domain classification, intent recognition, slot filling, and semantic parsing, which collectively enable a machine to “understand” and act upon human speech (Tur & De Mori, 2011). This field is central to applications such as virtual assistants, dialogue systems, and human-computer interaction, where the interpretation of the user intent is critical.
In linguistic research, SLU systems contribute to computational pragmatics by enabling the analysis of discourse structures, implicature, and speaker intent. Researchers could employ them to investigate how meaning is constructed and conveyed through speech, incorporating both lexical and prosodic cues. For instance, they can be used to analyze spoken corpora and identify patterns of politeness strategies, indirect speech acts, or pragmatic markers, providing insights into the interaction between form and function in language use (Jurafsky & Martin, 2024).
Current solutions consist in end-to-end systems, while traditional approaches rely on a pipeline architecture, which typically involves distinct modules for ASR and natural language understanding (NLU). After the speech is transcribed into text by the ASR system, it is processed by the NLU module. While this modular design allows for independent optimization and interpretability, it also introduces cascading errors: inaccuracies in the ASR output directly affect the NLU performance (Tur & De Mori, 2011; Haghani et al., 2018).
In contrast, end-to-end SLU systems bypass the intermediate transcription step, directly mapping speech inputs to semantic representations or intents. This integrated approach leverages deep learning models to jointly learn acoustic and semantic features, resulting in streamlined architectures and potentially improved performance, particularly in noisy environments or when dealing with informal speech (Radford et al., 2023). However, end-to-end systems often require large amounts of annotated data. They are also more challenging to interpret.
The choice between these approaches depends on application requirements, such as the availability of annotated data, the need for intermediate transcriptions, and system interpretability. For linguists, traditional SLU may offer more transparency in analyzing speech patterns, whereas end-to-end systems provide a flexible framework for exploring direct relationships between acoustic signals and semantic content.
Voice modification encompasses different techniques for altering speech attributes such as F0, speech rate, and timbre. Tools implementing these techniques are widely used in linguistic studies, especially in experimental phonetics and prosody research. By modifying F0 contours, vowel length, or spectral characteristics, linguists can investigate how these features influence speech perception, prosodic structure, or speaker identity.
Early methods such as TD-PSOLA (Moulines & Charpentier, 1990) have long been used to manipulate F0 and duration. It operates by segmenting speech into small overlapping frames and modifying these frames synchronously with the signal periods (1 / F0), allowing for smooth adjustments with minimal distortion. A widespread platform allowing its application is Praat10 (Boersma & Weenink, 2024).
Another popular method is based on Kawahara et al. (1999) ’s STRAIGHT vocoder11, which decomposes the speech signal into its source and filter components. This separation enables independent modifications of F0 and spectral envelope, making STRAIGHT particularly effective for perceptual studies and prosodic analysis. It is used in experimental linguistics for fine-tuning speech features, especially in perceptual experiments where precise control over the acoustic features is necessary. It is now superseded by WORLD12 (Kawahara & Morise, 2024), which achieves reduced computational cost compared to STRAIGHT without compromising quality. WORLD is also employed in commercial products as a singing synthesis engine in UTAU and as an analysis method for the CeVIO voice synthesizer.
Recent advancements in deep learning have introduced neural vocoders, such as WaveNet (van den Oord et al., 2016) and HiFi-GAN (Kong et al., 2020), which provide high-fidelity speech modeling with flexible control over acoustic features. These methods offer significant improvements in naturalness and flexibility compared to traditional ones, enabling linguists to manipulate fine-grained acoustic properties. Additionally, frameworks like ControlVC (Chen & Duan, 2023), as well as diffusion-based voice conversion models combine neural architectures with interpretable control, facilitating real-time speech modification and applications in experimental phonetics (Popov et al., 2022).
TTS systems convert written text into synthetic speech (Dutoit, 1997). Early state-of-the-art systems — in terms of quality — relied on concatenative synthesis, which pieced together pre-recorded speech units. Modern implementations have significantly improved the naturalness and intelligibility of synthetic speech. These systems employ deep learning architectures to model human speech’s intricate temporal and spectral characteristics, yielding outputs that closely approximate natural vocal expressions, e.g., Tacotron 2 (Wang et al., 2017; Shen et al., 2018), VITS13 (Kim et al., 2021), and VALL-E 2 (Chen et al., 2024).
In linguistic research, they serve as experimental tools for the controlled manipulation of speech stimuli. They enable researchers to systematically alter phonetic, prosodic, or syntactic properties to investigate their effects on speech perception and comprehension (Evrard et al., 2015; Barbosa, 2007). For example, linguists may test hypotheses about F0 contours, stress patterns, or segmental contrasts by generating stimuli with precise acoustic properties. Modern neural network architectures are typically end-to-end and thus do not allow for parametrization. A notable exception is FastSpeech 214 (Ren et al., 2020), which enables fine-grained control over F0, intensity, and duration through explicit variance predictors.
TTS systems also support sociolinguistic and dialectological studies by simulating various speaking styles, dialects, or accents, (e.g., the various “voices” and dialects of English proposed by the Google TTS system). By generating synthetic speech that replicates regional or social linguistic features, researchers can examine listeners’ perceptions of identity, status, or solidarity based on speech characteristics (Coupland, 2007). Some systems even propose audiovisual speech synthesis, using an avatar to produce speech and speech-related gestures (GRETA; Pelachaud et al., 2002). They allow for studying the complex interaction of facial body and audio information during communication (e.g., Blomsma, 2022).
Furthermore, TTS systems can contribute to documenting and preserving endangered languages by creating synthetic voices that simulate native speakers, facilitating educational and cultural initiatives in under-resourced linguistic communities (Strickland et al., 2023).
Despite their progress, capturing fine-grained linguistic variations, such as those influenced by context or emotion (Evrard et al., 2015), remains a challenge. Addressing these issues requires the integration of higher-level linguistic models (i.e., including pragmatics) with neural TTS architectures, as well as the development of corpora that represent diverse linguistic and cultural contexts. One important limitation is linked with the system’s input (text), which generally follows written language codes and rules, not oral interaction dynamic, syntax, and pragmatic rules.
Speech processing provides a comprehensive set of tools for analyzing, generating, and transforming spoken language, with significant applications in linguistic research. These techniques address a broad spectrum of linguistic phenomena, from low-level signal processing, such as voice activity detection and source separation, to high-level tasks like emotion recognition and spoken language understanding. For linguists, they allow for studying speech production, perception, and variation. These tools enable acoustic analyses, as well as experimental manipulation of speech stimuli.
Moreover, the integration of advanced machine learning models into speech processing continues to expand the possibilities, opening new avenues for linguistic research. Recent advances in self-supervised learning (SSL) for speech signals, exemplified by models such as wav2vec 2.0 and WhisperX, have significantly enhanced the ability to extract meaningful representations from raw audio data. These SSL frameworks reduce the need for large-scale labeled datasets, making it feasible to study under-resourced languages and diverse linguistic phenomena. By leveraging SSL, researchers can perform most speech processing tasks with improved accuracy and efficiency. Additionally, these models facilitate the integration of speech and text modalities, enabling more holistic approaches to linguistic analysis.
These methods enhance traditional research approaches and facilitate the exploration of complex interactions between speech, language, and cognition. For graduate students and researchers, familiarity with these tools and their applications is increasingly important for addressing linguistic challenges and contributing to the understanding of human language in both theoretical and applied contexts.
As for VAD (see section 2.1).↩︎
https://www.bas.uni-muenchen.de/Bas/BasMAUS.html (see also the web interface that offers more services: https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface).↩︎
See, e.g., https://openai.com/index/whisper/ for a description of the Whisper system.↩︎
https://www.isc.meiji.ac.jp/~mmorise/straight/english/links.html↩︎
Variational Inference with adversarial learning for end-to-end Text-to-Speech,↩︎
A non-official open-source implementation: https://github.com/ming024/FastSpeech2↩︎
Adda-Decker, M., & Lamel, L. (2017). Discovering speech reductions across speaking styles and languages. Cangemi, F., Clayards M., Niebuhr O., Schuppler B., & Zellers M. Rethinking reduction: Interdisciplinary perspectives on conditions, mechanisms, and domains for phonetic variation, De Gruyter Mouton, 101-128.
Adda-Decker, M., de Mareüil, P. B., Adda, G., & Lamel, L. (2005). Investigating syllabic structures and their variation in spontaneous French. Speech communication, 46(2), 119–139.
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on audio, speech, and language processing, 20(2), 356-370.
Avanzi, M., & de Mareüil, P. B. (2017). Identification of regional French accents in (northern) France, Belgium, and Switzerland. Journal of Linguistic Geography, 5(1), 17–40.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, p. 33, 12449–12460.
Bai, Z., & Zhang, X. L. (2021). Speaker recognition based on deep learning: An overview. Neural Networks, 140, pp. 65–99.
Bain, M., Huh, J., Han, T., Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. Proc. Interspeech 2023, 4489-4493, doi: 10.21437/Interspeech.2023-78.
Barbosa, P. N. A. (2007). From syntax to acoustic duration: A dynamical model of speech rhythm production. Speech Communication, 49(9), 725-742.
Barras, C., Zhu, X., Meignier, S., & Gauvain, J. L. (2006). Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1505-1512.
Batliner, A., Hantke, S., & Schuller, B. (2020). Ethics and good practice in computational paralinguistics. IEEE Transactions on Affective Computing, 13(3), 1236-1253.
Blomsma, P., Skantze, G., & Swerts, M. (2022). Backchannel behavior influences the perceived personality of human and artificial communication partners. Frontiers in Artificial Intelligence, 5, 835298.
Boersma, P., & Weenink, D. (1992–2024). Praat: doing phonetics by computer [Computer program]. Version 6.4.24, retrieved December 7, 2022 from https://www.fon.hum.uva.nl/praat/.
Bredin, H. (2023). pyannote.audio 2.1 speaker diarization pipeline: Principle, benchmark, and recipe. In 24th Interspeech Conference, pp. 1983–1987. ISCA.
Cauzinille, J., Evrard, M., Kiselov, N., & Rilliard, A. (2022). Annotation of expressive dimensions on a multimodal French corpus of political interviews. In First Workshop on Natural Language Processing for Political Sciences (PoliticalNLP) (pp. 91-97).
Chen, M. & Duan, Z. (2023). ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Speed. Proc. INTERSPEECH 2023, 2098-2102, doi: 10.21437/Interspeech.2023-1788.
Chen, S., Liu, S., Zhou, L., Liu, Y., Tan, X., Li, J., ... & Wei, F. (2024). VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2406.05370.
Comon, P., & Jutten, C. (Eds.). (2010). Handbook of Blind Source Separation: Independent component analysis and applications. Academic press.
Coupland, N. (2007). Style: Language Variation and Identity. Cambridge University Press.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, G., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. G. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798.
Devauchelle, S., Rilliard, A., Doukhan, D., and Ondel-Yang, L. (2024). Describing voice in French Media Archives: Age and Gender Effects on Pitch and Articulation Characteristics. In Studi AISV 2024.
Dudley, H. W. (1938). System for the artificial production of vocal or other sounds (Patent No. US2121142A). United States Patent and Trademark Office.
Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers.
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern recognition, 44(3), 572-587.
Evrard, M., Delalez, S., d'Alessandro, C., & Rilliard, A. (2015). Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis. In INTERSPEECH (pp. 3370-3374).
Evrard, M. (2021). Transformers in automatic speech recognition. In ECCAI Advanced Course on Artificial Intelligence (pp. 123-139). Cham: Springer International Publishing.
Eyben, F., Wöllmer, M., & Schuller, B. (2010). “OpenSMILE: The Munich Versatile and Fast Open-Source Audio Feature Extractor.” Proceedings of the ACM Multimedia, pp. 1459–1462. https://doi.org/10.1145/1873951.1874246
Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., ... & Truong, K. P. (2015). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE transactions on affective computing, 7(2), 190-202.
Ghannay, S., Caubrière, A., Mdhaffar, S., Laperrière, G., Jabaian, B., & Estève, Y. (2021). Where are we in semantic concept extraction for Spoken Language Understanding? In Speech and Computer: 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings 23 (pp. 202-213). Springer International Publishing.
Gunes, H., & Pantic, M. (2010). Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions, 1(1), 68–99.
Gussenhoven, C. (2004). The Phonology of Tone and Intonation. Cambridge University Press.
Haghani, P., Narayanan, A., Bacchiani, M., Chuang, G., Gaur, N., Moreno, P., ... & Waters, A. (2018). From audio to semantics: Approaches to end-to-end spoken language understanding. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 720-726). IEEE.
Hansen, J. H., & Hasan, T. (2015). Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74-99.
Hershey, J. R., Chen, Z., Le Roux, J., & Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-R., Jaitly, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Jahangir, R., Teh, Y. W., Memon, N. A., Mujtaba, G., Zareei, M., Ishtiaq, U., ... & Ali, I. (2020). Text-independent speaker identification through feature fusion and deep neural network. IEEE Access, 8, 32187-32202.
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In Conversation Analysis: Studies from the First Generation (pp. 13–31). John Benjamins Publishing.
Jimerson, R., Liu, Z., & Prud’Hommeaux, E. (2023). An (unhelpful) guide to selecting the best ASR architecture for your under-resourced language. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 1008-1016).
Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd edition. Online manuscript released August 20, 2024. https://web.stanford.edu/~jurafsky/slp3
Kalda, J., Pagés, C., Marxer, R., Alumäe, T., & Bredin, H. (2024). PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings. The Speaker and Language Recognition Workshop (Odyssey 2024), 115–122. https://doi.org/10.21437/odyssey.2024-17
Kawahara, H., Masuda-Katsuse, I., & De Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech communication, 27(3-4), 187-207.
Kawahara, H., & Morise, M. (2024). Interactive tools for making vocoder-based signal processing accessible: Flexible manipulation of speech attributes for explorational research and education. Acoustical Science and Technology, 45(1), 48-51.
Kim, J., Kong, J., & Son, J. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning (pp. 5530-5540). PMLR.
Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7, 117327–117345.
Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS 2020.
Labov, W., Ash, S., & Boberg, C. (2006). The Atlas of North American English: Phonetics, Phonology, and Sound Change. Mouton de Gruyter.
Ladd, D. R. (2008). Intonational Phonology (2nd ed.). Cambridge University Press.
Lavechin, M., Bousbib, R., Bredin, H., Dupoux, E., Cristia, A., Gill, M. P., & Garcia-Perera, L. P. (2020). End-to-end Domain-Adversarial Voice Activity Detection. In Interspeech 2020.
Macary, M., Tahon, M., Estève, Y., & Rousseau, A. (2021). On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT) (pp. 373-380). IEEE.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using Kaldi. In Interspeech (Vol. 2017, pp. 498–502).
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.
Nagrani, A., Chung, J., & Zisserman, A. (2017). VoxCeleb: a large-scale speaker identification dataset. Interspeech 2017.
Pelachaud, C., Carofiglio, V., De Carolis, B., de Rosis, F., & Poggi, I. (2002). Embodied contextual agent in information delivering application. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 2 (pp. 758-765).
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M. S., & Wei, J. (2022). Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme. In International Conference on Learning Representations.
Rabiner, L. R., & Schafer, R. W. (2011). Theory and Applications of Digital Speech Processing. Pearson.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International conference on machine learning (pp. 28492–28518). PMLR.
Raso, T., Rilliard, A., Santos, S. M., & De Moraes, J. A. (2023). Discourse Markers as information units formally conveyed by prosody. In Discourse Markers-Theories and Methods.
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations.
Reynolds, D. A. (2002). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1), 91–108.
Rilliard, A., d'Alessandro, C., & Evrard, M. (2018). Paradigmatic variation of vowels in expressive speech: Acoustic description and dimensional analysis. The Journal of the Acoustical Society of America, 143(1), 109-122.
Rose, P. (2002). Forensic Speaker Identification. CRC Press.
Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178.
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking in conversation. Language, 50(4), 696–735.
Scheibler, R., Ji, Y., Chung, S. W., Byun, J., Choe, S., & Choi, M. S. (2023). Diffusion-based generative speech source separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
Scherer, K. R. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1–2), 227–256.
Schiel, F. (1999). Automatic Phonetic Transcription of Non-Prompted Speech. 14th International Congress of Phonetic Sciences (ICPhS), San Francisco, USA. Ohala, John J. (ed.).
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779–4783). IEEE.
Shochi, T., Rilliard, A., Aubergé, V., & Erikson, D. (2009). Intercultural perception of English, French, and Japanese social affective prosody. The role of prosody in Affective Speech, pp. 31–60.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329–5333). IEEE.
Strickland, E., Aubakirova, D., Doncenco, D., Torres, D., & Evrard, M. (2023). NaijaTTS: A pitch-controllable TTS model for Nigerian Pidgin. In ISCA Speech Synthesis Workshop.
Strickland, E., Lacheret-Dujour, A., Kahane, S., Evrard, M., ... & Guillaume, B. (2024). New Methods for Exploring Intonosyntax: Introducing an Intonosyntactic Treebank for Nigerian Pidgin. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., & Zhong, J. (2021). Attention is all you need in speech separation. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25. IEEE.
Tur, G., & De Mori, R. (2011). Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Wiley.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), p. 125.
Wang, Q., Downey, C., Wan, L., Mansfield, P. A., & Moreno, I. L. (2018). Speaker diarization with LSTM. In 2018 IEEE International Conference on acoustics, speech and signal processing (ICASSP) (pp. 5239–5243). IEEE.
Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., Saurous, R.A. (2017). Tacotron: Towards End-to-End Speech Synthesis. Proc. Interspeech 2017, pp. 4006–4010, doi: 10.21437/Interspeech.2017-1452.
Wells, J. C. (1982). Accents of English. Cambridge University Press.
Zhu, J., Zhang, C., & Jurgens, D. (2022). Phone-to-audio alignment without text: A semi-supervised approach. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8167-8171.