Watch at YouTube: https://www.youtube.com/live/EOjNpyICY0Y?si=AIm3Y87SfIp3XNsG
During decades, the interplay of phonetic forms, phonological structures and communicative functions was mainly questioned via meticulous analysis of acoustic or multimodal performance of speakers and listeners... with the ambition of providing technology with principles, controls and constraints emerging from our human creativity. This golden age of human intelligence is now largely defeated by model-free generative AI, in particular text-to-speech systems that provide signals or videos of speaking faces often misconfused with natural data! In this presentation, I will argue for a positive attitude: consider these high quality end2end models as a proxy to capture lawful data variability and develop new tools to explore internal representations (so-called latent spaces) built by these successful models. I will further detail two works started with my colleagues Olivier Perrotin and Martin Lenglet: (1) exploration and control of phonetic and phonological embeddings via causal regression (Lenglet et al, Interspeech 2022 & submitted to CSL); (2) exploration and fine control of audiovisual attitudes via verbal tags (Bailly et al, LREC/COLING 2024).
Plan:
Watch at YouTube: https://www.youtube.com/live/nB4fcnTNBC4?si=IOrIjH4nabkmh-iD
On the timescale of prosodic phrases, temporal patterns in conversational speech are not very regular. To the contrary, spurts of fluent speech tend to be highly irregular and intermittent. Hesitations and pauses are common. What is the mechanism behind this pattern? In this talk I consider and reject two possible explanations. First, I examine the possibility that a hierarchical organization of relatively long-timescale prosodic units might explain intermittency. Several predictions of hierarchical prosodic structure accounts are examined in an analysis of the Switchboard NXT corpus, but the empirical patterns are not very consistent with those predictions. Moreover, I argue that even laboratory studies that purport to find evidence for hierarchical phrase structure suffer from flawed argumentation. For these reasons, a hierarchical structure-based account of intermittency is suspect. Second, I examine the possibility that there is a phrase-timescale oscillator that governs phrase initiation. I critique recent studies that have argued that such an oscillator is involved in speech production. Through model simulations I show that in order to adequately capture empirical timing patterns, such an oscillator would need an overly powerful ability to change frequency from cycle to cycle. On top of this, the neurophysiological basis for a role of oscillation in governing phrasal timing is called into question. Instead of structural or oscillation-based mechanisms being responsible for phrasal timing, I argue that intermittency arises due to mechanisms responsible for the organization of syntactic and conceptual systems. I present a model in which phrase initiation is contingent on the achievement of a coherent state among those systems, and show how stochastic influences can generate hesitative phenomena that may be the basis for the intermittency of speech.
Plan:
Watch at YouTube: https://www.youtube.com/watch?v=YUE9pRbq9w0
This talk is about the role of prenuclear prominences and their relation to nuclear accents in German and English. The production results (German) that I will present show that the realization of the prenuclear domain depends on whether it is focal or prefocal. The prenuclear noun is characterized by larger F0 excursions, higher F0 maxima, and longer durations when it is in broad focus than when it precedes a narrow focus. Furthermore, the realization of the prenuclear domain depends on the following focus type: The prenuclear noun is produced with smaller F0 excursions, lower F0 maxima and shorter durations before a corrective focus than before a non-corrective narrow focus. The findings suggest that the phonetic manifestation of information structure is distributed over larger prosodic domains with an inverse relationship in the syntagmatic dimension. In addition, the study contributes further evidence that continuous phonetic detail is used to encode information structural categories. An important question that arises from the production data is whether this phonetic detail can be used by listeners in perception. I will present first results from a series of perception experiments (German and English) to investigate this question.
Plan:
Watch at YouTube: https://www.youtube.com/watch?v=0RuJVUaV9QQ
Conversational interfaces, in the form of voice assistants, smart speakers, and social robots are becoming ubiquitous. This development is partly fuelled by the recent developments in large language models. While this progress is very exciting, human-machine conversation is currently limited in many ways. In this talk, I will specifically address the modelling of conversational turn-taking. As current systems lack the sophisticated coordination mechanisms found in human-human interaction, they are often plagued by interruptions or sluggish responses. I will present our recent work on predictive modelling of turn-taking, which allows the system to not only react to turn-taking cues, but also predict upcoming turn-taking events and produce relevant cues to facilitate real-time coordination of spoken interaction. Through analysis of the model, we also learn about which cues are relevant to turn-taking, including prosody and filled pauses.
Plan:
Watch at YouTube: https://www.youtube.com/watch?v=uRnR9dI_EAo
In the last decade, data and machine learning-driven methods to speech synthesis have greatly improved its quality. So much so, that the realism achievable by current neural synthesisers can rival natural speech. However, modern neural synthesis methods have not yet transferred as tools for experimentation in the speech and language sciences. This is because modern systems still lack the ability to manipulate low-level acoustic characteristics of the signal such as e.g.: formant frequencies.
In this talk, I survey recent advances in speech synthesis and discuss their potential as experimental tools for phonetic research. I argue that speech scientists and speech engineers would benefit from working more with each other again: in particular, in the pursuit of prosodic and acoustic parameter control in neural speech synthesis. I showcase several approaches to fine synthesis control that I have implemented with colleagues: the WavebenderGAN and a system that mimicks the source-filter model of speech production. These systems allow to manipulate formant frequencies and other acoustic parameters with the same (or better) accuracy as e.g.: Praat but with a far superior signal quality.
Finally, I discuss ways to improve synthesis evaluation paradigms, so that not only industry but also speech science experimentation benchmarks are met. My hope is to inspire more students and researchers to take up these research challenges and explore the potential of working at the intersection of the speech technology and speech science.
Plan:
Watch at YouTube: https://www.youtube.com/watch?v=_GNY3VEeC78
This talk will give an overview of the issue of variability in intonation and present methodological approaches that render variability easier to handle. These methodologies are presented by means of a case study, the English pitch accents H* and L+H*, which are treated as distinct phonological entities in some accounts but as endpoints of a continuum in others. The research that will be presented sheds light on the reasons for the disagreement between analyses and the discrepancies between analyses and empirical evidence, by examining both production data from British English unscripted speech and perceptual data, which also link the processing of the two accents to the participants’ levels of empathy, musicality, and autistic-like traits.
Plan:
Watch at YouTube: https://www.youtube.com/watch?v=iHz1JC504eg
This lecture will be on an aspect of the articulatory-acoustics relationship that is rarely addressed but which is both stable and robust across places of articulation, tonal context and prominence levels, inter alia. It’s about acceleration and deceleration peaks of articulatory movements and how they coincide with specific acoustic segment landmarks. I will present findings on lips, tongue tip, tongue body and jaw movements, and present ideas on how to use these rapid movements to model articulatory prosody.
Plan:
Watch at YouTube: https://www.youtube.com/watch?v=ENteqgM8NEM
Rather than being a coherently whole, speech prosody consists of highly diverse phenomena that are best understood in terms of their communicative functions, together with specific mechanisms of articulatory encoding and perceptual decoding. The understanding of these root causes is therefore key to further advances in prosody research.