Speech Prosody Lectures | Predictive Modelling of Turn-taking in Spoken Conversation

Posted on
Lecturer
Gabriel Skantze (Professor in Speech Technology at KTH, Sweeden)
Host
Plínio Almeida Barbosa (IEL/University of Campinas, Brazil)
Organized by
Speech Prosody Special Interest Group (SProSIG)
Live transmission date
15th May, 2024, at 1:00 PM (BRT, UTC-3)

Watch at YouTube: https://www.youtube.com/watch?v=0RuJVUaV9QQ

Conversational interfaces, in the form of voice assistants, smart speakers, and social robots are becoming ubiquitous. This development is partly fuelled by the recent developments in large language models. While this progress is very exciting, human-machine conversation is currently limited in many ways. In this talk, I will specifically address the modelling of conversational turn-taking. As current systems lack the sophisticated coordination mechanisms found in human-human interaction, they are often plagued by interruptions or sluggish responses. I will present our recent work on predictive modelling of turn-taking, which allows the system to not only react to turn-taking cues, but also predict upcoming turn-taking events and produce relevant cues to facilitate real-time coordination of spoken interaction. Through analysis of the model, we also learn about which cues are relevant to turn-taking, including prosody and filled pauses.

Plan:

  1. Introduction to conversational systems and human-robot interaction
  2. Why turn-taking is problematic in current systems
  3. Voice Activity Projection: A predictive, data-driven model of turn-taking
  4. Analysis of the model (prosody and filled pauses)
  5. Towards better turn-taking in conversational systems
Speech Prosody Lectures | The speech synthesis phoneticians need is both realistic and controllable

Posted on
Lecturer
Zofia Malisz (KTH Royal Institute of Technology, Sweden)
Host
Plínio Almeida Barbosa (IEL/University of Campinas, Brazil)
Organized by
Speech Prosody Special Interest Group (SProSIG)
Live transmission date
17th April, 2024, at 1:00 PM (BRT, UTC-3)

Watch at YouTube: https://www.youtube.com/watch?v=uRnR9dI_EAo

In the last decade, data and machine learning-driven methods to speech synthesis have greatly improved its quality. So much so, that the realism achievable by current neural synthesisers can rival natural speech. However, modern neural synthesis methods have not yet transferred as tools for experimentation in the speech and language sciences. This is because modern systems still lack the ability to manipulate low-level acoustic characteristics of the signal such as e.g.: formant frequencies.

In this talk, I survey recent advances in speech synthesis and discuss their potential as experimental tools for phonetic research. I argue that speech scientists and speech engineers would benefit from working more with each other again: in particular, in the pursuit of prosodic and acoustic parameter control in neural speech synthesis. I showcase several approaches to fine synthesis control that I have implemented with colleagues: the WavebenderGAN and a system that mimicks the source-filter model of speech production. These systems allow to manipulate formant frequencies and other acoustic parameters with the same (or better) accuracy as e.g.: Praat but with a far superior signal quality.

Finally, I discuss ways to improve synthesis evaluation paradigms, so that not only industry but also speech science experimentation benchmarks are met. My hope is to inspire more students and researchers to take up these research challenges and explore the potential of working at the intersection of the speech technology and speech science.

Plan:

  1. I discuss briefly the history of advancements in speech synthesis starting in the formant synthesis era and explain where the improvements came from.
  2. I show experiments that I have done that prove modern synthesis is processed not differently than natural speech by humans in a lexical decision task as evidence that the realism (“naturalness”) goal has been largely achieved.
  3. I explain how realism came at the expense of controllability. I show how controllability is an indispensable feature for speech synthesis to be adopted in phonetic experimentation. I survey the current state of research on controllability in speech engineering - concentrating on prosodic and formant control.
  4. I propose how we can fix this by explaining the work I have done with colleagues on several systems that feature both realism and control.
  5. I sketch a roadmap to improve synthesis tools for phonetics - by placing focus on benchmarking systems according to scientific criteria.
Speech Prosody Lectures | How to handle variability in the study of intonation

Posted on
Lecturer
Amalia Arvaniti (Radboud University, Netherlands)
Host
Plínio Almeida Barbosa (IEL/University of Campinas, Brazil)
Organized by
Speech Prosody Special Interest Group (SProSIG)
Live transmission date
14th December, 2023, at 1:00 PM (BRT, UTC-3)

Watch at YouTube: https://www.youtube.com/watch?v=_GNY3VEeC78

This talk will give an overview of the issue of variability in intonation and present methodological approaches that render variability easier to handle. These methodologies are presented by means of a case study, the English pitch accents H* and L+H*, which are treated as distinct phonological entities in some accounts but as endpoints of a continuum in others. The research that will be presented sheds light on the reasons for the disagreement between analyses and the discrepancies between analyses and empirical evidence, by examining both production data from British English unscripted speech and perceptual data, which also link the processing of the two accents to the participants’ levels of empathy, musicality, and autistic-like traits.

Plan:

  1. Intonation essentials
  2. Introducing the test case
  3. Production methods and results
  4. Conclusions
Speech Prosody Lectures | Segmental articulations and prosody

Posted on
Lecturer
Malin Svensson Lundmark (Lund University)
Host
Plínio Almeida Barbosa (IEL/University of Campinas, Brazil)
Organized by
Speech Prosody Special Interest Group (SProSIG)
Live transmission date
23rd November, 2023, at 1:00 PM (BRT, UTC-3)

Watch at YouTube: https://www.youtube.com/watch?v=iHz1JC504eg

This lecture will be on an aspect of the articulatory-acoustics relationship that is rarely addressed but which is both stable and robust across places of articulation, tonal context and prominence levels, inter alia. It’s about acceleration and deceleration peaks of articulatory movements and how they coincide with specific acoustic segment landmarks. I will present findings on lips, tongue tip, tongue body and jaw movements, and present ideas on how to use these rapid movements to model articulatory prosody.

Plan:

  1. Movement dynamics – acceleration peaks and jerks
  2. Findings on segmental articulations
  3. Findings on effects of tonal context, prominence, syllable articulations
  4. Towards a model
  5. “Segment_ART”: a low tech acoustic and articulatory analysis (if enough time)
Speech Prosody Lectures | Tackling prosodic phenomena at their roots

Posted on
Lecturer
Yi Xu (University College London)
Host
Plínio Almeida Barbosa (IEL/University of Campinas, Brazil)
Organized by
Speech Prosody Special Interest Group (SProSIG)
Live transmission date
25th October, 2023, at 1:00 PM (BRT, UTC-3)

Watch at YouTube: https://www.youtube.com/watch?v=ENteqgM8NEM

Rather than being a coherently whole, speech prosody consists of highly diverse phenomena that are best understood in terms of their communicative functions, together with specific mechanisms of articulatory encoding and perceptual decoding. The understanding of these root causes is therefore key to further advances in prosody research.