Return to list
From Phonological Symbols to Articulation: A new 3-Component Model of Speech Production
Alice Turk | University of Edinburgh
Stefanie Shattuck-Hufnagel | Massachusetts Institute of Technology
Benjamin Elie | University of Edinburgh
Juraj Šimko | University of Helsinki
1. Introduction

Speakers vary the phonetic shape of the same word in different utterances, and in different contexts in the same utterance. Accounting for the phonological equivalence of these different pronunciations, as well as the dynamic details of how they are articulated, is one of the biggest challenges for models of speech production. Previous models provide partial accounts of these phenomena, but do not provide the requisite flexibility for explaining the full range of systematic variability. This paper summarises a new approach that integrates traditional symbolic phonological representations with mechanisms for planning articulations that accomplish the phonological goals for a particular utterance (Turk & Shattuck-Hufnagel 2020). This approach provides accounts of phonological contrast, the systematic context-governed nature of phonetic variants, and their phonological equivalence. It also produces realistic articulatory trajectories.

Many lines of evidence motivate the use of symbolic phonological representations of the kind proposed in traditional phonological theory (e.g. Chomsky & Halle 1968). However, traditional phonological theory provided no strategy for how these representations could lead to articulatory movement, or for how speakers plan and produce the systematic variability of the “same” sounds in different contexts in an utterance. Progress in this area was made by Henke (1966), Keating (1990), Kingston & Diehl (1994), Guenther (1995; 2016) and Fujimura (1992), who motivated and developed three-component models. All of these approaches assumed three stages: 1) Phonological Planning, for retrieving and ordering a sequence of symbolic phonological elements, 2) Phonetic Planning, for planning details of how phonological goals should be achieved, and 3) Motor-Sensory Implementation, for producing the planned utterance. But none of these proposals succeeded in providing a full account of both the dynamics of speech-related movement, and systematic phonetic variability.

A two-component solution that eliminated the distinction between Phonology and Phonetics was provided by Articulatory Phonology (AP), its precursors, and subsequent models based on it (Fowler 1977; Fowler et al. 1980: Browman & Goldstein 1985, 1992 et seq.; Tilsen 2022). This proposal took the form of spatiotemporal phonological representations which contain information about their production, as well as adjustments to default activations of these representations in specific contexts. These phonological representations were modelled using a well-known tool from engineering approaches to physical dynamical systems: mass-spring oscillators (Saltzman & Munhall 1989). Later developments made further use of oscillators to model the control of contextual adjustments (e.g. Saltzman et al. 2008).

Although this two-component approach was impressive in its coverage, several lines of evidence, primarily from speech timing behavior, appear to require symbolic representations for Phonological Planning, as well as a separate Phonetic Planning component. The model sketched in Turk & Shattuck-Hufnagel (2020), called XT/3C, accounts for this evidence by leveraging a different set of theoretical mechanisms, originally proposed for non-speech motor planning and control. XT/3C uses Lee’s Tau theory (Lee 1998; 2009) for its dynamic model, rather than oscillators, and controls systematic phonetic variability using an Optimal Control Theory approach (Bellman 1957; Pontryagin et al. 1962; Jordan & Wolpert 1999; Todorov & Jordan 2002; Shadmehr & Wise 2005, among others; this approach has been adapted for speech articulation planning and control by Nelson 1983; Nelson et al.1984; Lindblom 1990; and Šimko & Cummins 2010, 2011, Houde & Nagarajan 2016). The new model that makes use of these mechanisms is called XT/3C because it uses a phonology-eXtrinsic Timekeeping mechanism and employs 3 processing Components:1) Phonological Planning, to set the phonological goals, 2) Phonetic Planning, to specify how those goals will be signalled acoustically via articulation, and 3) Motor Sensory Implementation, to govern the actual production.

2. The Three Components of XT/3C

A graphical summary of the three components of model is presented in Figure 1. As illustrated there, a significant feature of XT/3C is its use of symbolic phonological representations. Symbolic phonemes are required to account for the phonological equivalence of different "allophones" produced with different sets of articulators (e.g. Dutch /r/ variants (Scobbie et al. 2009), cases of debuccalisation, O’Brien 2012, e.g. British English glottal stop vs. released, aspirated variants of /t/, etc.). Symbolic phonemes form an important part of an account of lower temporal variability at goal-related part(s) of movement, for movements of the same articulators, used to produce the same phonemes in the same prosodic context (e.g. Perkell & Matthies 1992, discussed in Turk & Shattuck-Hufnagel 2020). This is because a separate level of representation (e.g. for symbolic phonemes) is required to map to the part of movement whose temporal accuracy is prioritised and whose temporal variaibility is lowest (spatiotemporal phonological representations predict similar temporal variability at all parts of movement). The choice of symbolic (as opposed to spatiotemporal phonological representations used in other approaches) has profound implications for the architecture of the system that speakers use for planning speech articulation, and for the mechanisms which relate phonology to surface phonetic form. Because the symbolic representations used in Phonological Planning do not contain details about how these sound categories are to be produced, this choice of symbolic representations requires a separate Phonetic Planning component, with different types of representations and goals. That is, representations are symbolic, discrete, qualitative, and relational in Phonological Planning, but quantitative and continuous in Phonetic Planning. Like other three-component models, XT/3C therefore has separate Phonological and Phonetic Planning components, in contrast with spatiotemporal models, in which phonological and phonetic planning are integrated in spatiotemporal phonology. Both two- and three-component models contain a component for implementing and monitoring speech, which XT/3C calls the Motor-Sensory Implementation component.

Figure 1. A schematic diagram illustrating the three processing components of XT/3C (grey boxes), with inputs and outputs for each (ovals). To date, the computational implementation of XT/3C has focused primarily on Phonetic Planning.

XT/3C explicitly represents surface phonetic goals, in the sense that it specifies optimal articulation patterns that signal the phonological goals. This contrasts with the approach taken in spatiotemporal models, where articulatory trajectories emerge from gestural representations in the lexicon, and adjustments to their activations, without explicit representation of the resulting articulatory movements. Such emergent phonetics is possible in spatiotemporal approaches because phonological representations contain information about how to produce the gestures. But because there are multiple lines of evidence suggesting the need of an alternative (Turk & Shattuck-Hufnagel 2020), e.g. evidence that surface phonetic goals are explicitly represented, XT/3C takes a different approach. In XT/3C surface phonetics can’t emerge from Phonological Planning because phonological representations are symbolic and relational, but not quantitative. As a result, a Phonetic Planning component is required to specify quantitative surface phonetic goals.

Temporal aspects of these surface phonetic representations are specified with the aid of a general-purpose timekeeper. This timekeeper is extrinsic to the phonology, and governs all motor activity, in contrast to spatiotemporal models which employ phonology-intrinsic timing. In spatiotemporal models, timing patterns are controlled via phonology-intrinsic “clock” slowing (e.g. Saltzman et al. 2008 inspired by Byrd & Saltzman 2003; see also Tilsen 2022). Adjustments to the phonology-intrinsic “clock”, which are specific only to a particular utterance, and/or to specific prosodic positions within an utterance, present challenges for coordinating speech with the timing of external events, for example in synchronous speaking. When the phonology-intrinsic “clock” is slowed, phonology-specific time no longer aligns with solar time, measurable in constant milliseconds. XT/3C therefore instead makes use of a phonology-extrinsic, non-speaker specific, general-purpose timekeeper, used for all activities, whose units of (milli)seconds align with invariant (non-warpable) timing units of solar time.

The following sections provide an expanded view of each of the three components of XT/3C, although to date, the computational implementation of XT/3C has focused mainly on Phonetic Planning.

2.1. Phonological Planning

The information specified in Phonological Planning in XT/3C is more extensive than that proposed in traditional phonological theory: Whereas traditional phonological theory includes symbolic representations for lexical contrast in a prosodic structure, XT/3C supplements this with qualitative and relational representations for distinctive acoustic cue patterns (Stevens 1987, 2002). In addition, the phonological plan includes specifications for other goals for the utterance, such as the desire to be understood by a listener, relative speech rate (e.g. faster than normal, normal, etc.), and speech style. The phrasal prosodic structure of the utterance (Selkirk 1978; Nespor & Vogel 1986; Fujimura 1992, Shattuck-Hufnagel & Turk 1996) is planned to maximise the likelihood of being understood by a listener, by highlighting (with prosodic prominence) and demarcating (with prosodic boundary strength) the words that are likely to be less recognizable due to their context (Aylett 2000; Aylett & Turk 2004; Turk, 2010). That is, prosodic structure is planned with the goal of making all the words in the utterance equally understandable, given their predictability in context and utterance length, while respecting conventions of language-specific prosodic well-formedness (e.g. the tendency to pitch accent words at phrase edges in some languages).  

The Phonological Plan also includes prioritisation of the goals for the utterance. The priority for being understood by a listener (global intelligibility) will usually be high, but whether global intelligibility can be achieved will depend on its priority relative to other requirements (e.g. rate). Stylistic requirements are implicit in the choice of acoustic cues, and in the prioritization of other requirements.

What results from Phonological Planning?

The outcome of Phonological Planning is a prosodically-structured string of wordforms (Garrett 1980; Levelt 1989), with associated acoustic cues in their symbolic, contrastive form (Stevens 1987), along with prioritized task requirements for the utterance. See Figure 1.

2.2. Phonetic Planning

The purpose of XT/3C’s Phonetic Planning Component is to plan articulations that accomplish the goals set out in Phonological Planning. This component thus provides a detailed articulatory plan for an utterance, based on the output of the Phonological Planning stage, i.e. prioritized task requirements of global intelligibility, rate of speech, etc.  As noted earlier, explicit representation of surface phonetic goals contrasts with the approach taken in spatiotemporal models, where phonetic patterns emerge from gestural representations in the lexicon, and adjustments to their activations, without having to be explicitly planned.

The Phonetic Planning processing component in XT/3C adopts 1) Lee’s Tau theory as a dynamic model for planning and generating articulatory movement (Lee 1998, 2009); and 2) an optimization procedure based on Optimal Control Theory (Bellman 1957; Pontryagin et al. 1962; Jordan & Wolpert 1990; Todorov & Jordan 2002; Shadmehr & Wise 2005, Nelson 1983; Nelson et al.1984; Lindblom 1990; and Šimko & Cummins 2010, 2011, Houde & Nagarajan 2011, 2016).  Like mass-spring-based systems used in many spatiotemporal approaches, Tau theory is a theory of how movements are controlled. However, it differs from these in its assumption that movements are controlled to reach a goal at an explicitly specified time. It therefore represents the time of goal achievement, and assumes that coordination and relative timing are based on goal-related movement endpoints (as opposed to movement onsets, as is the case in many spatiotemporal models). The (relative) time of goal achievement is specified using a general-purpose timekeeper, which uses units of (milli)seconds, without clock-slowing in particular prosodic positions. It assumes context-specific targets for each movement (as opposed to invariant targets in many other approaches). Elie et al. (2023) shows that the Tau theory dynamic model provides a better fit to real speech movement trajectories, when compared with approaches based on mass-spring oscillators.

The optimisation mechanism makes it possible for XT/3C to implement the influence of multiple factors on the phonetic characteristics of each sound in an utterance. This procedure finds Tau parameter values for movements that accomplish the Phonological Planning goals at minimum cost. Costs that are minimised include 1) Effort (cf. Kirchner 1998), 2) Time (Shadmehr et al. 2010), and 3) A local intelligibility cost, i.e. a cost of not being intelligible based on sensory characteristics. These cost weights can be specified differently for different positions in prosodic structure (Windmann et al. 2015). See Elie et al. (2024a) for a demonstration of this approach.

Effort

Minimization of the effort cost is required to model coarticulation in the sense of modifying movement targets according to adjacent context. In addition, minimization of this cost ensures near-symmetric velocity profiles (Nelson 1983; Elie et al. 2023), (nearly) straight movement paths (Morasso 1981), and reductions in movement distance at fast rates of speech (Nelson et al. 1984).

Time

Minimization of the time cost is needed in order to achieve coarticulation in the sense of articulatory overlap. That is, utterances with overlapped articulations can be produced in less time than utterances with non-overlapped articulations. Likewise, minimization of the time cost is required to account for the (near-) linear relationship between peak velocity and distance, otherwise, longer distance movements would be produced in a longer amount of time. Adjusting the weight of the time cost across all positions in the utterance can be used to achieve utterance-wide rate of speech effects, and local adjustments of this weight can be used to achieve local temporal effects such as phrase-final lengthening (Windmann et al. 2015).

Local Intelligibility

The local intelligibility cost is required to account for phonological contrast; if this cost (cost of not being intelligible) is high enough, intended phonemes are likely to be produced in a way that makes it possible for listeners to recognize them based on their sensory characteristics. The interplay between weights for the local intelligibility and other costs (e.g. effort), specified differently for different prosodic positions, can account for vowel centralisation and consonant lenition in prosodically weak positions (Elie et al. 2024a). The local intelligibility cost in Phonetic planning is related to the global intelligibility requirement used in Phonological Planning, but differs from it in two key respects. The requirement for global intelligibility used in Phonological Planning is the speaker’s desire for the utterance to be understood by a listener. Whether global intelligibility is achieved will depend not only on sensory characteristics of the phonemes in the to-be-produced utterance (local intelligibility), but also on linguistic and real-world knowledge (i.e. predictability of each word in the utterance from context, and parsing possibilities relating to utterance length). In contrast, local intelligibility in Phonetic Planning refers to recognition likelihood of each phoneme, predicted exclusively from sensory characteristics (auditory, somatosensory, perhaps visual). Thus, global intelligibility differs from local intelligibility in 1) the information taken into consideration in predicting it (global intelligibility is based on both sensory information and contextual information, such as predictability and parsing possibilities; while local intelligibility is based only on sensory characteristics; and 2) in the domain over which it is predicted: Global intelligibility is the speaker’s desire for the listener to understand the entire utterance, whereas local intelligibility is the speaker’s estimate of how likely it will be that the listener will recognize each phoneme (based on sensory characteristics).

Following the Smooth Signal Redundancy hypothesis (Aylett 2000; Aylett & Turk 2004; Turk 2010), XT/3C assumes that the speaker plans Prosodic Structure in order to increase the likelihood of achieving global intelligibility: In Phonological Planning, speakers plan to highlight and demarcate words that are less likely to be understood based on their linguistic and real-world context, as well as on parsing possibilities. Different prosodic positions will dictate the relative weighting given to the local intelligibility cost for these positions in Phonetic Planning: The resulting prosodic boundary strength and relative prominence, along with the priority of global intelligibility, are used to determine the weighting of the local intelligibility costs for phonemes in each prosodic position, relative to the weighting of the other costs, i.e. effort and time. The local intelligibility cost, which differs for different prosodic positions, will thus influence acoustic salience and distinctiveness: In prosodically strong positions (e.g. stressed syllables), with a higher cost for not being intelligible, speakers will plan hyper-articulated articulations that will achieve salient and distinctive acoustic characteristics. In weaker positions (e.g. unstressed syllables), with a lower local intelligibility cost (cost of not being intelligible), they will plan hypo-articulated articulations to save effort, and will therefore achieve less salient and less distinctive characteristics.

The Use of Internal Models

In order to predict the local intelligibility cost for articulations that they might produce (they need to do this in order to plan minimum cost movements), speakers need to have internalized models of a) relationships between articulation and its sensory consequences, i.e. acoustics and somatosensation1 (Guenther 1995, 2016; Patri et al. 2019), and b) the likelihood that listeners will recognise each phoneme, based on the predicted sensory consequences of articulation (Patri et al. 2019). These models are learned and updated through experience (Guenther 1995; 2016; Patri et al. 2018, 2019; Parrell et al. 2019; Kim et al. 2023). In addition, by providing links 1) between articulation and sensory characteristics (acoustics and somatosensation), and 2) between these sensory characteristics and symbolic phonological elements, these internal models are essential for the mapping between phonology and phonetics in this framework.

2.3. Motor-Sensory Implementation

Once a detailed phonetic plan for articulating an utterance has been created, the final step is to implement it, and to monitor and adjust the ongoing articulation to ensure that, as far as possible, the phonological goals are met. This occurs in the third processing stage, the Motor-Sensory Implementation component. It is so-called because it relies heavily on sensory feedback (both acoustic and somatosensory) to control the ongoing motor activity of articulation (Perkell et al. 2000; Perkell 2012; Cai et al. 2010; Niziolek et al. 2013, Mitsuya et al. 2015; Klein et al. 2019; Karlin et al. 2021, among many others). Sensory feedback is also used to update the internal models, based on feedback about the utterance that is being produced (e.g. Patri et al. 2018). This process may also rely on information from listeners about communicative success.

Updating these internal models provides an account for compensation and adaptation behavior that occurs in response to altered feedback, either acoustic or somatosensory. For example, lowering the first formant value of a vowel, in the acoustic feedback from his/her own speech that the speaker hears while articulating, causes the speaker to raise F1 in partial compensation on subsequent trials (Patri et al. 2018, 2019; Parrell et al. 2019; Kim et al. 2023).

The current version of the Motor-Sensory Implementation component can produce an utterance, and can update the internal models on the basis of sensory feedback. Work in progress will account for adjustments to both ongoing and subsequent articulation in response to feedback.

Using XT/3C to replicate speech behavior

XT/3C accounts for systematic phonetic variability through the interplay of cost function weights in Phonetic Planning, set on the basis of the prioritization of tasks in Phonological Planning. On this basis, XT/3C has successfully modeled near-symmetric velocity profiles (Elie et al. 2023), coarticulation (Elie et al. 2024a), rate of speech (Elie et al. 2024a), speech in noise (Lombard 1911; Elie et al. 2024b), and effects of prosodic prominence structure (Elie et al. 2024a). XT/3C provides a promising approach for accounting for dynamic characteristics of speech articulation, the influence of multiple factors on speech output, and speakers’ flexibility to adapt speech appropriately to each context, while preserving the insights of symbol-based phonology.

3. Acknowledgements

We gratefully acknowledge helpful discussions with Dave Lee, and graphical assistance from Hsi-Er Liu. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 101019847).


  1. Somatosensation includes sensations of touch and pressure from the articulators in the vocal tract, such as the tongue against the roof of the mouth. It also includes proprioceptive information about the spatial location of the articulators.↩︎


References

Aylett, M. P. (2000). Stochastic Suprasegmentals:  Relationships between Redundancy, Prosodic Structure and Care of Articulation in Spontaneous Speech [Ph.D. Dissertation, University of Edinburgh].

Aylett, M., & Turk, A. (2004). The Smooth Signal Redundancy Hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration on spontaneous speech. Language and Speech, 47, 31-56. 

Bellman, R. E. ( 1957). Dynamic Programming. Princeton University Press.

Browman, C. P., & Goldstein, L. (1985). Dynamic modeling of phonetic structure. In V. A. Fromkin (Ed.), Phonetic Linguistics (pp. 35-53). Academic Press.

Browman, C. P., & Goldstein, L. (1992). Articulatory phonology: an overview. Phonetica, 49(3-4), 155-180.

Byrd, D., & Saltzman, E. (2003). The elastic phrase: modeling the dynamics of boundary-adjacent lengthening. Journal of Phonetics, 31(2), 149-180. https://doi.org/Doi 10.1016/S0095-4470(02)00085-2

Cai, S., Ghosh, S. S., Guenther, F. H., & Perkell, J. S. (2010). Adaptive auditory feedback control of the production of formant trajectories in the Mandarin triphthong /iau/ and its pattern of generalization. Journal of the Acoustical Society of America, 128(4), 2033–2048.

Chomsky, N., & Halle, M. (1968). The Sound Pattern of English. Harper & Row.

Elie, B., Lee, D. N., & Turk, A. (2023). Modeling trajectories of human speech articulators using general Tau theory Speech Communication, 151, 24-38.

Elie, B., Šimko, J., & Turk, A. (2024a). Optimization-based planning of speech articulation using general Tau Theory. Speech Communication, 160, 103083.

Elie, B., Šimko, J., & Turk, A. (2024b). Optimization-based modeling of Lombard speech articulation: Supraglottal characteristics. ASA Express Letters, 4(1).

Fowler, C. A. (1977). Timing control in speech production (Vol. 134). Indiana University Linguistics Club.

Fowler, C. A., Rubin, P., Remez, R. E., & Turvey, M. T. (1980). Implications for speech production of a general theory of action. In B. Butterworth (Ed.), Language Production (pp. 373-420). Academic Press.

Fujimura, O. (1992). Phonology and phonetics-A syllable-based model of articulatory organization. Journal of the Acoustical Society of Japan (E), 13(1), 39-48.

Garrett, M. F. (1980). Levels of processing in sentence production. In B. Butterworth (Ed.), Language production. Vol. 1: Speech and talk (pp. 177-220). Academic Press.

Guenther, F. H. (1995). Speech Sound Acquisition, Coarticulation, and Rate Effects in a Neural-Network Model of Speech Production. Psychological Review, 102(3), 594-621.

Guenther, F. H. (2016). Neural Control of Speech. The MIT Press.

Henke, W. L. (1966). Dynamic Articulatory Model of Speech Production Using Computer Simulation Massachusetts Institute of Technology].

Houde, J. F., & Nagarajan, S. S. (2011). Speech production as state feedback control. Frontiers in Human Neuroscience, 5(Article 82), 1-14.

Houde, J. F., & Nagarajan, S. S. (2016). In Neurobiology of Language (pp. 221-238). Academic Press.

Jordan, M. I., & Wolpert, D. M. (1999). Computational motor control. In M. Gazzaniga (Ed.), The Cognitive Neurosciences. MIT Press.

Keating, P. A. (1990). The window model of coarticulation: Articulatory evidence. In J. C. Kingston & M. E. Beckman (Eds.), Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech (pp. 450-469). Cambridge University Press.

Karlin, R., Naber, C., & Parrell, B. (2021). Auditory feedback is used for adaptation and compensation in speech timing. Journal of Speech, Language, and Hearing Research, 64, 3361–3381.

Kim, K. S., Gaines, J. L., Parrell, B., Ramanarayanan, V., Nagarajan, S. S., & Houde, J. F. (2023). Mechanisms of sensorimotor adaptation in a hierarchical state feedback control model of speech. PLOS Computational Biology, 19(7), e1011244.

Kingston, J., & Diehl, R. L. (1994). Phonetic knowledge. Language, 70(3), 419-454.

Kirchner, R. M. (1998). An Effort-Based Approach to Consonant Lenition. PhD Dissertation. University of California, Los Angeles.

Klein, E., Brunner, J., & Hoole, P. (2019). The relevance of auditory feedback for consonant production: The case of fricatives. Journal of Phonetics, 77, 100931.

Lee, D. N. (1998). Guiding movement by coupling taus. Ecological Psychology, 10(3-4), 221-250.

Lee, D. N. (2009). General Tau Theory: evolution to date. Special Issue: Landmarks in Perception. Perception, 38, 837-858.

Leonard, T., & Cummins, F. (2011). The temporal relation between beat gestures and speech. Language and Cognitive Processes, 26(10), 1457-1471.

Levelt, W. J. M. (1989). Speaking: From intention to articulation. MIT Press.

Lindblom, B. (1990). Explaining Phonetic Variation: A Sketch of the H&H Theory. In W. J. Hardcastle & A. Marchal (Eds.), Speech Production and Speech Modelling (Vol. 55, pp. 403-439). Kluwer Academic Publishers.

Lombard, E. (1911). Le signe de l'élévation de la voix. Annales des maladies de l'oreille, du larynx, du nez et du pharynx, 27, 101-119.

Mitsuya, T., MacDonald, E. N., Munhall, K. G., & Purcell, D. W. (2015). Formant compensation for auditory feedback with English vowels. Journal of the Acoustical Society of America, 138(1), 413-424.

Morasso, P. (1981). Spatial control of arm movements. Experimental Brain Research, 42, 223-227.

Nelson, W. L. (1983). Physical principles of economies of skilled movements. Biol. Cybernet., 46, 135-147.

Nelson, W. L., Perkell, J., & Westbury, J. (1984). Mandible movements during increasingly rapid articulations of single syllables: Preliminary observations. Journal of the Acoustical Society of America, 75, 945-951.

Nespor, M., & Vogel, I. (1986). Prosodic Phonology. Foris Publications.

Niziolek, C. A., Nagarajan, S. S., & Houde, J. F. (2013). What does motor efference copy represent? Evidence from speech production. The Journal of Neuroscience, 33(41), 16110 –16116.

O’Brien, J. (2012). An experimental approach to debuccalization and supplementary gestures. [PhD Dissertation, University of California Santa Cruz]. https://escholarship.org/uc/item/1cm694ff

Parrell, B., Lammert, A. C., Ciccarelli, G., & Quatieri, T. F. (2019). Current models of speech motor control: A control-theoretic overview of architectures and properties. Journal of the Acoustical Society of America, 145(3), 1456–1481.

Patri, J.-F., Perrier, P., Schwartz, J.-L., & Diard, J. (2018). What drives the perceptual change resulting from speech motor adaptation? Evaluation of hypotheses in a Bayesian modeling framework. PLOS Computational Biology, 14(1), e1005942.

Patri, J.-F., Diard, J., & Perrier, P. (2019). Modeling sensory preference in speech motor planning: A Bayesian modeling framework. Frontiers in Psychology, 10, 2339.

Perkell, J. S. (2012). Movement goals and feedback and feedforward control mechanisms in speech production. Journal of Neurolinguistics, 25, 382–407.

Perkell, J. S., Guenther, F. H., Lane, H., Matthies, M. L., Perrier, P., Vick, J., Wilhelms-Tricarico, W., & Zandipour, M. (2000). A theory of speech motor control and supporting data from speakers with normal hearing and with profound hearing loss. Journal of Phonetics, 28, 233-272.

Pontryagin, L. S., Boltyanskii, V. G., Gamkrelidze, R. V., & Mishchenko, E. F. (1962). The Mathematical Theory of Optimal Processes (Russian), English translation. Interscience.

Saltzman, E., Nam, H., Krivokapić, J., & Goldstein, L. (2008). A task-dynamic toolkit for modeling the effects of prosodic structure on articulation. In P. A. Barbosa, S. Madureira, & C. Reis (Eds.), Proceedings of the Speech Prosody 2008 Conference (pp. 175-184). LBASS.

Saltzman, E. L., & Munhall, K. G. (1989). A dynamical approach to gestural patterning in speech production. Ecological Psychology, 1(4), 333-382.

Scobbie, J. M., Segbregts, K., & Stuart-Smith, J. (2009). Dutch rhotic allophony, coda weakening, and the phonetics-phonology interface. QMU Speech Science Research Centre Working Paper.

Selkirk, E. O. (1978). On prosodic structure and its relation to syntactic structure. In T. Fretheim (Ed.), Nordic Prosody II (pp. 111-140). TAPIR.

Shadmehr, R., Orban de Xivry, J. J., Xu-Wilson, M., & Shih, T.-Y. (2010). Temporal discounting of reward and the cost of time in motor control. The Journal of Neuroscience, 30(31), 10507–10516.

Shadmehr, R., & Wise, S. P. (2005). The Computational Neurobiology of Reaching and Pointing: A Foundation for Motor Learning. MIT Press.

Shattuck-Hufnagel, S., & Turk, A. (1996). A prosody tutorial for investigators of auditory sentence processing. Journal of Psycholinguistic Research, 25(2), 193-247.

Šimko, J., & Cummins, F. (2010). Embodied Task Dynamics. Psychological Review, 117(4), 1229-1246.

Šimko, J., & Cummins, F. (2011). Sequencing and optimization within an embodied Task Dynamic model. Cognitive Science, 35, 527–562.

Stevens, K. N. (1987). Relational properties as perceptual correlates of phonetic features. International Congress of Phonetic Sciences,

Stevens, K. N. (2002). Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America, 111(4).

Tilsen, S. (2022). An informal logic of feedback-based temporal control. Frontiers in Human Neuroscience, 16, 851991.

Todorov, E., & Jordan, M. I. (2002). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5(11), 1226-1235.

Turk, A. (2010). Does prosodic constituency signal relative predictability? A Smooth Signal Redundancy hypothesis. Journal of Laboratory Phonology, 1, 227-262.

Turk, A., & Shattuck-Hufnagel, S. (2020). Speech Timing: Implications for Theories of Phonology, Phonetics, and Speech Motor Control. Oxford University Press.

Windmann, A., Šimko, J., & Wagner, P. (2015). Optimization-based modeling of speech timing. Speech Communication 74, 74, 76-92.