Syntax studies the way in which words combine to form broader units and the relationships between words, phrases, and sentences. Typically, the study of syntax includes the analysis of phenomena such as word order, agreement, dependencies between elements, and the functional roles that various units assume in relation to others.
In the grammatical tradition, the study of syntax has been based on the description of written language. This has resulted in being influenced by the so-called written language bias in linguistics (Linell 2005), which involves that theories, categories, and methodologies developed for the analysis of written language have been extended to encompass the study of language in general, including spoken language. This bias has had a substantial scientific impact, despite speaking being a more fundamental activity than writing in linguistic behavior and development. This observation holds both ontogenetically, wherein humans acquire spoken language before learning to write, and phylogenetically, as the evolution of spoken languages in human history significantly predates the invention of writing. For these reasons, many scholars have assumed that the study of spoken syntax should begin with the fundamental characteristics of oral communication, which are essentially tied to its interactive, contextual, and multimodal nature.
The need to efficiently communicate in real-time requires speakers to organize their syntax dynamically (Chafe 1982; Schegloff 1996), as long-term planning is not allowed. The speaker simultaneously engages in the planning and production of the message, which corresponds, on the hearer’s side, to a simultaneous reception and comprehension. Spoken syntax is therefore flexible, and adaptive, and tends to favor relatively short and hierarchically simple constructions, reducing the risk of an overload of working memory. Syntactic relationships are primarily organized sequentially, built in an additive manner, in response to ongoing communication and through the contribution of all participants in the dialogue (Auer 2009; Thompson, Fox & Couper-Kuhlen 2015). The use of subordinating, and even coordinating conjunctions, is often minimized, with a preference for paratactic constructions and with prosodic phrasing ensuring coherence (Cresti 2014).
The spatial proximity of interlocutors also plays an important role in the syntactic organization of speech (Goodwin 1981; Mondada 2007). Since speaker and hearer are typically co-present during the spoken interaction, they share contextual salience and can directly refer to objects and events located in the space around them. Eye gaze is used for orienting conversation, employed by both speakers and listeners to coordinate the progression of interaction. These features help to optimize the time for message production and reception, and encourages the use of deictic elements and pointing gestures.
More in general, spoken language relies on an inherently multimodal system (Levinson & Holler 2014; Hagoort & Özyürek 2024). Vocal elements interact with gestures, facial expressions, and body movements, allowing the speaker to express themselves simultaneously on multiple levels, which contribute together to the act of enunciation (Kendon 2004). Is it not a case that the first combinations of symbols during linguistic development occur between single words and gestures, before word-to-word combinations (Özçalışkan and Goldin-Meadow 2005).
Furthermore, the medium of vocal transmission exhibits inherent properties that determine the characteristics of the communicative signal: spoken productions are in fact transient, and the acoustic trace vanishes and is not amendable (Voghera 2017).
The syntax in spoken language unfolds linearly along the auditory signal. This signal, however, is not exclusively composed of lexical items, as it typically includes the so-called disfluency phenomena (Fox Tree 1995; Ferreira & Bailay 2004), which correspond to interruptions in the verbal flow, primarily linked to moments of speech planning. Among the most characteristic instances are non-lexical vocalizations, such as filled pauses (produced through laryngealizations, nasalizations, or vocalizations) and vowel elongations. Other types of disfluencies involve proper verbal elements that are 'discarded' by the speaker during the production process, such as false starts, hesitations, repetitions, self-corrections, and reformulations.
The speech flow is also rich in discourse markers (Schiffrin 1987; Fraser 1999), consisting of single lexical items or brief expressions that lack syntactic compositionality with the discourse content and are primarily associated with regulatory functions in interaction (Chafe 1994). These markers may address the relationship with the interlocutor or serve to ensure textual cohesion. Common examples of discourse markers include expressions used to take or maintain the conversational turn, draw the interlocutor’s attention, or ensure that the communicative channel is effectively open.
Both disfluencies and discourse markers are highly pervasive in spontaneous speech, systematically segmenting the spoken stream and being perceived as entirely natural within the flow of speech. Although traditionally excluded from the domain of syntactic analysis, several authors argue that these elements are integral and inherent to the syntactic structure of spoken language (Lacheret-Dujour, Kahane & Pietrandrea 2019).
The following example is used to illustrate these phenomena. The reported turn is extracted from a dialogue (“Heart”), in which a speaker explains the rules of a card game to her interlocutor. The dialogue is part of the Santa Barbara Corpus of Spoken American English (SBCSAE; Du Bois et al. 2000-2005; text "Hearts"), and the example is available online in the SLAC database (Gregori, Panunzi & Rocha 2020), which provides access to the corresponding audio. In the initial part of the turn, a discourse marker (underlined) occurs, followed by various instances of disfluency (mainly fragments and false starts), marked with the symbol &.
(1) | 26. *JEN: well because &I cause if I &t if I take a &tr the &k diamond trick , and somebody didn't have diamonds and they threw a heart into that pile , I was gonna take that with that ace |
One of the core issues in the study of syntax is the identification of the basic units, which have traditionally been regarded as sentences. Nevertheless, the canonical notion of sentence, derived from grammatical tradition and reflected in much of contemporary syntactic theory, does not systematically apply to the linguistic reality of spoken discourse, where many syntactic units do not correspond to complete sentences at all. This discrepancy is particularly evident in the analysis of spontaneous texts, which has characterized many of the original approaches to spoken syntax across different theoretical frameworks (Brazil 1995; Blanche-Benveniste 1997; Miller & Weinert 1998; Cresti 2000; Mithun 2008; Thompson, Fox & Couper-Kuhlen 2015).
To move beyond the traditional notion of sentence as the fundamental unit for the analysis of spoken syntax, several alternatives have been advanced. Among these are the clause (Miller & Weinart, 1998), defined as a unit with different possible syntactic fillers, and the C-unit (Biber et al. 1999), which may correspond to either clausal elements, phrases or other elements. Considerable attention has been directed toward syntactic units that diverge from canonical sentence structures, as they are not organized around the central role of the verb. Let’s consider another excerpt from the SBCSAE (Du Bois et al. 2000-2005), also available online in the SLAC database (Gregori, Panunzi & Rocha 2020; text “Navy”).
(2) | 13. TOC: it was just an exciting thing to do |
14. TOB: so how many years there? |
The question in the turn of speaker TOB does not contain any verb. Even if structures like this have long been regarded as deficient in comparison to their canonical counterparts, it is possible to recognize to them an independent syntactic status, since is not possible to establish a unique derivational process from one to the other, nor a direct functional equivalence.
Non-verbal units exhibit considerable variation and encompass a range of structures. Besides proper nominal sentences, characterized by a subject-predicate configuration, they include autonomous nominal and prepositional phrases, greeting formulas, interjections, pro-sentences, sentence fragments, and more. Such units have been categorized differently across various English grammars, being described as “irregular sentences” (Quirk et al., 1985), “non-clausal material” (Biber et al. 1999), and “minor clause types” (Huddleston & Pullum, 2002). Within the generative tradition, they are commonly identified as “nonsentential constructions” (Barton 1990; Progovac et al. 2006). Non-verbal units are very frequent in spoken language, accounting for approximately 30-40% across various languages (Biber et al. 1999; Cresti & Moneglia 2005).
Verbal clauses in spoken language also exhibit peculiar features. They are generally short and contain on average two to three phrases (Miller and Weinert 1998; Biber et al. 1999). Additionally, they often show lower valency saturation, given that the arguments are often readily accessible within the discourse or context, with a frequent occurrence of light phrases and pronominal fillers.
In general, the structural simplicity of constituents, also called “lightness”, is a general characteristic of spoken language, and can be considered a common strategy for achieving concise and efficient communication. By exploiting the brevity of phrases and minimizing the use of overly complex syntactic hierarchies, cognitive processing demands can be reduced without compromising semantic and referential content, which can be easily integrated through the shared context (Fox & Thompson, 2010; Voghera 2017).
Several frameworks for the analysis of spoken syntax have adopted a modular approach, incorporating multiple levels of organization in oral discourse. At a first, strictly syntactic level, the principles of government and agreement are applied within dependency relations between a syntactic head and its associated elements. At a higher level, broader units can be identified, characterized by syntactic cohesion that go beyond the domain of government relations and integrates pragmatic and textual dimensions of discourse. This distinction is evident, for example, in the differentiation between C(lause)-units and T(ext)-units (Biber et al. 1990) and, similarly, in the conceptual separation of microsyntax and macrosyntax within the theoretical framework of the Approche Pronominale (Blanche-Benveniste 1990, 1997; Debaisieux, 2013).
The interaction between syntax and prosody in spoken language has been highlighted since the earliest studies on spoken English and throughout the following decades (Palmer 1924; Bloch & Trager 1942, Halliday 1967), which have contributed to shaping a general perspective in which intonational phenomena are considered an integral part of grammar. First of all, prosody is essential for the segmentation of speech, which in turn is fundamental for defining the relationships between words in a given sequence. In the following example, taken from Izre’el et al. (2020), it is shown how the same linear sequence of words can form different syntactic structures depending on how it is prosodically segmented:
3) | a. People (Calling)! Give John the book I promised him (Order)! |
b. People give John the book I promised him (Assertion). | |
c. People give John the book (Question)? I promised him (Assertion). | |
d. People (Calling)! Give John the book (Order)! I promised him (Assertion). |
As evidenced by the example just provided, the prosodic characteristics of spoken language serve not only to delineate the syntactic boundaries of units but also to fulfill broader functions, such as emphasizing focal elements, indicating the presence of discourse markers, and, most significantly, signaling pragmatic values, including questions, orders, and assertions.
Although it is widely recognized that prosody, alongside syntax, plays a crucial role in the structural organization of spoken language, there remains substantial debate regarding the interplay between these two components and which one should be considered primary. Syntacticist approaches often posit that prosody operates as a reflection of syntactic structure, following a mapping process that proceeds more or less directly from an abstract representation to a fully specified phonological representation (Selkirk 1984). Some studies place greater emphasis on the bidirectional interaction between prosody and syntax (Nespor & Vogel 1986), within a framework where the hierarchy of projected structure is considered central. On the contrary, various theoretical frameworks position prosody as central to the segmentation of spoken discourse into basic units (Izre’el et al. 2020). These frameworks range from models where prosody correlates with syntax in identifying basic units (e.g. Basic Discourse Units in Degand & Simon, 2009) to approaches where the identification of basic units relies primarily on prosodic boundaries and correlates with semantic and pragmatic properties. For instance, intonation units as described in Chafe (1994) delineate boundaries that correspond to cognitive constructs (idea units), while in Cresti (2000) utterances, identified prosodically and independently of syntax, are assumed to accomplish an illocutionary force. These frameworks underscore the dynamic nature of spoken language, where prosodic cues often supersede syntactic constraints in spontaneous communication.
Auer, P. (2009). Online-syntax: thoughts on the temporality of spoken language. Language Sciences, 31(1), 1-13.
Barton, E. (1990). Nonsentential Constituents: A theory of grammatical structure and pragmatic interpretation. Amsterdam: John Benjamins.
Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. (1999). The Longman Grammar of Spoken and Written English. London: Longman.
Blanche-Benveniste, C. (1990). Le français parlé, études grammaticales. Paris: Edition du CNRS.
Blanche-Benveniste, C. (1997). Approches de la langue parlée en Français. Paris: Ophrys.
Bloch, B. & Trager, G. (1942). Outline of linguistic analysis. Baltimore (MD): Linguistic Society of America.
Brazil, D. (1995). A grammar of speech. Oxford: Oxford University Press. 1995.
Chafe, W. (1982). Integration and involvement in speaking, writing, and oral literature. In D. Tannen (ed.), Spoken and written language: Exploring orality and literacy. Norwood (NJ): Ablex.
Chafe, W. (1994). Discourse, consciousness, and time: The flow and displacement of conscious experience in speaking and writing. Chicago (IL): The University of Chicago Press.
Cresti, E. (2000). Corpus di italiano parlato. Firenze: Accademia della Crusca.
Cresti, E. (2014). La parataxe dans le parlé spontané et dans l’écrit littéraire. In CHIMERA, Romance corpora and linguistic studies, 1, 1-29.
Cresti, E. & Moneglia, M. (eds.) (2005). C-ORAL-ROM: Integrated reference corpora for spoken Romance languages. Amsterdam: John Benjamins.
Debaisieux, J.-M. (ed.) (2013). Analyses linguistiques sur corpus: Subordination et insubordination en français. Cachan: Hermès & Lavoisier.
Degand, L. & Simon, A. (2009). Mapping prosody and syntax as discourse strategy: How Basic Discourse Units vary across genres. In A. Wichmann, D. Barth-Weingarten & N. Deh (eds.), Where prosody meets pragmatics: Research at the interface. Bingley: Emerald, 79-105.
Du Bois, J., Chafe, W. L., Meyer, C. & Thompson, S. (2000–2005). Santa Barbara corpus of spoken American English, Parts 1–4. Philadelphia (PA): Linguistic Data Consortium.
Ferreira, F. & Bailey, K. G. D. (2004). Disfluencies and human language comprehension. Trends in Cognitive Sciences, 8(5), 231-237.
Fox, B. A. & Thompson, S. A. (2010). Responses to Wh-Questions in English Conversation. Research on Language and Social Interaction, 43(2), 133-156.
Fox Tree, J. E. (1995). The Effects of False Starts and Repetitions on the Processing of Subsequent Words in Spontaneous Speech. Journal of Memory and Language, 709-738.
Fraser, B. (1999). What are discourse markers? Journal of Pragmatics, 31(7). 931-952.
Goodwin, Ch. (1981). Conversational organization: Interaction between speakers and hearers. New York: Academic Press.
Gregori, L., Panunzi, A. and Rocha, B. (2020). SLAC: Spoken Language Annotation Comparison. Online database, https://doi.org/10.1075/scl.94.slac
Hagoort, P. & Özyürek, A. (2024), Extending the Architecture of Language From a Multimodal Perspective. Topics in Cognitive Science, https://doi.org/10.1111/tops.12728
Halliday, M. A. K. (1967). Intonation and grammar in British English. The Hague: Mouton.
Huddleston, R. & Pullum, G. (2002). The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press.
Izre’el, Sh., Mello, H., Panunzi, A., Raso, T. (2020). In Search of Basic Units of Spoken Language: A corpus-driven approach. Amsterdam: John Benjamins.
Kendon, A. (2004). Gesture: Visible Action as Utterance. Cambridge: Cambridge University Press.
Lacheret-Dujour, A., Kahane, S., Pietrandrea, P. (2019). Rhapsodie: A prosodic and syntactic treebank for spoken French. Amsterdam: John Benjamins.
Levinson, S. C. & Holler, J. (2014), The Origin of Human Multi-Modal Communication. Philosophical Transactions of the Royal Society B, 369(1651), 1-9.
Linell, P. (2005). The written language bias in linguistics: Its nature, origins and transformations. London: Routledge.
Miller, J. E., & Weinert, R. (1998). Spontaneous spoken language: Syntax and discourse. Oxford: Oxford University Press.
Mithun, M. (2008). The extension of dependency beyond the sentence. Language, 84(1), 69-119.
Mondada, L. (2007). Multimodal resources for turn-taking: pointing and the emergence of possible next speakers. Discourse Studies, 9(2), 194-225.
Nespor, M. & Vogel, I. (1986). Prosodic phonology. Dordrecht: Foris Publications.
Ozçalişkan, S, Goldin-Meadow, S. (2005). Gesture is at the cutting edge of early language development. Cognition, 96(3), B101-B113.
Palmer, H. E. (1924). A grammar of spoken English: On a strictly phonetic basis. Cambridge: Heffer & Sons.
Progovac, L., Paesani, K., Casielles-Suárez, E. & Barton, E. (eds.) (2006). The Syntax of Nonsententials: Multidisciplinary perspectives. Amsterdam: John Benjamins.
Quirk, R., Greenbaum, S., Leech, G. & Svartvik, J. (1985). A Comprehensive Grammar of the English Language. London: Longman.
Schegloff, E. A. (1996). Turn organization: one intersection of grammar and interaction. In E. Ochs, E. A. Schegloff, S. A. Thompson (eds.), Interaction and Grammar. Studies in Interactional Sociolinguistics. Cambridge: Cambridge University Press, 52-133.
Schiffrin, D. (1987). Discourse Markers. Cambridge: Cambridge University Press.
Selkirk. E. O. (1984). Phonology and Syntax: The Relation between Sound and Structure. Cambridge (MA): The MIT Press.
Thompson, S. A., Fox, B. A. & Couper-Kuhlen. E. (2015). Grammar in everyday talk: Building responsive actions. Cambridge: Cambridge University Press.
Voghera, M. (2017). Dal parlato alla grammatica. Costruzione e forma dei testi spontanei. Roma: Carocci.