Return to list
Gesture in Speech
Giorgina Cantalini | Civica Scuola Interpreti e Traduttori ‘Altiero Spinelli’, Milano
Moneglia Massimo | University of Florence
The Multimodal utterance

Verbal gestures are visible bodily movements that have a role in the utterance's meaning. The whole body can be involved in the bodily action movement, so scholars usually may refer not only to hand and arm gestures but also to movements of the eyebrows, neck, head, torso, and others. However, hands and arms are recognized «as a separable expressive system» (Kendon, 2004: 101). They can be used in conjunction with spoken expressions as complements, supplements, substitutes, or alternatives (Kendon, 2004).

Gesture and speech share a common source in terms of cognitive impulse. According to Kendon (1980: 208 – 209) both are the output of the same idea unit; while similarly in McNeill’s psychological model (McNeill, 1992; McNeill & Duncan 2000: 14) in the minimal gesture-speech unit (called Growth Point) imagistic thoughts and categorial thoughts are inseparable and intertwined packaged (comparable to the “unit of thought” in Chafe, 1994). Given to the irreducible opposition between linguistic categorial and imagistic components in this minimal unit a meaning is considered simultaneously cast in two semiotic modes: analytically, conventionally, combinatorically as for the first, and globally, idiosyncratically, synthetically, as for the latter. The utterance is the product of the dialectic unpacking process between the two (McNeill, 1992; 2005; 2015; McNeill & Duncan, 2000).

In this view, gesture and speech are two distinct manifestations of a more general process that includes both because they «arise from a single process of utterance formation» (McNeill, 1992: 30), which is a fundamentally embodied phenomenon (Loehr, 2014). «By virtue of idiosyncrasy, co-expressive, speech-synchronized gestures open a ‘window’ onto thinking that is otherwise curtained» (McNeill & Duncan, 2000: 3). Gesture provides cues of possible mental simulation of concepts, it inherently involves spatial imagery, and gestures frequently objectify abstract concepts through metonymy and metaphor (Cienki, 2023).

Other theories don’t agree with gesture and speech coming from the same cognitive source but still assume both are guided by the same controlling mechanism, whereby they are two independent forms of thinking in a single utterance, the spatio-motoric thinking and the analytic one, tightly coupled collaborative and reciprocally shaping (Kita, 2000:171).

Gesture classification

Unlike “language signs”, gestures are characterized by highly variable semiotic profiles, but solely the hands show notable semiotic properties «of a core importance for a notation of gestural forms and a further analysis of gestural meaning» (Bressem, 2013: 1082). The typologies of gestures are relevant to understanding how the two semiotic modalities work together to successfully produce multimodal meanings. Taxonomies of gestures significantly vary according to theoretical frames and objectives.

Ekman and Friesen (1969) in a socio-psychological and intercultural perspective, classified the whole spectrum of non-verbal behavior into five categories: emblems (standardized gestures); illustrators (gestures accompanying speech); regulators (of the speaker interactions); affect displays (e.g. facial expressions); adaptors (self-touching; manipulating objects).

From a verbal perspective, the range of movements is arranged by McNeill along a continuum, said “Kendon’s continuum” (McNeill, 1992; McNeill, 2005). Going from presence to absence of speech, we find gesticulation followed by emblems, then by pantomime, and finally by sign language. In gesticulation, the presence of speech is obligatory; with the emblems, it is optional; with pantomime and sign language, it must be absent. In gesticulation, linguistic properties are absent; meanwhile, in sign language, linguistic properties are fully present. Moreover, considering the parameter of conventionality, the continuum may vary: gesticulation is not conventionalized, emblems are partly conventionalized, and sign languages are fully conventionalized. As a semiotic characterization, gesticulation is global and synthetic, pantomime is global and analytic, and emblems are segmented and synthetic, while signs are segmented and analytic (McNeill, 2005).

Gesticulation, in particular, are «symbols that exhibit meaning […] freely designated by the speaker» (McNeill, 1992: 105). It is created on the spot and not retrieved. They can show different semiotic dimensions: Metaphoric gestures realize «images of the abstract»; Iconic depict «concrete entities and/or actions»; Deictics locates entities and/or actions in the space around the speaker and the listener; Beats are the hand beating time and are formally the least elaborate of the four (McNeill, 2005: 39-40).

Metaphoric, iconic, and deictic are propositional since they are linked to the ideational process of the utterance, while beats are non-propositional (McNeill, 1992:80) and stress the segment of the discourse (McNeill, 2005: 41).

Semiotically speaking, in the whole “Iconic-Metaphoric-Deictic-Beat Quartet” (McNeill, 2005: 38), it is possible to operate a distinction between “imagistic types”, if they contribute to the meaning by depicting images (of objects or actions), or non-imagistic-types if it is not. Metaphoric and iconic are imagistic, while deictics and beats are not (McNeill, 1992: 78). Imagistics are also referred to as “representational gestures” (Cienki, 2013: 350; Mittelberg & Evola, 2014).

Looking at the gesture/language semantic interaction, Kendon distinguishes gestures that are part of the referential content of the utterance and gestures that have pragmatic functions. The latter are divided into gestures with modal, performative, or parsing functions (Kendon, 2004: 158 – 159).

Gestures may also have discursive functions connecting different parts of the discourse. McNeill referred to them as “cohesives” (McNeill, 1992), then supplanted by the concept of “catchment”, which is a kind of thread of visuospatial imagery that runs through discourse to reveal the larger discourse units that encompass the otherwise separate parts (McNeill, 2005; McNeill et al., 2001).

Integrating Kendon's and McNeill's models, Müller (2010; 2017; 2018) recognizes three types of gestures: singular, recurrent, and emblematic, which differ in the degree of conventionalization and communicative functions. On a continuum of increasing conventionalization (Müller, 2017: 276), singular gestures are created on the spot; recurrent gestures merge conventional and idiosyncratic elements and occupy a place between singulars as spontaneously created and emblems as standardized gestural expressions; finally emblematic gestures are fully conventionalized. Gestural forms may stabilize (through repeated usages) and sometimes undergo lexicalization and grammaticalization processes transforming them into signs within a signed language (Müller, 2018: 15).

The three kinds of gestures operate as prototype categories; that is, they are not separated by sharp boundaries, and their relations are dynamic (Müller, 2018: 2). Regarding function, singular gestures mostly have ‘lexical’ functions, while recurrent and emblematic gestures work pragmatically.

Gestures may systematically vary regarding how and which formational features participate in meaning construction. As earlier work of Kendon (2004) and Müller (2004) has shown, some kinds of gestures tend to cluster around one or several of the four distinct kinetic features taken into account for sign languages (handshape, palm orientation, movement, and location in space), which goes along with a common “semantic theme”. These form-meaning clusters are termed “gesture families” (Fricke et al. 2014).

For instance, the recurrent gesture Flat Open-Hand when performed “palm-up” refers to offering an object to the interlocutor and, if close to the speaker, it is often linked to the introduction of a new topic (Müller, 2004); when directed “palm-down” toward the addressee indicate the will of stopping his line of activity (Kendon 2004). Moreover, the Palm-Up-Open-Hand repeated in downward motion embodies a listing of arguments, while its lateral motion embodies a wide range of entities (Müller, 2004: 252).

Iconicity conveyed by recurrent imagistic gestures, as well as by singular imagistic gestures, may be grounded on metaphors and their "underlying action scheme" (Cienki & Müller, 2008; Mittelberg, 2014). For instance, the Holding Away gesture with the hand vertically moving toward the addressee may refer to a barrier placed toward the interlocutor, which is intended to stop his communication. Communication, in this case, is seen as an object according to the conceptual metaphor COMMUNICATION IS AN OBJECT TRANSFER (Lakoff & Johnson, 1980).

Recurrent gestures have been gathered into repertoires, for instance: Holding away, Brushing away, Sweeping away, Throwing away, Cyclic gestures, Back and forth, Weighing up etc. (Bressem & Müller, 2014:1580 – 1584),

Bavelas, in a dialogic perspective, proposes to divide gestures into those referring to semantic content (topic-related) and those referring to recipients (interactive) helping to “maintain the conversation as a social system” (Bavelas et al. 1992: 469). Interactive gestures are subdivided into “delivery gestures”, which deliver information from the speaker to the recipient; “seeking gestures”, which have the purpose of eliciting a specific response from the recipient; “citing gestures”, which refer to a previous contribution from the recipient; and “turn coordination gestures”, which concern the management of dialogic turns (Bavelas 1994: 213; Bavelas et al. 1995: 395-397).

Gullberg (1998; 2003), considering the role of gestures in second language acquisition, highlights their role as a substitute for verbalization. She proposes a scale that goes from the fully representational trait to the non-representational one. Thus, in order, we have fully iconic gestures, concrete deictic gestures, metaphoric gestures, abstract deictic gestures and beats.

Iriskhanova & Cienki (2018) propose to look at gestures as proper signs to be analyzed in terms of a multi-vector model. The multi-vector analysis considers the semiotic continuum of gestures through a set of parameters: conventionality, semanticity, arbitrariness, pragmatic transparency, autonomy, social and cultural import (symbolism), awareness, recurrence, iconicity, metaphoricity, indexicality, salience. For instance, emblems get all these features with the exception of iconicity. A representational gesture is neither conventional nor arbitrary, it cannot be autonomous, and it is neither metaphoric nor recurrent, but it is iconic, pragmatically transparent, referential, and has a semantic. A pragmatic gesture is recurrent and almost conventional, metaphoric, and pragmatically transparent, but lacks autonomy, arbitrariness, and also iconicity. The multivector analysis does not say whether a gesture is sign-like, but specifies the ways in which gestures are a sign.

Gesture structure

Gestures show a linear structure that can be segmented into units, which are hierarchically aligned to the units of the speech flow in the process of utterance formation (Kendon, 1972, 1980; McNeill, 1992, 2005). The minimal model presently in use foresees three hierarchic levels: “gesture unit,” “gesture phrase,” and “gesture phase” (McNeill, Pedelty & Levy:1990; McNeill 1992: 82; Kendon, 2004).

Gesture unit (GU) is the “entire excursion […] from the moment the articulators begin to depart from a position of relaxation until the moment when they finally return to one” (Kendon, 2004: 111), or «the period of time between successive rests of the limbs» (McNeill 1992: 83). GUs may contain one or a sequence of gesture phrases. Gesture phrase (GPHR) is «what we intuitively call a ‘gesture’» (McNeill, 2005: 31), corresponding to its semiotic active nucleus. It can be segmented into a sequence of discrete phases, named gesture phases. Gesture phase (GPHA) is the smallest unit of a gesture in its hierarchical organization, namely the «individual movement phases of gestures, considered to be potentially separable units of analysis» (Bressem et al., 2013: 1102).

Gesture phases are preparation, various holds, stroke, retraction and rest position (Kendon, 1980: 212; McNeill, 1992; Kita et al., 1998; Ladewig & Brassem, 2013). The stroke is the meaningful part of a gesture, and it is the only obligatory realization in a gestural occurrence. First and after a stroke, a gestural movement can display a preparation and a retraction. Rest position is «when the limb is finally at the rest» (Kendon, 1980: 212).

Considering the rest position, a distinction can be made between retraction and partial retraction, the former identifying the movement back to the rest position; the latter the hand making a movement that goes toward a potential resting position but shifts to preparation for another stroke (Kita et al.,1998: 30).

Finally, the holds can be divided into independent and dependent holds (Kita et al., 1998: 27 –28). The independent hold has the same semantic and functional value as a stroke, but it is performed by the hand still. In dependent holds (pre-stroke and post-stroke), the position is held, but the function is different because they are parasitic to their adjacent stroke. So, Kita and colleagues’ syntagmatic model extends the concept of gesture phrase not just as containing at least a stroke, but as containing a compulsory root, the expressive phrase, made by at least either a stroke or an independent hold, and when a stroke optionally accompanied by either a pre-stoke or a post-stroke hold, or by both (Kita et al., 1998: 27-28). The whole expressive phrase must be considered “the semiotically active phase” (Kita et al,. 1998: 28) of the gesture.

Annotation methods

The research on gesture is a typical corpus-based study that requires annotation in audio-visual data sets. The task requires criteria for the segmentation and identification of gestures grounded on the dynamics of the body movement.

One of the main models (Kita et al., 1998) is perceptively based. The onset and the offset of a gesture unit are identifiable from the departure and the contact back of the hand from and to a rest position. A gesture phrase is perceptively identifiable by the beginning-middle-end dynamic that describes its excursion around a peak of effort (the stroke).

The segmentation of gesture phases into discrete sequences is under two conditions: if there is an abrupt change of direction in the hand movement; and also a discontinuity in the velocity profile of the hand movement before and after the abrupt direction change (Kita et al., 1998: 29 – 30). Preparation is the phase in which the articulator moves away from a relaxation or a rest position (depending on whether the hand is on a surface or mid-air) «to a position at which the stroke begins» (Kendon, 1980: 212). It is composed optionally by a «liberating movement, which makes the hand free from a constrained position» (Kita et al., 1998: 28) and by a “location preparation” and a “hand internal preparation”, the two overlapping in time. Retraction is the phase «in which the limb is either moved back to its rest position or in which it is readied for another stroke» (Kendon, 1980: 212). The stroke is the phase in which more force is exerted than neighboring phases. The hold occurs when the hand is held still, though “rarely perfectly still” (Kita et al. 1998: 30).

Ladewig and Bressem (2013) propose a gesture phase annotation method grounded on instrumental procedures as the frame-by-frame marking procedure of Seyfeddinipur (2006: 106), that marks transitions from one gesture phase to another based on the sharpness of the video image. This method contrasts previous ones, mostly functional and related to adjacent phases and speech, grounding gesture annotation solely on physical characteristics observable in their execution (Ladewig & Bressem 2013: 1063; Bressem & Ladewig, 2011).

In fact, from a linguistic-semiotic viewpoint, this method stresses the separation of gestural forms and functions in the analytic process (Ladewig & Bressem 2013: 1063-1064). The analysis is inspired by phonology and takes into account articulatory features, dividing them into distinctive and additional features, such as movement and tension for the former and possible types of movement and flow of movement for the latter (only related to [+movement]). Each gesture phase type is so characterized by specific pair marks ascribed to each feature, for example for “movement”: [+movement] [-movement]; [+restricted] [-restricted]; [+variable] [-variable]; for “tension”: [+tension] [-tension]; [+constant] [-constant]; [+increase] [-increase]. The preparation phase is marked by the features [+movement] [+tension] [-constant] [-increase].

Still regarding the description of the gestures, however in a functional perspective, McNeill (1992) proposed the annotation of hands, motion, and meaning. Crucially, for motion, gestures occur in a space defined in relation to the body, which can be divided into sectors ιιsing a system of concentric squares: Center-Center; Center; Periphery (upper, lower, right, left); and the Extreme Periphery (upper, lower, right, left) (McNeill, 1992: 87 – 88).

McNeill observes that the four types of gesture, namely the gesticulation (the singular idiosyncratic type), occur differently in the space considered: Iconics tend to occur in the Center-Center space; Metaphorics in the lower Center space; Deictics extend to the Periphery; Beats at several places (McNeill, 1992: 88).

As regards shape, notably is that the reference model is the ASL hand shapes described by Friedmann (1977), although gestures are deficient from an ASL point of view (McNeill, 1992: 86).

Finally, regarding the interpretation of gesture meaning, McNeill considers globally its relations with speech and context (McNeill, 1992: 2005).

Bressem (2013) instead proposes a description of the form of the gesture independent of speech, excluding paraphrases of meaning. The system is based on the four parameters currently used to describe sign language already mentioned (hand shape, orientation, movement, and location). The crucial parameter of hand shape works on the basis of four basic categories: (i) fist; (ii) flat hand, (iii) single fingers; (iv) combinations of fingers (Bressem, 2013: 1085); The location parameter relies on McNeill’s schema previously mentioned.

The LASG, Linguistic Annotation System for Gestures (Bressem, Ladewig & Müller, 2013), provides guidelines for the annotation of gestures (gesture units and phases, form, and motivation of form) in line with the form-based principles of Ladewig & Bressem (2013) and Bressem (2013), however the system is extremely flexible and it is also used in other frames. The system divides the annotation process into three levels: annotation of gestures (determining units; annotation of form; motivation of form); annotation of speech (annotation of speech turn; annotation of speech unit); annotation of gestures in relation to speech (prosody; syntax; semantics; pragmatics), and codify each level on independent layers of powerful tools for multimodal annotation (ELAN o ANVIL).

NEUROGES (Lausberg, 2013; Lausberg & Sloetjes, 2009; 2015) is a system allowing gesture coding at multiple levels and it is composed of three modules: Kinetic coding (hand movements, trajectory, dynamic, location, and body contact); Bimanual relation coding (in touch vs. separate, symmetrical vs. complementary, independent vs dominant); and Functional gesture coding (the meaning based on specific combinations of kinetic gestural feature).

Various initiatives are now developing gesture annotation infrastructure, notably the recent M3D labeling system (Roher et al. 2020) offers a dimensionalized approach to the annotation of communicative body movements (i.e., manual gestures, head movements, and other articulators) in terms of their form, their prosodic characteristics, and their semantic and pragmatic contributions to speech and a training program guiding through the multiple steps of the gesture annotation process.

The SCG gesture coding system, developed by the Speech Communication Group at MIT, provides methods for labeling gestures specifically for their movements, so that the relationship between the kinematics of gesture with prosody, intonational breaks, morphosyntax, discourse coherence, and pragmatics can be explored.

Gesture / Speech Synchrony

According to Kendon (1980), when speech and gesture co-occur, they are expected to present the same semantic information or perform the same pragmatic function, and the stroke of the gesture precedes or ends at, but does not follow, the phonological peak syllable of speech. In parallel, the onset and the stroke may precede its lexical affiliate (Schegloff, 1984: 291).

According to more recent research on Dutch (Bekke et al., 2020), the majority of strokes (62%) started before their lexical affiliate, around 215 ms on average. However, they can still be synchronized with their co-expressive speech (McNeill 2005: 37). Affiliation, indeed, does not automatically correspond to a linguistic affiliate. If gestures are ‘windows onto thinking’ (as already seen in McNeill & Duncan, 2000), they may therefore refer to a ‘conceptual affiliate’ rather than a ‘lexical’ affiliate (Kirchhof, 2011).

McNeill (1992: 26 – 29) states three synchrony rules, respectively semantic, pragmatic, and phonological. The semantic and pragmatic synchrony rules state that if speech and gesture co-occur, they must present the same semantic information or perform the same pragmatic function. The phonological foresees that the gestural unit is aligned with the phrase, and the stroke is aligned with the stressed syllable, slightly preceding or coinciding (but not following) with its onset.

A substantial body of research specifically on gesture/prosody synchronization has been conducted, particularly in the autosegmental frame (Ladd, 2008) showing that strokes synchronize to prosodic prominence (Swerts & Krahmer, 2010; Loehr, 2004; 2012; 2014; Esteve-Gibert & Prieto, 2013; Shattuck-Hufnagel & Ren, 2018). More specifically, a strong synchronization between the instant of time constituting the kinetic goal of a stroke (apex) and pitch accents was identified (Loehr, 2012: 77); meanwhile gesture phrases are aligned with intermediate phrase accents (Pierrehumbert & Hirschberg, 1990), and respect the boundaries of the prosodic phrase (Loehr, 2004: 164).

The importance of prosodic edges as anchoring sites for different types of gestures (i.e.: referential vs. non-referential) has been recently considered in a set of studies (Shattuck-Hufnagel et al., 2010; Esteve-Gibert & Prieto, 2013; Rohrer et al., 2019; Rohrer, 2022; Rohrer et al., 2023). Rohrer et al. (2023) report that in English academic discourse, while the majority of strokes overlap a pitch-accented syllable (85.99%), apex alignment occurs at a relatively low rate (50.4%). At the phrasal level, strokes align with phrase-initial prenuclear pitch accents over nuclear accents.

Within the configurational frame of Language-into-Act-Theory (Cresti, 2000; Cresti & Moneglia, 2018) a strong synchronization of gestures with the utterance boundaries has been observed at the higher level of the Pragmatic / Prosody relation. Gesture units tend to start and end at a prosodic boundary, while the terminal prosodic boundary of the utterance is never crossed by gesture phrases (Cantalini & Moneglia, 2020).

Gesture/prosody synchronization has also been considered in connection with the pragmatic values expressed by prosody. Prosody and gesture work in an integrated manner for marking Information structure (Ebert et al., 2011; Im & Baumann, 2020; Rohrer, 2022). Loher (2012) noticed the frequent cooccurrence of the final edge tone signaling completeness or incompleteness, respectively with the ‘hands reaching a rest position’ and with a ‘raised hand’. Prosodic cues signaling new information are frequently accompanied by the ‘hand placing an abstract object in space’.Focus marking may be synchronous to ‘beats’ or ‘pointing’, and gestures frequently mark Contrastive focus. More in general, Focus marking turns out multimodal. Gesture presence augments the prosodic cues for focus marking (Estev-Gibert & Prieto, 2013; Gregori et al., 2023). Considering respectively the marking of Informative focus, Contrastive focus, and Correcting focus (Krifka, 2008), prosody and gestures work together (Ambrazaitis et al., 2017).

Discourse connections can also be supported through gesturing. For instance, Contrast relations are expressed by moving hands on opposing sides of the body (Hinnell, 2019), Listing by counting with fingers (Rodriguez, 2015), and Exception relations raising a finger (Bressen & Muller, 2014; Inbar, 2022).

The change in discursive level during speech segmentation may also be marked by gestures in conjunction with prosodic variation. In particular changes in gestural pattern (movement, hand shape, position, and orientation) have been noted in conjunction with the onset of Parenthesis and Reported speech (Barros & Mello, 2023; Bicalho de Albuquerque, 2023).

Gesture in the Ontogenesis and Phylogenesis of Language

Language acquisition can be seen as a multimodal process (Morgenstern, 2024). First Darwin (1877), focusing on the expression of emotions, noted his son's transition from uncontrolled body movement to intentional gestures. Habitual movements become automatic and associated with communicative functions, such as bodily manifestations of negation.

In a set of studies in the ’70s gestures were considered a system of communication that precedes language. Bruner (1975; 1983), Bates (1976), and Bates et al. (1979) showed that in children from 9 to 11 months, the coordination of gaze, gesture, facial expression, posture, and vocalizations give rise to early productions of intentional communication acts. Conventionalized bodily acts of pointing, reaching, and holding appear in the adult/child interaction before the onset of verbal symbols, contributing to the children’s construction of meaning.

The role of pointing and eye contact is crucial in constructing joint attention (Tomasello, 2003), and pointing is considered a strong predictor of lexical acquisition capacity.

At the onset of verbal symbols (12 / 13 months), the number of representational gestures produced by children (brushing the hair for “brush”) and routines on one side, and the number of words mastered on the other side are roughly equivalent, and only after that phase, in hearing children, the acquisition of lexical vocabulary becomes dominant (Capirci et al., 2005). The transition from one-word to two-word utterances is also supported, at its onset, by the composition of one word with a supplementary gesture (Capirci et al.,1996; Volterra, 1981).

The semiotic system of gesture is considered to have played a pivotal role in the evolution of language. Three basic theories link gesture system to the evolution of the dominant language capacity: primacy (monosemiotic), polysemiotic-equipollent, and polysemiotic-pantomimic (Zywiczynski & Zlatev, 2024).

In the Primacy Hypothesis (Hewes, 1977), a protolanguage constitutes a transitional system between the signal-basis communication of apes and the human language. It is made up of synthetic gestures working as quasi-lexical units standing for objects and actions (Manual Protolangue in Corballis, 2002; 2003; 2012).

Arbib’s Mirror Neuron Hypothesis (Arbib, 2012) foresees that the recognition and imitation of complex actions coupled with communicative intentions form the communicative system of pantomimes that are then conventionalized and segmented, giving rise to gestural proto-signs. A mechanism of collateralization whereby the activity of the area responsible for manual production splits over into the neighboring areas responsible for vocalization (exaptation).

Tomasello (2008; 2009) stresses the need for pro-sociality and shared intentionality in this process; however, he also notices that pantomimes are proposition-size rather than word-size and cannot be segmented. The gradual transition from gesture to vocalization arose from the need for less iconicity to more arbitrariness (Levinson & Holler, 2006).

Polysemiotic-equipollent theories postulate an early integration of gesture and vocalization, assuming that gesture and speech form two equipollent sides of a single communicative system (McNeill, 2005; 2012) or a process (Kendon, 2014) (polysemiotic system). Gesture and language can only have appeared simultaneously to explain the gesture-speech unity and their synchronicity. This conception fits with the McNeill's Growth Point (1992; 2005; 2012) idea, already mentioned, which postulates a strict functional division between the two semiotic systems: speech for propositional content and gesture for imagistic content. To justify their integration, McNeill proposes a mechanism called “Mead's Loop”, a new adaptation in the evolution of humans (not unlike Chomsky’s “lucky mutation”, as in Hauser, Chomsky & Fitch, 2002, at the level of mirror neurons). This mechanism rested on a new kind of mirror neuron, a “twisted” or “inverted” one. Gesture and speech were naturally selected together, beginning to evolve in the brain through a thought-language-hand link localized, presumably, but not exclusively, in the Broca’s Area.

In fact, the “straight” (untwisted) mirror neuron action, responding to the actions of others, is social and leads to mimics. The action of another is repeated and becomes one’s own. However, a mimicked action will lead only to pantomime and will not create gesture-speech unity. In Mead’s Loup ‘twist’, instead, one’s own gesture is responded to as it was performed by another, so inheriting a social property. The twisted mirror neurons made the gesture and its significance available in Broca’s area, where complex actions are orchestrated, so vocal movement can be integrated, opening the door to a dynamic dimension. (McNeill, 2012: 62-63).

In the polysemiotic-pantomimic theory (Zlatev, 2016; Żywiczyński, 2018) human communication is framed within a general cognitive capacity shared with other semiotic systems such as music, dance, ritual, and depiction (drawing, painting). These semiotic systems consist of (representational) signs specific to humans and have all developed from bodily mimesis (Zlatev et al., 2020). Mimesis (Donald, 1998) was first used for tool production and then evolved for communication. Polysemiotic pantomime is a necessary step in this process. Language realized as speech can be dominant in expressing propositions and narratives, but other semiotic systems remain polysemiotic.


References

Ambrazaitis, G. & House, D. (2017). Multimodal prominences: Exploring the patterning and usage of focal pitch accents, head beats and eyebrow beats in Swedish television news readings. Speech Communication, (95) 100–113.

Arbib, M.A. (2012). How the brain got the language. Oxford, Oxford University Press.

Barros, C., & Mello, H. (2023). The C-ORAL-BRASIL proposal for the treatment of multimodal corpora data: the BGEST corpus pilot project. In A. Grajales Ramírez, J. Molina Mejía, & P. Valdivia Martin (eds.) Digital Humanities, Corpus and Language Technology: A look from diverse case studies. University of Groningen Press.

Bates, E. (1976). Language and Context, The acquisition of Pragmatics. Cambridge MA, Academic Press.

Bates, E., Benigni, L. Bretherton, I., Camaioni L. & Volterra, V. (1979). Cognition and Communication from nine to thirteen months: Correlational finding. In E. Bates (ed.) The emergence of symbols. Cognition and Communication in Infancy. Cambridge, MA: Academic Press, pp. 69-140.

Bavelas, J. B. (1994). Gestures as part of speech: Methodological implications. Research on Language and Social Interaction, 27 (3), 201-221.

Bavelas, J. B., Chovil, N. Coates, L. & Roe, L. (1995). Gestures specialized for dialogue. Personality and Social Psychology Bulletin, 21 (4), 394-405.

Bavelas, J. B., Chovin, N., Lawrie, D. & Wade, A. (1992). Interactive gestures. Discourse Processes, 15 (4), 469-489.

Bekke M. ter. Bekke Linda Drijvers, L. Holler, J. (2020). The predictive potential of hand gestures during conversation: An investigation of the timing of gestures in relation to speech. In Proceedings of Gesture and Speech in Interaction (GESPIN2020). Stockholm, Sweden, 7-9 September 2020, pp 1-6.

Bicalho de Albuquerque, M. (2023). Gestures and multilevel discourse in spontaneous speech corpora: the case of reported speech. DILEF (3). 243-260

Bressem, J. (2013). A linguistic perspective on the notation of form features in gestures. In C. Müller, A. Cienki, E. Fricke, SH. Ladewig, D. McNeill, & S. Teßendorf (eds.), Body – Language – Communication: An International Handbook on Multimodality in Human Interaction, Vol. 1. Berlin: De Gruyter-Mouton, pp. 1079-1098,

Bressem, J. Ladewig, SH. & Müller, C. (2013). Linguistic Annotation System for Gestures (LASG). In C. Müller, A. Cienki, E. Fricke, SH. Ladewig, D. McNeill, & S. Teßendorf (eds.), Body – Language – Communication: An International Handbook on Multimodality in Human Interaction, Vol. 1. Berlin: De Gruyter-Mouton, pp. 1098-1125.

Bressem, J. & Müller, C. (2014) A repertoire of German recurrent gestures with pragmatic function. In C. Müller, A. Cienki, E. Fricke, SH. Ladewig, D. McNeill, & S. Teßendorf (eds.), Body – Language – Communication: An International Handbook on Multimodality in Human Interaction, Vol 2. Berlin: De Gruyter-Mouton, 1575-1592

Bressem, J. & Ladwig, S. H. (2011). Rethinking gesture phases – articulatory features of gestural movement? Semiotica, 184(1/4), 53-91.

Bruner, J. S. (1975). The Ontogenesis of Speech Acts. Journal of Child Language, 2 (1) 1-19.

Bruner, J. S. 1983). Child’s Talk. Learning to use language. New York, W. Norton & C.

Cantalini, G. & Moneglia, M. (2020). The annotation of Gesture and Gesture / Prosody synchronization in Multimodal Speech Corpora. Journal of Speech Science, 9, 1-24.

Capirci, O., Contaldo, A., Caselli, M. C. & Volterra, V. (2005). From Action to Language through gesture: A longitudinal perspective. Gesture, 5 (1-2), 155-177.

Capirci, O., Iverson, J. Pizzuto, E. &Volterra V. (1996). Gestures and Words during the transition to two-word speech. Journal of Child Language, 23, 645-673.

Chafe, W. (1994). Discourse, Consciousness, and Time: The Flow and Displacement of Conscious Experience in Speaking and Writing. Chicago, The University of Chicago Press.

Cienki, A., (2013). Cognitive Linguistics: Spoken language and gesture as expressions of conceptualization. In C. Müller, A. Cienki, E. Fricke, SH. Ladewig, D. McNeill, & S. Teßendorf (eds.), Body – Language – Communication: An International Handbook on Multimodality in Human Interaction, Vol. 1. Berlin: De Gruyter-Mouton, pp. 18–202.

Cienki, A. Müller, C. (eds). (2008). Metaphor and gesture. Amsterdam, Benjamins.

Cienki, A. (2023). Speakers’ Gestures and Semantic Analysis. Cognitive Semantics, 9, 167–191.

Cresti, E. (2000). Corpus di italiano parlato. Firenze, Accademia della Crusca.

Cresti, E. & Moneglia, M. (2018)- The illocutionary basis of Information Structure. Language into Act Theory (L-AcT). In E., Adamou, K. Haude & M. Vanhove (eds.) Information structure in lesser-described languages: Studies in prosody and syntax, Amsterdam, Benjamins, pp. 359-401.

Corballis, M. C. (2002). From hand to Mouth: The origins of Language. Princeton, Princeton University Press.

Corballis, M. C. (2003). From Mouth to hand: Gesture, Speech and the evolution of right-handedness. Behavioral and Brain Science, 26 (2), 199-208.

Corballis, M. C. (2012) How the language evolved from manual gesture. Gesture, 12(2), 200-226.

Darwin, C. (1877). A biographical sketch of an infant. Mind, 2 (7), 285-294.

Donald, M. (1998). Mimesis and the executive suite: Missing links in language evolution. In J. R. Hurford, M. Studdert-Kennedy, & C. Knight (Eds.), Approaches to the evolution of Language: Social and Cognitive bases. Cambridge: Cambridge University Press, pp. 44-67.

Ebert, C., Evert, S. & Wilmes, K. (2011) “Focus marking via gestures,” in I. Reich, E. Horch, & D. Pauly (eds.), Proceedings of Sinn und Bedeutung 15, pp.193–208.

Ekman, P., & Friesen W. V. (1969). The repertoire of nonverbal behavior. Semiotica, 1, 49-98.

Esteve-Gibert, N., & Prieto, P. (2013). Prosodic structure shapes the temporal realization of intonation and manual gesture movements. J. Speech Lang. Hear. Res. 56 (3), 850–864.

Fricke, E., Bressem, J. & Müller, C. (2014). Gesture families and gestural fields. In C.Müller, A. Cienki, E.Fricke, S. H. Ladewig, D. McNeill & J. Bressem (eds.) Body – Language – Communication. An International Handbook on Multimodality in Human Interaction Vol, 2. Berlin/Boston: De Gruyter Mouton, pp. 1630-1640.

Friedman, L. (1977). Formational properties of American Sign Language. In L. Friedman (ed.), On the Other Hand: New Perspectives on American Sign Language, New York: Academic Press, pp. 13–56.

 Gregori, A., Sánchez-Ramón, P., Prieto, P. & Kügler, F. (2023). Prosodic and gestural marking of focus types in Catalan and German. In. Proceedings of the 12th International Conference on Speech Prosody. July 02–05, 2024. University of Leiden, The Netherlands.

Gullberg, M. (1998). Gesture as a Communication Strategy in Second Language Discourse: A Study of Learners of Frenchand Swedish. Lund: Lund University Press.

Gullberg, M. (2003). Handling Discourse: Gestures, Reference Tracking, and Communication Strategies. Early L2, Language Learning, 56(1), 155-96.

Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: What is it, who has it, and how did it evolve? Science, 298(5598), 1569–1579. 

Hewes, G. W. (1977). A model for Language evolution. Sign Language Studies, 15, 97-168.

Hinnell, J. (2019). The verbal-kinesic enactment of contrast in North American English. The

American Journal of Semiotics. 35 (1-2) 55–92.

Im, S. & Baumann, S. (2020). Probabilistic relation between co-speech gestures, pitch accents and information status. In Proceedings of the LSA, 5, pp. 685–697.

Inbar, A. (2022). The raised index finger gesture in Hebrew multimodal interaction. Gesture. 21 (2-3). 264–295.

Iriskhanova, O. K., & Cienki, A. (2018). The Semiotics of Gestures in Cognitive Linguistics: Contribution and Challenges. Voprosy Kognitivnoy Lingvistiki4, 25-36.

Kendon, A. (1972). Some relation between body motion and speech: An analysis of an example. In: A. W. Siegman & B. Pope (eds.), Studies in Dyadic Communication. New York: Elsevier, pp. 177-210

Kendon, A. (1980). Gesticulation and speech: Two aspects of the process of utterance. In: M.R. Key, (Ed.) The relationship of verbal and nonverbal communication. Berlin, De Gruyter Mouton, pp. 207–228.

Kendon, A. (2004). Gesture: Visible Action as Utterance. Cambridge, Cambridge University Press.

Kendon A. (2014). The ‘poly-modalic’ nature of utterances and its relevance for inquiring into language origins. In D. Dor, C. Knight, & J. Lewis (Eds) The social origins of Language. Oxford, Oxford University Press, pp. 67-76.

Kirchhof, C. (2011). So What's Your Affiliation With Gesture? in C. Kirchhof, Z. Malisz & P. Wagner, (eds.) GeSpIn 2011 Vol 2. Bielefeld, pp. 1-7.

Kita, S., van Gijn, I. & van der Hulst, H. (1998). Movement phases in signs and co-speech gestures, and their transcription by human coders. In I. Wachsmuth & M. Fröhlich (eds.), Gesture and Sign Language in Human-Computer Interaction. Berlin, Springer, pp. 23-35.

Kita, S. (2000). How representational gestures help speaking. In D. McNeill (ed.) Language and Gesture, Cambridge, Cambridge University Press, pp. 162-185.

Krifka, M. (2008). Basic notions of information structure. Acta Linguistica Hungarica.35, 243-276.

Ladd, D. R. (2008). Intonational Phonology (2nd ed.). Cambridge, Cambridge University Press.

Lausberg, H. (2013). Understanding body movement: A guide to empirical research on nonverbal behavior. With an introduction to the NEUROGES coding system. Bertlin, Peter Lang.

Lausberg, H. & Sloetjes, H. (2009). Coding gestural behaviour with the NEUROGES- ELAN system. Behavioral Research Methods. 41(3), 841-849.

Lausberg, H. & Sloetjes, H. (2015). The revised NEUROGES-ELAN system: an objective and reliable interdisciplinary tool for nonverbal behavior and gesture. Behavioral Research Methods. 48(3), 973-993.

Ladewig, SH., Bressem, J. (2013). A linguistic perspective on the notation of gesture phases. In C. Müller, A. Cienki, E. Fricke, SH. Ladewig, D. McNeill, & S. Teßendorf (eds.), Body – Language – Communication: An International Handbook on Multimodality in Human Interaction, Vol. 1. Berlin: De Gruyter-Mouton. Berlin, De Gruyter Mouton, 2013, pp.1060–1079

Lakoff, G. & Johnson, M. (980). Metaphors we leave by. Chicago, University of Chicago Press

Levinson S. C. & Holler J. (2014). The Origin of Human Multi-Modal Communication, Philosophical Transactions of the Royal Society B, 369(1651), 1-9.

Loehr, D. (2004). Gesture and Intonation. Ph. D. dissertation. Washington, DC: Georgetown University.

Loehr, D. P. (2012). Temporal, structural, and pragmatic synchrony between intonation and gesture. Laboratory Phonology, 3(1), 71-89.

Loehr, D. (2014) Gesture and prosody. In C. Müller, A. Cienki, E. Fricke, S.H. Ladewig, D. McNeill, & S. Teßendorf (Eds.), Body – Language – Communication: An International Handbook on Multimodality in Human Interaction (38.2). Berlin, De Gruyter Mouton. pp. 1381-1391.

McNeill, D. (1992). Hand and Mind: What Gestures Reveal about Thought. Chicago, University of Chicago Press.

McNeil 2005, McNeill, D. (2005). Gesture and thought. University of Chicago Press.

McNeill, D. (2015). Why We Gesture. Cambridge: Cambridge University Press.

McNeill, D. (2012). How Language beganGesture and Speech in Human Evolution. Cambridge, Cambridge University press.

McNeill, D., Pedelty, L. & Levy E. T. (1990). Speech and gesture. Advances in Psychology 70, 203-256.

McNeill, D. & Duncan, S. D., (2000). Growth point in thinking-for-speaking. In D. McNeill (ed.), Language and Gesture Cambridge, Cambridge University Press, pp. 141-161.

D. McNeill, F. Quek, K.E. McCullough, S. D. Duncan, N. Furuyama, R. Bryll, N. Furuyama and R. Ansari (2001). Catchments, prosody and discourse. Gesture, 1 (1) 9 – 33.

Mittelberg, I. (2014). Gesture and Iconicity. In C. Müller, A. Cienki, E. Fricke, SH. Ladewig, D. McNeill, & J. Bressem (eds.), Body – Language – Communication: An International Handbook on Multimodality in Human Interaction, Vol 2. Berlin, De Gruyter Mouton, pp.1732-1746.

Mittelberg, I & Evola, V. (2014). Iconic and Representational Gestures. In In C. Müller, A. Cienki, E. Fricke, SH. Ladewig, D. McNeill, & J. Bressem (eds.), Body – Language – Communication: An International Handbook on Multimodality in Human Interaction, Vol 2. Berlin De Gruyter Mouton, pp. 1712-1732.

Morgensten, A. (2024). Gesture and first language development: the multimodal child. In A. Cienki (ed.) Gesture Studies, Cambridge: Cambridge University Press, pp. 368-397.

Müller, C. (2004). Form and uses of the Palm Up Open Hand, a case of gestural family? In C. Müller & R. Posner (Eds.) Semantics and pragmatics of everyday gestures. Berlin: Weidler, pp. 234-256.

Müller, C. (2010). Wie gesten bedeuten. eine kognitiv-linguistische und sequenzanalytische perspektive. Sprache und Literatur, 41, 37–68.

Müller, C. (2017). How recurrent gestures mean: Conventionalized contexts-of-use and embodied motivation. Gesture, 16(2), 277-304.

Müller, C. (2018). Gesture and Sign: Cataclysmic Break or Dynamic Relations? Front. Psychol, 9, 1-20.

Rodrigues, I. G. (2015). A tool at hand: Gestures and rhythm in listing events case studies of

European and African Portuguese speakers. Oslo Studies in Language. 7 (1)253–281.

Rohrer, P. (2022). A temporal and pragmatic analysis of gesture-speech association: A corpus-based approach using the novel MultiModal MultiDimensional (M3D) labeling system,” Ph.D. dissertation, University Pompeu Fabra, Barcelona.

Rohrer, P. L., Prieto, P. & Delais-Roussarie, E. (2019). Beat gestures and prosodic domain marking in French. In S. Calhoun, P. Escudero, M. Tabain & P. Warren (eds.) Proceedings of the 19th International Congress of Phonetic Sciences. Australasian Speech Science and Technology Association Inc. pp. 1500-1504.

Rohrer, P.L., Tütüncübasi, U. Vilà-Giménez, I., Florit-Pons, J., Esteve Gibert, N., Ren, P. Shattuck-Hufnagel, S. & Prieto, P. (2020) The MultiModal MultiDimensional (M3D) labeling system <https://osf.io/ankdx/>

Rohrer, P. H., Elisabeth Delais-Roussarie, E. & Prieto, P. (2023). Visualizing prosodic structure: Manual gestures as highlighters of prosodic heads and edges in English academic discourses. Lingua. 293. https://doi.org/10.1016/j.lingua.2023.103583.

Pierreumbert, J. & Hirschberg, J. (1990). The meaning of intonational contour in the interpretation of discourse. In P. R. Cohen, J. L. Morgan & M. E. Pollack (Eds.) Intentions in communication, Cambridge MA, MIT Press.

Schegloff, E. A. (1984). On some gestures’ relation to talk. In J. M. Atkinson & J. Heritage (Eds.) Structures of Social Action: Studies in Conversation Analysis, Cambridge, Cambridge University Press, pp. 266–296.

Shattuck-Hufnagel, S. & Ren, A. (2018) The Prosodic Characteristics of Non-referential Co-speech Gestures in a Sample of Academic-Lecture-Style Speech. Front. Psychol. 9.

Shattuck-Hufnagel, S., Ren, A., & Tauscher, E. (2010). Are torso movements during speech timed with intonational phrases? In Proceedings of the International Conference on Speech Prosody. ISCA Archive, pp.1–4.

Seyfeddinipur, M. (2006). Disfluency: Interrupting speech and gesture. PhD Thesis. Radboud University Nijmegen.

Swerts, M., & Krahmer, E. (2010). Visual prosody of newsreaders: Effects of information structure, emotional content and intended audience on facial expressions. Journal of Phonetics, 38, 197–206.

Tomasello, M. (2003). Constructing a Language: A based theory of Language acquisition. Cambridge MA, Harvard University Press.

Tomasello M. (2008). The Origins of Communication. Boston MA, MIT press

Tomasello M. (2009). Why we cooperate. Cambridge MA, MIT Press.

Volterra, V. (1981), Gestures, signs and words at two years old: When does communication become language?, Sign Language Studies , n. 33, 351-362

Zlatev J. (2016). Preconditions in human embodiment for the evolution of symbolic communication. In G. EtzelmMüller & C. Tewes (Eds.) Embodiment in evolution and Culture. Tubungen, Mohr Siebeck, pp. 151-174.

Żywiczyński, P. (2018). Language Origins: Form mythology to science. Berlin, Peter Lang.

Zlatev J., Zywiczynski P. & Wacewicz, S. (2020). Pantomime as the original human-specific semiotics system. Journal of Language evolution, 5(2), 156-174.

Zywiczynski P. & Zlatev J. (2024). The role of gesture in debates on the origins of Language. In A. Cienki (ed.) Gesture Studies. Oxford, Oxford University Press, pp. 335-367.

ANVIL: http://www.anvil-software.de/download/index.html

ELAN: https://archive.mpi.nl/tla/elan

NEUROGES: http://neuroges.neuroges-bast.info/

SCG Gesture Coding Manual: https://speechcommunicationgroup.mit.edu/gesture/coding-manual.html