Prosody Logo Prosody and Prosodic Models
ICSLP 2002 - September 16, 2002, Denver Colorado
Chilin Shih and Greg Kochanski
Section 1: What is Prosody?

What is prosody?

One way to appreciate prosody is to listen to sentences where the prosody is not quite right. For this, we'd like you to meet two robots, each with differently deficient prosody: R1D1 has defective speech rhythm. R1P1 has defective pitch control.

Duration R1D1 is playful R1D1 has two moods: when he is playful, he picks a random number between 10 and 400 milliseconds and use that for the phone duration.
R1D1 is serious When he is serious, he assigns the same duration value to each phone.
Pitch R1P1 is playful R1P1 doesn't know how to control pitch: when he is playful, he creates random melody for his sentence.
R1P1 is serious When he is serious, he uses the same pitch values, or monotone.

We use the term "prosody" broadly, meaning a time series of speech-related information that's not predictable from a reasonable window (i.e. word-sized or sentence-sized) applied to the phoneme sequence. This could include pitch, duration, amplitude, and gestures.

Viewed in the large, prosody is a parallel channel for communication, carrying some information that cannot be simply deduced from the lexical channel. All aspects of prosody are transmitted by muscle motions, and in most of them, the recipient can perceive, fairly directly, the motions of the speaker. Even in intonation, pitch has a fairly smooth relationship to the underlying muscle tensions.

While pitch is an important component of prosody, it has been known since the 1950s (Fry, 1955; Fry, 1958; Bolinger, 1958; Lieberman, 1960; Hadding-Koch, 1961) that duration and amplitude are also important components. Recent literature (Maekawa, 1998; Kehoe et al., 1995; Sluijter and van Heuven, 1996; Pollock et al., 1990; Sluijter et al., 1997; Turk and Sawusch, 1996, Erickson, 1998 and references therein) also provides support for amplitude, spectral tilt and jaw movement as important components of prosody.

Clearly, with our broad definition of prosody, hand gestures, eyebrow and face motions, can be considered prosody, because they carry information that modifies and can even reverse the meaning of the lexical channel. In this tutorial, however, we concentrate on pitch (f0) modeling.

Prosody, as expressed in pitch, gives clues to many channels of linguistic and para-linguistic information. Linguistic functions such as stress and tone tend to be expressed as local excursions of pitch movement. Intonation types and para-linguistic functions may affect the global pitch setting, in addition to characteristic local pitch excursion near the edge of the sentence (i.e. boundary tones). The combination of multi-channel signals present a challenge to prosody modeling, which we will return to in Section 3.

