|Prosody and Prosodic Models
ICSLP 2002 - September 16, 2002, Denver Colorado
Chilin Shih and Greg Kochanski
What is prosody?
One way to appreciate prosody is to listen to sentences where the prosody is not quite right. For this, we'd like you to meet two robots, each with differently deficient prosody: R1D1 has defective speech rhythm. R1P1 has defective pitch control.
|Duration||R1D1 is playful||R1D1 has two moods: when he is playful, he picks a random number between 10 and 400 milliseconds and use that for the phone duration.|
|R1D1 is serious||When he is serious, he assigns the same duration value to each phone.|
|Pitch||R1P1 is playful||R1P1 doesn't know how to control pitch: when he is playful, he creates random melody for his sentence.|
|R1P1 is serious||When he is serious, he uses the same pitch values, or monotone.|
We use the term "prosody" broadly, meaning a time series of speech-related information that's not predictable from a reasonable window (i.e. word-sized or sentence-sized) applied to the phoneme sequence. This could include pitch, duration, amplitude, and gestures.
Viewed in the large, prosody is a parallel channel for communication, carrying some information that cannot be simply deduced from the lexical channel. All aspects of prosody are transmitted by muscle motions, and in most of them, the recipient can perceive, fairly directly, the motions of the speaker. Even in intonation, pitch has a fairly smooth relationship to the underlying muscle tensions.
While pitch is an important component of prosody, it has been known since the 1950s (Fry, 1955; Fry, 1958; Bolinger, 1958; Lieberman, 1960; Hadding-Koch, 1961) that duration and amplitude are also important components. Recent literature (Maekawa, 1998; Kehoe et al., 1995; Sluijter and van Heuven, 1996; Pollock et al., 1990; Sluijter et al., 1997; Turk and Sawusch, 1996, Erickson, 1998 and references therein) also provides support for amplitude, spectral tilt and jaw movement as important components of prosody.
Clearly, with our broad definition of prosody, hand gestures, eyebrow and face motions, can be considered prosody, because they carry information that modifies and can even reverse the meaning of the lexical channel. In this tutorial, however, we concentrate on pitch (f0) modeling.
Prosody, as expressed in pitch, gives clues to many channels of linguistic and para-linguistic information. Linguistic functions such as stress and tone tend to be expressed as local excursions of pitch movement. Intonation types and para-linguistic functions may affect the global pitch setting, in addition to characteristic local pitch excursion near the edge of the sentence (i.e. boundary tones). The combination of multi-channel signals present a challenge to prosody modeling, which we will return to in Section 3.
Languages may employ prosody in different ways to differentiate declarative sentences from questions. A general trend is that questions are associated with higher pitch somewhere in the sentence, most commonly near the end. This may be manifested as a final rising contour, or higher/expanded pitch range near the end of the sentence. In English, declarative intonation is marked by a falling ending while yes-no question intonation is marked by a rising one, as shown on the last digit "one" in the English examples. Russian question, on the other hand, uses strong emphasis on a key word instead of a rising tail. Chinese questions are manifested by an expanded pitch range near the end of the sentences, however, the speaker preserves the lexical tone shapes (Yuan, Shih, Kochanski 2002).
Examples of declarative and question intonation in English, Russian, and Chinese.
Topic initialization is typically associated with high pitch (Hirschberg and Pierrehumbert, 1986; Sluijter and Terken, 1993). Pitch is typically raised in the discourse initial section and lowered in the discourse final section.
Also, new information in the discourse structure is typically accented while old information de-accented.
Most experiments studying emotional speech study stylized emotion, as delivered by actors and actresses. In these acted-out emotions, a few categories of emotions can be reliably identified by listeners, and one can find consistent acoustic correlates of these categories. For example, excitement is expressed by high pitch and fast speed, while sadness is expressed by low pitch and slow speed. Hot anger is characterized by over-articulation, fast, downward pitch movement, and overall elevated pitch. Cold anger shares many attributes with hot anger, but the pitch range is set lower.
The study of emotion in natural speech is a lot more complicated. It is generally recognized that speakers show mixed feelings and ambiguous states of mind, and the emotions do not fall into clear cut categories.
There is a tendency for pitch to decline during the course of an utterance ('t Hart and Cohen, 1973; Maeda, 1976). This effect is at least partially caused by the drop of sub-glottal pressure (Lieberman, 1967; Fujisaki, 1983; Strik and Boves, 1995). Listeners compensate for this effect: When presented with two accented words of equal pitch height, listeners judge the second one to be more prominent (Pierrehumbert 1979).
Below is an example of Mandarin Chinese (Shih, 2000) showing the pitch declination profile in a sequence of high level tones, which are marked as "H" in the figure. The pitch drops about 50 Hz from the highest "H" to the final "H".
|[ Papers | Top | Stem-ML modeling ]||Greg Kochanski: [ Home ]
Chilin Shih: [ Home ]