Prosody and Prosodic Models
Tutorial, ICSLP 2002, Denver, Colorado

September 16, 2002
Chilin Shih, Greg P. Kochanski
Bell Labs, Lucent Technologies
Chilin Shih, Greg Kochanski

Section 3: Modeling Example

Confirming a Word in a Question: Modeling Example

In this section we use one experiment to explore Stem-ML modeling.

The Nature of the Problem

Question intonation of yes-no questions in English typically has a rising tail. The rising gesture starts on the last stressed syllable.


When the emphasis is early, there is typically a "plateau" of high f0 values stretching from the stressed syllable to just before the end of the sentence. There is another final rise at the end of the sentence.

This intonation contour occurs frequently with number confirmation, either in human-human interaction or human-machine interaction. A dialogue system handling transactions may have a need to confirm a specific number in a string of credit card number or phone number. The most natural and effective way to confirm that number is to use rising intonation on the questionable digit, as in the 0 below.


It is desirable to have a good model of this intonation contour in a Text-to-Speech system to handle such dialogue acts.


We designed a small experiment to explore how to model this type of intonation contour.

The database consists of 200 digit sequences, organized in 16 blocks, with variations in phrasing, speaking speed, and single digit confirmation in different sentential positions. We'd like to see how question intonation interact with these factors. We also recorded declarative and yes-no question intonation as references.


Q0015.gif Audio Phrasing as indicated by the dash is clearly marked on declarative sentences. Pitch rises on the phrase initial digit.
Not surprisingly, declarative sentences end with falling pitch and questions end with rising pitch. This is a consistent difference between yes-no question and declarative sentence. Audio Audio Q0029.gif Q0030.gif
Q0036.gif Audio Digit confirmation is marked with a strong rise and longer duration on the digit being confirmed. Post-confirmation pitch remain high. Pre-confirmation phrasing is similar to that of declarative and yes-no question sentences.
There is another final rise in the confirmation sentences. But when the confirmed number is very close to the end, the confirmation rise and the final rise fuse together. Audio Audio Q0079.gif Q0077.gif
Q0049.gif Q0073.gif Audio Audio Post confirmation pitch tends to be flatter in fast speech than in slow speech.
Post confirmation accent returns after a while, this is especially clear in slow speech. Audio Q0067.gif
Q0039.gif Audio Post-confirmation phrasing is less obvious. However, when the phrasing structure is observable, new phrases are marked by pitch drop, in contrast to the pitch rise in declaratvie sentence.
The digit immediately before the confirmed digit tends to get de-accented. as in the 1 of -1 +5 This is particularly clear when this digit starts a new phrase, where it would normally be marked with phrase initial high pitch.

Phrasing is marked when there are at least two digits before the confirmed digit in the phrase, as in the 1 of -1 5 +0

There seems to be a rhymic consideration here. It appears that the speaker de-accents the phrase-initial digit to avoid putting strong phrases to close to each together.

Audio Audio Q0005.gif Q0013.gif


Compared to other modeling explanations

This particular contour poses some modeling difficulties for various intonation frameworks.

In English ToBI system, the final rising contour is transcribed as L*+H H- H%, a rising pitch accent aligned with the last stressed syllable, which is followed by a high phrase tone and a high boundary tone. The accented H of L*+H and the final boundary tone H% are local accent that are bound to the accented word and the phrase final position.

ToBI also assumes that all lexical stress after the nucleus accent are deleted. If the neucleus accent is on the last word, the pitch accent, phrase tone and boundary tone fuse into one rising shape. If the necleus accent is early, the phrase tone H- expand to fill the gap and create a plateau. F0 values in between are generated by interpolation.

Our data shows that the magnitude of post-nucleus accents varies with speed, an phenomenon that can find articulatory explanation and can be modeled as such.

Go To: index /A> Go To: Stem-ML Homepage /A>