Prosody and Prosodic Models ICSLP 2002 - September 16, 2002, Denver Colorado Chilin Shih and Greg Kochanski

# Section 3: Modeling Example

## Confirming a Word in a Question: Modeling Example

In this section we use one experiment to explore Stem-ML modeling. How to ask for confirmation of a particular word in English.

## The Nature of the Problem

Question intonation (especially yes-no questions) in English typically has a rising tail. The rising gesture starts on the last stressed syllable.

 9478-1509-7091? [audio]

When the emphasis is early, there is typically a "plateau" of high f0 values stretching from the stressed syllable to just before the end of the sentence. There is another final rise at the end of the sentence.

This intonation contour occurs frequently with number confirmation, either in human-human interaction or human-machine interaction. A dialogue system handling transactions may have a need to confirm a specific number in a string of credit card number or phone number. The most natural and effective way to confirm that number is to use rising intonation on the questionable digit, as in the 0 below.

 9478-1509-7091? [audio]

It is desirable to have a good model of this intonation contour in a Text-to-Speech system to handle such dialogue acts.

## Data

We designed a small experiment to explore how to model this type of intonation contour.

The database consists of 200 digit sequences, organized in 16 blocks, with variations in phrasing, speaking speed, and single digit confirmation in different sentential positions. We'd like to see how question intonation interact with these factors. We also recorded declarative and yes-no question intonation as references.

• Phrasing variations
• There are two types of phrasing: 12 digit sequence simulating (shortened) credit card numbers,

9478-1509-7091

and 10 digit sequences simulating telephone numbers.

301-123-5045

There are 4 blocks of credit card numbers, two with voiced digits and two with mixed digit. There are 12 blocks of telephone numbers, two with voiced digits and 10 with mixed digits. The mixed digit sequences is designed with all digits occurring in all positions with equal frequency.

• Speed variations
• In general, the credit card numbers were read slow and phone numbers fast. One set of voiced telephone sequence were read fast and the other slow.

• Positional variations
• We shift the emphasis, or the digit being confirmed, to all positions in the sentence. The reading order of the digit confirmation sentences were randomized within block.

301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045

• Intonation types
• Each block consist of digit string read once in declarative intonation, once in yes-no question, followed by 10 to 12 sentences, each confirming one of the digit in the string.

• The speaker was presented dash-separated digit strings written in arabic numbers. The speaker were asked to group the digits into credit card style or phone number style, but not to pause at the dash.

Sentences in the same block contain the same digit string, but they were read with different instructions.

• To obtain declarative intonation

Instruction: This is your phone number. You are giving other people this information over the phone.

Text presentation: 901-109-9091.

• To obtain yes-no question

Instruction: You are repeating a phone number back, asking whether you've got it right.

Text presentation: 901-109-9091?

• Confirming one digit

Instruction: You know you've got most of the numbers but are not sure about the one underlined in red. You are trying to confirm whether this digit is correct.

Text presentation 901-109-9091?

• Caveats
• Some sentences were excluded from modeling: In three cases 0 were read as oh, rather than zero. In six cases the speaker paused after the emphasis. These were acceptable renditions, but we decided not to model phrasing with pause here.

• In all pitch tracks, verticle lines mark word boundaries. Dash "-" marks phrasing as indicated in the text. There may or may not be acoustic correlates at the indicated phrasing boundaries. A leading plus "+" sign marks the confirmed digit.

## Observations

 Audio Phrasing as indicated by the dash is clearly marked on declarative sentences. Pitch rises on the phrase initial digit.
 Not surprisingly, declarative sentences end with falling pitch and questions end with rising pitch. This is a consistent difference between yes-no question and declarative sentence. Audio Audio
 Audio Digit confirmation is marked with a strong rise and longer duration on the digit being confirmed. Post-confirmation pitch remain high. Pre-confirmation phrasing is similar to that of declarative and yes-no question sentences.
 There is another final rise in the confirmation sentences. But when the confirmed number is very close to the end, the confirmation rise and the final rise fuse together. Audio Audio
 Audio Audio Post confirmation pitch tends to be flatter in fast speech than in slow speech.
 Post confirmation accent returns after a while, this is especially clear in slow speech. Audio
 Audio Post-confirmation phrasing is less obvious. However, when the phrasing structure is observable, new phrases are marked by pitch drop, in contrast to the pitch rise in declaratvie sentence.
 The digit immediately before the confirmed digit tends to get de-accented. as in the 1 of -1 +5 This is particularly clear when this digit starts a new phrase, where it would normally be marked with phrase initial high pitch. Phrasing is marked when there are at least two digits before the confirmed digit in the phrase, as in the 1 of -1 5 +0 There seems to be a rhythmic consideration here. It appears that the speaker de-accents the phrase-initial digit to avoid putting strong phrases too close to each together. Audio Audio

 [ Papers | Top ] Greg Kochanski: [ Home ] Chilin Shih: [ Home ]