Prosody and Prosodic Models ICSLP 2002 - September 16, 2002, Denver Colorado Chilin Shih and Greg Kochanski |
In this section we use one experiment to explore Stem-ML modeling. How to ask for confirmation of a particular word in English.
Question intonation (especially yes-no questions) in English typically has a rising tail. The rising gesture starts on the last stressed syllable.
9478-1509-7091? [audio] |
When the emphasis is early, there is typically a "plateau" of high f0 values stretching from the stressed syllable to just before the end of the sentence. There is another final rise at the end of the sentence.
This intonation contour occurs frequently with number confirmation, either in human-human interaction or human-machine interaction. A dialogue system handling transactions may have a need to confirm a specific number in a string of credit card number or phone number. The most natural and effective way to confirm that number is to use rising intonation on the questionable digit, as in the 0 below.
9478-1509-7091? [audio] |
It is desirable to have a good model of this intonation contour in a Text-to-Speech system to handle such dialogue acts.
The database consists of 200 digit sequences, organized in 16 blocks, with variations in phrasing, speaking speed, and single digit confirmation in different sentential positions. We'd like to see how question intonation interact with these factors. We also recorded declarative and yes-no question intonation as references.
There are two types of phrasing: 12 digit sequence simulating (shortened) credit card numbers,
9478-1509-7091
and 10 digit sequences simulating telephone numbers.
301-123-5045
There are 4 blocks of credit card numbers, two with voiced digits and two with mixed digit. There are 12 blocks of telephone numbers, two with voiced digits and 10 with mixed digits. The mixed digit sequences is designed with all digits occurring in all positions with equal frequency.
In general, the credit card numbers were read slow and phone numbers fast. One set of voiced telephone sequence were read fast and the other slow.
We shift the emphasis, or the digit being confirmed, to all positions in the sentence. The reading order of the digit confirmation sentences were randomized within block.
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
Each block consist of digit string read once in declarative intonation, once in yes-no question, followed by 10 to 12 sentences, each confirming one of the digit in the string.
The speaker was presented dash-separated digit strings written in Arabic numbers. The speaker were asked to group the digits into credit card style or phone number style, but not to pause at the dash.
Sentences in the same block contain the same digit string, but they were read with different instructions.
Instruction: This is your phone number. You are giving other people this information over the phone.
Text presentation: 901-109-9091.
Instruction: You are repeating a phone number back, asking whether you've got it right.
Text presentation: 901-109-9091?
Instruction: You know you've got most of the numbers but are not sure about the one underlined in red. You are trying to confirm whether this digit is correct.
Text presentation 901-109-9091?
Some sentences were excluded from modeling: In three cases 0 were read as oh, rather than zero. In six cases the speaker paused after the emphasis. These were acceptable renditions, but we decided not to model phrasing with pause here.
In all pitch tracks, vertical lines mark word boundaries. Dash "-" marks phrasing as indicated in the text. There may or may not be acoustic correlates at the indicated phrasing boundaries. A leading plus "+" sign marks the confirmed digit.
We include the voiced digit data set here.
Audio | Phrasing as indicated by the dash is clearly marked on declarative sentences. Pitch rises on the phrase initial digit. |
Not surprisingly, declarative sentences end with falling pitch and questions end with rising pitch. This is a consistent difference between yes-no question and declarative sentence. | Audio Audio |
Audio | Digit confirmation is marked with a strong rise and longer duration on the digit being confirmed. Post-confirmation pitch remain high. Pre-confirmation phrasing is similar to that of declarative and yes-no question sentences. |
There is another final rise in the confirmation sentences. But when the confirmed number is very close to the end, the confirmation rise and the final rise fuse together. | Audio Audio |
Audio Audio | Post confirmation pitch tends to be flatter in fast speech than in slow speech. |
Post confirmation accent returns after a while, this is especially clear in slow speech. | Audio |
Audio | Post-confirmation phrasing is less obvious. However, when the phrasing structure is observable, new phrases are marked by pitch drop, in contrast to the pitch rise in declarative sentences. |
The digit immediately before the confirmed digit tends to get
de-accented. as in the 1 of -1 +5 This is particularly clear when this digit
starts a new phrase, where it would normally be marked with phrase
initial high pitch.
Phrasing is marked when there are at least two digits before the confirmed digit in the phrase, as in the 1 of -1 5 +0 There seems to be a rhythmic consideration here. It appears that the speaker de-accents the phrase-initial digit to avoid putting strong phrases too close to each together. |
Audio Audio |
[ Papers | Top | Stem-ML modeling ] | Greg Kochanski: [ Home ] Chilin Shih: [ Home ] |