Chilin Shih, Greg P.
Kochanski
|
Bell Labs, Lucent
Technologies
|
Section 3: Modeling Example
Confirming a Word in a Question:
Modeling Example
In this section we use one experiment to
explore Stem-ML modeling.
The Nature of the
Problem
Question intonation of yes-no questions
in English typically has a rising tail. The rising gesture starts
on the last stressed syllable.
When the emphasis is early, there is
typically a "plateau" of high f0 values stretching from the
stressed syllable to just before the end of the sentence. There is
another final rise at the end of the sentence.
This intonation contour occurs frequently
with number confirmation, either in human-human interaction or
human-machine interaction. A dialogue system handling transactions
may have a need to confirm a specific number in a string of credit
card number or phone number. The most natural and effective way to
confirm that number is to use rising intonation on the questionable
digit, as in the 0 below.
It is desirable to have a good model of
this intonation contour in a Text-to-Speech system to handle such
dialogue acts.
Data
We designed a small experiment to explore
how to model this type of intonation contour.
The database consists of 200 digit
sequences, organized in 16 blocks, with variations in phrasing,
speaking speed, and single digit confirmation in different
sentential positions. We'd like to see how question intonation
interact with these factors. We also recorded declarative and
yes-no question intonation as references.
- Phrasing variations
-
There are two types of phrasing: 12 digit
sequence simulating (shortened) credit card numbers,
9478-1509-7091
and 10 digit sequences simulating
telephone numbers.
301-123-5045
There are 4 blocks of credit card
numbers, two with voiced digits and two with mixed digit. There are
12 blocks of telephone numbers, two with voiced digits and 10 with
mixed digits. The mixed digit sequences is designed with all digits
occurring in all positions with equal frequency.
- Speed variations
-
In general, the credit card numbers were
read slow and phone numbers fast. One set of voiced telephone
sequence were read fast and the other slow.
- Positional variations
- We shift the
emphasis, or the digit being confirmed, to all positions in the
sentence. The reading order of the digit confirmation sentences
were randomized within block.
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
301-123-5045
- Intonation types
-
Each block consist of digit string read
once in declarative intonation, once in yes-no question, followed
by 10 to 12 sentences, each confirming one of the digit in the
string.
- Reading instructions
-
The speaker was presented dash-separated
digit strings written in arabic numbers. The speaker were asked to
group the digits into credit card style or phone number style, but
not to pause at the dash.
Sentences in the same block contain the
same digit string, but they were read with different
instructions.
- To obtain declarative
intonation
Instruction:
This is your phone number. You are giving other people this
information over the phone.
Text
presentation: 901-109-9091.
- To obtain yes-no question
Instruction:
You are repeating a phone number back, asking whether you've got it
right.
Text
presentation: 901-109-9091?
- Confirming one digit
Instruction:
You know you've got most of the numbers but are not sure about the
one underlined in red. You are trying to confirm whether this digit
is correct.
Text
presentation 901-109-9091?
- Caveats
-
Some sentences were excluded from
modeling: In three cases 0 were read as oh, rather
than zero. In six cases the speaker paused after the
emphasis. These were acceptable renditions, but we decided not to
model phrasing with pause here.
- Data links
-
In all pitch tracks, verticle lines mark
word boundaries. Dash "-" marks phrasing as indicated in the text.
There may or may not be acoustic correlates at the indicated
phrasing boundaries. A leading plus "+" sign marks the confirmed
digit.
- Voiced Digits
-
- Mixed Digits
-
- Stimuli
set: 9478-1509-7091, credit card phrasing, slow
- Stimuli
set: 8495-2136-1076, credit card phrasing, slow
- Stimuli set:
147-205-3130, telephone phrasing, fast
- Stimuli set:
039-088-2421, telephone phrasing, fast
- Stimuli set:
410-746-8312, telephone phrasing, fast
- Stimuli set:
562-372-4583, telephone phrasing, fast
- Stimuli set:
973-634-7654, telephone phrasing, fast
- Stimuli set:
301-123-5045, telephone phrasing, fast
- Stimuli set:
695-497-9876, telephone phrasing, fast
- Stimuli set:
758-959-1207, telephone phrasing, fast
- Stimuli set:
824-861-0968, telephone phrasing, fast
- Stimuli set:
286-510-6799, telephone phrasing, fast
Observations
 |
Audio |
Phrasing as indicated by the dash is clearly marked on
declarative sentences. Pitch rises on the phrase initial
digit. |
| Not surprisingly, declarative sentences end with falling pitch
and questions end with rising pitch. This is a consistent
difference between yes-no question and declarative sentence. |
Audio Audio |
 |
 |
Audio |
Digit confirmation is marked with a strong rise and longer
duration on the digit being confirmed. Post-confirmation pitch
remain high. Pre-confirmation phrasing is similar to that of
declarative and yes-no question sentences. |
| There is another final rise in the confirmation sentences. But
when the confirmed number is very close to the end, the
confirmation rise and the final rise fuse together. |
Audio Audio |
 |
 |
Audio Audio |
Post confirmation pitch tends to be flatter in fast speech than
in slow speech. |
| Post confirmation accent returns after a while, this is
especially clear in slow speech. |
Audio |
 |
 |
Audio |
Post-confirmation phrasing is less obvious. However, when the
phrasing structure is observable, new phrases are marked by pitch
drop, in contrast to the pitch rise in declaratvie sentence. |
| The digit immediately before the confirmed digit tends to get
de-accented. as in the 1 of -1 +5 This is particularly clear when this digit
starts a new phrase, where it would normally be marked with phrase
initial high pitch.
Phrasing is marked when there are at least two digits before the
confirmed digit in the phrase, as in the 1 of -1 5
+0
There seems to be a rhymic consideration here. It appears that
the speaker de-accents the phrase-initial digit to avoid putting
strong phrases to close to each together.
|
Audio Audio |
 |
Model
Compared to other modeling
explanations
This particular contour poses some
modeling difficulties for various intonation
frameworks.
In English ToBI system, the final rising
contour is transcribed as L*+H H- H%, a rising pitch accent aligned
with the last stressed syllable, which is followed by a high phrase
tone and a high boundary tone. The accented H of L*+H and the final
boundary tone H% are local accent that are bound to the accented
word and the phrase final position.
ToBI also assumes that all lexical stress
after the nucleus accent are deleted. If the neucleus accent is on
the last word, the pitch accent, phrase tone and boundary tone fuse
into one rising shape. If the necleus accent is early, the phrase
tone H- expand to fill the gap and create a plateau. F0 values in
between are generated by interpolation.
Our data shows that the magnitude of
post-nucleus accents varies with speed, an phenomenon that can find
articulatory explanation and can be modeled as such.
Go To: index /A> Go
To: Stem-ML Homepage
/A> |