Prosody Logo Prosody and Prosodic Models
ICSLP 2002 - September 16, 2002, Denver Colorado
Chilin Shih and Greg Kochanski
Lucent Logo

Section 3: A Modeling Example

In this section, we will build a model of the intonation used to confirm a word in a question. The examples are in the form, "One two three four - five six seven - eight nine zero?", where the subject was told to ask (in this example) for confirmation of the sixth digit.

This use of intonation is frequently observed in natural speech when one person reads a digit string to the other, and also when one person spells out an unfamiliar word to another.

The Data

This is a link to the experiment design, data, and observations. In general, pitch rises on the word which the speaker was seeking confirmation, and rises again at the end of the sentence. There are interesting variations in accent interaction, phrasing and speaking rate.

The Model (Stem-ML)

For details explanation of the model and Stem-ML concept, please follow the model link. Here is a brief summary of how we chose parameters to fit the observed data.

We use a total of 48 parameters to fit a subset of the data, including 43 sentences that are composed of voiced digits "one", "nine" and "zero". On average we used 1.12 parameters per sentence. All parameters are global parameters which are shared among all sentences. That is, we did not leave room in this model to capture sentence-specific variation.

The Results

We include all 43 plots showing the data vs. model generated f0:

Plot 1 Plot 2 Plot 3

The RMS deviation is 0.212 Barks, which corresponds to approximately 21 Hz or 1.7 semitones. The result is surprisingly good especially considering how few parameters are being used. If one is concerned with good fit, there are plenty of room to add parameters to capture sentence-specific variations.

The model captures the slow and fast speech variations naturally without any specific need to adjust the model for fast or slow speech, or any parameter addressing this aspect of the variation. Slow speech has more pitch movement while fast speech has relatively smooth pitch. Many intonation schemes would require a categorically different set of accents to express the difference between fast and slow speech while our model doesn't, which may lead to a far simpler intonation phonology.


Bell Labs Innovations Logo [ Papers | Top | Stem-ML modeling ] Greg Kochanski: [ Home ]
Chilin Shih: [ Home ]