Prosody and Prosodic Models
ICSLP 2002 - September 16, 2002, Denver Colorado
Chilin Shih and Greg Kochanski

Section 3: A Modeling Example

In this section, we will build a model of the intonation used to confirm a word in a question. The examples are in the form, "One two three four - five six seven - eight nine zero?", where the subject was told to ask (in this example) for confirmation of the sixth digit.

This use of intonation is frequently observed in natural speech when one person reads a digit string to the other, and also when one person spells out an unfamiliar word to another.

The Data

This is a link to the experiment design, data, and observations. In general, pitch rises on the word which the speaker was seeking confirmation, and rises again at the end of the sentence. There are interesting variations in accent interaction, phrasing and speaking rate.

The Model (Stem-ML)

For details explanation of the model and Stem-ML concept, please follow the model link. Here is a brief summary of how we chose parameters to fit the observed data.

We use a total of 48 parameters to fit a subset of the data, including 43 sentences that are composed of voiced digits "one", "nine" and "zero". On average we used 1.12 parameters per sentence. All parameters are global parameters which are shared among all sentences. That is, we did not leave room in this model to capture sentence-specific variation.

16 parameters describe accent shape: 2 different kinds of accent plus boundary tones.
3 parameters for accent placement.
2 parameter for the phrase curve, including a step.
11 parameters determine accent strength. Three of those are for the control of boundary tones, the rest of the parameters define the strength of words in terms of the position in the phrase, the position in the utterance, and the position relative to the digit in question.
4 parameters control how accents interact with their neighbors.
7 parameters define the width of the accent.
4 parameters are global Stem-ML parameters that control base line, muscle respond speed, among others.

The Results

We include all 43 plots showing the data vs. model generated f0:

Plot 1

Plot 2

Plot 3

The RMS deviation is 0.212 Barks, which corresponds to approximately 21 Hz or 1.7 semitones. The result is surprisingly good especially considering how few parameters are being used. If one is concerned with good fit, there are plenty of room to add parameters to capture sentence-specific variations.

The model captures the slow and fast speech variations naturally without any specific need to adjust the model for fast or slow speech, or any parameter addressing this aspect of the variation. Slow speech has more pitch movement while fast speech has relatively smooth pitch. Many intonation schemes would require a categorically different set of accents to express the difference between fast and slow speech while our model doesn't, which may lead to a far simpler intonation phonology.

[ Papers | Top | Stem-ML modeling ]

Greg Kochanski: [ Home ]
Chilin Shih: [ Home ]