Prosody and Prosodic Models ICSLP 2002 - September 16, 2002, Denver Colorado Greg Kochanski and Chilin Shih |
Prosodic models serve two purposes: On one hand, they can be scientific hypotheses that explain how we communicate with each other, and what we communicate. On the other hand, they can be engineered software systems that are part of a dialog system or speech synthesizer. To a lesser extent - and this is mostly potential - a prosodic model can be the background for a system to recognize prosody in human speech.
Intonation production is generally considered a two-step process: an accent or tone class is predicted from available information, and then the tone class is used to generate f0 as a function of time. Historically, most attention has been paid to the first, high level, step of the process.
Most TTS systems divide the task of intonation generation into two components, a linguistic modeling component and a pitch generation component (Sproat, 1998). The linguistic modeling component is carried out as part of the text analysis, where the input text stream is processed, and intonation events are deduced from the text and from high- level tags that contain non-deducible information about prosodic intent. The intonation events are then coded in abstract representations. Examples of the linguistic modeling component include ToBI (Silverman et al., 1992), Tilt (Taylor, 1998), INTSINT (Hirst et al., 2000), among others. Lexical tone languages such as Chinese and Vietnamese conveniently provide some of this information from the lexicon.
The pitch generation component is the decoding process where f0 contours are generated from the linguistic representations. Traditionally, the pitch generation component is designed to support a specific abstract representation and is implemented after the representation is known. For example, given ToBI labeling, one may write a rule set to describe the f0 shapes and their pitch values (Anderson et al. 1984), or to use machine learning techniques to train the target values, including linear regression model (Black et al., 1996), CART tree models (Dusterhoff et al., 1999) and dynamical system models (Ross and Ostendorf, 1999). These pitch generation models are the decoders of ToBI, and will not support concepts that are not represented in ToBI. It should be obvious that phenomena that are not coded in the linguistic modeling component cannot receive support from the pitch generation component.
In this section, we review the literature in the area of intonation modeling, finding the common ground where multiple models might be interfaced to a common pitch generation component.
[ Papers | Top | Stem-ML modeling ] | Chilin Shih: [ Home ] Greg Kochanski: [ Home ] |