|Prosody and Prosodic Models
ICSLP 2002 - September 16, 2002, Denver Colorado
Chilin Shih and Greg Kochanski
The primary goal of intonation research is to model natural f0 contours of speech, preferably in relation to a transcription and a description of the prosodic intent of the speaker. The starting point of intonation research is the time series of f0. But the interpretation of the f0 information diverges widely among intonation schools. The table below represents a view of how one can classify the various intonation schools.
The shape of an accent may be fully-specified (i.e. defined without gaps) or under-specified (defined by disconnected regions or isolated points). Along another dimension, f0 values at any given time may be treated as a single component or as the combination of multiple components.
|Single Component||INTSINT||ToBI, Xu||Tilt, IPO||Olive, Machine learning|
|Multiple components||-||-||-||Van Santen|
INTSINT (Hirst et al., 2000) is an underspecified intonation system that defines an accent by a single point. Fitting quadratic spline curves through these points generates surface f0.
The most widely used under-specified accent shape is represented by the ToBI school (Beckman and Ayers, 1997; Silverman et al., 1992), which developed from earlier works such as Pierrehumbert (1980), Liberman and Pierrehumbert (1984), and Pierrehumbert and Beckman (1988). Each accent is represented by no more than two points, which specify abstractly the relative contrast of high (H) and low (L). One goal of the ToBI system is to specify a minimal set of categorical labels for intonation. These labels are usually interpreted as phonological distinction between accent types.
Xu et al. (1999) represents Chinese tones with under-specified, static or dynamic targets. The surface f0 contours are generated with a model that approaches these targets asymptotically within the domain of a syllable.
Tilt (Taylor, 2000; Taylor, 1998) allows more samples than ToBI near the peak of an accent and leaves the other regions unspecified, hence its status half way to a fully specified system. Tilt considers all accent types to be continuous variations of a single class. Surface variations are accounted for by changes in the continuous parameters.
IPO (de Pijper, 1983) prepares a piecewise-linear approximation to the pitch contour. They then associate the slope and height of these lines with various types of accents. Olive (1975) described a very early fully-specified system, following work by Levitt and Rabiner (1970). His model stored the surface pitch vs. time contour as a function of the grammatical structure of the sentence. The contour was then approximated by polynomial splines attached to words, to allow for duration variations.
Several works using machine learning techniques generate densely sampled f0 values, including Chen et al. (1992) and Malfrère et al. (1998). We classify these works as fully specified systems even though in some cases the concept of accent may not be clear.
Ross and Ostendorf (1999) described an interesting machine learning system where a discrete learning system would predict vectors attached to phonemes and syllables, and these vectors would in turn drive a (learned) dynamical system to predict f0.
Stem-ML (Kochanski and Shih) is flexible to cover all the cases in the table. It allows phrase curves, but does not require them. It naturally interpolates to compute prosody in between under-specified accents. It allows several accents to overlap, handling the over-specified and multiple components cases. We will use Stem-ML in Section 3 of this tutorial to build models of intonation.
The advantage of using an under-specified accent shape is that it allows sufficient distance between specified accent targets to allow a smooth f0 transition, typically by way of interpolation. The drawback is that it ignores changes of shape between specified targets. On the other hand, a system with fully specified accents leaves little room to resolve conflicting targets. A simple concatenation of fully-specified accents will result in a pitch curve with unnatural jumps at the concatenation joints. Many systems, such as Fujisaki (1983, 1988), use filters to smooth out abrupt changes in f0. Alternatively, van Santen (1997, 2000) requires each accent to begin and end at zero to ensure smooth connections between accents.
Turning to the f0 dimension of the table, many intonation schools treat surface intonation contours as the superposition of a phrase component and an accent component. Grønnum (1992) and Fujisaki (1983, 1988) are representatives of this view.
well-defined model that fully specifies accent shape and uses multiple components is van Santen's (van Santen and Möbius, 1997, 2000; van Santen et al., 1998), where accents are represented by densely populated points, providing a mechanism to describe highly complex accent shapes in detail. We characterize van Santen's system as having multiple components, because in addition to the phrase component, each accent in the phrase also adds a phrase-length component that contributes to the surface f0 contour.
The advantage of multiple components is that it provides a mechanism to separate individual accents from long-term effects. However, if one allows multiple components, then one necessarily faces the problem that there is no unique solution in the decomposition of a single f0 time series into multiple components. Any such decomposition depends on a model of the speech process, and is only as good as the underlying model. In contrast, Liberman and Pierrehumbert (1984) explicitly reject the notion of a phrase curve and represent intonation contours as a single component. The advantage of representing f0 information as a single component is that the representation of accent heights will then be transparent, which lends itself to convenient automatic labeling.
|[ Papers | Top | Stem-ML modeling ]||Chilin Shih: [ Home ]
Greg Kochanski: [ Home ]