|Prosody and Prosodic Models
ICSLP 2002 - September 16, 2002, Denver Colorado
Greg Kochanski and Chilin Shih
A model of prosody is a theory of nature, it states (explicitly or implicitly) how humans communicate, and what they are communicating. It is worth thinking about the implications of a model, even if sometimes the urgencies of life force us to settle for a good engineering approximation.
Here are a (somewhat subjective) list of properties that a good prosodic model should have.
prosodic model should predict numerical values, so it is subject to quantitative test. Qualitative models will ultimately have little impact on the field, because they cannot be verified or falsified. From a research point of view, it is important to have a model that bridges the gap from linguistic theories to the objective reality of a glottal oscillator with a time-varying frequency. The model needs to be general enough so that it can provide a quantitative representation of many different theories of intonation, and can therefore be used to compare theories.
People speak to be heard and understood. Consequently, a theory of prosody cannot just address production of speech: it has to consider the complete communication link between the speaker and the listener. This concept was described by Ohala (1992) who called speech a compromise between effort and communication clarity. It seems to have been first articulated by Passy (1890 or 1891), as the Functionalist Hypothesis.
Viewed in the large, prosody is a parallel channel for communication, carrying some information that cannot be simply deduced from the lexical channel. All aspects of prosody are transmitted by muscle motions, and in most of them, the recipient can percieve, fairly directly, the motions of the speaker. Even in intonation, pitch has a fairly smooth relationship to the underlying muscle tensions.
Because of the similarities and common mechanisms among different types of prosodic gestures,
Another important feature of a prosodic model is that the representation should be understandable, and adjustable. While an opaque model that can compute prosody for one emotional state for one person may be valuable, one that can be adjusted and parameterized to simulate many people and many emotions is correspondingly more valuable.
Data-intensive approaches may be able to someday reproduce normal, unemotional ``business'' speech fairly well by collecting speech on a broad range of texts. However, the corresponding data collection for emotional speech, or for different speaking styles (enthusiastic vs. matter-of-fact) requires large databases of unusual speech, which may be hard to obtain. Even professional actors may have a hard time being (for example) uniformly sullen for several hours.
prosodic model should cleanly separate into local (word-dependent) and global (speaker-dependent) parts. Such a separation mirrors our intuitive knowledge of language: One one hand, we know that sentences can carry meaning that is approximately independent of the speaker, and that listeners can identify localized accents in speech. On the other hand, different speakers clearly have identifiable styles of speech.
Unless proven otherwise, one would expect that a proper description of communication should reproduce these intuitions.
Additionally, a prosodic model that separates cleanly is more flexible, and has the potential of generating multiple styles of speech by changing the speaker-dependent parts of the model, and leaving the local information alone.
Any prosodic model that claims to compute prosody from unmarked text is incomplete. Why is this true? Because one can say the same sentence in many different ways, depending on the context.
One can say
One can create stories where the correct rendering of a sentence requires information from long before the target sentence, and also information about the world. For instance, imagine a story about someone trapped in the mountains. He has a dog and a cat. Snow falls. When the hero finally reaches civilization again, he says "I did not eat the dog. ", implying that he ate the cat.
Now, placing that accent requires knowledge that a cat is in the story, it requires knowledge that people trapped in snowy mountains can get hungry, and it requires knowledge that cats are edible. Even the most sophisticated syntactic analysis seems unlikely to be able to compute the prosody for this sentence.
Another counter-example is the existence of different interpretations of Shakespeare's plays. All directors start from the same text, and there are many "reasonable" interpretations, each of which corresponds to a different mental state of the characters.
Humans are mechanical systems, and the muscles which are used to implement prosody have measurable dynamic properties. For instance, they have some mass, which limits how fast they can accelerate. Likewise, the dynamics of the Actin and Myosin molecules in the muscle filaments limits the maximum rate of travel.
Then, there is a reasonable understanding of the oscillation of the vocal folds.
Prosodic models that violate these constraints from Physics, Biology, and Physiology are, at best, approximations.
For a model of prosody to be a description of a language, instead of just a scheme for efficiently coding f0 contours, we should be able to correlate the results of the fit with linguistically important features.
The demands of interactive approaches to TTS require more freedom to express prosody than current systems allow. Most current TTS systems, including the Bell Labs TTS system, were designed to operate on text with little or no "mark-up" information beyond the text. The prosody subsystem was therefore designed conservatively, because of the intrinsic limitations of how reliably prosodic information could be deduced from the text. If some prosodic feature could not be reliably deduced, it was found better to produce a neutral prosody than the wrong one.
The next generation of TTS applications will not have this limitation, because many applications will be conducting a dialog, and will have state information corresponding to goals and intentions. The application may be "intending" to convey that a set of words is a single proper noun, that a word is especially important, or that a word needs confirmation. This state information needs to be expressed prosodically, so one should think of speech synthesis more in the context of a concept-to-speech system than a text- to-speech system. Similarly, there are applications where the simulation of emotions, subtle meanings in speech acts, and stylistic variations is desirable. This prosodic information can be supplied to the TTS system by adding mark-up tags to the text. With marked text, the TTS system does not need to deduce as much, so it need not be designed conservatively.
A useful model should be able to predict prosody in a variety of different dialog situations.
The mark-up system is most useful if it is flexible enough to support any intonation event that a user or a future dialogue system might want to express. A pertinent question is then how to design a pitch generation system that will support linguistic models that are not yet defined.
It is thus most important to allow the model we define to represent any possible prosody .
Even while is is impossible to build a prosodic model that predicts accurately from text alone, the text and syntax clearly has some effect on the possible prosodies for an utterance. Consequently, it is important for technological applications that the model operate (as much as possible) from information that can be reliably deduced from the text.
|[ Papers | Top | Stem-ML modeling ]||Chilin Shih: [ Home ]
Greg Kochanski: [ Home ]