Chinese Intonation:
Connecting Linguistics to Acoustics
Greg Kochanski (Oxford Phonetics)
Chilin Shih (University of Illinois)
Tan Lee (CUHK)
Hongyan Jing (IBM)
Jiahong Yuan (Cornell)

Is it phonetics or phonology or physiology?
How to build a mathematical model?
How *do* tone languages implement prosody?
Can we objectively assign an importance to a syllable?
How simple might English intonational phonology be?

What is the goal?
Explain intonation in a way that is:
Consistent with the most basic linguistic assumptions
Consistent with known Physiology, Biology and Physics.

Existing work

But F0 bumps don’t match accents…

Existing work

Existing work


Another Challenge

Basic assumptions used in modeling
People plan their utterances several syllables in advance.
People produce speech optimized to meet their needs.
A realistic model for the muscles that control f0

Speech is planned.

People talk nearly as fast as possible.

Speech could be optimal

Optimize what?
People want to minimize the chance that they will be misunderstood.
Risk = P(misinterpreted) * cost(misinterpreted)
People want to minimize effort and/or talk faster
Chairs, Cars
How to combine the two?
A weighted sum.
We allow each syllable to have a different weight
Perhaps weight matches importance.

Modeling math


Modeling math

Model behavior
For s>>1, Error (R) dominates, and pitch matches target.
For s<<1, Effort (G) dominates, both speaker and listener accept large deviations, and pitch smoothly interpolates.
For s~1, everything compromises.

Q:Where did this “strength” come from?
A: What is 2 meters + 3 kilograms ?
“Effort” can have energy units.
“Error” can be a pure number (error probability).
A multiplier is needed to make the units agree.
A: Strength = cost of a misinterpretation

Physical implementations of prosody
Intonation (pitch) is one of the more important components of prosody.
Also duration, loudness, facial expressions.

Modeling math

The rest of the model.
A model is a sequence of targets.
Each target has a strength.
One target per tone.
Targets are stretched to fit syllable duration.
Only one phonological rule: 33®23

Model fits to Mandarin Chinese

What’s the procedure?

Model fits for Mandarin Chinese

Strengths are stable under small changes in the model.

Model parameters

Model parameters

Model parameters

Metrical patterns inside words

Other nice properties

Local Conclusion
Intonation is represented as:
 a small set of discrete symbols, in sequence,
modulated by a variable prosodic strength, with
a per-person or per-style shape for each symbol
One symbol per syllable seems enough
The basic mechanisms could be common across all languages.
The strength parameter seems real
Similar across languages
Matches language structure

But does it work for English?


The model for English

Model details

Model fits well over a range of speeds.

More fits - English confirming questions.

Local conclusion

Why is the model so compact?

Similar Effects in other Articulators?