Chinese Intonation:
Connecting Linguistics to Acoustics

Greg Kochanski (Oxford Phonetics)

Chilin Shih (University of Illinois)

Tan Lee (CUHK)

Hongyan Jing (IBM)

Jiahong Yuan (Cornell)

Questions

Is it phonetics or phonology or physiology?

How to build a mathematical model?

How *do* tone languages implement prosody?

Can we objectively assign an importance to a syllable?

How simple might English intonational phonology be?

What is the goal?

Explain intonation in a way that is:

Consistent with the most basic linguistic assumptions

Falsifiable

Reductionist

Consistent with known Physiology, Biology and Physics.

Existing work

But F₀ bumps don’t match accents…

Existing work

The
Challenge

Another Challenge

Basic assumptions used in modeling

People plan their utterances several syllables in advance.

People produce speech optimized to meet their needs.

A realistic model for the muscles that control f₀

Speech is planned.

People talk nearly as fast as possible.

Speech could be optimal

Optimize what?

People want to minimize the chance that they will be misunderstood.

Risk = P(misinterpreted) * cost(misinterpreted)

People want to minimize effort and/or talk faster

Chairs, Cars

How to combine the two?

A weighted sum.

We allow each syllable to have a different weight

Perhaps weight matches importance.

Modeling math

“Effort”

Modeling math

Model behavior

For s>>1, Error (R) dominates, and pitch matches target.

For s<<1, Effort (G) dominates, both speaker and listener accept large deviations, and pitch smoothly interpolates.

For s~1, everything compromises.

Q:Where did this “strength” come from?

A: What is 2 meters + 3 kilograms ?

“Effort” can have energy units.

“Error” can be a pure number (error probability).

A multiplier is needed to make the units agree.

A: Strength = cost of a misinterpretation

Physical implementations of prosody

Intonation (pitch) is one of the more important components of prosody.

Also duration, loudness, facial expressions.

Modeling math

The rest of the model.

A model is a sequence of targets.

Each target has a strength.

One target per tone.

Targets are stretched to fit syllable duration.

Only one phonological rule: 33®23

Model fits to Mandarin Chinese

What’s the procedure?

Model fits for Mandarin Chinese

Strengths are stable under small changes in the model.

Model parameters

Metrical patterns inside words

Other nice properties

Local Conclusion

Intonation is represented as:

a small set of discrete symbols, in sequence,

modulated by a variable prosodic strength, with

a per-person or per-style shape for each symbol

One symbol per syllable seems enough

The basic mechanisms could be common across all languages.

The strength parameter seems real

Similar across languages

Matches language structure

But does it work for English?

English

The model for English

Model details

Model fits well over a range of speeds.

More fits - English confirming questions.

Local conclusion

Why is the model so compact?

Conclusion



	Greg Kochanski (Oxford Phonetics)
	Chilin Shih (University of Illinois)
	Tan Lee (CUHK)
	Hongyan Jing (IBM)
	Jiahong Yuan (Cornell)


	Is it phonetics or phonology or physiology?

	How to build a mathematical model?

	How do tone languages implement prosody?

	Can we objectively assign an importance to a syllable?

	How simple might English intonational phonology be?


	Explain intonation in a way that is:
		Consistent with the most basic linguistic assumptions
		Falsifiable
		Reductionist
		Consistent with known Physiology, Biology and Physics.


	People plan their utterances several syllables in advance.
	People produce speech optimized to meet their needs.
	A realistic model for the muscles that control f₀


	People want to minimize the chance that they will be misunderstood.
		Risk = P(misinterpreted) * cost(misinterpreted)

	People want to minimize effort and/or talk faster
		Chairs, Cars

	How to combine the two?
		A weighted sum.
		We allow each syllable to have a different weight
		Perhaps weight matches importance.


	For s>>1, Error (R) dominates, and pitch matches target.

	For s<<1, Effort (G) dominates, both speaker and listener accept large deviations, and pitch smoothly interpolates.

	For s~1, everything compromises.


	A: What is 2 meters + 3 kilograms ?
		“Effort” can have energy units.
		“Error” can be a pure number (error probability).
		A multiplier is needed to make the units agree.

	A: Strength = cost of a misinterpretation


	Intonation (pitch) is one of the more important components of prosody.

	Also duration, loudness, facial expressions.


	A model is a sequence of targets.
	Each target has a strength.
	One target per tone.
	Targets are stretched to fit syllable duration.
	Only one phonological rule: 33®23


	Intonation is represented as:
		a small set of discrete symbols, in sequence,
		modulated by a variable prosodic strength, with
		a per-person or per-style shape for each symbol

	One symbol per syllable seems enough

	The basic mechanisms could be common across all languages.

	The strength parameter seems real
		Similar across languages
		Matches language structure