Click to edit Master text styles
Language has many interesting,
apparently complicated effects.
Linguists typically explain these by discrete rules on discrete
objects, such as “a /stop t/ should be transformed to a /flap t/ when in the
middle of an unstressed syllable”.
This is phonetics – operations on discrete entities. Phonology is the physical implementation
of these discrete entities, and is often considered relatively
unimportant. I’ll show in this talk
that with a physically reasonable model, the phonetics (I.e. the strategy that
the brain uses to control the articulators) can explain much that is normally
considered to be phonology.
I want to attach linguistics to the
reductionist stack of disciplines that forms modern science.
Linguistics is simply an example of rather
complex animal behavior, which must eventually be explained by
Neurobiology is, of
course, explained by cell biology, and so forth.
We’d like this explanation to preserve the basic linguistic assumptions, such
as that language is expressible as an ordered string of discrete entities
(such as syllables or words), which might have continuous properties attached
(like relative importance).
The tilt model is a way of
parameterizing local stretches of a F0 curve. The idea is that you parameterize the bumps and steps, which
hopefully correspond to accents, and then interpolate in between.
The problem is, that the locations of
accents (I.e. the position of prominent syllables) isn’t associated with any
particular local shape of F0. These
are data from the IViE corpus (Grabe and company, Oxford, circa 2002). They are local fits of polynomials to
prominent (red) syllables and non-prominent places (randomly chosen between
the prominent syllables). As can be
seen, the prominent and non-prominent syllables have essentially the same F0, slope,
and curvature. Thus, a Tilt-model
analysis of speech will not be able to find prominent syllables by looking for
The Fujisaki model is interesting, in
that it was the first model of intonation that attempted to explain it in
terms of biology. Unfortunately, it
has no clean explanation for rising contours (such as Mandarin tone 2), and
the accent commands that one derives from speech do not have a clean
connection with syllables or accents or words or stressed syllables. The basic problem with the Fujisaki model
is that the muscle dynamics used is too simple. Fujisaki assumed that muscles can be modeled by a damped
harmonic oscillator response with constant response: constant stiffness, mass
and damping. Real muscles can adjust
their stiffness to the task at hand, and the brain makes use of this
ability. Think of an arm swinging
loosely at your side vs. an arm held stiff to stop a ball: it’s common, normal
behavior, but not in Fujisaki’s math.
It’s hard to take the learning out of a
machine learning system and make it understandable to a human. It’s also often hard to port the system
from one problem to another, even to a closely related problem.
How do we explain the data (red stars)
from a sequence of templates (green stars)?
There are huge changes in syllables 2 and 4, though neither of those syllables
bothers a native listener.
syllable 2, the tone 4 is realized as something like a tone 2, but (in
context), listeners identify it as a tone 4.
The tone on the last syllable is drastically pushed down, yet is again
correctly perceived in context.
Obviously, intonation is not a simple concatenation of templates.
It is equally true that intonation is not the result of a simple smoothing operation
as assumed by Fujisaki.
of a smooth that would be needed to invert the second syllable would be very
large and would oversmooth many other sections.
Likewise, the lowering of the final tone is not consistent
with a simple smoothing operation.
The clue to what is going on likes at the boundary between the first and
the pitch should instantly hop up from the end of tone 3 to the beginning of
tone 1, but that’s physically impossible – the muscles can’t respond
Another clue is that the
second syllable is weak, based on other evidence (phoneme substitution,
It looks like the brain is
deliberately sacrificing the second syllable so that it’s neighbors can be
In the second syllable, we’re quite
obviously getting ready for the upcoming tone 4. This is one of the things Fujisaki model can’t predict.
Almost all speech is made of bits and
pieces of things you’ve said before.
We have the opportunity to optimize much of our speech as we practice
words, tones, and tone combinations.
Note especially that there are very few tone combinations in Mandarin
(64 tri-tones), so it’s quite possible to practice all the combinations in a
short time, and it’s reasonable to assume that they get practiced every
day. Thus, they could easily be
perfected, matching whatever criteria the speaker wishes to optimize.
Planned in advance, so you can adjust
for upcoming events – you can prepare for the next syllable without violating
People speak almost as fast as
possible. The upper graphs are the
signal and pitch contour from a maximum effort pitch warble: the speaker was
asked to warble as fast as possible.
She managed to execute one cycle per about 220 ms. The lower graph shows (on the same time
scale), normal Mandarin speech, at a conversational speed. Here we see a pitch cycle in 400ms. This is quite normal: most of the time,
while speaking, we run our muscles at 50% or more of their maximum speed. Since we’re pushing the muscles and the attached
neural control systems close to their limits, the dynamics of the nerves and
muscles will be important.
Plenty of opportunity to practice…
That’s the Bayesean risk. Some syllables you will really want to
execute correctly, because they are important to the conversation. Others hardly matter. You want to minimize the risk that the
conversation as a whole will be misinterpreted, which means that you execute
the important syllables correctly, even at the cost of messing up the
The effort term, G, increases as the
pitch curve gets wigglier. More
curvature (in either direction), steeper slopes (either up or down), and
larger deviations from the resting position all increase G. The curves are sorted into order of increasing
The yellow box shows an
approximation: The R term is a squared
deviation from the ideal target. The
more complex form shown in the first math slide allows us to split that
deviation into a shape term (wrong shape) and an offset term (too high/too
low). Those two terms can have
S>>1 for important syllables,
s<<1 for unimportant syllables.
Other than being part of the Bayesean
risk, you also need a scale factor just to make the units of G and R agree.
More details of this model can be found
in G. P. Kochanski, C. Shih, H. Jing, A Quantitative measurement of prosodic
strength in Mandarin, 2003 Speech Communication 41(4) 625-645, or http://authors.elsevier.com/sd/article/S0167639303001006.
We used one phonological rule from Dr.
Shih’s thesis. While we got good results
using it, and believe it to be necessary, we haven’t formally tested it. We see no need for other phonological
rules like 22->21. The seem to be
able to be described by our phonology here (I.e. the math).
This is the fit to an entire
sentence. Fit=red. Data=black.
Middle red box is what we talked about
so far. Prosodic strengths are the output. You go around the loop, adjusting the
model’s parameters until it agrees well with the data.
The first tone one is particularly
weak. Compare tone 4 instances of tone
1 in the figure.
In fact, we fit 15, slightly different
models to the data, and all the pairs look more or less like this. The red curve shows that for most
syllables, there is a simple relationship between the strengths derived from
the two different models.
Some of the model parameters are the
prosodic strengths of syllables and words.
These strengths make linguistic sense, and they show the structure of the
utterances. Strengths are uniformly
higher at the beginning of things, and lower at the end, in two different
languages. The consistency across languages
shows the reality of the strengths.
Again, we get consistent patterns of
strengths that depend on the part of speech.
Nouns are high.
Longer words tend to be stronger.
And, these metrical patterns show that
the strong-at-beginning, weak-at-end pattern continues even inside words.
These patterns may help with lexical acquisition by marking word
This experiment was presented at ICPhS
2003, Barcelona Spain. The idea is that
someone is trying to confirm a single uncertain digit in a credit card or phone
number, and they mark that digit prosodically. Can we model it? Can we
build a model with very simple phonology if we do the phonetics in a plausible
X, A, B are three accent types.
This shows the pattern of strengths
across the utterance. Log(strength) is
the sum of 3 terms (utterance decline, phrase decline, local effects around
focus), then the resulting strengths are compressed after the focus, so the
strength has a smaller range.
Details details details. See Chilin Shih and Greg Kochanski,
“Modeling Intonation: Asking for Confirmation in English”, Proceedings of the International
Conference of Phonetic Sciences, ICPhS03, August 2003, Barcelona, Spain.
Note that you see what could easily be
interpreted as a phonological rule that suppresses phrasing (top), or a
phonological rule that merges the final accent with the boundary tone. However, these are strictly phonetic
effects in this model, and can be explained by a reduced strength before the
focus (top) or the simple inability to make two separate gestures in a short
time interval (bottom).
Again, we see an apparent phonological
rule that lengthens a phrase from 4 to 5 syllables. However, it is produced by a phonetic model that has no phonological
Stan Freberg was a comedian, and it’s
quite funny. It’s called “John and Marsha”,
and tells the story of a romance prosodically, using only the words “John” and
By building some knowledge of physiology
and muscle dynamics into a machine learning system, we can make it extrapolate
farther from each training point.
That leads to better accuracy and/or it allows the model to be simpler
and trained with less data.
Puddle: given vs. new. Note the more complete stop in the ‘new’