Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Language has many interesting, apparently complicated effects.   Linguists typically explain these by discrete rules on discrete objects, such as “a /stop t/ should be transformed to a /flap t/ when in the middle of an unstressed syllable”.   This is phonetics – operations on discrete entities.    Phonology is the physical implementation of these discrete entities, and is often considered relatively unimportant.    I’ll show in this talk that with a physically reasonable model, the phonetics (I.e. the strategy that the brain uses to control the articulators) can explain much that is normally considered to be phonology.
I want to attach linguistics to the reductionist stack of disciplines that forms modern science.   Linguistics is simply an example of rather complex animal behavior, which must eventually be explained by neurobiology.   Neurobiology is, of course, explained by cell biology, and so forth.
We’d like this explanation to preserve the basic linguistic assumptions, such as that language is expressible as an ordered string of discrete entities (such as syllables or words), which might have continuous properties attached (like relative importance).
The tilt model is a way of parameterizing local stretches of a F0 curve.    The idea is that you parameterize the bumps and steps, which hopefully correspond to accents, and then interpolate in between.
The problem is, that the locations of accents (I.e. the position of prominent syllables) isn’t associated with any particular local shape of F0.  These are data from the IViE corpus (Grabe and company, Oxford, circa 2002).   They are local fits of polynomials to prominent (red) syllables and non-prominent places (randomly chosen between the prominent syllables).   As can be seen, the prominent and non-prominent syllables have essentially the same F0, slope, and curvature.   Thus, a Tilt-model analysis of speech will not be able to find prominent syllables by looking for bumps.
The Fujisaki model is interesting, in that it was the first model of intonation that attempted to explain it in terms of biology.  Unfortunately, it has no clean explanation for rising contours (such as Mandarin tone 2), and the accent commands that one derives from speech do not have a clean connection with syllables or accents or words or stressed syllables.   The basic problem with the Fujisaki model is that the muscle dynamics used is too simple.   Fujisaki assumed that muscles can be modeled by a damped harmonic oscillator response with constant response: constant stiffness, mass and damping.  Real muscles can adjust their stiffness to the task at hand, and the brain makes use of this ability.   Think of an arm swinging loosely at your side vs. an arm held stiff to stop a ball: it’s common, normal behavior, but not in Fujisaki’s math.
It’s hard to take the learning out of a machine learning system and make it understandable to a human.   It’s also often hard to port the system from one problem to another, even to a closely related problem.
How do we explain the data (red stars) from a sequence of templates (green stars)?  There are huge changes in syllables 2 and 4, though neither of those syllables bothers a native listener.   In syllable 2, the tone 4 is realized as something like a tone 2, but (in context), listeners identify it as a tone 4.  The tone on the last syllable is drastically pushed down, yet is again correctly perceived in context.  Obviously, intonation is not a simple concatenation of templates.
It is equally true that intonation is not the result of a simple smoothing operation as assumed by Fujisaki.   The strength of a smooth that would be needed to invert the second syllable would be very large and would oversmooth many other sections.    Likewise, the lowering of the final tone is not consistent with a simple smoothing operation.
The clue to what is going on likes at the boundary between the first and second syllables.    Linguistically, the pitch should instantly hop up from the end of tone 3 to the beginning of tone 1, but that’s physically impossible – the muscles can’t respond instantly.   Another clue is that the second syllable is weak, based on other evidence (phoneme substitution, etc).  It looks like the brain is deliberately sacrificing the second syllable so that it’s neighbors can be executed properly.
In the second syllable, we’re quite obviously getting ready for the upcoming tone 4.   This is one of the things Fujisaki model can’t predict.
Almost all speech is made of bits and pieces of things you’ve said before.     We have the opportunity to optimize much of our speech as we practice words, tones, and tone combinations.   Note especially that there are very few tone combinations in Mandarin (64 tri-tones), so it’s quite possible to practice all the combinations in a short time, and it’s reasonable to assume that they get practiced every day.   Thus, they could easily be perfected, matching whatever criteria the speaker wishes to optimize.
Planned in advance, so you can adjust for upcoming events – you can prepare for the next syllable without violating causality.
People speak almost as fast as possible.   The upper graphs are the signal and pitch contour from a maximum effort pitch warble: the speaker was asked to warble as fast as possible.   She managed to execute one cycle per about 220 ms.   The lower graph shows (on the same time scale), normal Mandarin speech, at a conversational speed.  Here we see a pitch cycle in 400ms.    This is quite normal: most of the time, while speaking, we run our muscles at 50% or more of their maximum speed.   Since we’re pushing the muscles and the attached neural control systems close to their limits, the dynamics of the nerves and muscles will be important.
Plenty of opportunity to practice…
That’s the Bayesean risk.    Some syllables you will really want to execute correctly, because they are important to the conversation.  Others hardly matter.   You want to minimize the risk that the conversation as a whole will be misinterpreted, which means that you execute the important syllables correctly, even at the cost of messing up the unimportant ones.
The effort term, G, increases as the pitch curve gets wigglier.   More curvature (in either direction), steeper slopes (either up or down), and larger deviations from the resting position all increase G.   The curves are sorted into order of increasing G.
The yellow box shows an approximation:  The R term is a squared deviation from the ideal target.   The more complex form shown in the first math slide allows us to split that deviation into a shape term (wrong shape) and an offset term (too high/too low).   Those two terms can have different weights.
S>>1 for important syllables, s<<1 for unimportant syllables.
Other than being part of the Bayesean risk, you also need a scale factor just to make the units of G and R agree.
More details of this model can be found in G. P. Kochanski, C. Shih, H. Jing, A Quantitative measurement of prosodic strength in Mandarin, 2003 Speech Communication 41(4) 625-645, or http://authors.elsevier.com/sd/article/S0167639303001006.
We used one phonological rule from Dr. Shih’s thesis.   While we got good results using it, and believe it to be necessary, we haven’t formally tested it.   We see no need for other phonological rules like 22->21.  The seem to be able to be described by our phonology here (I.e. the math).
This is the fit to an entire sentence.  Fit=red.   Data=black.
Middle red box is what we talked about so far.    Prosodic strengths are the output.  You go around the loop, adjusting the model’s parameters until it agrees well with the data.
The first tone one is particularly weak.  Compare tone 4 instances of tone 1 in the figure.
In fact, we fit 15, slightly different models to the data, and all the pairs look more or less like this.  The red curve shows that for most syllables, there is a simple relationship between the strengths derived from the two different models.
Some of the model parameters are the prosodic strengths of syllables and words.   These strengths make linguistic sense, and they show the structure of the utterances.   Strengths are uniformly higher at the beginning of things, and lower at the end, in two different languages.     The consistency across languages shows the reality of the strengths.
Again, we get consistent patterns of strengths that depend on the part of speech.  Nouns are high.
Longer words tend to be stronger.
And, these metrical patterns show that the strong-at-beginning, weak-at-end pattern continues even inside words.
These patterns may help with lexical acquisition by marking word boundaries.
This experiment was presented at ICPhS 2003, Barcelona Spain.     The idea is that someone is trying to confirm a single uncertain digit in a credit card or phone number, and they mark that digit prosodically.   Can we model it?   Can we build a model with very simple phonology if we do the phonetics in a plausible manner?
X, A, B are three accent types.
This shows the pattern of strengths across the utterance.  Log(strength) is the sum of 3 terms (utterance decline, phrase decline, local effects around focus), then the resulting strengths are compressed after the focus, so the strength has a smaller range.
Details details details.   See Chilin Shih and Greg Kochanski, “Modeling Intonation: Asking for Confirmation in English”, Proceedings of the International Conference of Phonetic Sciences, ICPhS03, August 2003, Barcelona, Spain.   http://prosodies.org/papers/2003/ICPhS03.pdf .
Note that you see what could easily be interpreted as a phonological rule that suppresses phrasing (top), or a phonological rule that merges the final accent with the boundary tone.  However, these are strictly phonetic effects in this model, and can be explained by a reduced strength before the focus (top) or the simple inability to make two separate gestures in a short time interval (bottom).
Again, we see an apparent phonological rule that lengthens a phrase from 4 to 5 syllables.  However, it is produced by a phonetic model that has no phonological complexities.
Stan Freberg was a comedian, and it’s quite funny.   It’s called “John and Marsha”, and tells the story of a romance prosodically, using only the words “John” and “Marsha”.
By building some knowledge of physiology and muscle dynamics into a machine learning system, we can make it extrapolate farther from each training point.   That leads to better accuracy and/or it allows the model to be simpler and trained with less data.
Puddle: given vs. new.  Note the more complete stop in the ‘new’ (top) version.