Greg Kochanski

Is a speech synthesizer the same as a text-to-speech (TTS) system?

H2 Question:

H2 Answer:

No. It may seem a pedantic distinction, but a speech synthesizer is anything that produces speech, while a text-to-speech system produces it from input text.

H2 Question:

So what?

H2 Answer:

They can be very different systems, because a TTS system needs to interpret the text (in the artistic, prosodic sense), while a synthesizer may have the interpretation provided for it. Interpreting the text can be the hardest part of the problem.

H3 Interpreting Shakespeare

For instance, think about performing Shakespeare. The text has been the same for the last several hundred years, but every production is different. Hamlet can be mad, or coldly pretending to be mad; Polonius can be doddering or foolish or evil. Each of these (and other) choices gives a different production, and each leads to different choices in the way the actors say their lines. The choices yield different prosodies: different patterns of emphasized words, different durations and pauses, different patterns of loud and soft words. Different intonation (patterns of pitch) will be applied to Shakespeare's classic lines.

It is not easy to produce a good interpretation of a play, so there is little hope that a machine learning system will be able to do the task in the forseeable future. Further, there are many possible good interpretations, and each one means something different, leaving the audience with a different view of the characters and the action. How should a machine learning system choose one?

H3 A More Concrete Example

Consider a single sentence: "I did not eat the dog." You can say it several ways, such as

STRONG I did not eat the dog. (Meaning: Somebody else ate the dog.)
I did STRONG not eat the dog. (Meaning: Don't suggest that I did.)
I did not STRONG eat the dog. (Meaning: I cooked it and fed it to my companions.)
I did not eat the STRONG dog. (Meaning: I ate the cat, actually.)

Now, choosing one or another of those meanings may be very difficult, because it may require a lot of knowledge about the world, and because it may require collecting information from widely separated parts of the text. For instance, suppose Our Hero has two faithful animal companions, one that purrs and one that barks. Many pages later, he/she is trapped in a cave in a snowstorm. When rescued, he/she is asked "How did you stay alive for so long?", and answers "I did not eat the STRONG dog ," with a lifted eyebrow.

To choose that particular prosody, you need to know quite a few facts about the world:

that people eat,
that it's hard to find food if you're trapped in a cave,
that dogs and cats are edible in extreme circumstances,
that eating pets is not exactly socially sanctioned so one might want to be a bit coy about admitting it,
that dogs and cats often stay with their people, and
that dogs bark and cats purr.

You also need to have connected the dogs and cats to the hero early in the story, and made a large number of deductive connections between all these facts. Overall, it's a very complex prediction which is well beyond the state of the art of artificial intelligence systems.

So, choosing an interpretation of a text can be very hard.

Concept-to-Speech systems

If the speech synthesizer is part of a larger application, like a airline reservation system, the application sets the interpretation. There is no need to try to deduce the correct interpretation from the text.

So, for instance, the system should never produce " STRONG You're going from San Francisco to Nome, Alaska?" Most likely, it would either put the emphasis on "San Francisco" or "Nome". It might emphasize the one that recieved a poor score from the speech recognizer, on the grounds that a mistake might have been made, and if the mistake is emphasized, the human user is more likely to notice it and correct it.

This is a form of a "concept-to-speech system", where the system has internal information that allows it to choose the prosody in a sensible manner. Sometimes, the system has the information needed to make the choice of prosody simple and obvious.

If the designers made the mistake of producing text, and then feeding the text into a text-to-speech system, all that information would be lost.

[ Papers | kochanski.org | Phonetics Lab | Oxford ]

Last Modified Thu Oct 13 17:27:28 2005

Greg Kochanski: [ Home ]