One of the things that happens every now and then is that someone does something neat, but then it gets blown out of all proportion. Sometimes, that may be because the scientists aren’t careful to put their work into context. A case in point came to my attention today (via slashdot).
The paper is “Uncovering Spoken Phrases in Encrypted Voice over IP Conversations” by Charles Wright (MIT), Lucas Ballard (Google), Scott Coull (UNC, Chapel Hill), Fabian Montrose (UNC Chapel Hill), and Gerald Masson (Johns Hopkins). It is doi:10.1145/1880022.1880029 and a earlier version can be found here.
Let me say up front that I believe every word of the paper. It’s neat, it’s clever. It explains how (under some conditions) that you can recognize what people are saying, even in an encrypted conversation, because different sounds get encoded at different bit rates. For instance, vowels are encoded at a higher bit rate, and fricatives are just noise and are therefore encoded at a low bit rate. So, by watching how many bits flow by each fraction of a second, you can gain some information about what is being said.
Thus, bit-rate patterns of “fall” and “laugh” will be slow-fast-fast and fast-fast-slow, respectively, so you can tell them apart. And, there are actually four different bit rates that one commonly sees, in encrypted telephony, so there is even more information than that. It’s a beautiful piece of cryptographical analysis that finds an information leak from a communication system that we thought was secure.
But is it a way to listen in on someone’s conversation? Certainly, it’s not going to be very effective.
Conventional speech recognition systems work fairly well if they are tuned to the individual. For instance, dictation systems do a pretty good job of transcribing what you are saying, as long as you speak carefully and use a headset microphone in quiet surroundings.
Alternatively, “speaker-independent” speech recognition systems work fairly well if you are making railroad reservations. These systems let anyone speak over their own phone, but there are only a limited number of places that a railroad train can go. In the USA, for instance, there are 925 Amtrack stations (thanks). When the computer asks “Where would you like to go?” it will pick from the list of active stations, so if you say anything close to “Albany”, you’ll end up in New York state. Try “Balbany”, “Albunny”, “Albanix”, “Aldani”, and a few hundred others, and the system will probably treat most of them as if you said “Albany”.
Speaker-specific systems are trained precisely to your voice and your microphone. Speaker-independent systems are designed to allow a lot of slop in the recognition, because different people say things differently, and they say them over different telephones. As a result, such a system cannot discriminate between sentences that are acoustically close together. If you’re designing a system that uses a speaker-independent recognizer, you want to allow only a few answers to every question, so the answers will be acoustically very different from each other. Like “Zanesville” and “Albany”, for instance: not much chance of confusion there.
So, what do the authors say?
They say that they train the system to recognize 122 sentences using speech from one group of people. (They let the system watch the bit-rate across an encrypted connection.) They then test the system on another group of people, and it has a 50-50 chance of picking the right sentence from among the 122 choices. People on slashdot got all excited about this (as one does) and say things like this:
So it’s back to Wind Talkers? Making up some new language every few years (months?), or using some dead one, to keep communication secret? Seems like a lot of work.
…as if this system would actually let someone listen in on your encrypted communication.
The reality is more pedestrian. If your conversation consisted only of the 122 sentences from the TIMIT corpus, then the system would indeed do so. (At least half the time.) But, those 122 sentences don’t give you a lot of room to express yourself. Here are some samples:
She had your dark suit in greasy wash water all year.
Don’t ask me to carry an oily rag like that.
A boring novel is a superb sleeping pill.
Call an ambulance for medical assistance.
We saw eight tiny icicles below our roof.
All of them are about 11 syllables long. The trouble is that there are a lot of 11-syllable sentences in English. Roughly speaking, there are about 1000 English syllables (there are more actually, but I’m ignoring the rare ones). So, if you could combine 11 syllables in all possible combinations, you’d get a billion trillion trillion possibilities. In reality, the number is much smaller because you cannot just stick syllables together to make a word, and you cannot just arbitrarily stick words together to make a sentence. But, even so, there are probably about 1000 trillion sentences of that length that make some sense.
That’s a big number, and from that big number, you can only use the 122 sentences that are in the TIMIT corpus. The number of possible sentences corresponds to a million olympic-sized swimming pools filled with pennies, from which you get to use a handful of pennies. Or, maybe we bury Manhattan 10 meters deep in pennies (up over the third floor) and you get to use a dollar’s worth.
So, what’s the poor system going to do if you read a headline like “Author of novels leaves suburbs, keeping mill?” It’s going to pick the closest utterance from the 122 choices it has been taught. (Hint, it’s blue.) In other words, if you speak your mind instead of reading from TIMIT, the system has a remarkably small chance of capturing what you are saying. If it were really a system for super-spies, it would have to pick the right sentence from amongst trillions of possiblities, rather than from amongst only 122 choices.
This all makes sense, because a conventional speech recognition system works off the full rich set of audio data, rather this relatively scanty four-level complexity measurement, and even a full speech recognition system working on a full set of audio data could just barely do what some of the slashdotters seem worried about. Less data going into the recognizer means less information comes out the other side.
So, Wright et al should have been a bit more careful to point out that their systems surprisingly good performance really only applied to the TIMIT corpus, and not to a more realistic use case.