He studies too much for words of four syllables
-Jane Austen, Pride and Prejudice
“No”
“the”
“of”
“to”
“and”
“a”
“Yes”
“How”
“LOL”
“F**K”
What do all of these words have in common? They are all in the 500 most frequently used words in the English language (ok, two of them aren’t, but I bet you on the internet, they are). Notice something about these words? They are all really SHORT. Most of the big 500 are, the majority are only one syllable. This is not surprising, the most frequently used words in many languages are short. This is pretty obvious when you think about it, and applies to most languages (ok, maybe not German, where the word for “speed limit” is “Geschwindigkeitsbegrenzung”. They’ve got other words of similar length. A mouthful? Sure. But super fun to speak!).
When you know that short words are used more frequently, then the next logical step is to hypothesize that word length is DETERMINED by how frequently its used. If you’re going to use a word to mean you agree with something, and you’re going to use it a lot, it doesn’t make a lot of sense to have that word be 15 letters long, esp when you can say “yes”, “oui”, “ja”, or “si” (I’d include a non-phoenetic alphabet example too but I don’t think wordpress can do that in text…). The hypothesis that word length is determined by how frequently the word is used in a language was laid out by a guy named Zipf, who observed that the length of a word is inversely correlated to how often its used.
The idea behind this is straight up efficiency. Information is most efficient when its conveyed in the shortest way possible. That means short words.
This hypothesis has stood for the past 75 years or so, but there have been some problems. And now, there’s a new hypothesis: what if the length of words is correlated with their INFORMATION content?
Piantadosi, Tily, Gibson. “Word lengths are optimized for efficient communication” PNAS, 2011.
Zipf’s idea, that the frequency of word usage determines how long the word is does work in some respects. But the problem is that the frequency of word usage depends heavily on context. The choice of words you use on the internet may be highly different from the choices in a business meeting, where longer words maybe used with great frequency (especially compared to “LOL”, which many people may argue is not a word, but hey, it’s in the OED). So if context changes word choice, but you still want to keep the maximal efficiency of information communication, what determines whether words are short or long?
The scientists behind today’s paper hypothesize that the INFORMATION conveyed by the word determines its length. This differs from Zipf in that it doesn’t depend on frequency. If Zipf’s hypothesis was true, the more we used a word (like ‘supercalifragalisticexpialidocious’), the shorter it would become, as frequency of use would determine its eventual length. But with the hypothesis that information content determines length, This means that the more information a word conveys, the longer its allowed to be. This means that you can keep your long words long if the information they convey is essential.
The other big way this differs from Zipf is the rate of information communication. The hypothesis of information conveying word length keeps the rate of information flow during conversation as constant as possible. For example, if you use the word “a”, you’re conveying relatively little information, and it’s also a very fast word to say. While if you use the word “criminal”, you convey more information, but it takes more time to do so. The hypothesis described in this paper not only keeps word length dependent on information, it also keeps the flow of information relatively constant. Long words with lots of information take a long time, and short words with little information take a short time. The net results is that the flow of information remains pretty constant over time as you use long or short words.
So the question now is: does the hypothesis WORK? Does the information contained in a word really predict its length?
Edited to add: A note on the methods. Replicated from my comment below:
They say they used “an unsmoothed N-gram model trained on data from Google”. So basically they took frequently used strings in the Google datasets for each language, strings of 2, 3, or 4 letters. They compared them to the OPUS Corpus (I looked it up, and it’s an archive of open source documents) to take out nonsense words. They used the most frequent words in the dataset because information content can be estimated reliably only from high frequency (so even the words they worked with that were used less frequently were still used more frequently than many other words). They then used a mathematical model to estimate the amount of information contained in each word.
Here’s the mathematical model:
Where C is context, W is word, and the joint distribution of a word in a context is P(C, W). This corresponds to the expected information context from a random word in a large trove of words. At least, that’s how I understand it.
Looks pretty good so far. Here you can see a correlation between the length of words in the English language and the amount of information the words are thought to convey, and the correlation is very nicely in favor of longer words conveying more information.
Sure that works fine for English, but what about other languages?
Here you can see correlation values for information value and word length (the solid bars), and frequency of usage and word length (the hashed bars) for 11 different languages. I am not sure what the n=2, n=3, and n=4 stand for, if anyone knows, please drop it in the comments!! But what you can see overall is that there is a correlation between information value and word length in all of these languages (the lowest correlation here overall appears to be Romanian and I’m not sure why that is). There is a much lower correlation between frequency of usage and word length (except for in Italian). But when they comapred the two correlations to each OTHER, they found that the frequency of word usage is also correlated with information content. When you need to convey a lot of information frequently, you’re going to use the long words.
So far it looks pretty good, that information value and word length are well correlated. The only place where this breaks down is in words that mean relatively little and are used extremely frequently, the top 5-20{9f43b4361d9a125bc126dd2a2d1949be02545ec69880430bc4fed2272fd72da3} most frequently used words have little correlation between word length and information content, meaning that some short words convey a lot of information, while long words may contain less.
they conclude that information content of words will correlate with their length, and inversely correlate with the frequency of usage. But this DOES break down for the most frequently used words, and so Zipf’s hypothesis is not entirely incorrect. I imagine that the REAL things that determine word length probably is a result of the informative value (promoting longer words for more information) which is modulated by the frequency of use (frequency of use promotes shorter words for more efficiency). So it is possible to have high information and short words IF they are words used extremely frequently. With words that are used less often, there’s less pressure to shorten them up, and thus their length correlates more with their informative value. The net result: the most efficient use of language, taking into account frequency, informative value, and word length. The use of language will become less efficient as you use words that are less frequent. So for maximal communication? Small words.
But I’m still waiting for an explanation of “Geschwindigkeitsbegrenzung”.
Piantadosi ST, Tily H, & Gibson E (2011). From the Cover: Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences of the United States of America, 108 (9), 3526-9 PMID: 21278332