How many words could English have?


How many words are there in English? A popular method for counting words in a language is to use the total number in a dictionary (or some other corpus). But counting words this way doesn’t tell us anything very interesting about “English”, because not every speaker of English knows every word in the dictionary.

For example, you might know the word beech refers to a tree, but not know how to identify one in the woods. You might know that some governments are jingoistic, but not know which ones or why. Maybe you confuse yams and sweet potatoes all the time.

So does beech count as “part of English” if not every speaker of English knows it, uses it, or understands it? The words you know depend on things like education, job, dialect, how much you read, where you grew up, your hobbies, how much you’ve traveled and so on.

You can’t really say that “English” has X words. At best, you might be able to say something like “an average Canadian with an undergraduate education knows X words”. And that’s not terribly interesting.

So instead of asking how many actual words there are in use, let’s ask how many possible words there could be. What’s a possible word? Whether something counts as an English word isn’t just a matter of people using it to mean something. The word also has to conform to certain sound patterns. Have a look at these made-up words:

(1) blick, thritch, gakt
(2) *bnick, *thkich, *gatk

The words in (1) are “possible” words because they conform to sound patterns found in other existing English words. For example, the “thr” in “thritch” also appears in “three” or “throw”, and the “itch” part appears in “pitch” or “ditch”.

The words in (1) are words that could be given meanings and put into circulation, they just haven’t been (so far as I know, and even if they have, that doesn’t really detract from the point). Recognizing that blick is a word, even though it doesn’t have a meaning, is a bit like recognizing that beech is a word even if you don’t know what it really refers to.

The made up words in (2), on other other hand, are not possible words of English. They contain sequences of sounds which do not exist in other words. There are no words of English that start with the sounds [bn], for example, which is why *bnick sounds so odd.

These patterns of sound sequencing are known as “phonotactic rules”. Every language has these. Phonotactic rules describe which sounds can appear next to others: phono means ‘sound’ and tactic means ‘touching’ (cf. ‘tactile’).

The existence of phonotactic rules means there is a finite number number of possible syllables in English, indeed in every language, since there are a finite number of sounds and not every combination of sounds is possible. And since words are made up of syllables, we can get a rough count of the number of possible English words by first figuring out how many monosyllables are possible, and then multiplying.

So how do you count syllables? Syllables are normally represented like this:

That’s the verb “seeks” in IPA transcription. Syllables contain at minimum a nucleus, which is generally a vowel. Consonants that come before that vowel are called the onset. Consonants that come after that vowel are called the coda. The nucleus and coda are grouped into a sub-unit called a Rhyme, but it doesn’t really matter why right now.

The number of syllables in a language reduces to the number of things that can go in each slot in the syllable tree. Multiply the number of onsets by the number of nuclei by the number of codas and you get the total number of monosyllables.

For English, the possible onsets and codas as listed in Wikipedia will do for this thought experiment. There are 20 possible onsets consisting of a single consonant plus 59 complex onsets makes for 79 possible onsets. And if we include the fact that a word can have no onset at all (e.g. ice, out, over) then there are 80 possible onsets.

Let’s say there are 12 vowels that act as a nucleus. This will vary a lot according to dialect so treat this as a made-up number. Finally, Wikipedia gives 18 simple codas + 77 complex codas + no coda = 96 codas. So given these very rough numbers, this makes for 80*12*96 = 92,160 possible monosyllables. To find out how many bisyllabic words there are, you just multiply the number by itself. Then add those numbers together to find out how many one and two syllables words are possible. And so on. The general formula for how many possible words there are up to N syllables is:

So how big should N be? In theory, there’s no limit to how many syllables a word can have. In real life there are limits on how long things can be because of memory limits, the need to breathe, and so on. But consider a word like supercalifragilisticexpialidocious. That’s 14 syllables long. That’s way longer than just about any word you would normally use in conversation, it doesn’t really mean anything, and it still sounds totally natural, so our phonotactic intuitions scale up to some pretty big sizes.

And just for fun, let’s say that 14 syllables is the longest a word can be in English. Using the formula above you can find there are a massive 2.75963 x 10^70 possible words in English (rounded). Treat this number with extreme caution. I’ve been pretty loose about counting possible onsets and codas, and some of the numbers change with dialect. I also didn’t include any effects of stress.

Now you can technically use this method to compare the “size” of different languages, since each has different phonotactic rules. Let’s take Hawaiian, for example, which has one of the smallest sound inventories in the world. This language has only 8 consonants which can serve as onsets, and there are no complex onsets. So including the fact that there can be no onset, that makes for 9. We’ll also give the most generous count for vowels and say there are 25 of them, which means 25 nuclei. Hawaiian syllables never have a coda, which means there is only one possibility there. (By the way, this is pretty normally cross-linguistically. Lots of languages ban codas, especially in he Polynesian family that Hawaiian belongs to. It’s also common for languages to ban anything except nasal consonants in coda.)

The math is a lot simpler than for English: 9*25*1 = 225 possible monosyllables. And using the formula above, there are measly 8.56032 x 10^32 possible words (rounded) up to 14 syllables long. That is smaller than English, but certainly more than enough to have a word for just about anything you’d need to talk about day to day. Also, Hawaiian words can be very long, so 14 syllables might not be as unusual as English, and there’s probably a little more homophony.

It’s also worth mentioning that people don’t necessarily invent new words by just randomly gluing some sounds together. That happens (language is arbitrary after all) but people also coin new terms in other ways. For example, I haven’t considered compounding, which would up the word count again.

But as I said earlier, treat all these numbers with some skepticism. The final count doesn’t matter anyway. The number of words in a language isn’t nearly as interesting as the mental processes that generate those words in the first place.

13 Comments

Filed under Linguistics

13 responses to “How many words could English have?

  1. An interesting thought exeriment indeed.
    PS Have you talked about compound rules before? I would love to read about that in your words.

    Like

  2. Fascinating post! It got me to thinking about antidisestablishmentarianism as one of the longer English words. (Looked it up in Wikipedia, which mentions that it is, “commonly believed to be the longest word in English found in major dictionaries […], excluding coined and technical terms.” (Does that make ‘supercalifragilisticexpialidocious’ a coined or a technical nanny term? :grin:)

    Interesting comparison with Hawaiian, too! From what I can recall from my visits, the words are definitely longer on average. The mountain on Kauai’i is Waiʻaleʻale (5), and there is the famous Queen, Liliʻuokalani (6) [Wiki says her full birth name was Lydia Liliʻu Loloku Walania Wewehi Kamakaʻeha].

    There must be an inverse correlation between the sounds available and the average length of words in a language?

    I understand that German tends to make new words by combining existing words. From what little German I know, some of their words seem very long as well.

    Like

  3. SynedraAcus

    Nice post, and btw have this kind of stuff (ie full space of phonetically correct words) ever been seriously studied? Interesting to see, say, whether size of possible word space has anything to do with amount of actual words used in language (in all dialects and slangs and all, taken together) or complexity of syntax or whatever else?

    Like

  4. SynedraAcus

    But you made a little error. To find amount of 2-syllable words, you should not double number of syllables, you multiply it by itself.

    Like

    • Oh, that’s a silly mistake to make. Thanks for pointing it out. I’ve fixed the formula to reflect this, and changed the numbers in the post too. This also results in considerably larger syllable counts. The magnitude makes our fast judgements of long nonsense words, like supercalifragilisticexpialidocious, all the more impressive.

      Like

  5. gmeredow

    Reblogged this on gmeredow.

    Like

  6. Pingback: Comparing the complexity of languages | linʛuischtick

  7. Pingback: The man who can talk backwards | linʛuischtick

  8. Pingback: 能倒着说话的人 - Mandarinian

  9. Pingback: A one who can discuss backwards (2017) – TOP HACKER™

  10. Pingback: A man who can talk backwards (2017) – A2Z Facts

  11. Pingback: A man who can talk backwards (2017) – TOP Show HN

Tell the world what you think!