Comparing the complexity of languages

Which language do you think is the most complex? There are a lot of different answers that people will give to this question. Some people are sure that whatever language they struggled with in high school is the most complex. Others are certain that highly influential cultures must have complex languages, so they choose Greek, Latin, Sanskrit, or Chinese. Language buffs might break out a rare one like Tlingit or Lardil. Many people insist it is their own native language that’s the most complex, though oddly, I’ve found that just as many people seem willing to say their own language is terribly simple.

But ask a linguist, and you get a really wet blanket answer: no language is any more complex than any other. Or, put another way, all languages are equally complex. That answer tends to stop conversation dead in the tracks and no one is really satisfied by it, so I’m going to spend some time in this post explaining this answer and making it more interesting (maybe).

I should start by saying that it isn’t true that all languages are known to be equally complex. That’s just the consensus opinion among linguists. It’s a natural conclusion that you reach after enough years of looking at a lot of different languages. The reality of the situation is that linguistics lacks a coherent theory of complexity. There is no known way of getting a useful, objective, measure of complexity. There’s is no equation for complexity that you stick a language into and get a number out.

Part of the problem is just the sheer magnitude of language diversity. There are 6,000-7,000 languages in the world, depending on how the count is done; distinguishing between dialect and language is not a trivial issue. If you aren’t familiar with linguistic diversity, a very widely-used categorization of languages is called The Ethnologue, which is worth a visit if you’ve never seen it.

Another way to get a sense of diversity is to look through the World Atlas of Language Structures. Rather than being organized by language, WALS is organized by grammatical feature. Choose one, and then you get a map of the world with languages organized into groups, depending on how that particular feature works in that language.

As it stands, it’s not possible to do a comparison across the whole set of human languages because there just isn’t enough data available to do that. For most languages there exists little more than a word list or maybe a dictionary. Certainly not every language has a descriptive grammar, and even for those that do the contents are going to be far from complete. A full description of a language can involve decades of work. Gathering original data on language is time-consuming if it goes well. It can be easily bogged down by technical problems, travel problems, political problems, health problems in cases where there are only elderly speakers left, and the general slowness of academic publishing.

In the meantime, it has become the consensus, some might even say the orthodox, position in linguistics that all languages are equally complex. I want to walk you through why linguists tend to believe this, and why this is such a difficult issue.

In order to decide which language is more complex, we would need a clear understanding of what it really means to be complex. One thing we could consider is learnability: if one language takes longer to learn, on average, than others, that makes it a more complex language. Unfortunately, this turns out not to be very useful as a guide to complexity.

All evidence points to the fact that infants acquire their first language in roughly the same amount of time. Children follow a normal order of acquisition moving from babbling, to a word-one stage, on to a bigger chunks of syntax, and finally on to full sentences. Children around the world do this at roughly the same age regardless of language, regardless of intelligence, regardless of whether there exists any formal grammar education. To be clear, this doesn’t mean that all children learn language in exactly the same way. The details of acquisition vary somewhat from child to child, and also depend on the specific language, but the overall trajectory and time frame for acquisition seems to be comparable around the world.

Adult learners have a different challenge. The difficulty of learning a language at a later stage in your life depends almost entirely on how similar it is to other languages you already know. Mandarin might seem complex if your native language is English, but not if your native language is Cantonese, which is related to Mandarin. Finnish would probably appear highly complex to a speaker of Vietnamese, but less complex to a speaker of Estonian.

All in all, learnability isn’t a good measure of complexity, because in the case of children, there are no differences, and in the case of adults it depends on facts about the learner, not the language itself.

It would be better if you could have some way of comparing languages more directly to each other. One thing that might come to mind is comparisons of vocabulary size. This is also a dead end. I have a post over here that explains why counting words is pointless. Besides, the lower-bound for even small languages is hundreds of millions of words, and at that size I think anything would count as complex.

So if we can’t compare lexicons, maybe we can compare the grammars of two languages in some way. This is trickier than you might expect. I decided to choose English and Russian as examples, for no particular reason other than I happen to have enough material nearby to write something coherent about Russian for this post.

To start with, we’re going to need to pick which area of the grammar that we can compare. Already, this is not simple, because not every grammar has the same parts. English, for example, has words commonly called ‘articles’, including ‘the’ and ‘a’. Russian does not have an equivalent class of words.

On the other hand, Russian nouns are inflected for grammatical case. There are six cases in Russian, and each case can have a different form for masculine, feminine, neuter and plural nouns. English has no case marking on any nouns except pronouns, and there are at most four forms, e.g. subject “I”, object “me”, possessive “my/mine” and reflexive “myself” (and it would be a stretch to call them all “grammatical cases”).

How do we compare these two things? Should we say English is more complex because it has articles? Should we say that Russian is more complex because it has case marking? There’s another trade-off here too: case-marking partly indicates the role of a noun in a sentence, so words in Russian sentences can be scrambled around a little bit. English words have to go in a stricter order because there’s no case-marking to lay out who did what to whom. Is it more complex to have a case system and free word order, or is it more complex to have no case marking and strict word order rules?

Even if you can find something that both languages have in common, it’s not always easy to decide which is more complex. Let’s do some comparison of English and Russian verbs for example. This is going to get a bit grammar-y here because there’s a lot to say about verbs, but I’m going to try to keep it brief.

English has one basic past-tense suffix (written as -ed usually), which has three possible pronunciations, depending on the final sound of the stem it attaches to. For example, “kicked” has a final /t/ sound, “begged” has a final /d/ sound and “headed” has a final /əd/ sound. There are also several other ways of forming the past tense. Some verbs change their vowel (speak-spoke, ride-rode, freeze-froze) some change the final consonant to /t/ (build-built, lend-lent, spend-spent), some add a -t and also have a vowel change (catch-caught, keep-kept, think-thought) some have no change at all (hit, bet, broadcast), and some are totally unpredictable (leave-left, is-was).

Russian past tense is indicated with a suffix, just like English. The suffix also has multiple possible pronunciations, just like English, but instead of being based on the final consonant of the verb root, it based on the gender of the subject of the verb. Verbs take the suffix -лa in the case of feminine subjects, -лo for neuter subjects, -л for masculine subjects, and -ли suffix for plural subjects, regardless of gender.

English has one simple present suffix (written as -s usually), and it is only used for the 3rd person singular. It has three possible pronunciations, and, like the past tense, it depends on the final consonant of the verb: it’s /s/ in “he kicks”, a /z/ in “he goes”, and /əz/ in “he touches”. Otherwise, a bare form of the verb is used, e.g. “I kick”, “We go”, “They touch”.

The present tense in Russian also marked with a suffix, but unlike English, there is a different suffix for each of the grammatical persons. Verbs fall into two conjugation classes, with slightly different suffixes, which is indicated with a slash below

1st person singular -y
2nd person singular -ешь/-ишь
3rd person singular -ет/-ит
1st person plural -ем/-им
2nd person plural -ете/-ите
3rd person plural -yт

Additionally, many Russian verbs undergo a change in their final consonant when they are conjugated. For example, the verb ‘to cry’ is плакать but when conjugated the “к” becomes “ч”, which is like English “ch”. Some verbs have an extra consonant inserted, instead of changing one. The verb ‘to live’ is жить, but when conjugated there’s a letter “в” that is added, e.g. живу ‘I live’.

The past and the present are examples of verb ‘tenses’, which refers to the time when an action took place. English and Russian verbs can also have their ‘aspect’ modified. The term ‘aspect’ refers to the way that the flow of an event is described, as opposed to what time it happened at. There are many different kinds of aspects, but the basic two are ‘imperfective’, for events which are incomplete or open-ended, and ‘perfective’, for events which are complete or closed-off. The difference between tense and aspect is not always easy to distinguish in a language, as the two systems are often intertwined.

In English, aspect is indicated with modal and auxiliary verbs. English has a perfective that is formed with ‘have’, e.g. “I had spoken”, which suggest that the event of speaking is complete. This can be contrasted with a progressive aspect formed with ‘be’, e.g. “I am speaking”, which suggests that the event is ongoing or unfinished. These auxiliaries can even be combined to give more nuanced meanings e.g. “I had been speaking”.

These extra verbs also change the rules for where the past-tense or present-tense suffix goes. In general, the tense suffix goes on the verb the farthest to the left. We say “she had been eating”, for example and put the auxiliary verb ‘have’ in the past. We don’t say *”she has been ateing” or *”she has wasen eating”. If the verb is negative, English requires an extra dummy verb ‘do’ which can carry tense, so we say “she did not write”, but not *”she not wrote” or *”she do not wrote”.

In Russian, aspect distinctions are indicated with verb prefixes. Russian is commonly described as having two aspects, the perfective (complete action) and imperfective (incomplete action). For example the verb читать means ‘to read’ and has an imperfective interpretation. The verb пpo-читать also means ‘to read’, but the prefix пpo- suggests that the reading is completed. There are numerous prefixes that can indicate perfective aspect, and you need to memorize which verb takes which prefix.

Some verbs can take more than one prefix, each of which changes the aspect and adds some new meaning to the verb. The verb писать means ‘to write’. It can become perfective with the prefix на-, but also with the prefix за-, in which case it takes on a meaning closer to that of the verb ‘to record’. But what if want to have the verb ‘to record’, but in the imperfective? In that case, Russian grammar has an “infix” -ыв- that you can add back into the middle of the verb root, to make it imperfective again.

So now, at the end of that whole discussion, which language is more complex? Keep in mind that we’ve barely scratched the surface of these languages, by focusing on very basic facts about verbs. I had to simplify in numerous places just to make this fit in a reasonable space inside a blog post. It’s not even clear that it makes sense to declare one more complex than the other, when they are both so complex to begin with. If you start to take apart any two languages and compare them piece by piece you’ll come to a similar conclusion that linguists do – all languages are equally complex, each in their own way. It is largely a matter of opinion which one you consider to be the most complex.

But it’s good to have an opinion, so share your favourite bit of complex grammar in the comments!



Filed under Linguistics

3 responses to “Comparing the complexity of languages

  1. What do linguists say about the number of glyphs required to express a language? Chinese, for example, has thousands, whereas Hawaiian (as I understand it) has only 13. Does that lead to any useful measure?

    In computer science, complexity has been extensively studied. One measure involves execution time, although I can’t see how to apply that to language. What would be needed is some measure of the amount of “brain power” necessary to use the language.

    Perhaps more useful might be something along the lines of Kolmogorov complexity, which is a measure of how much is required to describe something (it’s also called algorithmic or descriptive complexity).

    Liked by 1 person

    • Good questions!

      The idea about glyphs is interesting, but it conflates the writing system of a language with its sound system. Hawaiian has a small number of written characters because it has a very small sound system, and it is written alphabetically. Chinese can also be written alphabetically, using a system called ‘pin-yin’, in which case it takes about 30 letters plus 4 tone markers. There’s also a variety of Mandarin spoken in Kazakhstan called “Dungan” which is written in the Cyrillic alphabet (the one that Russian uses).

      Even at this level, it’s not clear that having 30-odd sounds is more complex than having only 12 because of other facts about the lexicon and phonotactic rules of these languages. For one thing, Hawaiian words are sometimes 10 syllables or longer (the state fish of Hawaii is humuhumunukunukukuaapua’a), whereas Mandarin words are almost universally monosyllabic . It may also be a more complex task to parse Hawaiian speech and find word boundaries, whereas in Mandarin you can be almost certain that a syllable edge is a word edge.

      And what about all the languages that have no standardized writing system (i.e. most of them)? How do you measure them?

      That said, I definitely agree with you that having to learn thousands of traditional Chinese characters is a more complex task than learning an alphabet. Still, this doesn’t actually make Chinese qua language any more complex. In principle, any language could be written with a system of thousands of logograms. The use of logograms doesn’t reflect anything special about the grammar of Chinese. For comparison, consider the syllabary of Japanese, which is perfectly suited to the structure of Japanese, and does not work very well for other languages.

      Kolmgorov complexity is actually a very good suggestion, and I think that people have even tried to apply it to natural languages (I’m sure a Google Scholar search will turn up something). This will give you a nice formal measure of complexity, but some of the problem that I brought up in the post still exist, including the lack of descriptive grammars for most languages, and the difficulty finding comparable parts of a language. I should probably add that there’s a lot of interest in linguistics in (un)supervised grammar learning, which often requires dipping into the literature on computational complexity, so there definitely are people hard at work on the question of linguistic complexity.

      Liked by 1 person

      • I wondered about the length of Hawaiian words. I do know they can be quite long (I love how they sound). IIRC, you’ve even touched on that in this blog.


