Saturday, August 20, 2016

Publish your research Open Access - for you, me and everyone!

Thinking of turning that MA thesis into an article, putting together an edited volume from a workshop or finally writing that big book on discourse particles that is going to solve everything? Why not consider Open Access (OA) publishing instead of the traditional publishing houses and journals? With OA, people have an easier time actually reading your work, and you won't be feeding money into a potentially shady system that exploits academics as editors and reviewers for free, and then makes the same community pay to read the products. Furthermore, by selecting OA options that have a good reputation, you're not in danger of the standing of your work lowered.

There are several traditional publishing venues who exploit the benevolence of the academic community by not paying reviewers and/or editors, not paying authors and then expecting university libraries to pay expensive rates when buying the published research - research that tax payers somewhere probably has already funded. 

Universities like Harvard are already actively encouraging their researchers to choose OAthe Max Planck Society in Germany is also a enthusiastic supporter and co-founder of the OA-movement and CERN are either publishing their research themselves (they already have a rigorous review process in-house) or at other OA-venues. After all, why would these research institutions like to pay for things several times? They're often already providing the reviewing and editing themselves, so why mix in a middle man who charges you money for honestly not much added value?

It can be hard for a junior researcher to take the step to publishing OA, the ranking of a certain OA-journal might not be good enough yet, no-one might have heard of or trust the venue. That's why it's important to have a set of trusted high quality venues where already more senior and established researchers are publishing and where you can too!  We've compiled a helpful list of venues where you can publish your work. Go here to see it! If you want to know more about different kinds of OA (Gold, Platinum, Green, Blue etc), click here.

And before you ask, OA does not mean a lack of reviewers, editors, proof-readers etc. There are other funding schemes that can provide pay for such work, or sometimes academics provide it for free.

Finally, it may be that you're at a financially stable institutions with a library that can afford to pay these fees, but do consider the fact that not everyone else is in such a privileged position.

Thank you for your time, go OA!

P.S. Naturally there are non-evil publishing places that are non-OA.

Friday, August 19, 2016

Some language universals are historical accidents

There are surprisingly few properties that all languages share.  Pretty much every attempt at articulating a genuine language universal tends to have at least one exception, as documented in Evans and Levinson's article 'The Myth of Language Universals'.  However, there are non-trivial properties that are found in if not literally all languages, enough of them and across multiple language families and independent areas of the world, that they demand an explanation.  

An example is the fact that languages have predictable word orders.  Languages differ in whether they allow the verb to come before or after the object (English has it before, Japanese after).  They also differ in whether they have pre-positions (such as English ‘on the table’) or post-positions (such as Japanese テーブルの上に teburu no ue ni ‘’the table on’).  If a language has the verb before the object then it tends to have prepositions rather than postpositions, as in English; if the verb is after the object, it is a good bet that the language will have postpositions rather than prepositions (these rules hold for 926/981 languages in WALS, not controlling for relatedness).  The ordering of different elements in a sentence such as the noun and adjective, noun and possessive, and so on, are to some extent free to vary among languages, but again tend to fall into correlating types.

Why should knowing the word order of one category in a language help predict the orderings of other categories?  Many people have taken facts such as this as evidence that language is shaped by principles of harmony across grammatical categories, or evidence of Universal Grammar.  Another possible explanation is that languages which have similar word orders for different grammatical categories are somehow easier to learn, or easier to use. 

However, I would argue that at least some of these patterns are not evidence of our psychological preferences, but are accidental consequences of language history.  This post is a response to a couple of blog posts written by Martin Haspelmath this week on Diversity Linguistics Comment (here, and here with Sonia Cristofaro), which argued that historical explanations of universals still need to invoke constraints on language change, and that 'the changes are often adaptive, and that in such cases a precise understanding of the diachronic mechanisms is not necessary (though of course still desirable).'  I disagree, in the particular case of word order correlations, and I will argue specifically in this post that word order correlations are a consequence of grammaticalization.  In arguing this I'm building on work by Aristar (1991) and Givon (1976), but developing the argument to refute Haspelmath's points.  I will present the background of the argument first and then discuss his points in more detail.

Grammaticalization is the process by which new grammatical categories can be formed from other (often lexical) categories. For example, Mandarin Chinese has a class of words which might be called prepositions if they were in a European language, but which really have their historical roots in verbs.  An example is 從 cóng which in modern Mandarin is a preposition meaning ‘from’ but which in Classical Chinese was a verb meaning ‘to follow’, as these two sentences illustrate.  

我 從    倫敦       來
I  from London  come
‘I come from London’

天下                         之                   民       從       之
under-heaven  POSSESSIVE  people  follow  him
‘Everyone in the world follows him’  (孟子萬章上 Mengzi Wanzhang Shang: from the 古代漢語辭典 Gudai Hanyu Cidian)

The word 從 cóng has changed its meaning from ‘follow’ to a more abstract spatial meaning ‘from’.  It has also lost its ability to be used as a full verb, requiring another verb such as ‘come’ in the sentence, just as English requires a verb in the sentence ‘I come from London’ (*’I from London’ and its equivalent *我從倫敦 are ungrammatical).  Other Chinese prepositions such as 跟 gēn ‘with’,  also have a verbal origin, while many preposition-like words such as 給 gěi 'for' and 在 zài 'in/at' retain verbal meanings ('give' and 'to be present') and verbal syntax (such as being able to be used as the sole verb in the sentence and to take aspect marking).

Why is this relevant to word order universals?  Because if so-called prepositions in Chinese were once historically verbs which have since lost their verbal uses, this can explain why the two grammatical classes have the same word ordering: they were once the same category, and they simply haven’t changed their word orders since then.  Since the verb precedes the object in Chinese, as in the Classical Chinese sentence given above (從之 cóng zhī ‘follow him’), the preposition in modern Chinese also precedes the object (從倫敦 cóng Lúndūn ‘from London’).

It is incorrect to say that Chinese has prepositions and verb-object order because this is a combination that is easy to process or learn, or because the categories are in some other sense ‘harmonious’.  The real explanation is that verbs and prepositions in Chinese have a common ancestor, and have simply preserved their word orders since then.  This is a subtle variant of Galton’s problem, by which the historical non-independence of data points can create correlations that are not causal.  Most examples of this are from relatedness of whole languages or cultures, for example the spurious correlation between chocolate consumption and Nobel Prize winners; or the way that a maths teacher at my school was exercised by the fact that in many languages ‘eight’ and ‘night’ often rhyme or are similar (e.g. German acht and nacht, French huit and nuit, Italian otto and notte) - there's nothing mystical here, the explanation here being that the words in these languages descend from common Proto-Indo-European roots *okto and *nekwt, which happen to be similar.  

Just as languages and cultures can be related, individual words in a language can be related, such as prepositions and verbs, and hence share properties such as their word order.  It turns out that the process of grammatical categories developing from other categories is extremely common and attested in every language family and part of the world (e.g. Heine and Kuteva 2008).  Verbs can change into adpositions as in the Chinese example above (also found in languages such as Thai and Japanese), while nouns also often change into adpositions (as in many Niger-Congo languages such as Dagaare, where adpositions are all also body parts such as zu 'on/head').  Other word order correlations can be explained in a similar way, such as the relationship between adjective-noun and genitive-noun order, and even verb-object order and genitive-noun order (because of grammaticalizations such as me-le e kpe dzi 'I am on his seeing' as a way of expressing 'I am seeing him' in Ewe, Claudi 1994).  I give further examples in different languages in a short article I wrote for Evolang (2012), in which I make the point that these processes should be considered a serious confound to an explanation which tries to claim that there is a causal link between word orders across grammatical categories.

But can't both explanations be correct?  This is the response I hear from every linguist that I've described this argument to, including Balthasar Bickel, Morten Christiansen, Simon Kirby and now also Martin Haspelmath in his blog post when he says '...while everyone agrees that common paths of change (or common sources) have an important role to play in our understanding of language structure, I would argue that the changes are often result-oriented, and that in such cases a precise understanding of the diachronic mechanisms is not necessary (though of course still desirable).'  In short, does grammaticalization happen in order to create correlated word orders?  

No.  Objections like that are missing the point about non-independence.  Grammaticalization happens, causing two grammatical constructions to exist where there was previously one. These two constructions are likely to have the same word order, on the reasonable assumption that constructions are more likely than not to keep the same word order over time (an assumption also vindicated by work by Dunn et al. described below).  You have to control for this common ancestry if you wish to claim that the correlation in word orders across constructions is causal.  It is as if people wanted to claim that there was a deeper ecological reason why chimpanzees and humans share 98.8% of their DNA, rather than just the primary historical reason which is that they have a common ancestor.

There are interesting ways that both explanations could be true, if this non-independence of constructions is successfully controlled for, but evidence for this is surprisingly elusive.  One possibility is that only some kinds of grammaticalization happen, namely the types which produce word orders that are easy to process.  Haspelmath makes this suggestion in his post: 'I certainly think that studying the diachronic mechanisms is interesting, and I also agree that the kind of source of a change may determine parts of the outcome, but to the extent that the outcomes are universal tendencies, I would simply deny the relevance of fully understanding the mechanisms. In many of the cases that Cristofaro discusses, I feel that the “pull force” of the preferred outcome may well have played a role in the change, though I do not see how I could show this, or how one could show that it did not play a role.'  

I agree that this might be possible ("the “pull force” of the preferred outcome may well have played a role in the change"), but there is currently no evidence for this.  The main way to test it would be to compile a database of grammaticalizations across languages and to see whether certain grammaticalizations happen only in certain languages: for example, do postpositions only develop from nouns in a genitive construction (the table's head -> the table on) if the language also places the verb after the object?  It is easy to find exceptions to that such as Dagaare, which has verb-object order but has postpositions because those postpositions develop from nouns, and it has genitive-noun order.  In a large database, there may be all sorts of interesting constraints on what grammaticalizations can occur, as well as geographical patterns, and it may of course turn out that word order is one constraining factor, but currently this hypothesis is unsubstantiated.

Another way that word orders might be shown to be causally related to each other is if a change in one word order can be shown to be correlated with a change in another word order in the history of a language, or in its descendants.  For example, if a language has verb-object order and prepositions but then changes to having object-verb order and postpositions, then this suggests that the two word orders are functionally linked (if this event takes place after any grammaticalization linking these verbs and postpositions).  The only solid statistical test of this so far has been an article by Dunn, Greenhill, Levinson and Gray in Nature (2011).  They tested the way that four language families have developed (Bantu, Austronesian, Indo-European and Uto-Aztecan) and tested models of word order change using a Bayesian phylogenetic method for analysing correlated evolution.  What they found was that some word orders do indeed change together.  For example, the order of verb and object seems to change simultaneously with the order of adposition and noun in Indo-European, as shown in the tree reproduced from their paper below (red square = prepositions, blue square = postpositions, red circle = verb-object, blue circle = object-verb, black = both):

A model in which these two word orders are dependent is preferred over a model in which they are independent with a Bayes factor of above 5, a conventional threshold for significance.  This seems to vindicate the idea that adpositions and verb-object order are functionally linked in Indo-European.  This also holds up in Austronesian.  It does not hold up in the smaller and younger families Uto-Aztecan and Bantu, but that may be because of the low statistical power of this test when applied to small language families.

Is this convincing evidence that there is a functional relationship between the two word orders after all, after factoring out grammaticalization?  It would be, except that language contact is not controlled for in this case.  What could be happening is that some Indo-European languages in India have different word orders because of the languages that they are near, such as Dravidian languages, which also have object-verb order and postpositions.  A similar point could be made about the Austronesian languages that undergo word order change, which are found in a single group of Western Oceanic languages on the coast of New Guinea, which is otherwise dominated by languages with object-verb order and postpositions.

An interesting result of their paper is that word orders are very stable, staying the same over tens of thousands of years of evolutionary time (i.e. summing the time over multiple branches of the families), supporting the assumption that I described above that word orders tend to be stay the same.  The main result of this test has been that language families differ in which word order dependencies they show, and many of them are likely to reflect events of word order change due to language contact, but the test has also been acknowledged by Russell Gray and others as having low statistical power, and hence not conclusive evidence either for or against there being genuine functional links between word orders.  A promising approach in the future is to apply the same phylogenetic test to the entire world, attempting to use it on a global phylogeny - a world tree of languages that does not have to be completely accurate, but simply has to incorporate known information about language relatedness, and perhaps some geographically plausible macro-families to control for linguistic areas where languages have shared grammatical properties across families (such as Southeast Asia or Africa).

Another intriguing line of inquiry is to work out what particular predictions a theory of processing or learnability would make about word order patterns across languages, and whether these predictions are in fact different from an explanation that invokes grammaticalization.  Hawkins (2004) is an example, which shows that there are word order patterns such as 'If a prepositional language preposes genitives, then it also preposes adjectives', and argues that these are predicted on the basis of the relative length of constituents such as possessive phrases and adjectives.  These particular rules worked in Hawkins's sample of 61 languages, but fail on larger databases such as WALS (38 out of 50 languages contradict the rule just given, for example).

Whether or not these attempts to demonstrate the role of processing are successful, a large part of the story of why word universals exist is the evolution of grammar.  When we try to explain why adpositions correlate in their ordering with other categories, we should remember to ask why languages have a separate grammatical category of adpositions at all.  Why does grammaticalization happen, forming a distinct class of adpositions, rather than languages just expressing spatial relations with nouns and verbs?  Why are English prepositions such as for, to, on and so on etymologically obscure, whereas in some languages such as Dagaare and Chinese many adpositions are homophonous with verbs and nouns, to the point that is doubtful that these 'adpositions' really constitute a separate class (as opposed to a sub-class of verbs, and relational nouns)?  One possibility is that we store individual constructions rather than words, and these constructions once individually stored can end up being transmitted as independent units between speakers.  To take a hypothetical example, in a language which uses body-part terms to convey spatial meanings such as saying table's head to mean 'on the table', the particular use of head as a spatial word may be stored separately from the body-part use of 'head'.  Once that happens, it is possible for the body-part sense of head to be lost in a community of speakers and just the spatial sense retained (for example, the English front derives from the Latin frons 'forehead').

This process often creates a chain of intermediate cases between nouns and adpositions, such as in Tibetan, where some adpositions require genitive marking such as mdun 'front' ('the house's front'), while others used genitive marking in the Classical language but no longer allow it (nang 'inside') (DeLancey 1996:58-59).  There are similarly ambiguous cases in English where words such as regarding can be both a verb form and a preposition.  It is worth asking whether any language has ever developed adpositions any other way: it is hard to imagine a language inventing adpositions from scratch (Edward de Bono managed to get the phrase 'lateral thinking' to catch on among English speakers, but not his invented preposition po), and they instead catch on better if they are extended uses of already existing words, such as English regarding. It is likely that most words began as extensions of other words, more generally, rather than invented out of nothing, perhaps with some exceptions such as 'Quidditch', or ideophones.  It is possible that some languages may actually invent adpositions, such as sign languages (which can use iconic signs for 'up' and 'down' for example), but if the hypothesis is correct that this is not normally what happens in spoken languages, then the historical default ought to be that adpositions share the same syntactic properties, including their word order, as other categories. The real thing to explain is not why they correlate in their word orders with other categories, but why it is ever the case that they do not correlate.   

As an analogy, some languages have unusual non-correlations of word orders across constructions, such as German which has verb-object order in main clauses and object-verb order in subordinate clauses, or Egyptian Arabic* in which numerals precede the noun except for the number 'one' and 'two', which follow it.  It is true that most languages in the world have a 'correlation' between the ordering of the number 'one' and the ordering of other numerals, to the point of making this another word order universal: but a functional explanation for this fact ('A language is easier to learn if the word order is the same for all numerals') would be banal, and would miss the fact that the historical default in most languages has been for the orderings to be correlated, simply because 'one' is normally treated as a member of the same grammatical class as other numerals.  I'm arguing that the correlation between adpositions and verb-object ordering is also likely to be a historical default due to grammaticalization, rather than a situation which languages converge on for reasons of processing or learnability.

I see word order correlations, in short, mostly as an unintended consequence of the way that grammatical categories evolved in most languages, not as an adaptive solution to processing or language acquisition.  Martin Haspelmath seems to disagree with the spirit of this type of historical argument in his blog post, however, which states (to repeat): 'Quite a few people have argued in recent times that typological distributions should be explained with reference to diachronic change...but I would argue that the changes are often adaptive and result-oriented, and that in such cases a precise understanding of the diachronic mechanisms is not necessary (though of course still desirable).'  My main point in this post is that I disagree with both parts - that an understanding of the history of languages is unnecessary to understand word order correlations (it is in fact the main story behind them), and that these changes are adaptive and result-oriented (there is little evidence so far that these grammaticalizations are geared towards producing harmonic word orders).

He has some more specific objections to historical arguments, reproduced below:

"(A) Recurrent paths of change cannot explain universal tendencies; universal tendencies can only be explained by constraints on possible changes (mutational constraints).

(B) Diverse convergent changes cannot be explained without reference to preferred results.

(C) If observed universal tendencies are plausible adaptations to language users’ needs, there is no need to justify the functional explanation in diachronic terms."

Objection (A) I disagree slightly with, because common pathways of change are enough to be a serious confound to functional explanations of language universals, as I have tried to argue. How common is 'common'?  In an ideal world, common enough that, for example, the number of languages predicted to have word order correlations is about 926/981, simply using a statistical model that assumes grammaticalization, inheritance of word order in language families, and language contact.  I have only listed some examples in this post, but their existence in multiple families and parts of the world, coupled with the stability of word orders in families, is enough to make the relatedness of constructions an important confound.  I acknowledge that an actual quantitative test is needed of whether they are common enough to explain the entire distribution of word orders, which would rely on a database of grammaticalizations - if there is enough data on grammaticalization to ever be able to test this.
Haspelmath is sceptical of 'common pathways of change', viewing these as unfalsifiable, and asks instead for stronger constraints: 'In syntax, one might explain adposition-noun order correlations on the basis of the source constraint that adpositions only ever arise from possessed nouns in adpossessive constructions, or from verbs in transitive constructions, Aristar 1991).'  In this post, I suggested a strong constraint, namely that new words normally develop from already existing words and are rarely invented from scratch.  Adpositions are therefore likely to develop from words that include, but are probably not limited to, nouns and verbs.  The question of what particular grammaticalizations can occur and why some are especially common is of course an interesting subject, but secondary to the main argument of this post, namely that the very existence of these processes is a serious confound to functional explanations of universals.  
Point (B) effectively asks why languages converge on patterns such as word order correlations when they take different historical paths, such as Chinese grammaticalizing verbs to prepositions, while Thai grammaticalized (in some cases) possessive nouns.  Isn't it a coincidence when both processes conspire on the same result, both verb-object languages having prepositions?  Well, in these cases the expected outcome of grammaticalization in both cases was prepositions simply based on the ordering of their source constructions (verb-object, and noun-genitive), so there isn't anything to explain.  There are also plenty of counter-examples, such as Dagaare mentioned earlier, which takes the same path as Thai and ends up with non-correlating word orders (verb-object order but having postpositions), because the postpositions come from possessed nouns with a genitive-noun ordering.  Again, a database of grammaticalization would tell us how common these exceptions are; if they turn out to be rarer than expected - for example, if there really is a tendency for verb-object languages not to evolve postpositions even when they have the genitive-noun order - then Haspelmath's point (B) may be vindicated.  Finally, point (C) is the main one that I disagree with, as I stated above (the history of these categories is all-important, and grammaticalization does not seem to happen with the goal of creating word order correlations).  I should add that I am only talking about word order, and may agree with Haspelmath's points in explaining other common linguistic patterns.  I am also not denying the relevance of processing to understanding why some word order combinations may be favoured over others, which can be illustrated with sentences such as 'The woman sitting next to Steven Pinker's pants are just like mine' (Pinker 1994) (illustrating the problem of a language having genitive-noun order and noun-relative clause order).
Why am I writing about a relatively minor set of disagreements on a niche question?  For me, this subject is interesting because it is about a subtle variant of Galton's problem and the possibility of erroneously inferring causation from correlation, but also because it encompasses three of the greatest discoveries of modern linguistics.  One of them is the discovery of word order universals themselves, the unexpected set of rules which allow one to make predictions about word orders in every part of the world from the Europe to the Amazon and New Guinea, with deep implications for the way that grammatical rules are represented in the mind.  Word order universals were first elucidated by Joseph Greenberg (1963) and substantiated for over 600 languages (now over 1500) by Matthew Dryer (1992).  I sometimes wonder why this discovery was not reported in Nature at the time, given that Dunn et al.'s later article on attempting to refute word order universals was published there.  It is an intriguing linguistic fact that has been written about in popular accounts of language such as Pinker's The Language Instinct but which has not yet received a fully satisfactory explanation and awaits further statistical tests, such as a large-scale phylogenetic analysis.  Such tests require knowledge of how languages are related to each other, touching on the second 'great discovery' that I would suggest that linguists have made, the way that we can study the history of large, ancient families such as Indo-European and Austronesian (and perhaps soon even larger macro-families).
The third great discovery, though less well-known, is grammaticalization, 'the best-kept secret of modern linguistics' (Tomasello 2005).  Languages are systems of complex grammatical categories and sometimes perverse syntactic rules.  How did all that get here?  Who 'invented' Latin verb endings, or English prepositions?  The most satisfying answer that we have is that grammatical words and morphemes tend to develop from already existing elements, and develop their grammatical meanings gradually.  The English morpheme -ing for example is claimed to have begun as an ending denoting nouns to do with people such as cyning 'king' and Iduming 'Edomite', and was then extended to be used on verbs as a nominalizer (playing tennis is fun) and then as a marker of continuous aspect (I am playing tennis) (Deutscher 2008).  The change from nominalizers to verb endings is mirrored across several language families (see here), and the origin of nominalizers in some languages can be traced back further to noun endings or even full nouns (such as 化 huà 'change' in Mandarin being used a nominalizer in 現代化 xiàndàihuà 'modernization', or sa in Tibetan coming from a noun meaning 'ground, place').  This shows in principle how complex grammar does not need to be invented, but can develop by gradual changes from simple elements such as concrete nouns.

In some cases these links are directly attested in languages with a long written record, such as Chinese.  In other cases they are inferred from polysemies or by comparison with related languages.  These links differ in how plausible or substantiated they are, and this work therefore needs some attempt at quantification, for example in objectively assessing similarity between forms, or counting instances of known semantic shifts across languages.  Above all, attested grammaticalizations need to be gathered into a database in order to test relationships with other properties such as word order.  Heine and Kuteva's The Genesis of Grammar (2008) and a well-written popular account The Unfolding of Language (Deutscher 2005) are overviews of grammaticalizations that have been documented across languages, including from nouns to adjectives, case markers, adpositions, adverbs, and complementisers; from verbs to aspect markers, case markers, adpositions, complementisers, demonstratives, and negative markers; from demonstratives to definite articles, relative clause markers, and pronouns; and from pronouns to agreement and voice markers.   

These pathways of change by which new categories can be created are the fullest account of the evolution of language that we currently have, a fraction of which are summarised in a tree below from Heine and Kuteva (2008:111).  They help us make sense of the inherent fuzziness of closely related categories, and also the formal similarities between them, including correlations in their word orders.  Word order universals may turn out to have been shaped in part by other factors such as processing and learnability, but they also tell the story of a linguistic equivalent of the Tree of Life, the history of grammatical categories.

(References: see this bibliography.  *Correction: it was pointed out to me that the Thai example that I cited from memory and without a source was wrong, which I've now replaced with the example of Egyptian Arabic from WALS.)

Saturday, July 9, 2016

A Global Tree of Languages

I was a reviewer for the Evolution of Language (Evolang) conference for the first time this year, a tedious-sounding task that turned out to be hilarious.  The conference attracts some bizarre manuscripts on the origins of language, one particularly imaginative one I wanted to devote a blogpost to, but regretfully cannot because of reviewer confidentiality.

Also in my inbox to review was the most exciting paper about language that I’d ever seen.  I recommended acceptance obviously, even though it was only tangentially related to the theme of the conference, and it was accepted as a poster and published in the conference proceedings (available here). 

The paper was by Gerhard Jäger and Søren Wichmann, about constructing a world family tree of languages using a database of basic vocabulary, the ASJP database.  Claims about how language families may be related are nothing new but are normally statistically uninformed (such as work by Merritt Ruhlen and Joseph Greenberg).  The amazing thing about this new paper is that it uses a simple statistical test of relatedness between languages, extending the methodology of a paper by Gerhard Jäger in PNAS last autumn and covered in my post here, and finds evidence for language families around the world being related to each other in a geographically coherent way, clustering into continents and even reflecting quite specific events in human history.

The resulting family tree of around 6000 languages and dialects was what they presented as a poster at Evolang. If correct, it might as well be on the cover of Nature:

I find this tree exciting because geographical patterns emerge purely out of comparing lists of words in different languages.  The four lists of words below, for example, do not seem to have much in common, but somehow the algorithm manages to correctly place A and B in Papua New Guinea, and C and D in South America.  (The languages are Yabiyufa, Dubu, Nadeb and Kukua). 

             A              B                C           D
I:         nemo         no               u*h        we*b
one:    makoko    k3rowali      SEt        bik
blood: oladala      t3ri              yuw      be*p
fish:   lahava        ambla          hu*b      keh
skin:   upala          ser              buh        baka7 Co

The way that the algorithm works is to calculate the distance between two word lists, namely how different two languages are.  The way this is done is to take words for the same concept, and align them using an alignment algorithm.  Then the difference between the words is computed using a matrix of what substitutions are most probable.  For example, ‘p’ often changes to ‘f’ , as in the cognates Latin pater ‘father’ and English ‘father’.  This matrix of probable sound changes can be computed using an unsupervised learning method from the data itself, by looking at only the most similar languages, such as English and Dutch, or Italian and French, and counting how many times particular phonemes are aligned with each other in these related words.   

The distances between words are then summed and produce an overall measure of similarity between languages.  Languages with shorter distances to each other are then placed nearer to each other in the family tree using a neighbour-joining algorithm.  As expected, almost all language families that we already know about emerge from this method, such as Indo-European, Austronesian, and so on, with only a few exceptions corresponding to more controversial families.  The novel part is that these known families cluster into larger groups, 'macro-families', that make sense geographically.

The word lists contain only about 100 words.  The simplicity of the algorithm and the paucity of the data used make it even more surprising that a coherent result comes out (other phylogenetic studies require a few hundred words, such as work using the Austronesian Basic Vocabulary Database).  The full tree is available to inspect online here.  Before critiquing the methodology, I will suspend disbelief and savour some of their results.  

Perhaps the most impressive result is that indigenous languages in the Americas come out as a single group.  South America contains over a hundred language families, which linguists have been unable to relate to each other.  This algorithm demonstrates for the first time that they have something in common, and also with the languages of Meso-America and North America, commonalities which are unlikely to be entirely due to recent contact given the distances involved and may therefore reflect events over the last 20,000 years.

Similarly, languages across northern Eurasia are related in this tree.  The so-called Altaic family emerges, comprising Turkic, Mongolic and Tungusic.  Interestingly, Japanese and Korean are not present, despite sometimes being placed with them in the so-called 'Transeurasian' family.  Martine Robbeets has a group at the MPI for the Science of Human History in Jena specifically devoted to demonstrating the validity of Transeurasian and investigating its linguistic, genetic and archeological history.  For what it is worth, Japanese is placed in Jäger and Wichmann's tree next to Sino-Tibetan and Hmong-Mien (probably reflecting borrowing from Chinese); Korean is placed with the language-isolate Burushaski in Pakistan and more distantly to languages of New Guinea and Australia (implausible, although not actually impossible if all closer relatives of Korean in Asia have died out).

There is a larger ‘Nostratic’ family covering the whole of Eurasia as well, linking the Altaic languages with Indo-European, Uralic and languages of the Caucasus.  The language family Afro-Asiatic is on the very outside of this family, a satisfying result because despite most branches of Afro-Asiatic being found in Africa, there is a lot of genetic and archeological evidence linking these populations with movement from Eurasia into North Africa in the last 10,000 years, as Jared Diamond among others has noted in his paper on how language families often spread with farming.  There has previously been no linguistic evidence at all for a Eurasian origin of Afro-Asiatic languages, so it is very interesting that it is supported in this analysis.  

In Asia, Austronesian and Tai-Kadai turn out to be related, a hypothesis long maintained by Laurent Sagart, Weera Ostapirat and others, but which has previously not been assessed statistically.  Austronesian is known to have originated in Taiwan, and if it is related to Tai-Kadai, it looks like we can now push back its origins further to southern China.  Further up, this Austro-Tai family turns out to be closely related to Austro-Asiatic, another southeast Asian family.

Could information about migrations over tens of thousands of years really be contained in these word lists?  The paucity of the data (between 40 and 100 words per language) and the simplicity of the algorithm make these results even more remarkable.  Some people may be sceptical at the very idea of being able to reconstruct language history that far back.  This knee-jerk reaction is disappointingly common and mostly reflects an inability to assess statistics, such as in people's reactions to Jäger's 2015 paper on the 'eurogenes' blog:

A more reasoned approach is to look at the vocabulary similarities between languages and ask how plausible they really are.  Somebody called 'Ebizur' on the same blog commented on the supposed cognates between Korean and Burushaski:

Another objection might be that language contact is partly responsible for the similarities between geographically neighbouring languages.  The authors acknowledge the problem of language contact, but also say that it often does not affect the results: known instances of contact such as between Dravidian and Indo-Iranian, for example, do not show up in their tree.  Despite it not interfering with the reconstruction of young language families that much (i.e. within the last 5000 years), it may affect the reconstruction of macro-families, as we will see.

Simon Greenhill and Russell Gray pointed out to me that it gets the structure of some language families wrong.  Austronesian has a well-known structure in which the first branches are all in Taiwan, a structure that Jäger and Wichmann's tree fails to recover.  One might then wonder why we can trust these results.  If the tree fails to accurately recover language families that are only 5000 years old, how can it get families that are tens of thousands of years old? 

There are unfortunately some more important problems.  An easy-to-miss point in their paper is that in fact the algorithm does not just use distances between word lists to construct the tree; it also uses a second distance measure between languages, namely phonological distance, based on what bigrams occur in each language (pp.3-4):

To calculate the secondary distance measure we represented each doculect as a binary vector representing the presence/absence of bigrams of the 41 ASJP sound classes in the corresponding word lists. The bigram inventory distance between two doculects is then defined as the Jaccard distance between the corresponding vectors...The relative weight of lexical to bigram inventory distances was, somewhat arbitrarily, set to 10:1. In this way it was assured that phylogenetic inference is dominated by the information in the PMI distances, and bigram inventory distances only act as a kind of tie breaker in situations where lexical distances do not provide a detectable signal.

This is important, because the macro-families that they detect are therefore not necessarily reflecting relatedness of vocabulary items, but simply reflect phonological similarity, which could easily be due to language contact.  This method was not used in Jäger's PNAS paper, and the authors give no justification for using it here.  There is in fact no good reason to use it, if the aim is to push back reconstruction of vocabulary.  

It is an interesting, but separate, question whether you can use the sound systems of languages to investigate their history.  I replicated that part of their method, taking the ASJP data and bigrams of characters in each word list (although I'm not sure why they used bigrams in particular), and then computed the Jaccard distance between each language.  I then made a neighbour-joining tree of the 7000 languages in the data based on this distance measure.  The tree is very large, so here is one part of it:

There are some intriguing patterns here, because all of these languages are found in North, South and Central America, and are grouped together here purely because they have similar phoneme inventories.  The second group below shows the way that Afro-Asiatic languages such as Berber, Arabic, Aramaic, Amharic and so on cluster together, which are placed with some Eurasian families such as Mongolic, Tungusic and Indo-European languages, perhaps explaining why they were placed with Eurasian languages in Jäger and Wichmann's tree.  But there are also some African languages in this group, perhaps reflecting language contact, and some which have no explanation because they are from other continents (such as Algonquin).


The results show a few geographical patterns and a weak sign of clustering by language family, but there is also a lot of noise.  This is not surprising, partly because the implementation here is clumsy (a neighbour-joining method, and the use of bigrams rather than all n-grams), but also because nobody normally expects sound systems by themselves to tell us much about language relatedness.  I think there may be something worth exploring further - for example why languages in the Americas are similar in their phonological inventories - but the question here is why Jäger and Wichmann use this phonological distance measure, without justification, if it is not reliable for showing how languages are related.

Note too that the vocabulary distance and phonological distance are combined with an arbitrary weighting of 10:1.  This means that known language families are recovered with little interference from phonological distance (meaning that it avoids silly results like Algonquin being placed in the Afro-Asiatic family), but beyond that point, where similarity between vocabulary of different language families is minimal, then the phonological distance measure comes into effect.  This allows them to get a superficially impressive result: known language families and geographically plausible macro-families.  In reality, the result may be more banal: known language families are recovered using one distance measure, but it is unable to reconstruct anything beyond what we already know; and the other distance measure hints at large-scale geographical patterns, but is unable to produce sensible results at the more local level of known language families.  And to return to the issue of language contact, their first distance measure may be relatively immune to borrowing, and hence works over a 5000 year time scale; but this does not mean that the phonological distance measure is immune to borrowing, which comes into play at time depths beyond that.

The other main problem with Jäger and Wichmann's method, as I have raised before in my review of Jäger's PNAS paper, is the use of a neighbour-joining algorithm to construct the family tree.  Neighbour-joining is the principle that the more similar two things (languages or species) are, the more likely they are to be closely related.  To see why this is wrong, consider the three animals below.  The two animals that look like mice share a lot of morphological traits in common - they are small, have four legs and a tail, similar skulls, fur, and so on.  On a neighbour-joining principle, one would guess that the two mouse-like creatures are closely related, and they are more distantly related to dolphins.   

In fact this is wrong: the mouse and the dolphin are most closely related to each other, being placental mammals.  The animal on the far left is antechinus, a marsupial that evolved its mouse-like form independently in Australia (famous for its 'suicidal' mating habits, in which the male dies of exhaustion after copulating for fourteen hours).

We know that the mouse and dolphin are more closely related to each other than they are to antechinus despite being superficially more different, because the bodies of animals can change very rapidly, especially when under new selection pressures from the environment; relatively closely related species such as whales and hippos can differ in a large number of traits because whales adapted to life in water.  Other traits of animals are much slower-changing, such as the reproductive system, which along with a few other slow-changing traits clearly separate the marsupials from the placental mammals (and using DNA rather than morphological traits makes this clearer).  Taking different rates of change into account suggests that the true tree is more like this:

This is perhaps a reason for why Jäger and Wichmann's algorithm gets the structure of Austronesian wrong.  It places some Oceanic branches on the outside of the Austronesian tree, perhaps because they have undergone a lot of evolutionary change, analogous to the way that whales underwent rapid change when their ancestors went into the water.  In a neighbour-joining method, the amount of change that languages have undergone is confounded with how distantly related languages they are.
It is not just that a neighbour-joining method of constructing a family is inaccurate.  In a sense, it gets the whole notion of relatedness wrong.  Truly related animals (or languages) are not expected to necessarily be similar overall, and one should be worried if they are: overall similarity is likely to accrue by chance between unrelated species, given enough time (as with the mouse and antechinus), whereas genuine relatedness is detectable in only a few slow-changing traits.  In the case of languages, even languages in a well-established family such as Indo-European have only a few words in common between all members, such as words for two and five (as shown in this paper), because most cognates get replaced by new words over time.  If the languages in Jäger and Wichmann's tree really are related, there should be specific words that are common to them that can be demonstrated to be slow-changing, rather than showing similarity distributed throughout the entire word list.

To test this, one should Bayesian phylogenetic methods, rather than neighbour-joining based on distances between languages.  These methods calculate the probability of trees being true under different models of evolution, and take into account the fact that words evolve at different rates and can get lost entirely.  The family trees that you test can also incorporate branch lengths, reflecting the fact that not all species evolve at the same rate.

Bayesian phylogenetics asks what the likelihood of a family tree is, namely with what probability the data would be the way that it is if the hypothesis (the family tree being proposed) were true.  If one were to rerun the history of mammals, the probability that we would get exactly the mammals that we actually see on Earth is very, very small; the number of possible mammals that could have evolved is huge, and the species that actually exist are a tiny fraction.  
But this probability would be even smaller if their family trees had been much different.  There are certain family trees which make the evolution of these species more likely, such as a tree which places humans with chimpanzees, although crucially these do not have to be the same as the trees with the fewest changes (the most parsimonious tree, formed by neighbour-joining).

The probability of the data can be calculated for each tree (by considering every possible scenario for how traits, or DNA nucleotides, changed at each node of the tree, and summing over these possible scenarios), and also calculated using different assumptions about how fast- or slow-changing each trait is.  Ideally one would then like to check every tree and calculate its likelihood; since there are normally too many possible trees to check, a Markov Chain Monte Carlo method is used to search through trees and sample them according to how likely they are (see John Huelsenbeck's article here on Bayesian phylogenetics).  

The data needed to apply this method is either DNA data, or a string of 0's and 1's which represent the presence or absence of traits (e.g. legs, placenta, small skull...).  In the case of languages, the approach taken since Russell Gray and Fiona Jordan's pioneering paper in 2000 on Austronesian is to code the presence or absence of cognate classes.  For example, main in French and mano in Italian are similar enough that they are likely to be related (or borrowed); hence one can code French and Italian as 1 (meaning they have a word which is part of this cognate class), and English as 0 because it has hand, which is not in this cognate class.  English then gets a 1 for the next cognate class (cognates of hand, which exist in Germanic languages), while French and Italian get 0.  

To apply this to the ASJP data, one would have to sort words in the data according to classes of likely cognates.  As an initial test of this, I used the matrix of likely substitutions that Jäger supplied in his PNAS paper, and computed similarity scores between words using the same alignment algorithm as in his paper, the Needleman-Wunsch algorithm (using the Python package nwalign).  A clustering algorithm such as the UPGMA algorithm (implemented in Python in Mattis List's package Lingpy) can then be used which sorts words into clusters, using an arbitrary cut-off point beyond which words are judged to be to dissimilar to be related.    

As an example, here are words for 'I/me' in different Indo-European and Austronesian languages, sorted by the cognate groups that the algorithm finds:

Class 1  
ene, xina, xine, kinaN, inE, ina, i5a

Class 2
saikin, sak3n, sakon, sak, ak3n, yak3n, aken, tiaq3n, ha77on, ha77in

Class 3
ik, ik3, ix

Class 4
inau, enau, inda7u, inahu, inaku, iau, kinau, kinu, ki5u, Nanu, Nau, nau, axanau, anau, Noua, Nou, nou, inu, ino, inoi, no, nu

Class 5
Eku, ako, qaku, yoku, yaku, yako, iaku, i7aku, yaku7, yako7, aku7, akuh, agu, 53ku, ha7o, wa7u, ia7u, a7u, ga7u, Na7o, ahu, ku, uku, hu, wau, au, auw, lau, ilau, rau, iyau, eyau, yau, dyau, yao, kyau, yaw

These are not the best cognate judgements, but the sound changes in each cognate class are at least quite plausible, such as 'inaku' to 'inda7u' and 'kinau', showing that Jäger and Wichmann's similarity judgement algorithm seems to work.  I've heard that using distance measures and clustering in Lingpy in general seems to agree 90% of the time with a human's judgement, which is about as often as humans agree with each other.  

More importantly, this method is reproducible and consistent, and avoids the subjectivity of human linguists, one whom Roger Blench once quoted as saying 'I always find more cognates after a good lunch'. It is additionally interesting that you can make the algorithm more 'lumping' or 'splitting' depending on the distance threshold you choose.  A higher threshold, the clustering algorithm's equivalent of a good lunch, causes it to merge the above cognate classes into a single cognate class, which may be useful for some purposes such as detecting cognates in large families.

Does this method perform any better than Jäger and Wichmann's method?  Unlike their method, you cannot analyse all 6000 languages at the same time, as it is computationally intensive to analyse even one thousand languages using BEAST.  I analysed 194 Indo-European languages as an initial test, producing the following tree, where numbers on each node represent the posterior probability of each clade.  

It's hard to read and preliminary, but there are some pleasing results.  English is placed closest to Frisian and slightly more distantly with Dutch, which is correct according to Glottolog and the consensus among historical linguists.  By contrast, Jäger and Wichmann's paper (and Jäger's 2015 paper) gets the position of English wrong, placing it with Scandinavian languages because of Scandinavian loan-words in English.  It even does better than Bouckaert et al.'s 2012 paper in Science on Indo-European, which places English as equally related to all West Germanic languages (i.e. equally related to German as it is to Dutch).  

The Romance languages also look correct, with Latin and Italian dialects placed as the first branches off, therefore placing the origin of the Romance languages correctly in Italy (which was one of the main success stories of Bouckaert et al.'s paper on the phylogeography of Indo-European).  There is a strongly supported clade for Germanic-Romance and then higher up for Germanic-Romance-Celtic, which geographically makes sense (all are the westernmost branches of Indo-European) and is not too different from the clade proposed by Bouckaert et al., namely Romance-Celtic with Germanic on the outside.  Greek and Balto-Slavic languages then join this group, which finally join with languages in India.  Hittite is disappointingly placed with Albanian and the Balto-Slavic languages, rather than as its own primary branch.

I then analysed a set of mostly Austronesian languages with some Indo-European languages thrown in as well.  It recognised that Indo-European and Austronesian were different families (phew), but fails to recover the structure for Austronesian.  Further tests on Austronesian languages by themselves also failed to recover a sensible structure for the family, with low posterior probabilities (<10%) on each clade, no matter how lumping or splitting the cognate threshold was.

The results are mixed so far, perhaps because of the nature of the ASJP data (maybe the Austronesian data in particular isn't good enough), and because of the inherent difficulty in cognate-classification.  Jäger and Wichmann are presumably trying ways of cognate-coding the data (or have already done so), and many other people interested in automatic cognate-coding have worked with the ASJP data (such as this paper by Hauer and Kondrak, and an ingenious recent paper by Taraka Rama on using convolutional networks for cognate-coding), and of course Mattis List who is the main author of Lingpy.  

Despite this, I have not seen discussion (published or otherwise) of automatically detected cognates in the ASJP data that support macro-families.  That is the main task in evaluating Jäger and Wichmann's paper, which I think is worth doing given the intriguing geographical patterns that they find.  There is ongoing work on building better databases of vocabulary for the world's languages, both in Tübingen (where Jäger is based) and in the Glottobank group at the MPI for the Science of Human History in Jena (a group which two other authors on this blog, Hedvig and Siva, and I are part of), with the aim of pushing back our knowledge of the history of languages further back in time.  Jäger and Wichmann's paper has made me more optimistic that these new databases will yield results.  

In the mean time, it is worth continuing to analyse the ASJP data using Jäger and Wichmann's method and checking whether their global family tree stands up to Bayesian scrutiny.  My attempt at this analysis I will leave to another blog post, especially as I have not done it yet.  Is Austro-Tai a valid family?  Are the origins of Afro-Asiatic in Eurasia?  Is South America a monophyletic clade?  Find out in the next instalment.  Maybe.


Post scriptum
Gerhard Jäger kindly sent a reply to my blog post, with some criticism of my points including the antechinus example, which I post here with his permission.

Dear Jeremy,

Thanks for your thoughtful and fair comments on our paper! Here are replies to some of them. (Everything below reflects my own opinion; Søren might disagree.)

As you know, there was a strict page limit for that proceedings paper, so we couldn't discuss all issues in as much depth as we wanted to. Stay tuned; there will be a follow-up.

A remark on the ASJP data: We only used the 40-item vocabulary lists throughout. For ca. 300 doculects, ASJP has 100-item lists, but we figured using items 41-100 might introduce a bias, so we left them out. Also, ASJP contains quite a few gaps, so in total there are about 36 entries per doculect only. I keep being amazed on how much information you can squeeze out of so few data. :-)

Perhaps the central point: You write:

„There are unfortunately some more important problems. An easy-to-miss point in their paper is that in fact the algorithm does not just use distances between word lists to construct the tree; it also uses a second distance measure between languages, namely phonological distance. […] This is important, because the macro-families that they detect are therefore not necessarily reflecting relatedness of vocabulary items, but simply reflect phonological similarity, which could easily be due to language contact. This method was not used in Jäger's PNAS paper, and the authors give no justification for using it here. There is in fact no good reason to use it, if the aim is to push back reconstruction of vocabulary.“

My apologies if this was easy-to-miss; it is in fact the most important methodological innovation of the paper.

As you also point out, similarities in vocabulary items give enough information to identify language families and their internal structure (with some caveats, of course; borrowings and chance similarities occasionally kick in, but overall this works reasonably well). There is little you can say about trans-family patterns on the basis of vocabulary alone. What I did in my PNAS paper is, in my experience, pretty much the best you can do in this respect. (Longer word lists might help, but I am not very confident in this respect.) There is a lexical signal for Eurasiatic, Austro-Tai, Australian, but that's about it. In fact, if you run the method from my PNAS paper on the entire ASJP, most supra-family groupings have low confidence values and presumably reflect chance similarities more than anything else. (The languages of the Americas and the Papuan languages are lexically extremely diverse.)

This is where the information about phonetic inventories kick in. As you also noticed, this signal is very noisy, and if you do phylogenetic inference with it alone, you get pretty poor phylogenies. The phonetic inventories do carry a deep signal though. Most macro-patterns in our tree reflect similarities in phonetic inventories.

If Atkinson was right in his out-of-Africa paper (, sound inventories might carry a very deep phylogenetic signal. I do not want to rule this out a priori. It is equally conceivable though that this is all language contact. Even if so, this is a relevant finding, I think, as it points to prehistoric language contact.

I do not have a firm opinion on this, but my best guess is that the truth is somewhere in the middle, i.e., phonetic inventories carry information both about vertical and about horizontal transmission. Disentangling the two is one of the big challenges for the future.

There is no denying that several groupings in our tree reflect language contact. I commented on this in my PNAS paper in relation to the (probably non-genetic) Sino-Tibetan + Hmong-Mien grouping, but there are certainly many more instances of this kind. The placement of Japanese next to Sino-Tibetan you mention is a case in point.

I suppose (something to explore in the future) that the problems with the internal structure of Austronesian Greenhill and Gray pointed out to you also reflect contact, albeit in an indirect way. Just shooting from the hip: There are many loanwords between the Oceanic branch of Austronesian and Papuan languages. This results in a non-tree-like lexical signal. This effect is not strong enough to pull the affected Austronesian languages out of the Austronesian cluster, or to pull the Papuan languages into the Austronesian cluster, but it leads to a rotation of the Austronesian tree topology in such a way that Oceanic is moved to the periphery and the Taiwanese branches end up in the interior. (This does not explain all problems with the Austronesian tree, but perhaps the most conspicuous one.)

I do not agree with your comments on the phylogenetic algorithms. Neighbor Joining (actually we used Minimum Variance Reduction, but this is a close cousin of Neighbor Joining) is not as good as Bayesian phylogenetic inference (if your data are in the right format to do a Bayesian analysis, that is), but it is a good approximation in many cases. It certainly does not have the inherent bias you describe.

Your example with the mouse, the antechinus and the dolphin is not very well chosen, for several reasons. Neighbor joining (like Maximum Parsimony, Maximum Likelihood any most other phylogenetic inference algorithms) computes an *unrooted* tree. As there is only one unrooted tree topology over three taxa, any algorithm will give you the right result here. So let us, for the sake of the argument, add a shark to the mix. There are three different topologies over four taxa, only one of which is correct.

In your description of the algorithm, distances are calculated on the basis of morphological traits such as „small“, „hase four legs“, „has fur“ etc. This would, in fact, lead to the wrong topology

((shark, dolphin),(mouse,antechinus))

But the same would happen if you arrange your morphological traits in a character matrix and do character-based inference:

small four_legs tail fur fin lives_in_water

mouse 1 1 1 1 0 0

antechinus 1 1 1 1 0 0

dolphin 0 0 0 0 1 1

shark 0 0 0 0 1 1

Any character-based phylogenetic inference algorithm will give you the same topology, simply because there are no mutations separating mouse and antechinus, and likewise none separating dolphin and shark.

The deeper problem here is that we have *convergent evolution*, i.e. all those characters evolved twice, due to natural selection. Standard phylogenetic algorithms are not really applicable with those data as they rely on a neutral model of evolution, i.e., the absence of selection.

If you would compare those four species on the basis of their DNA, Neighbor Joining would undoubtedly give you the correct topology, just like character-based methods.

(There is a still unpublished paper by Johann-Mattis List and me where we, among other things, discuss the pros and cons of various phylogenetic algorithms:

As you point out, to perform Bayesian phylogenetic inference you need data in a character matrix format. As there are no expert cognacy judgments so far for most language families, doing this for data beyond well-studied families is a challenge. Distance-based phylogenetic inference (such as Neighbor Joining or Minimum Variance Reduction) is one way to circumvent this problem. Using automatic methods to bring ASJP data into character format, as you suggest, is another option – one we are currently exploring as well. In this connection you might find this manuscript (also co-authored by Mattis and me: interesting. I am happy to share more information about this sub-task off-line.

To conclude for today: With our method we come up with an automatically inferred tree for more than 6,000 doculects (representing ca. 4,000 languages with separate ISO codes – two thirds of global linguistic diversity) which has a Generalized Quartet Distance of 0.046 to the Glottolog expert tree. The challenge is on: Can we do better than this?