Royal Society Publishing

Population-level neutral model already explains linguistic patterns

R. Alexander Bentley, Paul Ormerod, Stephen Shennan

In customizing the neutral model for language transmission, Reali & Griffiths [1] have added a new, ‘Bayesian’ learning interpretation to a neutral model that has been used in cultural evolution studies for some time (e.g. [26]). While Reali & Griffiths [1, p. 435] dismiss previous applications of this neutral model as being used merely ‘as a metaphor’, it has been explored in quantitative detail, and applied even to word frequencies [7]. We therefore question whether this fairly complex description of Bayesian learning is necessary, as it does not change the results of the model, and runs the risk of obscuring the advances made both through careful modifications and wider applications of this powerful model.

The Reali & Griffiths [1] version of the model is mathematically equivalent to previous versions, in representing a set of N variants per time step, with the next generation of N variants constructed by repeatedly sampling variants from the previous time step, with some probability of ‘mutation’, i.e. introduction of a variant of a unique new variant. Reali & Griffiths [1] apply the model in a novel sense to individual cognition, whereas it has previously been considered in terms of a population of fixed size N, which is replaced by N new agents in each time step. In this population version, each new individual either copies an existing variant from the last time step (with probability proportionate to the previous choices of agents), or chooses a new variant. With probability 1 − μ, an incoming agent copies its variant from that of an agent within the previous time step, or else with probability μ, the agent innovates by choosing a unique new variant at random.

This model has already been shown to yield the inverse power law in the probability of variant frequencies that Reali and Griffiths report, from which the other inverse power law, in word frequency versus replacement rate, also follows. We demonstrate this briefly below, in order to show how the discussion of the model has advanced beyond this, including the incremental addition of a new extra parameter, ‘memory’ [8].

Before we consider the model predictions, however, we address the question of the psychological plausibility of the assumptions that are made about agent behaviour. The Bayesian learning approach assumes agents use ‘a rational procedure for belief updating that explicitly represents the expectations of learners’ (Reali & Griffiths [1, p. 430]). This raises a very important topic in social sciences. Standard economic theory, for example, assumes rational agents capable of obtaining all relevant information and then processing this to arrive at an optimal decision. The less demanding postulate of bounded rationality restricts access to full information, but the hypothesis of optimal decisions is retained (subject to information constraints).

Individual rationality is not the only way, however, that learning can be represented (e.g. [9,10]), and alternative hypotheses have support both in particular applications and in the wider literature. At the other extreme, for instance, the assumption of literally ‘zero-intelligence’ agents, based on the particle model of physics, has provided a powerful explanation of many population-scale phenomena, such as financial asset price changes [11]. Daniel Kahneman [12] argued in his Nobel prize lecture, awarded for his work in economic psychology, that ‘the central characteristic of agents is not that they reason poorly, but that they often act intuitively. And the behaviour of these agents is not guided by what they are able to compute, but by what they happen to see at a given moment’. Heuristically, this seems perfectly consistent with a population-level view of learning, rather than the standard rational-individual model.

We might, therefore, alternatively assume that agents (including language speakers) observe the relative popularity of previous choices among other agents, and copy in proportion to these popularities, with a small probability of making an innovative choice. This population view, which is able to offer a good account of a wide range of cultural phenomena, assumes that agents are indifferent—‘neutral’—to the particular agent that they copy. The result of this neutrality is that agents tend to copy variants in proportion to the previous choices made by other agents; a more popular choice is more likely to be copied than a less popular one. In the special case where μ = 0, the model is formally equivalent to a process of preferential attachment (e.g. [13]).

This population view also can provide extra flexibility for modifying the model. In the Bayesian learning representation, for instance, each individual commits previous experience to individual memory. In the population view, previous experience is copied directly from other individuals, and ‘memory’ becomes a useful added parameter. In our own version [8], this variable memory, designated as m, allows agents to copy from m previous time steps. This effectively means that the original variants can ‘linger’, because even if replaced in the current time step, they might still be ‘remembered’ again within m subsequent time steps.

Adding this memory parameter adds both flexibility and power to replicate real-world data patterns, including the three that Reali & Griffiths [1] report; in the inverse power law probability distribution of variant frequencies, and in ‘S’ curves of replacement. Furthermore, by keeping the model simple in description, it is then easier to explore different phenomena, such as the flux of variants on a ranked popularity list [8,14]. We can visit each of the three main results of Reali & Griffiths [1] to show the advantages.

Firstly, Reali & Griffiths [1] fit their results to a power law probability distribution with exponent −1.7. Among the many processes that can produce Zipf's laws [15], the simpler neutral model yields this very same power law exponent (between about −1.6 and −1.8) when variants are counted cumulatively (e.g. [16]). Notably, the exponent of best fit varies with the choice of mutation rate, and in fact as we move away from Nμ = 1 (N is population, μ is mutation), a power law no longer fits [17]; increasing Nμ tends to reduce highest frequencies and push the distribution towards a an exponential distribution, whereas decreasing Nμ increases the highest frequency, ultimately to a ‘winner-take-all’ distribution at Nμ = 0 [3,8,18]. In addition, the added memory parameter m enables the modified neutral model to generate a much larger family of right-skew frequency distributions [8].

Secondly, Reali & Griffiths [1] report an inverse power law relating word frequency to replacement rate, comparable to real languages [19]. This has also already been shown in the population version of the model [14], albeit in a slightly different way. Consider a ranked list of the V most popularly used variants in a particular variant pool, such as the V most popularly used words in a language. Regarding this list of variants, ranked in order of popularity from 1 (most popular) to V (least popular), the turnover on the entire list (word replacement rate among the top V words) is simply proportional to the size, V, of the list—as has been shown through simulation [14,20] and analytically [21]. The turnover just at rank V is equivalent to the turnover on the entire list from rank 1 to rank V, because each new entry must, at some instant, displace the one at the bottom or Vth position (this turnover decreases moving up the rankings, because reaching rank V does not guarantee reaching rank V − 1). Zipf's law implies that the frequency of a word at rank V has usage frequency proportional to V-a, meaning the probability distribution function (PDF) has an exponent of –(1 + 1/a). Plotting turnover at position V on the y-axis versus the word frequency V-a on the x-axis therefore yields an inverse power law with negative slope 1/a, which is just the slope of the PDF (1 + 1/a) minus 1. Reali & Griffiths ([1], fig. 2) show word frequency PDF with exponent −1.74, so we expect a plot of turnover versus word frequency to have a power law slope of −0.74, which is quite close to their demonstrated slope of −0.8.

Thirdly, Reali & Griffiths [1] demonstrate ‘S’ curves of replacement that follow naturally from their rational-individual version of the model, following Wright–Fisher. This is an important realization, and by allowing for a population view, we can (i) add the memory parameter to gain flexibility in the resulting turnover, and (ii) apply to multiple variants in such contemporary representations as the ranked popularity list. To see why, consider an original pool of V different variants at time zero, and let x(t) be the number of new variants replacing this original pool. These V variants could represent the different words in a vocabulary, for example, or perhaps different possible grammatical rules in use. This replacement rate is proportional to the constant turnover rate, z, times the diminishing proportion (1 − x/V) of original variants remaining in the pool:Embedded Image 1which is a separable differential equation, with a simple solutionEmbedded Image 2This is not an S curve, because even though it asymptotically approaches V over time, it begins quickly (when x = 0, dx/dt = z), rather than beginning slowly and speeding up. Adding the memory parameter, however, for certain combinations, such as m = 10 time steps and μ = 10%, enables the modified neutral model to generate an S curve of replacement (figure 1).

Figure 1.

The effect of the simple memory parameter on replacement rate under the simple neutral model, with N = 1000, and μ=10%. Here we consider an original vocabulary of 100 popular words, and how those words are replaced over time under the model (note that an ‘S’ curve does not follow from the simpler version of the model, where m = 1). Dashed line, m = 1; thick grey line, m = 5; thin grey line, m = 10; thin black line, m = 15; thick black line, m = 20.

In summary, the basic versions of the two models are mathematically the same, but radically different in interpretation: one is aimed at populations of simple imitators, whereas the other is focused on rationally based individual learning through iterated sampling. We argue that there are two reasons not to abandon the population representation. Firstly, it keeps things simple, and modifications and advances from this model—e.g. adding new extra parameters such as copying bias [6] or memory [8]—are more easy to benefit from and to apply generally. Secondly, in many situations of collective behaviour—which includes language, by its very definition—a representation of very simple social learning in a population may be more realistic than a model of individual rational actors. In any case, as vast sets of new word frequency data become available [22], the simplest, most flexible model has the best chance of generating insights across the range of social evolutionary phenomena where the same patterns occur.


  • Received November 30, 2010.
  • Accepted December 22, 2010.