Human languages differ broadly in abundance and are distributed highly unevenly on the Earth. In many qualitative and quantitative aspects, they strongly resemble biodiversity distributions. An intriguing and previously unexplored issue is the architecture of the neighbouring relationships between human linguistic groups. Here we construct and characterize these networks of contacts and show that they represent a new kind of spatial network with uncommon structural properties. Remarkably, language networks share a meaningful property with food webs: both are quasi-interval graphs. In food webs, intervality is linked to the existence of a niche space of low dimensionality; in language networks, we show that the unique relevant variable is the area occupied by the speakers of a language. By means of a range model analogous to niche models in ecology, we show that a geometric restriction of perimeter covering by neighbouring linguistic domains explains the structural patterns observed. Our findings may be of interest in the development of models for language dynamics or regarding the propagation of cultural innovations. In relation to species distribution, they pose the question of whether the spatial features of species ranges share architecture, and eventually generating mechanism, with the distribution of human linguistic groups.
Human diversity expresses itself in vastly different ways in terms of cultural traits, personal identity and relationships [1,2]. The human population is genetically quite similar , and technological advancements have led to personal mobility and communication on a global scale, but cultural diversity remains pervasive to a degree mostly comparable with biodiversity [4,5]. Most studies comparing cultural and biological diversity rely on the language spoken by individuals to define cultural groups; indeed, human languages are among the most easily quantifiable cultural traits, and display a variety that has intrigued scholars for centuries [6,7]. The analogy between biodiversity and human linguistic groups has led to the application of ecological methods to cultural data, often driven by the intuition that analogous features might arise from common generating processes [5,8,9]. Some remarkable patterns that both systems share are the latitude diversity gradient  and the language–area dependence [9,11], which mirrors the species–area relationship in ecology . Also, the allometric dependence between the area occupied by the speakers of a language and the number of speakers of that language  finds a counterpart in the allometric dependence between species ranges and their abundances [14,15]. The specific history of particular species or languages has little to no influence in the construction of these collective statistical patterns.
A common representation of the relationships between biological species is in the form of food webs, where links stand for trophic interactions . Food webs display the notable property of being graphs of high intervality, a feature that is deeply related to the existence of a niche space of low dimensionality [16–18]. A graph is perfectly interval if their nodes can be ordered in such a way that the neighbours of any node occupy positions near that node, with no gaps left in between. A quasi-interval graph has a small number of gaps compared with suitable randomizations of its links. Intervality is deeply related to the existence of an almost one-dimensional configuration space [16,17], which implies that feeding relationships in food webs can be determined using a single species property (a ‘niche’ variable), and explains the success of models of food-web structure in accounting for many of their topological properties [19–23]. The probability that a food web is interval diminishes with its size , though larger food webs maintain high intervality in comparison with appropriate random models . In humans, interactions occur at many levels, from individuals to confederations of countries, involving a hierarchy of connectivity patterns unfolding at different scales of space and time. Often, agents and their contacts can be depicted as networks embedded in space, a geometrical condition that affects their structure and evolution .
In this contribution, we construct and analyse networks of contacts between human linguistic groups, or language networks for short. Language networks are undirected, spatial networks that make explicit physical contacts between the areas in which different languages are spoken. We apply several measures commonly used in the analysis of complex networks and show that language networks are characterized by atypical topological properties, among which are a lognormal degree distribution, a one-dimensional local structure and quasi-intervality. The relevance of this latter property is assessed through the introduction of three different constructive hypotheses, which eventually allow us to conclude that the distribution of range sizes, together with a simple perimeter-covering rule among spatial neighbours, explains the patterns described. Nonetheless, we conjecture that the fundamental origin of quasi-intervality in language networks must arise from a non-trivial interaction between environmental variables and settlement of human groups, leaving an interesting question open in the area of linguistics.
2. Material and methods
(a) Language networks
Data on world languages have been obtained from the most comprehensive database currently available, the Ethnologue , which contains information on 6900 extant languages. The origin of data in the Ethnologue stems from a collection by SIL International (see http://www.ethnologue.com) and a map developed by Global Mapping International (World Language Mapping System, http://www.gmi.org/wlms/index.htm). In the Ethnologue we find a list of the spatial domains spanned by the speakers of each language and a centroid that is assigned to those domains, a point in latitude–longitude coordinates that best represents their average location. There is only one centroid per language, and centroids are the nodes of language networks. Since a language may have more than one disconnected domain where it is spoken (the sum of domain areas being the range, or total area, covered by the speakers of a language), centroids do not always fall inside speaking domains. Two centroids are connected if the two corresponding languages share boundaries in any of the areas where they are spoken. To avoid insularity effects, only languages found within the 100 largest landmasses of the Earth have been considered. Data in the Ethnologue describe the current distribution of languages, though the observed heterogeneity can be put in correspondence with different evolutionary states . In order to further address the changes in language networks caused by the disappearance of languages and recent mechanisms such as colonization, we have studied how different structural modifications in language network definition affect the topological patterns described below. In addition, we have considered a different dataset regarding the distribution of native languages in North America prior to colonization, to check the robustness of our results (see the electronic supplementary material for further information).
The linkage density of an undirected network is defined as z = 2L/N, where N is the number of nodes and L is the total number of links.
A planar network can be drawn on the plane in such a way that its edges intersect only at their endpoints. Planarity has been checked in our networks through application of Kuratowski's theorem to find the minimum number of links that have to be eliminated to obtain a planar graph (see the electronic supplementary material).
The degree distribution p(k) of a language network is the probability that a given linguistic group is in contact with k other linguistic groups. Though languages are often spoken in more than one isolated spatial domain, each border contact counts only once for every possible pair of languages.
The average shortest path length for a network is the average over all possible node pairs of the minimum number of steps required to go from one node to another through existing links.
The clustering coefficient Ci of node i is defined as the number of connections between pairs of neighbours of i divided by the maximum value this quantity may take, ki(ki − 1)/2. The clustering coefficient of a network is the average over all its nodes, . This quantity can be exactly calculated in some simple cases, as for regular networks (i.e. graphs for which linkage density z is uniform for all nodes) embedded in D dimensions, where2.1and λ = 3/4.
A perfectly interval directed network admits a permutation of its nodes such that the ki directed links of any given node i point to a subset of nodes labelled with consecutive indices . This means that the corresponding adjacency matrix (aij)—defined by the conditions aij = 1 if there exists a directed link from j to i and aij = 0 otherwise—has no gaps along its columns. If the network is undirected, it is perfectly interval if and only if there exists a node ordering such that the ki connections of node i are restricted to ki circle-neighbours nearby. Therefore, if the node at position i + j is connected to i, so is i + j − 1 (and similarly for node i − n and i − n + 1, respectively). This implies that the corresponding (symmetric) adjacency matrix has no gaps along its columns and rows.
The intervality of a network can be measured through the overall number of gaps G′ along its columns. For a perfectly interval network, G′ = 0. In particular, a one-dimensional regular graph is an example of a perfectly interval network. Conversely, the larger the number of gaps, the lower the intervality of the network. The overall number of gaps depends on the particular node labelling scheme. Hence there exists a node permutation σ = (σi) such that G′(σ) is minimal. This quantity is the empirical number of gaps of the network, G = minσG′(σ). We have used simulated annealing to estimate the minimum number of gaps G in language networks (see the electronic supplementary material).
We constructed the world's network of contacts and extracted from it the subgraphs corresponding to Africa, Asia, Europe and the Americas. For each subgraph, a connected component analysis was performed. World languages can be grouped into a set of connected networks of variable size. To analyse topological properties, we have selected the 13 largest connected components, with sizes ranging from 2126 nodes (Continental Africa) to 33 (a group of languages located in the borders between Argentina, Bolivia and Paraguay—ABP borders; table 1). Figure 1a depicts a subset of the network obtained for New Guinea. Analogous results and maps of networks for all other cases studied are provided as electronic supplementary material.
(a) Topological properties
Despite the existence of strong spatial restrictions in our networks—a constraint that often facilitates planarity—language networks are non-planar in general. Small language networks are planar or almost planar, but the larger the network, the larger the fraction of non-planar links (table 1). Planarity is broken due to the variable number of isolated domains where a language is spoken and to multilingualism, which causes different domains to overlap (see the electronic supplementary material for details).
(ii) Degree distribution
The distribution p(k) of the number of neighbours of a given linguistic group presents a peak at value 2–4 and a fat tail that extends to high degrees (up to 125 for Mandarin Chinese in Continental Asia). In all cases, the degree distribution of language networks is compatible with a discrete lognormal distribution. This means that most languages have a similar number of neighbours, but there is a small fraction of exceptions with a large number of connections. Figure 1b shows the degree distribution for New Guinea's network. In order to assess the likelihood that empirical degrees of nodes arise from independent trials of a lognormal distribution, we have compared this model with two others: an Erdős–Rényi model, characterized by a Poisson degree distribution, and a modified Watts–Strogatz model for which an analytical expression of its p(k)—based on the derivation for the original case —has been calculated (see the electronic supplementary material). We have used maximum likelihood for parameter estimation and Akaike's information criterion for model comparison. The lognormal model is rejected only in one case (for New Guinea's degree distribution) at a 5% confidence level. Degree distributions for the remaining networks, together with parameter estimates from lognormal fits as well as the quantitative comparison between the models tested, can be found in the electronic supplementary material.
(iii) Average shortest path length
For each language network, we have calculated the empirical value of the average shortest path length , which has been compared with lengths rendered by different models for which the functional dependence between and the size of the network N is known. Language networks are mostly compatible with two-dimensional, planar networks of similar average degree (square or hexagonal lattices; see the electronic supplementary material), which indicates that nodes are ‘separated’ on average as if linguistic domains were spatially distributed yielding a perfectly planar network of contacts. Two cases that show significant deviations are Continental Africa and Continental Asia, which actually contain the largest fraction of non-planar links (0.43 and 0.46, respectively) among all networks analysed. In agreement with this fact, they present average shortest paths well below the value expected for regular, planar networks with comparable linkage density z.
The average clustering coefficient obtained for language networks has been represented in figure 2 as a function of z. When the functional form (2.1) expected for regular networks is fitted to the data, we obtain a reasonable fit with parameters D = 0.84 (95% CI: (0.56, 1.12)) and λ = 0.68 (95% CI: (0.57, 0.80)). Therefore, language networks seem to behave locally as one-dimensional networks. This is remarkable considering that language networks are naturally embedded in the two-dimensional space, and points to a non-trivial reorganization of neighbouring relationships.
Clustering values are large when compared with random networks with the same linkage density, for which . Hence, though we could not discard that an Erdős–Rényi model matched the degree distribution of New Guinea language network, the random model cannot explain the high clustering measured (table 1). In general, no model without spatial correlations can account for high values of when z is low .
Contrary to what is observed for the shortest path length, the clustering analysis described above reveals that language networks exhibit local topological features compatible with those of one-dimensional regular networks. This suggests that our networks might be described using a reduced number of variables embedded in a low-dimensional space, as reported in previous work for food webs [16–18,22]. To substantiate this possibility we have quantified to what extent language networks are close to one-dimensional regular graphs by analysing their intervality.
(v) Language network intervality
The values of the empirical number of gaps G obtained for language networks as a measure of their degree of intervality are summarized in table 1. An example of a node ordering that minimizes the number of gaps for New Guinea is shown in figure 3a (other networks are provided in the electronic supplementary material).
(b) Assessment of the significance of intervality in language networks
The absolute number of gaps is not informative per se of the degree of intervality of a network, since G depends on the network size, on the number of links it has and, in general, on the precise connectivity pattern. Therefore, G has to be compared with appropriate models able to reveal whether the obtained value indeed originates from high intervality or whether it is a generic property of networks sharing some of the topological features described. In order to assess the significance of intervality levels in language networks, we have first devised two random models that conserve the degree distribution plus another a priori relevant ingredient: the spatial random model (SRM) and the planar random model (PRM). These models fail at recovering, among others, the intervality of language networks. Finally, and inspired by niche models in ecology, we introduce the range contact model (RCM), which is shown to accurately reproduce the structural patterns observed.
(i) Spatial random model
Let us hypothesize that the topological structure of language networks arises from local spatial restrictions in such a way that links can only be established between nodes (centroids) that are at a certain mutual distance on Earth's surface. For this model, we have thus chosen to preserve, in addition to the degree distribution, the empirically obtained distribution of physical distances between pairs of centroids. These empirical distributions are compatible with lognormal distributions in most cases, thus implying the existence of a typical distance for linkage but also a non-negligible probability that distant centroids are linked. The preservation of the distance distribution is a qualitative way to account for the restrictions imposed by a two-dimensional space—it seems unreasonable that links can be drawn arbitrarily between centroids regardless of their mutual separation. We performed 50L link rewirings to randomize language networks under the two previous assumptions. Then, we estimated the minimum number of gaps GSRM for the network so obtained, and repeated for 500 independent realizations. The distribution of GSRM values takes a Gaussian shape (figure 3e,f; averages are reported in table 1) that has been used to estimate the probability p that GSRM is smaller than the empirical number of gaps G. There is only one instance where we cannot reject this hypothesis at a 1–99% confidence interval: ABP borders (see the electronic supplementary material).
(ii) Planar random model
Our second model corresponds to networks where the empirical degree of planarity is preserved. To this end, only links in the previously identified planar component are rewired in a way that maintains planarity and the degree of the node. The PRM assumes that the planar component of language networks is strong and should be conserved. The number of rewirings allowed in this case is significantly smaller than under distance-preserving rewiring. Therefore, we have rewired 10L links to generate random networks under the PRM conditions, and repeated the procedure 500 times. As above, the minimum number of gaps GPRM has been estimated for each PRM network. The distributions are also well approximated by Gaussian curves, again used to test the likelihood that the PRM explains the observations: in this case, this hypothesis is consistently rejected for all empirical networks (see the electronic supplementary material). The probability distribution of GPRM for New Guinea has been depicted in figure 3f. Average values of GPRM are summarized in table 1: they are systematically far from empirical values in language networks.
Figure 3a–c depicts optimal orderings obtained for networks generated through SRM and PRM together with the permutation that maximizes intervality for the empirical network corresponding to New Guinea. Both SRM and PRM qualitatively yield many more gaps (i.e. lower levels of intervality) than those in language networks. Interestingly, SRM and PRM implicitly reinforce the two-dimensional structure of contacts between the ranges of linguistic groups, a feature that seems to blur the one-dimensional structure uncovered by clustering and high intervality.
(iii) The range contact model
None of the two putative explanations analysed is able to account for the high intervality observed. At this point, it seems necessary to resort to different kinds of models if we wish to explain not just the high intervality measured in language networks, but also their uncommon degree distribution or the local similarity to networks embedded in a one-dimensional space. Inspired by niche models for food-web structure, which by definition entail a one-dimensional organization, we have devised a model for language networks, the RCM, where the relevant variable is the total area over which linguistic groups are spread. Our working hypothesis is that the lognormal distribution of areas  and the lognormal degree distribution of language networks are somehow related through actual spatial contacts between neighbouring linguistic groups ordered along a one-dimensional ring. Group interactions—expressed as conflicts for territory—coupled to demographic growth can quantitatively account for the lognormal shape of the distribution of areas . Our expectation is that other topological properties may also follow from an effective arrangement of areas stemming from an intuitive condition on neighbouring domains: the assumption that the perimeter of any domain is covered by the sum of shared perimeters across all of its neighbours.
The RCM is defined as follows. (i) random numbers are drawn from a lognormal distribution with parameters (μa,σa). Each of them represents an area ai. (ii) Areas are arranged along a one-dimensional space in no particular order. (iii) A directed link connects i to its adjacent neighbours j = i ± 1, i ± 2, … (1 ≤ j ≤ N′) until the condition is first fulfilled. This amounts to assuming that the perimeter of domain i is covered by the sum of shared perimeters of all its neighbours. Parameter f weights the average fraction of perimeter shared by domain i with each of its neighbours. (Note that, for regular tilings, f = 1/z. In general, f is inversely correlated to the linkage density in empirical networks, but a precise functional relationship cannot be systematically derived.) The set of nodes linked to i, nn(i), is determined as follows: the initial link is established with the left or right neighbour with equal probability, and subsequent links occur with neighbours on alternating sides, not previously considered, and in order of decreasing proximity to i. The procedure is repeated for each area i; note that the order in which areas are selected is so far irrelevant. (iv) By construction, the network generated through steps (i)–(iii) is directed. Since language networks are undirected, we introduce an additional parameter q that sets the probability that a directed link is complemented by its reverse counterpart; with probability 1 − q the existing link is removed. In this symmetrization process, some nodes or small groups of nodes in the network might become disconnected from the bulk. We have checked that the final networks used are connected by discarding these small disconnected components, and accepting RCM networks only if their final size has at most a 0.5% size difference to N. The likely elimination of some nodes under application of the algorithm is the reason to begin with nodes.
Variations in parameter μa mostly cause a rescaling of the areas, leaving any other topological property of the resulting networks essentially invariant. Therefore, we fix μa to its empirical value (see the electronic supplementary material), and the RCM model is left with three relevant parameters: the dispersion σa of the lognormal distribution of areas, the fraction of shared perimeter f and the symmetrization probability q.
(iv) Comparison of the range contact model with language networks
The values of parameters that better render the empirical adjacency matrix of each of the 13 studied networks are obtained through a maximum-likelihood procedure (see the electronic supplementary material for further information).
The degree distribution yielded by the RCM is fully compatible with data in all cases. An example of the goodness of fit can be seen in figure 1b. The RCM distributions of the remaining networks also show an excellent agreement (see the electronic supplementary material). The RCM reproduces with reasonable accuracy the values of the clustering coefficient and the average shortest path length (table 1). Probably the most remarkable result concerns the distribution of the minimum number of gaps, GRCM, derived from the model. The distribution of this variable has been obtained through 500 independent RCM realizations for each of the 13 language networks analysed (figure 3e,f displays the RCM distribution for New Guinea). The hypothesis that the degree of intervality of language networks can be accounted for with RCM networks cannot be rejected in any case at a 1% confidence level. In addition, the RCM accounts for the local structure of language networks measured through the distribution of the number of gaps per node (see the electronic supplementary material for details). An example of an optimal RCM network mimicking New Guinea's language network is represented in figure 3d.
The same hypothesis testing has been conducted for networks modified according to three different mechanisms: first, a procedure of domain aggregation that mimics language colonization; second, the removal of hubs (i.e. widespread languages) from language networks; and third, the use of available, high-resolution data of pre-colonial language distributions. High intervality of language networks remains a robust pattern under such modifications, akin to different processes of language network evolution. A detailed account of results can be found in the electronic supplementary material.
The topological structure of networks of contacts between linguistic groups is consistently similar in all cases analysed despite likely differences in the accuracy of language identification in different world regions. This indicates that the characteristics uncovered are generic and robust under different classifications (such as if more taxonomic levels are considered for languages or different cultural traits are used) and under modifications that mimic the natural processes affecting language networks, as we have shown. Language networks present a particular architecture previously unseen in any other networks described in the literature. A lognormal-like degree distribution has been rarely observed [28,29], and to the best of our knowledge never reported in spatial networks. Table 2 summarizes the main properties of the latter in comparison with language networks. Language networks constitute the second natural example of quasi-interval graphs, together with food webs [17,18]. This property supports that the architecture of language networks is mainly driven by a single quantitative attribute of nodes, which has been shown to be the area occupied by linguistic groups. Further support that domain area is the quantity that shapes language networks stems from the positive correlation between area and node degree, as the RCM trivially predicts (these results will be published elsewhere). In analogy with niche models, we have introduced the RCM, which successfully reproduces the structure and organization of language networks. Other network models parametrized in two dimensions have been successful in reproducing certain food-web properties , a small number of species attributes being needed to explain their global topology . Confronting language network data with those models is a future direction worth pursuing.
The number of neighbours of a given language depends on its area of spread, a quantity strongly correlated to the number of speakers . The number of contacts is also a measure of the likelihood of conflicts between different groups. It has been argued that the frequency and strength of those conflicts affects the area occupied by the group . A particular form of conflict between neighbouring languages is competition for speakers. The dynamics of extinction of languages is influenced by the attractiveness of competing languages , by geography  and, plausibly, by the number of competitors, which we have shown to vary broadly.
Overall, language networks might be regarded as a first approximation to networks of contacts between cultures. As such, their topology may have implications in the way cultural innovations (e.g. farming, animal domestication or iron tools) spread in the past , and in the modelling of the spreading process . A common language is a fast vehicle for the dissemination of knowledge assuming that individuals speaking the same language experience stronger ties than those shared with other linguistic groups. The existence of a complex underlying topology of contacts may entail qualitative changes in the propagation dynamics, as compared with propagation on homogeneous media. This modification echoes how our understanding of epidemic spreading was improved upon the introduction of heterogeneous networks  and calls for a deeper study of its effect in the cultural relationships that might be established between human groups.
High intervality, a property reflecting a one-dimensional underlying ordering of nodes (linguistic groups in our case), is indeed a remarkable feature considering that language networks are embedded in two-dimensional space. Although language domains are clustered together, contacts between them are such that the spatial ordering of languages resembles one-dimensional arrays. This suggests that linguistic communities interact along certain directions to a greater extent than would be expected for spatial networks with low intervality. These patterns are robust throughout different regions across the world, and could be used to further improve our understanding of language organization, change and extinction.
The placement of cultural groups is plausibly related to properties of the landscape. Mountain ranges, coastlines, rivers and fertile valleys condition the position and extension of human settlements, as well as preferred directions for movement and group interaction [36–38], which seem to partly eliminate the freedom of a two-dimensional space in favour of linear interactions among neighbouring groups. Indirect evidence of the role played by the environment arises from the significant dependence between linguistic diversity and, especially, landscape roughness and river density . Whether an explicit consideration of topography might explain the quasi-intervality of language networks is an open question that deserves additional attention.
Widespread languages play a relevant role in several of the issues tackled. Usually, they have many neighbours, responsible for most ‘shortcuts' in our networks and, consequently, for decreasing intervality. The elimination of those languages in the Ethnologue networks, or their progressive appearance through models that effectively consider modern evolutionary processes of language extinction and growth, shows however that they are not essential in determining the topological properties uncovered. Widespread languages are the hubs of language networks, though at the same time they percolate across continents, and cause the isolation and fragmentation of groups of minority languages. That is the case for North America, with 609 remaining languages forming 30 disconnected components located on the continental landmass. Asia also holds an astonishingly large number of solitary languages (34% of its total diversity) and many disconnected components. However, the latter are mostly due to the abundance of large islands, not to fragmentation on the mainland. The structure of language networks is in continuous transformation due to the sustained growth of widespread languages and to the disappearance of many others: 3500 languages are predicted to become extinct within the next century . Extinction dynamics are likely to be affected by variations in contacts with potentially competing languages, but also by increasing isolation and area shrinkage. These factors find their counterpart in ecology. Habitat fragmentation leads to the isolation of species, to a reduction of their home ranges, and eventually to an accelerated extinction . It would be interesting to extend our analysis to networks of contacts between species ranges. An intriguing question is whether the architecture of those networks belongs to the class here described and, in that case, whether cultural and biological diversity patterns are the final products of generic constructive processes. As our knowledge increases, so does evidence supporting the qualitative and quantitative parallelisms between both evolutionary systems.
J.A.C. and S.M. conceived and designed the research; J.A.C. performed the research; J.A.C., J.B.A. and S.M. analysed the data; J.A.C. and J.B.A. contributed materials/analysis tools; J.A.C. and S.M. wrote the paper.
The authors acknowledge financial support from Spanish MICINN through projects FIS2011–27569 and FIS2011–22449 (J.A.C.), from Comunidad de Madrid through project MODELICO, S2009/ESP-1691 (J.A.C., S.M.) and from the Carlsberg Foundation (J.B.A.).
The authors are indebted to Jacobo Aguirre, Sara Cuenda, José A. Cuesta, Anxo Sánchez, Daniel B. Stouffer, Damián H. Zanette and two anonymous reviewers for constructive criticism of the manuscript.
- Received December 2, 2014.
- Accepted December 23, 2014.
- © 2015 The Author(s) Published by the Royal Society. All rights reserved.