## Abstract

Networks play a prominent role in the study of complex systems of interacting entities in biology, sociology, and economics. Despite this diversity, we demonstrate here that a statistical model decomposing networks into *matching* and *centrality* components provides a comprehensive and unifying quantification of their architecture. The *matching* term quantifies the assortative structure in which node makes links with which other node, whereas the *centrality* term quantifies the number of links that nodes make. We show, for a diverse set of networks, that this decomposition can provide a tight fit to observed networks. Then we provide three applications. First, we show that the model allows very accurate prediction of missing links in partially known networks. Second, when node characteristics are known, we show how the *matching–centrality* decomposition can be related to this external information. Consequently, it offers us a simple and versatile tool to explore how node characteristics explain network architecture. Finally, we demonstrate the efficiency and flexibility of the model to forecast the links that a novel node would create if it were to join an existing network.

## 1. Introduction

The modern world is an increasingly connected place, through transport, social, and economic networks, and via our knowledge of interactions at the ecological or molecular level [1–3]. It is increasingly recognized that such systems should be studied globally, and networks of interacting entities provide us a powerful representation of their structure and function. Research on network theory parallels this growth [3]. A first body of research concentrates on the fact that observed networks are often considered to be only partially known. This would be the case in a food web, for instance, in which some real interactions may have yet to be observed, or a protein interaction network where not all pairwise combinations had been tested in the laboratory. Thus, observed links are typically considered as certain, whereas an absence of a link between a pair of nodes may reflect an absence of information rather than a real absence of interaction. Models have been devised to predict these ‘missing links’ and thus correct a network dataset for this incomplete sampling or direct future research towards these candidate interactions [4,5].

A second domain aims to determine if the structure of these networks exhibits basic generalities, and to uncover the processes that may generate these patterns. This aspect has been tackled with a variety of mostly comparative approaches, such as those treating the classification of networks [6,7], motifs [8], or stochastic models [9]. Progress in this undertaking could be achieved if there were general methods to relate network structure to characteristics of the nodes. For example, body size has been related to patterns in food webs [10], or country politics and trade to the organization of military conflict networks [11].

A third potential application of research is network forecasting, which aims to predict the links made by new nodes joining a network. Many current issues facing human society would benefit from the ability to forecast networks, such as for the ecological interactions of invasive species [12], the molecular interactions of a newly discovered protein [13], or the links of a subversive social group [14]. However, there exists no general framework for this forecasting. Here we provide a general model that can be applied to all three domains of network research.

The key feature of our methodology is the development of a model for the probability of interaction between nodes based on the decomposition of network architecture into *matching* and *centrality* terms. The *matching* term aims to quantify assortative structure in who makes links with whom [15–17], whereas the *centrality* term captures variation in the number of links that nodes make. Typically, research on network structure has focused on patterns in either assortativeness or *centrality*. However, the architecture of empirical networks is usually a product of both features simultaneously [18]. Here, we take into consideration both patterns. Specifically, the decomposition is implemented at the node level, with each node characterized by latent traits of *matching* and *centrality*. Latent traits are variables whose values are unknown *a priori*, but can be estimated *a posteriori* from the network adjacency matrix itself [19,20]. The model, called the *matching–centrality* model, is implemented in such a way that the closer the *matching* traits of two nodes, the greater the probability that they are linked, and the higher the *centrality* trait of a node, the greater the probability that this node makes links. This model belongs to the general class of ‘hidden’ variables models [5,19–24], for which some variables are unknown *a priori*, but can be estimated from the data *a posteriori*. In our model, these ‘hidden’ variables are the latent traits of *matching* and *centrality*.

Based on a dataset of 86 networks from disparate fields, we show that this decomposition into *matching* and *centrality* terms can provide a precise fit to observed networks. Then, we provide three applications of the *matching–centrality* model. First, we show that the model can be used to accurately predict missing links. Second, we show that the latent traits of *matching* and *centrality* are not just abstract traits, but can be linked to external information about the nodes and thus provide a means to study network organization. For example, in a food web, the latent traits are related to the body size and phylogeny of the interacting species. Finally, by placing latent traits as intermediates between the network structure and characteristics of the nodes, the model offers the possibility to forecast the interactions made by novel nodes when joining the network. For example, in a spatial network of mammal communities on mountains, we show that we can accurately predict the mammal fauna of unsampled mountains, or in a food web the trophic links for a new incoming species.

## 2. Methods

We formulate our *matching–centrality* model for undirected bipartite networks, but it can be applied to any kind of undirected or directed network, as explained below. Bipartite networks are made of two sets of nodes ( and ) with connections only between them and not within; plant–pollinator networks provide a classical example. Let *A* be the adjacency matrix of the network, i.e. if there is a link between nodes *i* and *j*, and zero otherwise. The model characterizes each node *i* in set by a latent trait of *centrality* denoted and by latent traits of *matching* denoted [15–17], and similarly, each node *j* in set by a *centrality* trait and *matching* traits The value of *d* gives the number of *matching* space dimensions and can be tuned to improve the goodness of fit of the model. We take a statistical approach, in which the probability of existence of a link between a pair of nodes *i* and *j* (hereafter the linking probability ) is modelled through its logit [24]. Our model is given by
2.1where are positive constants that scale the relative importance of the *matching* and *centrality* terms and *m* the common intercept. Figure 1 depicts patterns in interaction networks that the model is able to capture. When only the *matching* term is present, the model is perfectly tailored to fit a modular structure (figure 1*a*) [3,25,26]. In turn, the *centrality* term can perfectly fit highly nested networks [29] (figure 1*c*), because it captures the variation in node degree [3,9,27,28]. The latent traits of *centrality* are usually highly correlated to node degree. Therefore, when both the *matching* and the *centrality* terms are present, the model can fit simultaneously the modular and nested components of network structure (figure 1*b*).

For a given dimension *d*, the model parameters and latent trait values for each node can be estimated using a simulated annealing algorithm [30]. The likelihood of the model is computed by
2.2i.e. we assume the presence/absence of links follows a multi-Bernoulli distribution and that the probabilities are conditionally independent given the parameters and the latent traits. Moreover, to make the parameters and the latent traits of the model uniquely defined, we have to impose some constraints

(1) All vectors of latent traits are orthogonal to the unit vector,

(2) All vectors of

*matching*traits are pairwise orthogonal, and similarly for the vectors and(3) The length of the vectors is set to where is the number of nodes in set Similarly for set the length of the vectors is set to

Application to other types of network requires simple modifications: for directed unipartite networks, e.g. food web or military conflict networks, the two sets of nodes reflect the function of the nodes (consumer and resource species, initiator and target nations, respectively); thus, a node can appear in both sets. For undirected unipartite networks such as social networks, the adjacency matrix is symmetric; we have to impose , and (the probability of a self-link is equal to zero). For more complex networks like directed or undirected multipartite networks, linking probabilities must be set to 0 for pairs of nodes that, by definition, cannot be linked. Note that our model is designed here for qualitative networks; its application to weighted networks would require modifying the likelihood function (equation (2.2)) to include the probability density for the weights of the links, in a similar way as in zero-inflated models [31].

## 3. Results

### (a) Performance of the model

We illustrate the ability of the *matching–centrality* model to capture network architecture using a set of 86 examples from disparate fields: social interactions in Zachary's karate club network [32], co-appearance of characters in the novel *Les Misérables* [33], United States college football (USCF) games of the USCF teams [34], social networks of long-lasting association between 62 dolphins [35], associations between the terrorists involved in the September 11 attacks [14], military conflicts between countries [11], a subset of the network of physical interactions between the nuclear proteins in *Saccharomyces cerevisiae* [2,36], the neural network of *Caenorhabditis elegans* [37], 18 food webs, 59 mutualistic ecological networks [38], and the presence/absence of data of mammal species on peaks within the southern Rocky Mountains [39] (see electronic supplementary material for details; data can be downloaded from Dryad [40]). Once fitted, the model provides a new visual representation of the network in the latent trait space (see electronic supplementary material, figures S1–S11).

We fit the model for one and two dimensions of *matching* traits, and calculate the fraction of correctly fitted links and the McFadden's pseudo-*R*-squared [41] as metrics for its performance. The fraction of correctly fitted links is defined as the true positive rate after having classified the presence/absence of the links in the following way. We choose a cut-off point in linking probability and then classify a link to be present between a given pair of nodes if its linking probability is higher than the cut-off point; otherwise, the link is classified as absent. The level of the cut-off point is chosen so that the false-positive rate is equal to the false-negative rate. The McFadden's pseudo-*R*-squared is given by where *L* is the likelihood of the model (equation (2.2)) and is the likelihood of the model without the latent traits, i.e. only with the intercept *m*.

Figure 2 shows the performance of the model as a function of the number of nodes for one and two *matching* dimensions. We observed that the performance of the model decreases with the number of links, whereas it increases with the number of matching dimensions. This behaviour is expected, as increasing the number of nodes leads usually to more complex networks, and more matching dimensions are needed to reach a high level of fit. This is somewhat akin to a principal component analysis where increasing the number of dimensions leads to a higher fraction of explained variance. Theoretically, by increasing the number of *matching* dimensions *d*, it would be possible to obtain a fraction of correctly fitted links and *R*-squared equal to 1. However, an increase in the number of *matching* dimensions has drawbacks. First, it will lengthen computation time. More fundamentally, it can lead to an overfitting of the network. In such a situation, the prediction of missing links becomes meaningless, in a way analogous to increasing the degree of the polynomial for interpolation in a regression context. Ultimately, the choice of the number of *matching* dimensions *d* must be made in accordance with the application of the model.

### (b) Relation between the *matching–centrality* model and network architecture

We explore how the relative importance of the *matching* term (for ) over the *centrality* term (i.e. ) affects the architecture of the network. We simulated bipartite networks of and nodes, with a connectance of 0.15, along a gradient of relative importance of the *matching* term. For each sampled network, we computed its level of nestedness [29] and modularity [3,26]. Figure 3 shows that, with greater importance of the *matching* term, modularity, indeed, increases, while nestedness decreases. Moreover, increasing the clustering in the matching traits (from a uniform distribution to four clusters) also increases the level of modularity (figure 3*a*). This result is qualitatively robust to change in network connectance and size (results not shown).

## 4. Applications of the *matching–centrality* model

### (a) Prediction of missing links

The *matching–centrality* model can be used for the prediction of ‘missing’ links in partially known networks, where the absence of an interaction may in fact reflect an absence of information [4,5,42]. Here, we demonstrate its performance by simulating missing links in a subset of eight networks (figure 4). We simulate missing links by removing a given percentage of links and attempting to recover them.

Specifically, we simulated networks with missing links by setting a given fraction of 1 s to 0 s in the adjacency matrix. We removed at random 2, 5, 15, 30, 50, 75, and 90% of the 1 s, replicated 100 times for each fraction. Then, the *matching–centrality* model was fitted to the incompletely observed networks, and latent traits were estimated for each node. We fitted the model with only one dimension of *matching*. These *matching* and *centrality* traits were then used to estimate linking probabilities for each pair of nodes (equation (2.1)). We judge the performance by comparing the matrix of estimated linking probabilities and the true network using the area under the receiver operating characteristic curve (AUC) criterion. Here, the AUC can be interpreted as the probability that missing or observed links are given higher linking probabilities than real absences. For comparison, we give the performance of four other methods: the stochastic block model [5] with its extension to directed and bipartite networks [44–46], the Jaccard index, the common neighbours, and the degree product [42]. Note that some methods were not designed for directed or bipartite networks, and were not applied in these specific cases. In practical applications, the absent links () with the highest predicted linking probabilities are considered to be missing links. These are the candidate interactions that should come under the scrutiny of researchers, thus serving as a guide for cost-effective analysis of complex systems.

Figure 4 shows that the *matching–centrality* model performs very well in recovering missing links. There is one notable exception for the football game network, for which the stochastic block model performs better (figure 4*c*). This is not surprising given that this specific network exhibits a strong block structure (see electronic supplementary material, figure S3). Note that we also tested the method with two *matching* dimensions. We encountered overfitting with increasing percentages of removed links, and therefore, a decrease in the ability to recover missing links (results not shown). For larger networks, however, more than one *matching* dimension may be needed.

### (b) Linking latent traits to node characteristics

As shown above, the model can be fitted and used for prediction simply based on the network itself. However, it also offers an intuitive tool to gain insights into possible processes underlying network structure, as the *matching* and *centrality* traits of nodes can be related to independent information about the nodes, using standard analyses such as linear models or Mantel tests.

For example, in the food web of Tuesday Lake [43,47], both the *matching* and *centrality* traits of predators and prey can be related to their body size and phylogeny. Specifically, we used phylogenetic regression [48] to relate the *matching* and *centrality* traits to species’ body size and phylogeny. Therefore, we assume that the latent traits follow a multivariate normal distribution (MVN), where the linear term is given by the logarithm of the body size and where the correlation structure is induced by the phylogeny, i.e.
4.1where denote the vectors of the *matching* and *centrality* traits of resources and consumers, respectively, the body sizes, is the variance–covariance matrix induced by the phylogeny, and *α*, *β*, and *λ* are the parameters of the phylogenetic regression. We use Pagel's-*λ* [49] structure for the variance–covariance matrix, i.e.
4.2where is the common variance, is the proportion of time that species *i* and *j* spent in common before speciation on the phylogenetic tree, and *λ* is the control parameter for the strength of the phylogenetic correlation (*λ* = 0 is equivalent to no correlation). The *p*-values of the parameters *α* and *β* are computed with the usual *z*-test, whereas the *p*-value associated with the correlation structure is computed using a log-likelihood ratio test between models with and without correlation. The analyses were done in R [50] with the libraries ape [51] and nlme [52].

For this specific network, the result shows that both the *matching* and *centrality* traits are related to the species’ body size and phylogeny (electronic supplementary material, table S2). Although in this specific example we used only phylogeny and body size, any other relevant ecological and behavioural traits, or environmental conditions, could be used [53–56]. The latent traits are thus not just an abstract characterization of the nodes, but provide a versatile method to unravel factors underlying the different aspects of network structure.

### (c) Forecasting the links of new nodes

Finally, a significant feature of the *matching–centrality* model is the possibility to forecast the links that new nodes would create when joining an existing network. This might be applied, for example, to forecast the interactions of an invasive species entering a food web or pollination network, the contacts of a non-surveyed individual in a terrorist network, or the biota of an unsampled mountain. The procedure is as follows: from the adjacency matrix, we first estimate the latent traits of *matching* and *centrality* for each node in the existing network, and verify that our model provides an accurate fit. Then, using appropriate statistical models, we relate the latent traits of the nodes to external information about them (as done in the previous section), and ensure that the model provides a good fit to the data. If both conditions are met, we can forecast the *matching* and *centrality* traits of the new node(s) using the external information, and finally their linking probability with each of the existing nodes.

In the following, we illustrate the method using the network describing the presence/absence of mammal species on mountains within the Rocky Mountains [39], and the food web of Tuesday Lake [43]. The model fits these two networks perfectly, with a fraction of correctly fitted links equal to 1 for two dimensions of *matching*. We first present the technique of forecasting applied to the presence/absence network of the Rocky Mountains. Then we give the modifications that have to be made to apply this technique to the food web of Tuesday Lake. The latent traits of the mountains ( , and ) can be related to their area, elevation, and geographical position using a generalized least-squares linear model with a spatial correlation structure [31]. Specifically, we assume that the *matching* and *centrality* traits follow an MVN, where the linear part is given by the longitude, latitude, area, and elevation of the mountains, and that the spatial correlation structure follows an exponential law, i.e.
4.3with the elements of the variance–covariance matrix given by
4.4

The parameters are the intercept and slope; *r* is a parameter tuning the exponential decay of the spatial correlation; is the common variance; and is the distance between mountains *i* and *j*. For the *matching* traits, we found that only the spatial correlation structure was significant, whereas for the *centrality* traits, area, elevation, and latitude were significant (electronic supplementary material, table S3). Note that the *matching* and *centrality* traits for the mammals could not be accurately related to species traits (electronic supplementary material, table S3).

Because the *matching* and *centrality* traits of the mountains are significantly related to several covariates and to the correlation structure given by the between-mountain distances, it should be possible to forecast the mammal communities in unsampled mountains for which the covariates are known, given the information provided by the covariates of the sampled mountains and the observed presence/absence network. We test the forecasting performance of the *matching–centrality* model by removing, one-by-one and by sets of four, each mountain from the dataset and attempting to recover its mammal community. This yields the following out-of-sample test

(1) We remove one or four mountains, denoted by

*k*, from the network and then estimate the*matching*and*centrality*traits of the mammals and of the remaining mountains using the*matching–centrality*model,(2) We fit the statistical model (4.3) on the estimated

*matching*and*centrality*traits of step 1,(3) Using the fitted parameters from step 2, we compute the conditional expectation for the

*matching*and*centrality*traits of mountain(s)*k*; for the*centrality*trait, this value is given by 4.5where are the fitted parameters from step 2. The conditional expectation for the*matching*trait is given by 4.6where is the*k*th column(s) without the*k*th row(s) (indicated by subscript −*k*) of the variance–covariance matrix estimated using equation (4.4); is the element(s) of the estimated variance–covariance matrix; is the row vector of residuals obtained from step 2. The last term of equation (4.6) represents the deviation from the linear prediction that is introduced by knowledge of the spatial correlation structure,(4) With the predicted

*matching*and*centrality*traits, we can predict the linking probabilities (equation (2.1)) between the removed mountain(s) and the mammals. The presence/absence of the mammals is determined from these linking probabilities and the cut-off point obtained with the fit of step 1. The cut-off point is defined as the linking probability chosen such that the number of false-positives is equal to the number of false-negatives, and(5) We repeat steps 1–4 for all mountains in turn.

We applied the same procedure for the food web of Tuesday Lake, but we used the phylogenetic regression given by equation (4.1) to relate the *matching* and *centrality* traits to the covariables. The model performs remarkably well: on average, 87% of the data were correctly forecasted for the Rocky Mountain network (figure 5*a*). For the food web of Tuesday Lake, recall that the *matching* and *centrality* traits are closely related to the body size and phylogeny of the species (electronic supplementary material, table S2). Based on these predictors, on average, 86% of the trophic links were correctly forecasted (figure 5*b*).

In a real-case situation for the forecasting of unsampled nodes, the first condition for application is that the *matching–centrality* model provides a good fit to the sampled network. Second, there must exist a strong relationship between the *matching* and *centrality* traits and the independent information on the nodes; in our first example, forecasting the mountains occupied by a new mammal would not be possible. Once these conditions are met, one can proceed to the forecasting of the links connected to the unsampled nodes. We note three final technical points. First, as validation, we recommend performing a complete out-of-sample test as applied above. Second, the new nodes have to belong to the same statistical population as the original ones (it would obviously make no sense to forecast the mammal community present in a completely different region). Third, in our examples, we related the *matching* and *centrality* traits to the characteristics of the nodes using linear models; however, models of any form can be applied at this stage.

To the best of our knowledge, only one other method has been devised to forecast the links made by a new node in a network, for host–parasitoid networks based on the phylogenies of the species [57]. Our approach has two decisive advantages that make it extremely versatile: first, it is not necessary to include information about both sets of nodes of the bipartite network to make the forecast; second, by placing latent traits as intermediates between the network adjacency matrix and node characteristics, we provide an entirely flexible way to incorporate external information about nodes, for any conceivable statistical model could be used to relate the latent traits to external variables. Both features are perfectly illustrated in the above examples.

## 5. Conclusion

By translating an adjacency matrix into a set of quantitative traits for the nodes, the *matching–centrality* model represents a powerful and general tool for network analysis. It allows the reconstruction of missing information and forecasting of the links of entirely novel nodes, and opens the door to comparative analyses to shed light on the factors underlying network structure across disciplines.

## Data accessibility

Network data: Dryad http://dx.doi.org/10.5061/dryad.5fn84.

## Authors' contributions

All authors designed the research; R.P.R. performed the simulations; R.P.R., R.E.N., and L.F.B. analysed the results; all authors wrote the paper.

## Competing interests

The authors declare no competing financial interests.

## Funding

The work was supported by the National Centre of Competence in Research ‘Plant Survival’, the Swiss National Science Foundation grant 31003A-138489 (both to L.-F. Bersier), by SystemsX.ch, the Swiss Initiative in Systems Biology (to L.-F. Bersier and C. Mazza), and by the FP7-REGPOT-2010-1 programme (project 264125 EcoGenes), and an ERC Advanced Grant (both to J. Bascompte).

## Acknowledgements

We thank J. Bascompte, A. Davison, and S. Saavedra for critical discussion.

- Received November 9, 2015.
- Accepted January 8, 2016.

- © 2016 The Author(s)

Published by the Royal Society. All rights reserved.