Mathew & Perreault [1] analyse cross-cultural data from the Western North American Indian (WNAI) dataset [2] in order to compare ‘the relative effect of environment and cultural history’ on behavioural variation across 172 societies. This endeavour is inspired by many other evolutionary studies of human cultural variation [3–7]. Mathew and Perreault conclude that ‘social learning operating over multiple generations [is] the main mode by which humans acquire their behaviour’ (p. 5). Our own investigation of cultural macroevolution in the WNAI [8] motivated us to attempt to reconstruct their analyses. we found their paper to be undermined by questionable analytical choices, and computational and data-handling problems. We draw this conclusion having used the information in the Methods and electronic supplementary material S1, S3, S4 and S6 in [1] to recreate those parts of their study that we were able to. In this commentary, we present the results of our examination and detail the serious methodological flaws that lead us to conclude that a complete re-analysis is required by Mathew and Perreault. We also comment briefly on their conceptual schema, which, in trying to find ‘the main mode of human adaptation’ [1], appears to set cultural transmission (i.e. social learning) in opposition to environmental adaptation.

Mathew and Perreault use logistic regression to model 457 present/absent behavioural traits as a function of three dimensions—*E* (local ecological conditions), *P* (phylogenetic or linguistic distance to other societies) and *S* (spatial or geographical distance to other societies). To judge the relative importance of the *E*, *P* and *S* classes of predictors, Mathew and Perreault compare sums of absolute values of regression coefficients across classes, for the best model of each behavioural trait. The ‘summed absolute values' metric is used for various purposes in model and feature selection [9,10]. The metric is problematic here, however, because it compounds statistical signal with different sizes of the *E*, *P* and *S* classes. To demonstrate, consider a null case in which none of the predictors in *E*, *P* or *S* are related to a trait and the regression coefficients resemble stochastic noise. The analyst must understand how a statistical metric would behave in such a case and choose an inference procedure that reliably distinguishes null from non-null cases. For concreteness, assume that the coefficients share a common Gaussian distribution with mean zero and variance *σ*^{2}. The absolute value of a coefficient *β* then has expectation and for a class containing *M* predictors the summed absolute values have expectation . The null expectation therefore scales linearly with class size *M*, and larger classes of predictors will appear to have greater relative importance based on representation alone. Although model selection criteria such as AIC (discussed below) include a penalty for the number of predictors in a model, this does not mitigate the confounding effects of class size [11].

To see how summed absolute values depend on the number of predictors contributing to the sum, consider fig. 3 of [1], which appears to show a time-depth effect of language. The numbers of predictors in phylogenetic classes, from the Level 8 class down to the Level 3 class, follow a descending sequence: *M* *=* 8, 8, 6, 4, 4, 3 (see electronic supplementary material S7 in [1]). This sequence is mirrored by the stair-step feature of the overall effect of phylogeny and is wholly consistent with null expectations. Thus, what Mathew and Perreault interpret as greater relative importance for predictors incorporating deeper history may reflect only the fact that the deeper-level classes contain more predictors—interpretation of the figure is precluded by confounding effects of class size. The ratios of importance values shown in their fig. 2 are also consistent with null expectations. For example, with *M* = 6, 8 or 3 (for classes *E*, *P* or *S*, respectively), the null expectation of the relative importance of cultural history over ecology, (*S* + *P*)/*E*, is approximately 11/6, yielding 1.4 after tan^{−1} scaling (see their fig. 2 caption). The value 1.4 (not 1.0) is thus a more appropriate null reference for fig. 2*a* in [1]. Boxplot medians in fig. 2*a* stay relatively close to this value. As 11 of the 17 (65%) predictors are assigned to cultural history, it is not surprising that the overall effect for this combined class is larger than that for ecology in 70% of the behavioural traits. In short, the relative strengths of *E, P* and *S* cannot be discerned as Mathew and Perreault intend without a correction for the number of predictors in each class.

The overall effects reported by Mathew and Perreault are also troubled by numerical instability. For instance, the outcome ‘salt added to food’ (r_tech_162) has absolute sums of 4.29 × 10^{15}, 3.20 × 10^{15} and 1.05 × 10^{16} for *E*, *P* and *S*, respectively (see electronic supplementary material S6 in [1]). Models for at least 29 outcomes have extreme effect sizes, even after dividing by class size *M*. Logistic regression coefficients predict changes in the log-odds of a binary outcome per unit changes in the independent variables. For reasons described by Gelman *et al.* [12], large-magnitude coefficient estimates call for special scrutiny and may indicate underlying problems such as non-convergence of model-fitting algorithms, variable separation or multicollinearity. In fact, in reproducing their results, we found that 24 of 457 models failed to converge; 12 of these were among the models with extreme effect sizes. Rather than addressing such problems directly, Mathew and Perreault take measures such as truncating axes (figure 1 above, and figs 1 and 3 in [1]) and rescaling coefficients by a tan^{−1} transformation (fig. 2 in [1]), the latter apparently to control the infinite relative importance values that result from division by zero.

During our reconstruction of the analyses, we also uncovered a critical flaw in Mathew and Perreault's use of principal component analyses (PCA) to compute *P* and *S* dimensions. To determine the predictors in *P* and *S*, Mathew and Perreault pass a matrix of pairwise distances (spatial or linguistic) to the *prcomp* function in the R language [13]. This function is designed to operate exclusively on datasets consisting of sampling units (here, cultures) in rows, and measured attributes (features) of these units in columns. In their columns, Mathew and Perreault use relational data—pairwise distances—where *prcomp* expects cultural attributes. In figure 2, we demonstrate the consequences of this for the spatial components. To visualize analogous consequences for the phylogenetic components is a more challenging problem we have not solved. Given the functionality of *prcomp*, however, we expect similar distortions because the components are calculated using relational distances along a language tree, not primary attributes of the individual cultures.

One alternative to PCA is principal coordinates analysis (PCoA), a related method specifically for pairwise distance matrices [14]. The first two dimensions of a PCoA applied to spatial distances closely recreate spatial configurations at moderate scales (see example 5.2 in [15]). For discrete, low-variation distances like those produced by the language tree, PCoA may not be as suitable. A second alternative to PCA makes direct use of phylogenetic and spatial distances in a parametrized variance–covariance matrix [16]. Whatever the relationships are between cultural traits, phylogenetic and spatial distances, they may be disrupted by inappropriate use of PCA.

Mathew and Perreault use the *glmulti* package in R to screen all 131 071 possible models for each trait and then analyse the coefficients of just the single best model—the model with the lowest AIC value (AIC_{min}). Model comparison is a powerful tool for evaluating alternative models given the data [12,17]. However, given so many predictors, the best model is often one of a large set of models with very similar AIC values (e.g. Δ* _{i}* < 4, where Δ

*is defined as AIC*

_{i}*− AIC*

_{i}_{min}), and our analysis of their data suggests this is true for most traits. As detailed by Burnham & Anderson [11], model averaging combines information from a set of top models, and thus incorporates more sample information and produces more robust inferences (e.g. about the relative effects of

*E*,

*P*and

*S*). Moreover, the coefficients associated with each predictor can be interpreted in the context of other predictors, even if the single best model leaves out a particular component. Alternatively, rather than screening all possible models, Mathew and Perreault could compare

*a priori*sets of models with different combinations of components (e.g.

*EPS*,

*EP*,

*ES*,

*PS*,

*E*,

*P*and

*S*) [18].

The accumulation and transmission of cultural knowledge are hallmarks of the human species [19], and cross-cultural variation demonstrates strong regional and linguistic patterning [20]. Likewise, both abiotic and biotic environments clearly shape and constrain human cultural behaviour (see [21] for a review and [22] for a detailed study of resource-defence polygyny in the WNAI). We thus question the central framing of Mathew and Perreault's study—as a test for whether human behavioural variation ‘is due to variation in the ecological environment or to differences in cultural traditions' (see abstract of [1]). Their apparent posing of non-cultural mechanisms against cultural mechanisms is problematic in two main ways. First, it fails to acknowledge that social learning is itself an adaptation to the patterning of environmental change over time [23]. Second, we do not believe that anyone has seriously proposed that the decisions humans make regarding behavioural strategies emerge de novo with each generation (Mathew and Perreault's ‘single-generation adaptive response’ caricature) [24–26], nor have they ever done so [21]. Mathew and Perreault undoubtedly recognize the complexities here, and in places acknowledge the complementarity of perspectives. Nevertheless, we caution that framing a paper as a debate detracts from the more interesting challenge, which is to determine the relative influences of different transmission mechanisms, recognizing that they operate in addition to—not as alternatives to—adaptation [3,4,19,25]. Furthermore, careful modelling in this area indicates that given the complexity of interacting factors, we are still far from being able to use patterns in space and time to pinpoint the mechanisms responsible for cultural diversity [27].

For over a century, anthropologists have endeavoured to understand the origins and maintenance of cultural variation across human societies [28], and an evolutionary ecological perspective such as that of Mathew and Perreault has much to contribute [3–8]. Although such investigations have proved to be methodologically challenging, the field has prospered as many alternative quantitative approaches have been developed, employed and compared. Unfortunately, Mathew and Perreault's analyses are hampered by numerous problems that make it impossible to draw any reliable conclusions from their results. In conclusion, we believe that the statistical and computing mistakes in Mathew and Perreault are serious enough to merit a complete re-analysis of the data, particularly given the importance of understanding the multiple mechanisms responsible for the origins and maintenance of cultural variation across human societies.

## Competing interests

We declare we have no competing interests.

## Funding

We received no funding for this study.

## Acknowledgements

We thank Bret Beheim, Barney Luttbeg, Richard McElreath and Pete Richerson for their insights, with the proviso that the views expressed here are our own.

## Footnotes

The accompanying reply can be viewed at http://dx.doi.org/10.1098/rspb.2016.0177.

- Received September 10, 2015.
- Accepted December 8, 2015.

- © 2016 The Author(s)

Published by the Royal Society. All rights reserved.