## Abstract

The Euclidean and MAX metrics have been widely used to model cue summation psychophysically and computationally. Both rules happen to be special cases of a more general Minkowski summation rule , where *m* = 2 and ∞, respectively. In vision research, Minkowski summation with power *m* = 3–4 has been shown to be a superior model of how subthreshold components sum to give an overall detection threshold. Recently, we have previously reported that Minkowski summation with power *m* = 2.84 accurately models summation of *supra*threshold visual cues in photographs. In four suprathreshold discrimination experiments, we confirm the previous findings with new visual stimuli and extend the applicability of this rule to cue combination in auditory stimuli (musical sequences and phonetic utterances, where *m* = 2.95 and 2.54, respectively) and cross-modal stimuli (*m* = 2.56). In all cases, Minkowski summation with power *m* = 2.5–3 outperforms the Euclidean and MAX operator models. We propose that this reflects the summation of neuronal responses that are not entirely independent but which show some correlation in their magnitudes. Our findings are consistent with electrophysiological research that demonstrates signal correlations (*r* = 0.1–0.2) between sensory neurons when these are presented with natural stimuli.

## 1. Introduction

Our interactions with the environment require the ability to accurately interpret and integrate the surrounding perceptual cues [1–4]. In vision, straightforward combination rules for neural channels have been proposed in detection (e.g. [5]) and visual search experiments [6–10]. These include classical models such as linear (city-block) addition, energy (Euclidean) summation and the maximum (MAX) rule. Of these, the Euclidean metric has been widely used to define psychological space (e.g. [6,7]) and has often been the benchmark against which psychological models of cue summation are tested. However, the MAX metric has also been studied computationally and biologically (e.g. [11–13]). These two rules are therefore particularly important in the modelling of perception.

However, there is evidence to suggest that cue summation may diverge systematically from these classical rules, which are special cases of a more general Minkowski summation rule:
1.1where *m* = 1, 2 and ∞ result in linear addition, Euclidian summation and the MAX rule, respectively.

Visual detection experiments using compound stimuli such as pairs of sinusoidal gratings or even complex natural images have shown that a Minkowski rule with power *m* = 3–4 (rather than Euclidean or MAX) is a good model of how subthreshold components sum to give an overall detection threshold (e.g. [14–18]). Furthermore, we have recently shown in *supra*threshold discrimination experiments that perceptual combination of features such as colour changes and shape changes in photographs of *natural* visual scenes are best modelled using a Minkowski summation with power *m* ≈ 3 [19,20].

We are interested in whether such perceptual summation of suprathreshold cues also occurs when an observer listens to composite natural sounds, such as music or speech. In hearing, there has been much debate on how auditory cues combine in multi-tone stimuli (e.g. [21]) and more complex sounds like music (e.g. [22]) or speech (e.g. [23]). Although there exist parallels between visual and auditory processing, research in auditory feature integration arises from a different tradition with a different point of view, so that it is difficult to draw direct comparisons between summation models used in vision and hearing. However, Green [21] did report that two subthreshold tones summed to reach threshold with less summation than expected for power (i.e. Euclidean) summation, implying that simple auditory cue summation might also follow a Minkowski rule with *m* > 2.

The present paper compares cue or feature combination in different auditory and visual natural stimuli by asking observers to give magnitude estimation ratings for pairs of stimuli that differ along a number of different dimensions. Cue combination is studied by considering how ratings to changes along single dimensions compare with ratings to changes along a combination of these dimensions. We will consider auditory cue summation in musical sequences and in phonetic utterances, and we will show the same kinds of systematic failure of the Euclidean and MAX models found with cue combination in photographs of natural scenes. We will also investigate how visual cues are summed with auditory cues in a cross-modal situation, where observers are simultaneously presented with photographs and natural sounds and where the stimulus changes can be visual, auditory or a combination of both. Again, our results will demonstrate that Minkowski summation with power *m* = 2.5–3 is most successful in describing feature integration in cross-modal natural stimuli and a significantly better description than Euclidean summation and MAX rule. We suggest that Minkowski summation with power *m* = 2.5–3 reflects the summation of neuronal responses that are not entirely independent but that show some correlation in their magnitudes. This would make sense since information from the world is correlated [24]. Our conclusions are supported by electrophysiological research that demonstrates consistent signal correlations between sensory neurons when they are presented with natural stimuli (see §4).

## 2. Methods

### (a) Experimental stimuli

#### (i) Construction of stimuli

In the music experiment (experiment 1), observers were presented with 160 musical sequence pairs (each lasting 2 s) that differed in one or two of the following dimensions: intensity (by changing the dynamics to *pp* or *ff*), timbre (by changing the instrument), pitch (by transposing the sequence upward or downward by various chromatic or diatonic intervals) and/or content (by changing, adding or removing one or more notes). The magnitudes of these changes varied between the different parent sequences. There were 16 reference sequences, each providing four single dimension changes and six composite changes (see §2*a*(ii)). All sequences were generated using a free evaluation copy of *Notion Demo* (Notion Music Software, v. 1.5.4.0), a piece of music composition and performance software. (Examples of sequences and differences are shown in figure 1*a* and in the electronic supplementary material, figure S1*a*,*b*.)

The phonetic experiment (experiment 2) used 320 stimulus pairs (each lasting 1–2 s) made from recordings of single spoken syllables. All phonetic utterances were recorded and modified using *Audacity* (v. 1.3.4—beta software that is freely downloaded online). The stimuli in the pair could differ in intensity (increased by 5 dB or decreased by 15 dB), low-pass (roll-off = 18 dB/octave) and high-pass (roll-off = 24 dB/octave) filtering, tempo (faster or slower) and pitch (very high, high, low or very low). The magnitudes of these changes were different for different syllables. A reference syllable could be paired with one of these variants or with a different syllable (e.g. ‘ta’ versus ‘na’), and might change in one (component) or two (composite) ways (see §2*a*(ii)). There were 20 reference syllables, each with 15 variants, and a total of 320 phonetic utterances (see electronic supplementary material, table S1).

In the vision experiment (experiment 3), observers were presented with a total of 300 pairs of natural image scenes based on 20 parent images, each matched with 14 variants. Some variants were a second photograph of the same scene taken when, say, an object had moved. In other variants, using Matlab, the scene could be blurred or sharpened to varying degrees, the contrast could be reduced, or the hue and saturation of the whole scene could be changed while leaving the brightness relatively unaffected (see examples in figure 1*c* here and in the electronic supplementary material, figure S1*d*). In total, 20 pairs were identical, 100 pairs differed only along one of the dimensions described above and 180 pairs differed in a combination of two dimensions (see §2*a*(ii)).

The stimuli in the cross-modal experiment (experiment 4) were natural images coupled with natural sounds, i.e. a visual image was presented simultaneously with an auditory sequence. Observers were presented with a total of 648 pairs of these image-sound combinations and were asked how different the overall audio-visual ‘experience’ appeared to them. The stimuli were based on 36 reference photographs (nine each of animals, musical instruments, objects and people) and 36 appropriate reference sound effects. Seventy-two original photographs were purchased from the website *iStockphoto* (from various artist members) and then modified using Matlab. Each reference image could be changed in two ways (e.g. blur, contrast or hue) and a third variant consisted of a second photograph of the same object or scene taken when a target object had moved or changed shape. The natural sound sequences were generated from 36 reference sound effects chosen from a database called ‘InstantSoundFx’. Each reference sequence was subsequently modified using the software program *Audacity* and cropped to have a duration of 1 s. Three variants from each reference sound sequence were made by increasing or decreasing the intensity by values between 3 and 10 dB, low-pass (roll-off between 6 and 36 dB/octave) and high-pass (roll-off between 6 and 48 dB/octave) filtering, and lowering or heightening the pitch by 15–40 dB. In 324 image-sound pairs, the images and sounds were chosen so that their content was congruous (e.g. bird images–chirping sounds), but in the other 324 pairs, the images and sounds were drawn from different categories so that their content was incongruous (e.g. bird images–telephone ringing sounds). See the electronic supplementary material online for examples, and details on how congruous and incongruous sounds were matched with visual images.

The electronic supplementary material online also contains further details on the construction and presentation of all the stimuli described above.

#### (ii) Combination sets

The experiments were based around *combination sets* of three stimulus pairs. Starting from one reference stimulus, the first pair (component pair) differed in one dimension such as intensity, the second pair (a second component pair) differed in a second dimension such as pitch, and the final pair (the composite) differed in both the dimensions. For example, in the music experiment, a first pair (a component pair) might differ in one dimension such as dynamics, the second pair (a second component) might differ in a second dimension such as scale (transposition), and the final pair (the composite) would exhibit differences in *both* dynamics *and* scale dimensions (figure 1*a*). The magnitude of the two changes in the composite was the same as in the component pairs, and each component pair contributed to more than one combination set. There were in total 96 musical combination sets in the experiment 1, 200 phonetic combination sets in experiment 2 and 180 visual scene combination sets in experiment 3. See examples of each in figure 1.

In the cross-modal experiment (experiment 4), among the 648 image-sound pairs, 216 contained only image changes (while the sound remained unchanged), 216 contained only sound changes (while the images remained unchanged) and 216 contained both image and sound changes; i.e. there were 216 image-sound combination sets. Overall, the 216 combination sets consisted of 108 congruous and 108 incongruous sets. An example of congruous combination set is presented in the electronic supplementary material, figure S2*b*–*d*).

### (b) Experimental procedure

The procedure was the same for all experiments, and the details of training and experimental design are given in the electronic supplementary material. Human observers (naive to the purpose of the experiments) were presented with pairs of stimuli and asked to make subjective numerical ratings of the perceived difference between the items in each pair [19]). During the experiments, observers were frequently presented with the same standard stimulus pair (specific for each experiment, see electronic supplementary material, figures S1*a*,*c*, S2*a* and table S1), whose magnitude difference was defined as ‘20’. They were instructed that their ratings of the subjective difference between any other test pair should be based on this standard pair: if they perceived the difference between the test pairs to be lesser, equal or greater than the standard pair, their ratings should be less, equal or greater than 20, respectively. For the experiments, the presentation sequence of stimulus pairs was randomized differently for each observer.

### (c) Data collation

In each experiment, the magnitude estimation ratings of the 10–15 observers were averaged together for further analysis. The results for each observer were first divided by their median value (typically about 20). The scaled rating for each stimulus was then averaged across observers, and the average was multiplied by the grand average of all the observers' original ratings [25].

## 3. Results

### (a) Cue combination for natural sounds and natural visual scenes

Figure 2*a*–*c* examines whether Euclidean summation can predict the measured rating (*R*3) to the composite stimulus in each combination set from the separate ratings (*R*1 and *R*2) given for its two component pairs in the first three experiments. The figure panels plot the predicted value of R3 against the measured value of R3. By analogy with equation (1.1), the predicted rating for the compound is:
3.1where *m* = 1 for linear summation, *m* = 2 for Euclidean summation and *m* = ∞ for the MAX rule. The predicted ratings for the composite stimuli (ordinate) are well correlated with the actual ratings (abscissa); Pearson's *r* is 0.87, 0.94 and 0.90 for figure 1*a* (musical sequences), figure 1*b* (phonetics) and figure 1*c* (visual scenes), respectively. However, there is a systematic failure of the prediction: the data points tend to lie above the line of equality (especially evident in figure 2*a*,*c*, where the summed squared-error per point (SSE) is greater than in figure 2*b*), showing that Euclidean summation of the component ratings has slightly overestimated the rating given by the observers to the compound stimuli.

The MAX rule (equation (3.1) with *m* = ∞) also performs poorly; figure 2*e*–*g* shows that data points were mostly below the line of equality, meaning that taking the maximum of the component ratings mostly underestimated the rating of the combination stimuli. Indeed, we have previously shown for cue integration in visual stimuli that the MAX operator model underestimated observers' ratings while linear addition dramatically overestimated them. The same trend was observed in the present experiments; table 1 shows that the SSE per point for the linear addition rule was very high and the model is less acceptable even than the Euclidean model. The SSE values for the MAX rule were close to but were consistently higher when compared with those for Euclidean summation.

Euclidean summation and the MAX operator are special cases of a general Minkowski summation rule (equation (3.1)), a summation rule frequently used to model threshold and suprathreshold visual performance [14–18,25] as well as elsewhere [26]. An iterative search was used to determine the value of the exponent *m* that minimized the sum of squared deviations SSE between the predicted value of *R*3 (ordinate, equation (3.1)) and the measured value on the abscissa; 95 per cent confidence intervals were obtained by Monte Carlo bootstrapping. Figure 2*i*–*k* show how well Minkowski summation of the ratings to component stimuli predicts the rating given to the composite stimulus in the first three experiments. This serves to compare our present experiments (particularly with natural sounds) with our previous visual experiments [19] but, more importantly, by allowing the summation exponent *m* to be a free parameter, it illustrates that there *is* a systematic failure of the Euclidean and MAX predictions.

In the two experiments with natural sounds (figure 2*i*,*j*), the generalized Minkowski summation rule outperformed the specific Euclidean and MAX rules, in that the SSE in both cases was reduced significantly (table 1). For instance, for the phonetics experiment where the improved fit is least obvious, adding the one parameter has caused SSE to fall from 4.81 to 3.83 on average for each data point: *F*_{1,199} = 50.91, *p* ≈ 0. The Minkowski summation exponents are 2.95 (95% confidence interval, 2.62–3.36) and 2.54 (95% confidence interval, 2.31–2.78) in the music and phonetics experiment, respectively (figure 2*i*,*j*). The correlation coefficients between predicted and measured ratings are 0.86 (music) and 0.95 (phonetics).

To examine whether the Minkowski model was equally successful in modelling cue combination across the different feature dimensions (e.g. pitch and timbre versus intensity and tempo), we first calculated the squared errors between the averaged observers' ratings for each combination stimulus and the corresponding Minkowski predictions, and then analysed the mean-squared errors in a one-way repeated measures analysis of variance (ANOVA) for the different types of combination (listed in electronic supplementary material, table S3). In both auditory experiments, the Minkowski summation model was uniformly efficient in predicting the ratings for all the different combination types: *F*_{5,95} = 1.69, *p* = 0.14 for music and *F*_{9,190} = 0.82, *p* = 0.60 for phonetics. Post hoc Bonferroni analyses found no differences in the SSE between predicted and measured ratings among the different types of combinations.

Results from the visual experiment (figure 2*k*) also showed that Minkowski summation with a very similar exponent (*m* = 2.97; 95% confidence interval, 2.78–3.36) was a better model than Euclidean or MAX summation (table 1), confirming our previous reports on a different set of image stimuli [19]. The correlation between measured and predicted ratings was 0.91, and a one-way repeated measures ANOVA showed that the Minkowski summation model was equally accurate in modelling all nine types of combinations (electronic supplementary material, table S3): *F*_{8,171} = 1.21, *p* = 0.30. In addition, a post hoc Bonferroni analysis revealed no differences in the SSE between predicted and measured ratings among the nine types of visual combinations.

### (b) Combination of audio-visual cues in bimodal stimuli

In the previous three experiments, observers were presented either with pairs of sounds or with pairs of visual scenes so that they could rate perceived differences either in the sound or in the visual stimulus, respectively. Here, we show the results of an experiment where each stimulus in a pair was a natural image visual stimulus *coupled with* a natural sound. The stimuli encompassed 216 combination sets, composed of three stimulus pairs: in the first pair only the images changed, in the second pair only the sounds changed, and in the third pair both images and sounds changed.

Figure 2*d*,*h* shows that Euclidean and MAX summations of the auditory cue (or rating) with the visual cue failed to predict the rating given by observers to the composite stimuli, where both visual and auditory components change. The discrepancy was similar to those of figure 2*a*–*c*,*e*–*g* for the separate auditory and visual cases. Minkowski summation (with power *m* = 2.56; 95% confidence interval, 2.42–2.67) of the ratings for the separate visual and auditory changes was a superior fit to Euclidean or MAX rules (figure 2*l*; table 1 summarizes all the SSE values). The correlation coefficient between predicted and measured ratings was 0.82. When combination sets from the congruous and incongruous conditions were analysed separately (see electronic supplementary material, figure S3), the best predictions were obtained with the general Minkowski summation rule with power *m* = 2.62 and 2.50, respectively. The absence of an effect of congruency suggests a general rule rather than a stimulus-specific phenomenon.

The remaining parts of figure 2*m*–*p* will be discussed below.

## 4. Discussion

There has a long-been debate about the arithmetic rules governing feature integration: that is, the way in which a person combines multiple sensory or cognitive cues. It has been asked whether successful demonstration of a rule would represent some universal Law of Mentation [6,7]. The present experiments have focussed on the summation of perceptual cues in *natural* auditory and visual stimuli. We have been able to look more subtly than many previous studies at the precise applicability of different summation rules, and our experiments have revealed that Minkowski summation with power *m* in the range of 2.5–3.0 provides a significantly better fit to the data than linear addition, Euclidean summation and MAX rules, where *m* would exactly be 1, 2 and ∞, respectively. We find systematic deviations from all three classical models for feature integration in natural auditory, visual and audio-visual stimuli.

### (a) Minkowski summation in feature integration

Minkowski summation with *m* = 3–4 has long been used in the study of vision (e.g. [14,15]) to describe how multiple subthreshold visual stimuli sum towards an overall detection process. This may be consistent with Green [21], who found that the summation of pure tones in auditory experiments was less than expected on a power summation rule. The same Minkowski rule (typically with *m* = 3–4) has been used to model the detection of changes in natural visual images, when multiple tiny cues are contributed across very many visual channels or model neurons [16–18]. Such modelling has been extended to the perceived magnitudes of suprathreshold differences in natural images similar to those that we have described here [18,25]. We have shown here that the same sort of Minkowski exponent describes the perception of suprathreshold changes in naturalistic auditory stimuli, as well as visual changes. This suggests that the combination of cues, whether subthreshold for detection or suprathreshold for perceived differences or similarity judgements, follows one general rule (cf. [7]).

We have also shown that the same Minkowski summation rule describes the summation of auditory cues with visual ones in cross-modal experiments, for incongruous as well as congruous pairings. There is an interesting evaluation of multi-sensory integration by Laurienti *et al*. [27], who summarize the degree of summation of auditory and visual responses in the cat superior colliculus. Here, some neurons respond only to auditory stimuli, some only to visual stimuli and some to both (even when they are not necessarily congruous visual and auditory events). Laurienti *et al*. [27] estimate the overall population response of the superior colliculus (for comparison with functional magnetic resonance imaging studies (fMRI)) and report that the response to an auditory/visual stimulus combination is greater than the response to either alone, but is less than the arithmetic sum. The summing exponent seems to be consistent with the overall appearance of Euclidean or Minkowski summation with exponent like those we have fit in figure 2.

Minkowski summation with exponent *m* of 2.5–3 provides a convenient numerical description of the results of our present experiments, but it does not provide a physical or neural explanation for the cognitive processes involved. Vision scientists who model detection processes have called the Minkowski summation rule the ‘probability summation model’, presuming that the Minkowski exponent is a parameter associated with the steepness of the psychometric function for detection [15]. On the other hand, for the suprathreshold integration of binocular and binaural signals, the Minkowski exponent has been suggested to reflect the strength of reciprocal inhibition between two neurons prior to summation [28]. These interpretations do not seem to be immediately applicable to an overall suprathreshold sensation, in the context described above. In the following section, we speculate about what neural mechanisms might lead to the failure of Euclidean summation.

### (b) The Mahalanobis distance

Euclidean summation has been widely discussed as a general model for cue summation (e.g. [6,7]). However, this straightforward rule, as well as the MAX operator, is contradicted by our empirical results where Minkowski summation with power *m* of 2.5–3 is clearly a better fit to the experimental data. It is possible that this value of the Minkowski exponent *m* is related to the amount of correlation between different neuronal signals responding to natural stimuli. Euclidean summation might be appropriate if activity is independent, as each neuron would convey a uniquely important signal. However, if responses were highly correlated, the information given by only one neuron would be sufficient and the MAX rule (where *m* is ∞) would apply. So, if the neuronal signals to natural stimuli are slightly correlated, then the most appropriate summating exponent should be only a little greater than 2. A Minkowski exponent between 2.5 and 3 therefore suggests some small degree of signal correlation between actual neuronal responses.

In this case, an appropriate measure of cue combination should therefore readily account for potential correlation in signals, and an adjusted Euclidean measure of distance between stimuli is needed: the Mahalanobis distance [29] with covariance parameter *ρ*. The Mahalanobis distance has one free parameter, like the Minkowski distance. For the case where we are summing just two cues, the following formulation for a measure of feature integration is based on the Mahalanobis distance [29] and its relation to the Euclidean sum (equation (3.1) with *m* = 2) is clear:
4.1where *ρ* is the correlation between the dimensions represented by *R*1 and *R*2; it is the covariance of the sensory messages, if the sensory dimensions each have the same overall variance. The true Mahalanobis distance would be given by equation (4.1) after division by (1 − *ρ*^{2}) but we omit this division from our measure since we would have to apply the same scaling to the *measured* values of *R*3 as well. The term 2*ρ* × *R*1 × *R*2 is the amount by which Euclidean summation overestimates (figure 2*a*–*d*) the rating to *R*3. For our four experiments, figure 2*m*–*p* plots the value of *R*3 (ordinate) predicted (equation (4.1)) by the Mahalanobis sum of *R*1 and *R*2 against the actual values of *R*3; the value of *ρ* shown in each panel was found for each of the four experiments separately by iteratively searching for the value that gave the least SSE and the 95 per cent confidence intervals for the *ρ* values for experiments 1 (music), 2 (phonetic), 3 (visual scenes) and 4 (cross-modal) are 0.16–0.23, 0.07–0.15, 0.15–0.21 and 0.13–0.17, respectively. (Electronic supplementary material, figure S3 shows the separate Mahalanobis fits for the congruous and incongruous conditions in experiment 4.) The Mahalanobis fits are significantly superior to the Euclidean and MAX operator fits (all *F*-tests highly significant), and are about the same as for the Minkowski fits.

The Mahalanobis distance is closely related to the ‘law of cosines’ where ‘*ρ*’ is replaced by ‘cos *τ*’ (e.g. [28]). In this case, *τ* is a measure of the interaction between two non-orthogonal (i.e. correlated) sensory dimensions. Therefore, the present data might also represent an interaction between two sensory dimensions within a non-orthogonal coordinate system. Although the present data fit both the Mahalanobis and Minkowski models nicely, we should bear in mind that there might exist conditions where the behaviour of the Minkowski sum and the Mahalanobis sum do not overlap; indeed, it is only for small values of *ρ* that a Minkowski exponent can give a satisfactory alternative fit.

### (c) Correlations in neural signals

While a Minkowski exponent *m* of 2.5–3 has little neural meaning in cross-dimensional feature integration, the values (0.11–0.19) of the covariance parameter *ρ* for the Mahalanobis sum *do* have a potential neurophysiological significance. It has long been known that stimulus-evoked responses and spontaneous activity are correlated between neurons in the cerebral cortex [30], though Ecker *et al*. [31] argue that some of this correlation could be removed in well-controlled experiments. Indeed, the idea of widespread correlation is implicit in our understanding of the origin of the electroencephalogram, and such correlation probably underlies the spontaneous fluctuations in the BOLD signal seen in the connected cortical areas during fMRI [32]. Such widespread correlations might be related to changes in overall alertness or in attention to a task [33], and changes in neuronal responsiveness might lead to trial-by-trial apparent correlation of neural messages about truly-independent sensory dimensions. If attentional or other factors operate across large areas of cerebral cortex, we might even expect such correlations to be cross-modal and, of course, we have found the same putative correlations in our experiments between natural visual changes and natural auditory changes. Moreover, such widespread correlations of neural activity might explain why the summation of cross-modal stimuli was the same for incongruous pairings as for congruous ones.

The responses of sensory neurons are also likely to be correlated for several other, less global reasons. First, sensory input signals are likely to be correlated because information from the world is correlated. For instance, when we see a small elongated element in one part of our visual field, it is very likely that, beyond it and along its long axis, we may see elongated elements of very similar orientation [24] since the small elements are all part of one collinear or slightly curved object. A sharp luminance boundary will activate multiple visual cortex neurons whose receptive fields are of different spatial scales [34]. Secondly, visual receptive-field construction is not orthogonal and the visual cortex code is redundant [35,36] so that neurons' stimulus response-spaces overlap to some extent with the spaces of other neurons. Neurons close together within cerebral cortex are likely to respond to similar stimuli because of the columnar layout of the cortex [37] and also because nearby neurons share their (noisy) inputs and modulatory controls.

Thus, there are many reasons why we should expect the neural signals to paired stimuli to be correlated. Correlations in nearby stimulus features or in the receptive-field structure of nearby neurons with shared connectivity might explain why a Mahalanobis rule governs summation *within* audition or *within* vision. It is harder to see how such correlations explain why the same rule should govern cross-modal summation between audition and vision, especially in the incongruous case.

We still have an incomplete understanding of the magnitudes of typical correlation coefficients between neural responses, especially when the systems are studied with natural stimuli [38]. There have been many studies, particularly in various cortical areas of the visual system (e.g. [38–42]) and other cortical systems (e.g. [43]) that have measured the correlation in activity of simultaneously recorded neurons. The correlation coefficients actually vary widely between pairs of neurons and are generally highest for neurons recorded very close together. However, noticeable positive correlations have been reported even for neurons recorded more than 10 mm apart in cortex and for neurons that do not respond to the same visual features [41]. Some studies have tried to distinguish correlations in the trial-by-trial noise (‘perceptual independence’) from correlations in the underlying coded signals (‘perceptual separability’) by asking whether neuronal responses show overall correlations or whether the correlations are only at a trial-by-trial level of modulation. Despite the wide variety of behaviours, all these studies have generally reported typical or average correlations in the noise and the coded signals of about 0.2 (but see [31]). A typical inter-neural correlation of about 0.2 compares remarkably well with our estimates (figure 2 and electronic supplementary material, figure S3) of the signal correlations implicit in human sensory integration.

## Acknowledgements

This research was supported by EPSRC/Dstl research grant (E037097/1 and EP/E037372/1) to D.J.T. and T.T., under the Joint Grants Scheme. M.P.S.T. was employed by that grant. We are very grateful for the helpful comments and suggestions of Prof. Nick Chater and his laboratory members, Dr Ian Cross and members of the Centre for Music and Science, Prof. Sarah Hawkins, Dr Jeroen Van Boxtel and Dr Patrick Rebuschat. We also thank Prof. Nigel Bennett and the three anonymous reviewers. The results from experiment 1 have been reported briefly in To *et al*. [20].

- Received September 2, 2010.
- Accepted October 1, 2010.

- This Journal is © 2010 The Royal Society