Abstract
Neurons that respond selectively to the orientation of visual stimuli were discovered in V1 more than 50 years ago, but it is still not fully understood how or why this is brought about. We report experiments planned to show whether human observers use crosscorrelation or autocorrelation to detect oriented streaks in arrays of randomly positioned dots, expecting that this would help us to understand what David Marr called the ‘computational goal’ of V1. The streaks were generated by two different methods: either by sinusoidal spatial modulation of the local mean dot density, or by introducing coherent pairs of dots to create moiré patterns, as Leon Glass did. A wide range of dot numbers was used in the randomly positioned arrays, because dot density affects cross and autocorrelation differently, enabling us to infer which method was used. This difference stems from the fact that the crosscorrelation task is limited by random fluctuations in the local mean density of individual dots in the noisy array, whereas the autocorrelation task is limited by fluctuations in the numbers of randomly occurring spurious pairs having the same separation and orientation as the deliberately introduced coherent pairs. After developing a new method using graded dot luminances, we were able to extend the range of dot densities that could be used by a large factor, and convincing results were obtained indicating that the streaks generated by amplitude modulation were discriminated by crosscorrelation, while those generated as moiré patterns were discriminated by autocorrelation. Though our current results only apply to orientation selectivity, it is important to know that early vision can do more than simple filtering, for evaluating autocorrelations opens the way to more interesting possibilities, such as the detection of symmetries and suspicious coincidences.
1. Introduction
In this paper, we report results from psychophysical experiments using human observers that strongly suggest there are two different mechanisms for discriminating orientation in early vision. Hubel & Wiesel [1,2] discovered that most of the neurons of V1 were highly selective, not only for the position in the visual field of a visual stimulus, but also for other properties: a particular neuron only responds to a visual stimulus whose orientation, direction of motion and spatial frequency content lie within particular ranges whose mean values and widths vary from neuron to neuron. As well as discovering the orientation and motion selectivity of neurons in primary visual cortex, Hubel & Wiesel [1,2] also noted that there were two classes, which they named simple and complex cells. Subsequent work on different species, and work using awake, behaving, animals, has complicated the picture, but it is generally agreed that simple cells have smaller receptive fields than complex cells, have more direct inputs from LGN afferents and behave in a linear manner that can be explained in terms of the excitatory and inhibitory zones of their receptive fields [3,4]. It is also widely accepted that the simple cells act as tuned spatial filters having a wide range of peak sensitivities to position, orientation and spatial frequency, and that they do this by having neurons with a Gaborlike receptive field that are crosscorrelated with the patches of the image that overlie each receptive field.
Complex cells tend to have larger receptive fields and to combine information from different parts of them in more complex ways [5–8]; they are more often directionally selective to motion than simple cells, and may be divisible into many different subcategories, possibly better described as a range of hybrids rather than just two contrasting types ([9]; but [10] do not fully subscribe to this interpretation). Hubel & Wiesel introduced the idea that the complexity of receptive field properties was increased in V1 and other visual areas by repeated transformations of the type exemplified by the simple cell/complex cell transition [2,7], though they recognized that the inability to be more precise about the nature of such hierarchical transformations made it seriously incomplete. We think this is the crux of the problem, and it has added interest because the discovery of pinwheels [11,12] is leading to a wealth of new information about the connections between neurons in V1; these organized connections must surely be doing more than the very simple task allotted to them in the diagrams illustrating Hubel & Wiesel's hierarchy, but what does this extra work achieve?
A vast amount of information about the physiology and anatomy of V1 has now been acquired, and has been incorporated in neural models of early vision, such as energy models of motion analysis [13–15] and normalization models of cortical responses [6]. These are satisfactory in some ways, but not in others. We cannot embark on a fair, critical, review, but our approach is largely motivated by the failure of the above models to provide insight into early vision at the level of computational theory. David Marr [16] suggested this was the first, and clearly in his view the most important, of the three levels at which informationprocessing mechanisms have to be understood; he defined it as the level asking ‘What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?’.
We thought it might be possible to answer some of these questions by psychophysical experiments on human subjects, and the first question we chose to ask concerned the mechanism of selectivity for orientation. We think this is the most distinctive of the new types of pattern selectivity to appear in V1, and therefore designed psychophysical tests that we hoped would answer the following specific computational question: ‘Does V1 use crosscorrelation, or autocorrelation, to detect oriented streaks in a noisy image?’ The first possibility arises from the widely accepted suggestion, described above, that simple cells act as oriented spatial filters. For autocorrelation the situation is different, because there is no welltested physiological example of its occurrence. Reichardt [17] introduced the radical idea that it is an important mechanism for detecting motion in the early stages of the visual pathway in the beetle Chlorophanus, but later developments of this model have tended to regard it as crosscorrelation in the spatiotemporal domain, thus obscuring the original connection with autocorrelation [13–15]. Glass [18] suggested that the moiré effects generated by coherently oriented pairs in otherwise randomly positioned dots were a direct demonstration that the human visual system also computes local autocorrelations at an early level; but for many, the absence of a welltested physiological example remains a problem, even though this cannot be taken as proof that it does not occur.
The results of our psychophysical measurements turned out to be even more interesting than we expected, for they clearly indicate that either cross or autocorrelation can be used, depending on how the appearance of orientation had been generated.
The distinction between crosscorrelation and autocorrelation is crucial for our argument, so we repeat here the definitions from Haralick & Shapiro [19], and we shall follow this usage ourselves. Crosscorrelation γ, normalized for twodimensional images, is given by 1.1and it gives a measure of the similarity between a fixed, predetermined template T(x,y) and the pixel values in a patch of the image I(x,y) covering the same positions, each containing N pixels; Ī and T̄ denote the means and σ_{I} and σ_{T} denote the standard deviations of I(x,y) and T(x,y), respectively. This is profoundly different from autocorrelation α, given by 1.2
This gives a measure of the similarity between the pixel values, whatever they happen to be, of one patch of the image and those of another patch shifted by u in x and v in y. Notice that autocorrelation does not specify what it is that is similar. It can thus be considered as a coincidence or symmetry detector, something that detects the abstract property of similarity or sameness, but does not discriminate between different examples of it: for these particulars, the response is invariant. These characteristics cannot be detected by crosscorrelation alone, for it will be seen from expression (1.1) that the contributions to γ from every particular location in a pattern are influenced by the value at the corresponding position in the template, but are not influenced by the concurrent contributions from all other locations in the pattern. Two input patterns that differed only in the joint input values at two different locations would necessarily have the same total value of γ, for none of the separate contributions to γ are determined by joint values at pairs of input positions.
The difference between these two operations (sometimes called first and second order for cross and autocorrelation, respectively) is important for understanding why changes in the mean dot density of random dot patterns result in such different effects. It is because, for crosscorrelation, only one of the variables whose products are summed has a random component, whereas for autocorrelation, both have. Thinking along these lines has its antecedents in Bela Julesz's conjecture that humans cannot distinguish between textures with identical secondorder statistics, which was made in 1962. He later claimed that he had disproved it [20], but we think it set the fashion for considering the order of the statistical analysis required for perceptual discriminations, and that it is correct in the slightly modified form we shall give later.
(a) Choosing the tasks
The plan of our experiment was to measure the ability of the human visual system to discriminate regular, patterned features presented in arrays of randomly positioned dots. The threshold strength of the signal for the regular feature changes when the mean density of the randomly positioned dots in the array is changed, but the expected relationship between the two differs according to whether cross or autocorrelation is used for the discrimination. Conversely, we can infer which mechanism is being employed by observing which relationship is being followed. We therefore wanted two test tasks having the following appearances and ranges of adjustment:

they should look as similar as possible, for if they were readily distinguishable by subjective appearance at threshold stimulus strengths, the change of method from cross to autocorrelation or vice versa might involve higher cognitive processes;

the signal strength of the test patterns must be adjustable in order to determine thresholds;

the noise that limits detection by cross or autocorrelation must also be variable; for both tasks, we achieved this by varying the initial mean numbers of dots allotted to each pixel position in the arrays, before introducing the regular feature whose threshold was to be measured.
We chose to test crosscorrelation by measuring thresholds for detecting the oriented streaks induced by lowamplitude sinusoidal spatial modulations of dot density in random dot patterns, partly because we already knew that this followed the inverse squareroot of unmodulated dot density over a considerable range, as expected theoretically for crosscorrelation (see eqn (8) in [21]), but also because we suspected that the streakiness would look very similar to that induced by coherent dot pairs in Glass patterns [18]. These moiré patterns would form a natural partnertask for testing autocorrelation, and it has already been found that, over a limited range, the coherence threshold stays nearly constant with changes of dot density, as expected if autocorrelation is used [22,23].
Some of the test patterns generated for these tasks are shown in figure 1 and illustrate that, at signal strengths just above threshold; it is not easy to discriminate between streaks caused by dot density modulation and streaks caused by Glass pairs. It is also clear that there are no problems with grading signal strength; therefore, requirement (a) and (b) above are at least partially met. For the noise level (c), we have assumed that the standard deviation of dot density, when this is expressed as the mean number of dots per pixel, λ, is equal to the square root of this mean, and this was confirmed by our simulations for all the conditions encountered in these experiments (see §2). Note, however, that the noise when discriminating by autocorrelation is not the same as the noise when discriminating by crosscorrelation, because it depends upon the standard deviation of the number of spurious pairs having properties indistinguishable from the deliberately introduced coherent pairs, rather than upon the standard deviation of local dot density.
When calculating theoretical limits and doing computer simulations, we are primarily interested in how well a proposed mechanism or model collects together and uses the statistical information that the visual system has been presented with and is needed to perform the task. What matters, then, is not simply how signal strength at the output varies with signal strength at the input under different conditions, but how signaltonoise ratios attainable at the output compare with the signaltonoise ratios attainable at the input under different conditions. It is the statistical efficiency of the process that matters, not just the signal strength. We have, however, kept our simulations as simple as possible, and have not introduced factors such as the intrinsic noise of neurons; we do not doubt that there are conditions where this is important, but for simplicity we chose in the first place to avoid such conditions.
The usual method of generating noisy dot arrays using randomly positioned dots has a serious limitation that we call occlusion: except at very low dot densities, a significant fraction of the dots programmed for a particular location will have no effect, because that location has already been occupied by a previously programmed dot. The way this was overcome by using graded dot intensities will be described in §2.
2. Methods
(a) Observers and apparatus
The procedures used were entirely noninvasive and were approved by the local ethics committee. The observers were the two authors plus another experienced psychophysicist, all having normal or correctedtonormal vision. Aided by a chin rest, observers viewed the stimuli binocularly at 100 cm in a darkened, quiet, room. The stimuli were circular with a diameter of 200 pixels and were presented on a 19 inch CRT monitor (CTX EX951F) (1024 × 768 pixels, 0.34 mm [H] × 0.34 mm [V] per pixel, 85 Hz frame rate). The maximum luminance of the monitor was 92.4 cd m^{−2} and was calibrated and linearized by constructing a lookup table. The programs were written in Matlab using the psychophysics toolbox extension [24].
(b) Stimulus generation
Examples of the types of patterns used in the experiments are shown in figure 1. The dots making up these patterns occupy a single pixel position within a circular region of 200 pixels diameter and had luminances controlled by an 8 bit D/A converter. Dot densities are expressed as the mean expected number of dots per pixel position. To generate the gratings, each pixel within the array was designated a probability determined by the eventual twodimensional sinusoidal pattern required, and numbers of dots determined for the required values of mean dot density were then distributed accordingly within the array. These gratings, with a spatial frequency of 7.44 cycles per degree, were oriented at 45° either to the left or right of vertical. To generate the Glass patterns, pairs of dots, with a dot separation of 8.3 arcmin, were placed at random within the circular region, with the pairs tilted at 45° either to the left or right of vertical.
In the initial experiments using only white dots, each pixel in the array held either zero or one dot, corresponding to black or white. For the main psychophysical experiments and simulations, we used pixels whose luminances could be varied, having values increasing linearly with the number of times a particular pixel position had been chosen to contain a dot, and these we called ‘graded dots’. For each value of mean dot density, M_{p}, we calculated the expected standard deviations of the number of dots per pixel, σ_{p}, and devoted the whole range of pixel luminances to cover the range (M_{p} ± 2 × σ_{p}). The output of the D/A converter was set to produce a luminance increasing linearly with the number of dots held by each particular pixel. Thus, the pixels occupied by (M_{p} + 2 × σ_{p}) or more dots were set at white, those occupied by (M_{p} − 2 × σ_{p}) or less dots were set to black and intermediate values were assigned the appropriate grey value. This assignment of dot luminances was chosen in order to improve the fidelity of the visual system's computations of cross and autocorrelations, on the assumption that the system responds linearly to luminances in the range covered by the graded dots.
(c) Threshold determination
We used a modified staircase procedure. Up to eight staircases for eight different mean dot densities were run simultaneously in random order until 20 stimuli had been presented on each staircase. A small fixation dot positioned at the centre of the stimulus area, together with a short beep, occurred 500 ms before each stimulus, which lasted 160 ms. The observer indicated on the keyboard whether he or she saw left or right tilting streaks, or was uncertain, and received auditory feedback for correct or incorrect responses, but none for signals of uncertainty. The response time was unlimited, but in practice lasted a second or two. Reversal points on each staircase were corrected by half a step value according to the direction of the reversal, and these were used to estimate thresholds and their standard errors. There were usually 6–12 reversals on a staircase, and if the falsepositive rate was higher than 5 per cent, or the number of reversals on a staircase was below 6, the results were rejected. Other details of the procedure were conventional.
(d) Simulations
In the simulations presented here, we generated the dot arrays in the same way as they had been generated for the gratings and Glass patterns in the experiments. We then set the parameters for cross and autocorrelation at their optimum values for the tasks in hand. For crosscorrelation, this required the template to be matched exactly to the size, shape, spatial frequency, phase and orientation of the test patterns, and for autocorrelation, it required the complete test pattern to be multiplied by its copy shifted by one Glass pair separation in the direction of the separation. The simulation programs computed decision variables at narrowly spaced intervals of dot density for the expressions γ and α in equations (1.1) and (1.2) for samples of gratings and Glass patterns. For comparison with experimental thresholds, we determined the modulation or coherence for which the discriminability index d′ given by d′ = (M_{SN} − M_{N})/σ_{N}, is equal to 2, where M_{SN} is the mean of the distribution of values of cross or autocorrelations determined for sample populations at specified values of modulation or coherence. M_{N} and σ_{N} are the means and standard deviations of equivalent distributions of sample populations for zero modulation or coherence (i.e. noisealone stimuli).
When the dot density is low and ungraded white dots are being used, it can be assumed that the numbers of dots in single pixels are Poisson distributed, and therefore the standard deviation of dot numbers in any single pixel, or designated subset of pixels, is equal to the square root of the mean dot number, or sum of dot numbers. However, some of the test arrays are large and contain significant fractions of all the dots, and they may not be simple sums of the dots in individual pixels, but weighted sums, which are not in general Poisson distributed. We have done many of the simulations in two ways, first using the assumption that the standard deviations of the noisealone means are equal to the square roots of the mean numbers of dots they contain, and second by generating large samples of such distributions and measuring their standard deviations directly. The assumption that they would be equal to the square root of the mean number of dots turned out to be a good approximation, provided that only low values (below about 50%) of modulation and coherence were used. Thus, the assumption that the standard deviations of dot numbers in any test array is equal to the square root of the mean of the number of dots it contains is a good approximation under the conditions of these experiments.
3. Results
Figure 2 shows the result of the first experiment: log thresholds were measured for amplitudemodulated gratings and varying coherence Glass patterns as functions of log(λ), using white, ungraded, dots. As generally expected from previous results (e.g. [21,23]), the results show slopes (using log/log coordinates) quite close to −0.5 for modulation thresholds at low dot densities, and quite close to zero for coherence thresholds also at low dot densities, but they both then deviate upwards from the expected values (see bestfitting lines of slopes −0.5 or 0 calculated for the first four points). The range of validity was thus limited to a 30fold range of dot densities. At low values, it was subjectively clear to the observer that cognitive factors tended to be used for making the judgments, and this discouraged attempts to extend the test range downwards. However, the upper limit for the range of validity appears to be due to a quite different effect—that of occlusion, explained below.
In figure 2, the upward deviations from the predicted relations start when log(λ) ≈ −1, and are easily significant when log(λ) ≈ 0, corresponding to values for λ itself of 0.1 and 1, respectively. With our method of generating white dots, if two or more dots are allocated to a pixel, still only one is shown, and any excess allocations are occluded, having no effect whatever on the display. Calculation shows that the probability of two or more dots occurring in a single pixel exceeds 0.05 when λ ≥ 0.355, or log(λ) ≥ −0.45, and since occluded dots cannot help the observer to see oriented streaks, occlusion provides a satisfactory explanation for the experimental thresholds rising at this point.
To test this explanation, figure 3a,b show threshold measurements like those of figure 2, but using graded dots. The results using ungraded white dots have been copied over from figure 2. Two points are immediately obvious: first, up to a value of log(λ) ≈ −0.5, the results using graded dots agree well with those using white dots; second, the results using graded dots demonstrate good approximations to the predicted inverse square root and constant coherence threshold relationships far above the value of log(λ) ≈ −0.5, the agreements both holding over 1000fold ranges of dot density.
Figure 3 also shows the results of simulations for ideal detectors using crosscorrelation and autocorrelation (black lines). For graded dots, these have slopes of −0.5 and 0, accurately following the theoretical predictions. The grey lines curving upwards show the results for ungraded white dots, confirming that the simulations, like human observers, are affected by occlusion when the targets use ungraded white dots.
We think the slopes for ideal crosscorrelation and autocorrelation in figure 3 add support to the hypothesis that either method can be used for detecting the orientations of streaks. The fact that avoiding occlusion has such a dramatic effect in extending the ranges for which these relationships hold, and the fact that it works for the ideal simulations as well as for the human observers, strengthens the case.
As shown in figure 3, our current simulations have thresholds about 25 per cent of those of our observers for crosscorrelation, and about 8 per cent of those for autocorrelation. This could be taken to indicate that our ideal simulations are able to do a much better job of interpreting the information in the stimulus dots than the real visual system is able to do. This would, however, be misleading, because the ideal systems have an advantage over the natural system: their parameters are matched to the parameters of the stimulus, whereas the real system's parameters cannot be adjusted by the experimenter. An alternative would be to match the parameters of the stimulus to those of the natural system but we do not yet know their values. The main such parameter is likely to be the target area over which the information contained in the positions of the dots can be combined efficiently to give the correct impressions of the orientations of the streaks. This area is likely to be much less than the 15 deg^{2} provided by the full test area, so when making the comparison of figure 3, the ideal system has a great advantage over the real system because it uses the whole test area. This advantage should be reduced when a smaller test area is used, and this was found to be the case, but further experiments are needed to obtain proper estimates of optimal statistical efficiencies and the conditions under which they are obtained.
4. Discussion
(a) Comparisons with earlier results
The psychophysical results of figure 3 clearly indicate that early vision uses crosscorrelation to detect the oriented streaks produced in random dot arrays by sinusoidal spatial modulation of dot density, whereas it uses autocorrelation to detect the oriented streaks produced by coherent pairs, as in translational Glass patterns. The earlier studies referred to above had suggested that this was the case, but the technique for using graded dots to avoid occlusion at high dot densities leads to much more convincing experimental results. Notice that the technique for generating the graded dot displays was designed to improve the fidelity of the visual system's computations of cross and autocorrelations, so the fact that it works so well adds confidence to our belief that our psychophysical procedures genuinely test the use of these two types of computation.
We shall not understand the full significance of there being two computational methods for distinguishing orientation until we find out whether there are similar dichotomies for the other main computations thought to occur in early vision—those for motion, stereopsis and perhaps colour. It is also important to determine whether, as we believe, they both occur at a single level, or whether the two components occur at successive levels in early vision, for example V1 for crosscorrelation and V2 for autocorrelation, but a conclusive answer to this question is likely to need a neurophysiological approach.
As far as we know, there are no experimental results that conflict directly with ours, though some may at first appear to do so. Dakin [25], for example, suggests that crosscorrelation alone can detect Glass patterns, but he does not appear to have made quantitative comparisons to support his claim. It is also not clear whether his tests were aimed (as ours were) at the first of the two stages involved in detecting Glass patterns—that of detecting coherent pairs—or at the second stage, that of detecting whether the pattern of coherence is translational, radial, circular or hyperbolic. There is also a lack of positive evidence for autocorrelation in the neurophysiological analysis of responses to paired dots by Smith et al. [26], but while these results show good evidence for crosscorrelation, we do not think that any of them challenge our evidence that autocorrelations are also computed in early vision.
We are not aware of any other studies that have tested for the occurrence of autocorrelation in early vision by the method described here. It would certainly be interesting to know whether measurements of signaltonoise ratios in energy models [13–15] can mimic the results shown in figure 3.
(b) Significance of autocorrelation
If autocorrelation, or an equivalent wavelet computation, does occur, this must be important for it makes possible the discrimination of classes of patterns defined by joint properties of pairs of pixels. As pointed out in §1, the products that contribute to the value of a crosscorrelation are all between single pixel values and a prespecified templateweighting factor: the contributions of a particular pixel do not vary according to the simultaneous contribution of other pixels, and crosscorrelation alone is therefore blind to secondorder statistical differences between patterns of pixels. On the other hand it is quite different for autocorrelations; they are formed solely from the products of the values of two image pixels, so differences in the frequencies of joint events are certainly expected to cause distinguishable differences in α.
This has an enormous effect in increasing the total number of patterns that could, in principle, be distinguished from each other by autocorrelation, compared with the number distinguishable by crosscorrelation alone, and this immediately suggests an explanation for the explosive increase in the number of neurons per incoming afferent neuron that occurs when the visual pathway enters the cortex. This explosive increase has long been known to occur [27], and the ratio of cortical neurons to input afferent fibres has more recently been shown to reach the astonishing figure of 10 000 : 1 in the foveal projection of macaques [28].
Surely this explosive expansion must indicate that a radically different form of computation is being employed, and we think this is likely to be the systematic exploitation of autocorrelation in the cerebral cortex to discover secondorder features in input patterns. In other words, the conjecture Julesz [20] made in 1962 and later withdrew was very nearly right: he should have said ‘The general ability to distinguish between patterns with different secondorder statistics only becomes possible in the cerebral cortex, through its systematic use of autocorrelation’.
Because of the expansion, it is reasonable to expect that there is a huge number of different complex cells each connected to the same, or nearly the same, set of simple cells, but differing in the detailed pattern of their connections. If these differences amount to specifying different displacements (u,v) in equation (1.2) for a large number of different autocorrelations α, then we would have an array of elements that is, in effect, looking for prespecified autocorrelations in the filtered input from the image, and these might be the functional correlates of the geometrical arrangements that BenShahar & Zucker [29] describe.
Our search for the computational goal of early vision has produced some evidence that auto as well as crosscorrelation is used in the pinwheels of V1 to detect higher order regularities in the visual input, such as symmetries and suspicious coincidences. Perhaps William James [30] was right in claiming that, in his words, ‘the sense of sameness is the very keel and backbone of our thinking’, and in that case the computational goal of V1 may turn out to be closer to that of the cerebral cortex as a whole than has been generally recognized.
Acknowledgements
The authors would like to thank the Gatsby Foundation for a generous travel and maintenance grant, and Trinity College, Cambridge, for additional support during the academic year 2008–2009. D.L.B. would like to acknowledge the support of the University of Évora for granting sabbatical leave and to the Fundação de Ciência e Tecnologia, Portugal, for a sabbatical grant (SFRH/BSAB/825/2008) and H.B.B. would like to thank David Cameron (Department of Psychology, University of Dundee) for help with early work on the generation of graded dot patterns. We would also like to thank the many critics of earlier versions of this paper for working so hard to improve its clarity.
 Received October 7, 2010.
 Accepted November 15, 2010.
 This Journal is © 2010 The Royal Society