## Abstract

Perception of shaded three-dimensional figures is inherently ambiguous, but this ambiguity can be resolved if the brain assumes that figures are lit from a specific direction. Under the Bayesian framework, the visual system assigns a weighting to each possible direction, and these weightings define a prior probability distribution for light-source direction. Here, we describe a non-parametric maximum-likelihood estimation method for finding the prior distribution for lighting direction. Our results suggest that each observer has a distinct prior distribution, with non-zero values in all directions, but with a peak which indicates observers are biased to expect light to come from above left. The implications of these results for estimating general perceptual priors are discussed.

## 1. Introduction

Perception consists of interpreting two-dimensional retinal images of a three-dimensional world. The process of projecting a three-dimensional scene onto a two-dimensional retina necessarily discards information about the three-dimensional structure of that scene. This makes it impossible, in principle, to deduce all of the three-dimensional structure of a scene, and perception is therefore a classic example of an *ill-posed problem* (Poggio *et al*. 1985). However, even though such problems cannot be solved by deduction, acceptable solutions can be found using statistical inference. This involves using additional information, usually based on prior experience, to interpret two-dimensional retinal images, where this additional information takes the form of heuristics (rules of thumb) or constraints (rules which exclude certain ‘illegal’ solutions).

Within the Bayesian framework, this extra information is realized in the form of prior distributions. For example, the image marked with a cross in figure 1 can be interpreted as either convex or concave. The particular perception evoked by this image depends only on the direction in which the light source is assumed to originate (Rittenhouse 1786; Brewster 1847; Oppel 1856; Kleffner & Ramachandran 1992). If the light source is assumed to originate from below, then the image is interpreted as convex, but if the light source is assumed to originate from above, then the image is interpreted as concave. As this is the usual interpretation made by human observers, it implies that we implicitly assume light originates from above. However, such demonstrations provide only a qualitative impression of where we assume the light source to be.

In reality, it is unlikely that human observers make the simplistic assumption that light comes only from above or below. More realistically, each observer assigns a probability to each possible light-source direction, which may be based on prior experience of the directions in which light sources originate.

These probability values collectively define a prior probability density function, which can be visualized using a polar plot, where the radial distance in a given direction indicates the relative probability that the light originates from that direction (as in figure 2*c*). In this paper, we show how it is possible to estimate the overall form of this prior, which, for reasons that will become obvious, we call the light-from-above prior. For the sake of clarity, note that we do not seek the prior for lighting direction, which could be obtained empirically, but the prior as used by a given observer.

Our general strategy is closely related to that described in Paninski (2006). However, in the simulated experiment described by Paninski, the observer estimates a continuous parameter, and so each trial provides an equality constraint on the prior. Here, we concentrate on the more common case in which the observer makes a forced choice, so that each trial provides a weaker, inequality constraint on the prior.

## 2. Results

The shape information in our images is a function of two parameters, the direction *θ* of the light source and the three-dimensional shape *c* of the imaged surface, which specifies whether the stimulus is concave *c*=*c*_{1} or convex *c*=*c*_{0}. On each trial, the observer is presented with an image ** x**, and makes a binary response

*r*=1 if the stimulus appears concave or

*r*=0 if the stimulus appears convex (see appendix A).

We assume that the observer's perceived shape of a shape *c* depends on two quantities: the posterior probability density function and the loss function. First, the probability (density) that the shape has value *c* and that the light source is in direction *θ* given an image ** x** defines the joint posterior probability density function

*p*(

*c*,

*θ*|

**). Second, the ‘cost’ of perceiving a shape as , when it is actually**

*x**c*, is defined by the loss function .

The observer's perception is assumed to correspond to the shape , which minimizes the expected loss, where this expectation is taken over all possible values of *θ* and *c*(2.1)(2.2)Using Bayes' rule, the posterior is given by(2.3)where the observer's prior expectations about shapes and lighting directions define the joint prior distribution *p*(*c*, *θ*), and where the probability of the observed image for a given three-dimensional shape and lighting direction defines the likelihood function *p*(** x**|

*c*,

*θ*). The integral in square brackets in equation (2.2) can now be rewritten as(2.4)so that the expected loss is(2.5)In fact, each stimulus

**is consistent with only two lighting directions,**

*x**θ*

_{x}and . This implies that the likelihood

*p*(

**|**

*x**c*,

*θ*) is a delta function, which is zero except at

*θ*=

*θ*

_{x}and ,(2.6)(2.7)Substituting equation (2.6) in equation (2.4) for

*c*=

*c*

_{0}yields(2.8)(2.9)If the observer assumes that the stimulus shape and the lighting direction are independent, then the joint prior distribution

*p*(

*c*

_{0},

*θ*

_{x}) factorizes to yield(2.10)where

*p*

_{θ}(

*θ*

_{x}) is the prior over lighting direction and

*p*(

*c*

_{0}) is the prior for the shape

*c*

_{0}. A similar calculation for

*c*=

*c*

_{1}yields(2.11)Regardless of the value of

*θ*

_{x}and , each observer perceives the stimulus as either convex

*c*

_{0}or concave

*c*

_{1}, and responds accordingly. Thus, together,

*p*(

*c*

_{1}) and

*p*(

*c*

_{0}) is a pair of co-determined observer-specific scalar priors, such that

*p*(

*c*

_{1})+

*p*(

*c*

_{0})=1. We call the prior

*p*(

*c*

_{1}) the concavity preference for a given observer, which can be estimated using the same method (described below) for estimating the prior

*p*

_{θ}(

*θ*).

We choose the zero/one loss function to model the forced choice task, i.e. for a correct decision and for an incorrect decision (Bishop 1996); the optimal decision rule under this loss function minimizes the number of misclassified stimuli. Substituting this loss function into equation (2.5), we find that the observer should respond *r*=0 (convex) if the log posterior ratio(2.12)(2.13)(2.14)and the response should be *r*=1 (concave) otherwise.

This deterministic rule would lead to the same decision for all presentations of a given stimulus. In order to model the stochastic character of human decision making, we follow a general suggestion of (Paninski 2006), and assume that our rule is stochastic (see §3). Specifically, we assume that the process (e.g. the observer's criterion) that compares the log posterior probability log *p*(*c*_{0}|** x**) with log

*p*(

*c*

_{1}|

**) is subject to noise. In order to be clear about the implications of this, we define(2.15)(2.16)and rewrite equation (2.14) as**

*x**L*=

*L*

_{0}−

*L*

_{1}. We assume that the distribution of

*L*

_{0}values is Gaussian with mean and standard deviation

*σ*, and that the distribution of

*L*

_{1}values is Gaussian with mean and also with standard deviation

*σ*. As

*L*

_{0}and

*L*

_{1}are both Gaussian with variance

*σ*

^{2},

*L*is also Gaussian with mean and variance . For simplicity, we assume that

*σ*is the same for all lighting directions.

Note that we have chosen to measure the relative log likelihood (sometimes called evidence) in decibels (dB) as suggested by Jaynes (2003). This allows easy comparison of levels of evidence. For example, evidence of 3 dB for a hypothesis means that it is about twice as likely than its alternative, and 10 dB means that it is about 10 times as likely. Jaynes has suggested that an evidence threshold of approximately 1 dB is characteristic of many human judgements (Jaynes 2003).

We assume that the probability *P*(*c*_{0}|** x**) of the observer perceiving a shape

*c*

_{0}is described by the cumulative density function of a Gaussian with zero mean and variance ,(2.17)(2.18)(2.19)where

*q*is defined for brevity. For a given value of

*q*, if the same stimulus is presented on

*n*trials and if responses are independent across trials, then the probability that the observer responds

*r*=1 (concave) on

*m*of those

*n*trials is(2.20)where

*C*

_{n,m}is a binomial coefficient. For a given light direction,

*C*

_{n,m}is constant, and so it does not affect the value that maximizes

*p*(

*m*|

*q*), and is omitted below.

We discretize the lighting direction into *N* values: *θ*_{i}:*i*=1, …, *N*. For a given value of *θ*_{i}, we present the stimulus *n*_{i} times, and record the number *m*_{i} of ‘concave’ responses, so that(2.21)Thus, the *n*_{i} binary responses of a single observer to repeated presentations of the same stimulus are maximally consistent with the value of *q*_{i}, which is the probability that the observer perceives the shape as concave when the lighting direction is *θ*_{i}.

When considered over all *N* lighting directions, and assuming independent noise, the probability of the vector ** m**=(

*m*

_{1}, …,

*m*

_{N}) for a given vector

**=(**

*q**q*

_{1}, …,

*q*

_{N}) is(2.22)which is the likelihood function of

**. The vector of**

*q***that maximizes**

*q**p*(

**|**

*m***) is the maximum-likelihood estimate of the true value**

*q*

*q*^{*}. Taking logs and multiplying by minus, one transforms equation (2.22) into the negative log likelihood function of

**,(2.23)As both the prior distribution and the concavity preference are implicit in , this provides an estimate of the true prior distribution , and an estimate of the true concavity preference**

*q**p*

^{*}(

*c*

_{1}).

As discussed later, the unknown value of the discrimination parameter *σ*_{L} means that, in practice, the prior is not completely determined by equation (2.14) (see §3); but for the sake of brevity, we will refer to this as ‘estimating the prior’.

### (a) Smoothing the prior

Unless the dataset is very large, the prior distribution estimated by direct minimization of *E*_{f} will not be very smooth. Smoothness of the prior probability for lighting direction is an important physical constraint, which we can model by regularizing the solution(2.24)where *E*_{s} is a measure of the smoothness of *p*_{θ}(*θ*), and *λ* is proportional to the square of the expected angular scale over which the prior for lighting direction is expected to change. This regularization procedure can be thought of as specifying a ‘prior for priors’ (Paninski 2006).

Paninski suggests using the usual *L*_{2} norm on the derivative of the prior to measure smoothness. A related measure that is more appropriate to this probabilistic situation (see §3) is the Fisher information, which measures the extent to which the prior *p*_{θ}(*θ*) is localized, and which is a weighted version of the usual *L*_{2} norm,(2.25)

(2.26)In summary, for given values of the smoothing parameter *λ*, the values of the *N* elements of the discretized prior *p*_{θ}(*θ*) and the concavity preference *p*(*c*_{1}) can be estimated simultaneously as those values which minimize *E* (equation (2.24)). The value of *λ* was estimated using cross-validation (Bishop 1996; see appendix B), and the MatLab minimization procedure ‘fminsearch’ was used to find an estimate of and *p*^{*}(*c*_{1}).

### (b) Results for simulated observer

In order to test our methods, we first analysed data from a simulated observer with a known prior . The prior was defined as a von Mises distribution (Fisher 1995) , with location parameter *μ*=−45° and dispersion parameter *κ*=0.33. The value of the smoothing parameter *λ* has no explicit representation when generating data for the simulated observer, and cross-validation (appendix B) was used to find an estimate of (figure 2). This was then used with the known value of *σ*=1 to estimate the simulated observer's prior for lighting direction and its concavity preference . The concavity preference of this simulated observer had been defined as *p*(*c*_{1})=0.5, and was subsequently estimated as . The method also recovered an accurate estimate of the prior, as shown in figure 2*c*.

### (c) Results for human observers

Using cross-validation (appendix B), the estimated value of the smoothing parameter was (figure 3). This was then used with *σ*=2 to estimate each observer's prior *p*_{θ}(*θ*) (figure 4). In each case, the estimated prior is biased towards the upper left, in agreement with previous findings on group average data (Mamassian & Landy 2001). Thus, the left biases observed in each posterior in figure 4 and in Mamassian & Goutcher (2001), as well as the left and right biases reported in Sun & Perona (1998) and Adams *et al*. (2004), are probably due to a bias in each observer's prior, rather than a bias in the likelihood function. The estimated prior concavity preferences for all observers were within the range , compared with findings for the posterior in Adams *et al*. (2004) (0.44), which used similar stimuli. Details of the experimental procedure are given in appendix A.

## 3. Discussion

When an observer is asked to report the concavity/convexity of a shape for a range of different lighting directions, the resultant set of responses (usually depicted as a polar plot) represents a sample from their posterior probability density function for shape. It is this sample from the observer's posterior which has been used in all previous experiments to provide estimates of observers' posterior for lighting direction.

The main contribution of this paper is a method for using this sampled posterior, in combination with a likelihood function and a loss function, to estimate the prior probability density function for lighting direction and the prior for concavity preference in individual observers. In order to achieve this, we assume plausible forms for the likelihood and loss functions. For the loss function, we assume that each observer attempts to minimize the number of misclassified stimuli, an objective which corresponds to making responses consistent with the mode of the posterior probability density function. With regard to the likelihood function, each convexity/concavity response is consistent with one of two possible lighting directions, which effectively implies that the likelihood function is a delta function with non-zero values corresponding to these two lighting directions. This provides a posterior which is proportional to the prior for exactly two lighting directions and two shapes (convex/concave). An estimate of each observer's prior and concavity preference was then obtained by minimizing a regularized (smoothed) version of the negative log likelihood of the sampled posterior.

### (a) Related work

Research on motion perception explained the change in perceived speed that occurs at different levels of contrast by assuming a specific (Gaussian) form the speed prior (Weiss *et al*. 2002). Other researchers assume that the mean of the posterior coincides with the true stimulus value in a sensorimotor task (Körding & Wolpert 2004) or that (i) the log of the prior is a straight line, (ii) the likelihood is Gaussian, and (iii) the mean of the posterior is the true mean (Stocker & Simoncelli 2006*a*). We make none of these assumptions.

A parametric estimate of the lighting prior has previously been obtained (Mamassian & Landy 2001) under the assumption that it can be described by a two-parameter von Mises distribution (see below).

The method described here is inspired by Paninski (2006). However, our method is different from Paninski's in two key respects. First, the stochastic choice model assumes that the log posterior probabilities (and not posterior probabilities) are subject to additive Gaussian noise (equation (2.17)). This has a number of advantages. (i) The chosen value of *σ* corresponds naturally to a threshold value for the evidence (in the sense of log posterior ratio) needed to obtain a given choice rate in the presence of encoding noise. (ii) There are no problems of positivity in adding an unbounded noise contribution to probability values which should be positive. (iii) The neural encoding of log probabilities has been shown to have a direct neural interpretation as an approximation to Poisson noise in neural populations (Gold & Shadlen 2001).

Second, we have replaced the *L*_{2} regularizer used in Paninski (2006) with Fisher information. This is more closely related to the probabilistic nature of the problem. Essentially, regularization using Fisher information (equation (2.26)) tries to satisfy the experimental constraints using the least localized prior density. By up-weighting the contribution for low probabilities (i.e. by 1/*p*_{θ}(*θ*)), the Fisher regularizer takes account of the fact that small ripples in low-probability regions are just as significant as larger ripples at higher probability values when the task requires a likelihood ratio judgement. There is inevitably a trade-off between the form of the estimated prior and the nature of the smoothing function. However, because the Fisher regularizer seeks that prior with the least localized density, it can be interpreted as the regularizer of least commitment.

### (b) The experimental task

The design of the concavity–convexity task was chosen for a number of reasons. First, we have chosen a forced choice task. Experiments in which the observer provides an explicit estimate of lighting direction on each trial could provide more powerful constraints on the prior. However, asking observers to estimate lighting direction is an unnatural task, and is therefore likely to yield data that are both biased and noisy. Although we have seen that the forced choice experiment leaves some aspects of the prior unconstrained, it requires far fewer modelling assumptions than parameter estimation alternatives, and so the information that is obtained is more reliable.

Second, although the question of the modification of the prior by feedback is of great interest (Adams *et al*. 2004), no feedback was given here, and there is no correct response for the chosen stimuli. This is important because, in most applications, even a small number of trials with feedback reduce the dependence of the posterior on the prior to insignificance (Mele & Rawling 2004; indeed, this ‘washing out’ property is often invoked to protect Bayesian methods from the consequences of choosing incorrect priors). In our experiment, neither the posterior nor the prior can be updated as a consequence of feedback. It is generally assumed that exposure to a biased population of stimuli (e.g. exposure to mainly concave stimuli) induces a shift in the prior. However, this appears to be the case only if feedback is given to correct the interpretation of ambiguous stimuli. Observers adapt their visual interpretation of stimuli as those in figure 1, provided they are given haptic feedback of those stimuli (Adams *et al*. 2004). Moreover, this adaptation was found to affect performance on a different (lightness judgement) task, which required an assumption regarding light direction, indicating a shift in the mean of the light-from-above prior. From a statistical perspective, this makes sense. Decisions based on a series of measurements with corrective feedback are initially based mainly on prior expectations. However, the corrective feedback can be used to update the prior, making future decisions more reliable, as in the classic Kalman filter (Kalman 1960). However, exposure to a biased population of stimuli without feedback induces after-effects in the opposite direction to that predicted by a shift in the prior. These after-effects are consistent with a change in the likelihood function and not in the prior (Stocker & Simoncelli 2006*b*). In our experiment, observers were exposed to an unbiased population of stimuli and received no feedback. Given the above considerations, this suggests that the prior and likelihood were not affected by the stimuli, and were reasonably constant throughout the experiment.

Third, we have used stimuli which are essentially noise free. Many visual tasks have unavoidable sensory noise, and when this is not the case, experimenters have added artificial noise, specifically in order to allow a Bayesian analysis. By virtually eliminating this sensory noise in a very simple task, we have ensured that any stochastic variation in responses must be a result of noise in the internal encoding of variables used in the decision process, noise which we have modelled by the parameter *σ*.

### (c) Estimating the discrimination parameter

Our estimate of the prior depends on the value of the discrimination parameter *σ*, and we have not addressed how to fix a value for *σ*. This parameter cannot be estimated directly from experimental data because, for any given value of *σ*, there is a prior which fits the observed data equally well, as shown in figure 5. This ambiguity is unavoidable for judgement tasks that depend only on likelihood ratios, which comprise the majority of choice tasks (Green & Swets 1966). For the task considered here, this dependence is made explicit in equation (2.17), where the posterior probability is seen to be a function of the ratio *L*/*σ*, so that smaller log likelihood differences *L* can always be reliably detected by using a smaller value for *σ*.

We have chosen a value *σ*=2 dB to analyse human data, which is a generous approximation to the 1 dB assumed as a nominal value for the discrimination threshold for human judgements (Jaynes 2003). We anticipate that an analysis similar to that proposed by Jazayeri & Movshon (2006) based on Poisson statistics of individual model neurons would constrain the value of *σ*.

We note that this choice of discrimination parameter gives estimated lighting priors (figure 4), which are similar in shape to the von Mises distributions assumed in Mamassian & Landy (2001), but which are less localized than implied by the values of their estimated concentration parameters. This is consistent with our aim to use the prior of least commitment.

### (d) Priors for other parameters

The method described for estimating perceptual priors can, in principle, be applied to a variety of other parameters. These include priors for low-level parameters (e.g. speed, direction, line orientation, colour, spectral illuminance), but could also be extended to high-order parameters (e.g. faces, words, syllables).

### (e) More complex priors

In this study, we have just two variable parameters, light direction (*θ*) and the convexity/concavity (*c*) of a fixed shape, and there is no reason to expect these parameters to be correlated in the physical world. Hence, we were able to assume independence and factorize the joint prior *p*(*θ*,*c*)=*p*_{θ}(*θ*)*p*(*c*). This assumption was essential in order to make the estimation problem tractable, but it may not be justified in general.

A prior is just the re-scaled marginal distribution of a multivariate prior distribution. In this study, we have kept all parameters constant except light direction (*θ*) and the convexity/concavity (*c*) of a fixed shape. This implies that the prior we have estimated is the marginal of a two-dimensional joint distribution *p*(*θ*, *c*). Moreover, this joint distribution is itself a marginal distribution of a high-dimensional prior distribution with axes that include parameters such as shape, illuminance spectrum, multiple light sources, colour and stereo disparity. Had we the time and the means to find the light-from-above marginal of this high-dimensional prior distribution, it is possible that the result would be quite different.

## 4. Conclusion

If Helmholtz was correct in stating that perception is a form of ‘unconscious inference’ (von Helmholtz 1867), then this implies the existence of a posterior (which determines a perception), a likelihood function (the conditional probability of the retinal image) and a prior (the observer's expectations about the statistical structure of the visual world). Studies in computational neuroscience suggest that the visual system is adapted to the statistical structure of its physical environment (Olshausen & Field 2004). Moreover, this adaptation occurs over a range of time scales, and shapes the evolution of the visual system over generations, and the transfer functions of visual neurons over a matter of seconds (Rieke *et al*. 1996). Here, we have described a method for characterizing the prior for lighting direction. We anticipate that this method will be used to characterize many other priors used for perceptual inference.

## Acknowledgments

Thanks to Stephen Isard for his useful discussions.

## Appendix A. Experimental methods

### (a) Participants

There were eight observers, in the age range of 21–26 years (mean age=22.7). Observers all gave their informed consent and were paid £5 sterling.

### (b) Apparatus and procedure

The experiment was run in a dimly lit room. Stimuli were generated using the MatLab (v. 7.3.0 R2006b) and PsychToolbox (v. 3.0.8) (Pelli 1997). The observer viewed stimuli on a 17 inch TFT monitor, at a distance of 57 cm, using a chin rest. Each observer completed 576 trials in a morning and afternoon session (not on the same day), making a total of 1152 trials. Stimuli were presented in 16 blocks of 36 trials each. After each block of 36 trials, the observer was able to take a break. In each trial, a stimulus was presented with one of the discs marked with an ‘×’ in the outermost corner, as in figure 1. The observer's task was to indicate whether the marked disc appeared to be convex or concave by pressing one of two response keys. Each stimulus remained on the screen until the observer made a response, after which the screen went blank, and there was a pause of 0.5–1 s before the next stimulus appeared. Observers received no feedback.

The lighting direction adopted one of 36 directions ‘around the clock’, at intervals of 10°. For each lighting direction, the stimulus had two complementary configurations. In one configuration, the top left and bottom right discs were convex, whereas the top right and bottom left were concave, and in the complementary configuration it was the other way around. The reason for having two configurations per lighting direction was to ensure that each stimulus looked identical to its complementary configuration when lit from 180° further around the clock. Each disc position (e.g. top left) in each configuration was presented twice at each lighting direction, making a total of 1152 trials (i.e. 4 positions×2 configurations×2 repeats×36 light orientations×2 sessions).

## Appendix B. Estimating lambda

Cross-validation consists of splitting each observer's data into two subsets, a training dataset *s*_{train} and a test dataset *s*_{test}. For each putative value of *λ*=*λ*_{j}, the training data *s*_{train} was used to estimate (and therefore the prior) by minimizing *E*. Setting in equation (2.23), *E*_{f}(*s*_{test}) was then evaluated using the test data *s*_{test}, which yields estimate of the likelihood of the test data for *λ*_{j}. This procedure is repeated over a range of values for *λ*_{j}, and the value of *λ*_{j} that minimizes *E*_{f}(*s*_{test}) is taken to be . In order to obtain a robust estimate for , this whole procedure was repeated using four runs, as follows. Initially, the data were split into four subsets. On each run, three subsets were combined to make the training set, and the remaining subset was used as the test set. Each of the four subsets took its turn as the test set on exactly one run, with the remaining three subsets being used as the training set. Each run yielded a curve for *E*_{f}(*s*_{test}) as a function of *λ*, and these four curves were averaged. The value of *λ* corresponding to the minimum of this average curve was taken to be for a single observer. The value of *σ* was set to *σ*=1 for the simulated observer and to *σ*=2 for human observers.

## Appendix C. The mean vector

The mean vector is the mean of the estimated prior distribution. The direction of this vector indicates the direction of the bias (anisotropy) in the prior and its length shows the amount of bias. The *x* and *y* components of the mean vector are and , respectively.

## Footnotes

- Received November 10, 2008.
- Accepted January 14, 2009.

- © 2009 The Royal Society