AIML volg Seminar: Understanding Scene Understanding

How do humans represent a scene after a brief period of viewing? volg shows that the “gist” of a scene (i.e., the scene name and layout) is extracted almost immediately from the blurred visual periphery, but little is known about how scene understanding evolves with viewing fixations (i.e., what people choose to look at in a scene). We studied this question using an integrated behavioral and computational approach with the aim of identifying scene metamers—scenes generated by a latent diffusion model that humans inaccurately believe were scenes they had just viewed (original scenes).

To obtain these scene generations we developed Seen2Scene, an image-prompt adapter model that combines gist (blurred peripheral pixels) and fixation tokens (DINOv3 patches) to generate complete and plausible scenes from only these sparse visual inputs. Human participants were shown scenes (n=300) and, using a gaze-contingent eye-tracking paradigm, were permitted to view each for either 1, 3, 5, or 10 fixations. After the critical fixation (varied randomly from trial to trial), the scene disappeared and was replaced with a central fixation cross. Participants looked at this cross for five seconds, during which Seen2Scene would generate in real time a scene based on the participant’s own viewing fixations. Following this 5-second interval (needed to reliably complete the generation), the participant is briefly (200 msec) shown either the originally viewed scene or the scene generated by Seen2Scene and is asked to judge if the scene is the same or different from the just-viewed original.

Metameric scenes are defined as generations from Seen2Scene that result in incorrect “same” responses from participants (i.e., confusion between the original and generation). We found that the probability of a generated scene being a metamer was highest (~50%) when the generation used the participants’ own fixated visual information as conditioning, compared to a ~45% rate when it used fixation locations from a different participant viewing the same image and a ~40% rate when it used a different participant’s fixation locations while viewing a different image (a random baseline). We interpret this pattern of results to mean that blurred peripheral information (mediating a gist representation), although expected to be a dominant factor in determining whether a scene generation will be a metamer, is not the only factor affecting metamerism; a generation capturing how a person attends to a scene increases the probability of its confusion with the original.

The number of fixations made during viewing produced mixed results, with the greatest increase in metamerism found over the first three fixations but little difference found between five and ten fixations. We interpret this pattern to suggest that the objects most important to representing a scene’s semantic structure are fixated early during scene viewing, thereby producing confusable generations over this range, while later fixations encode less essential information that is not included in the scene representation. Machine learning researchers should care about these findings. A generative AI model of human scene understanding would enable interpretable tools capable of visualizing how a user is understanding their world, which would fuel widespread applications ranging from clinical research (e.g., modeling the visual world of people with autism) to education (visualizing what a student understood after viewing a plot or webpage about a topic).

Our work also informs next-generation AI systems that will learn latent representations that are better aligned with the ones used by humans to understand their visual world, thus enabling richer and more socially intelligent interactions.

Tagged in Artificial Intelligence, 3D, computervision