Uncovering 'world models': from cognition to neurobiology

When we open our eyes, we do not see a jumble of light or colorful patterns. There lies a great distance from the raw inputs sensed at our retinas to what we experience as the contents of perception. The goal of our research program is to understand how our minds and brains transform raw sense inputs at our retinas into rich, discrete structures that we can think about, plan with, and manipulate. To address this, we take a primarily computational approach for a distinctly integrative program spanning the cognitive and neural levels. We develop computational theories that synthesize an especially broad technical toolkit including probabilistic programming, causal generative models, nonlinear dynamics and control, approximate Bayesian inference, and artificial neural networks. We test these models in objective, performance-based psychophysical experiments in humans using computer graphics and computational fabrication for stimulus delivery. We also test these models in neural data from non-human primate experiments via experimental collaborators and in human imaging experiments that we design and execute.

With this multilevel program and multifaceted methodology, the lab has been developing a computational account of the mind and brain’s representations of the physical world with unprecedent depth, extent, and empirical force. We now have triangulated — via models, psychophysics, and neural data — several aspects of these internal representations of reality, from cognition to neurobiology. These include formal specifications of their formats, how they are selectively deployed during perception, how they are implemented in neural populations and neural dynamics, and how they are inferred across the sensory cortex. In many cases, our work has yielded accounts of the mind and brain that run counter to dominant streams in the field, revealing the fundamental nature of the contents of our percepts: Structure-preserving, behaviorally efficacious representations of the physical world —objects with 3D shapes and physical properties, scenes with navigable surfaces, events with temporally demarcated dynamics, and agents with coarse biomechanics and plans.

About

First few hundred milliseconds: Reverse-engineering the brain’s algorithms for core vision

The visual system must not only recognize and localize objects, but also perform much richer inferences about the causes in the world underlying sense data. This research thrust aims to uncover the algorithmic basis of how we see so much so quickly, aiming to capture, in concrete engineering terms, something that we often take for granted: the breathtakingly fast and complex set of computations that occur in the brain between the moment we open our eyes and the moment a perceptual experience of a rich world appears in our minds. 

Understanding human attention: Adaptive computation & goal-conditioned world models

Most scenes we encounter hold complex structure (e.g., in terms of objects, agents, events, places), but our goals render only a slice of this complexity relevant for perception. Why do we see what we see? To answer this, we have been focusing on a new account of attention. Attention is central to human cognition, with decades of research since the cognitive revolution exploring how attention continually focuses visual processing in the service of our goals. But, how does this work in computational terms? Our goal here is to uncover the computational underpinning of how attention integrates goals to construct goal-conditioned, structured representations of the world. Addressing this goal has been opening up a new generation of formal models that generate testable predictions at unprecedented empirical depth across the domains of scene perception, intuitive physics, and planning.

Uncovering neural mechanisms of world models by symbolically programming RNNs

How is it that through the distributed and dynamic activity in our brain’s neural circuits, we think thoughts about objects, mentally simulate how they will move and react to forces, and plan actions toward them? This research thrust develops new multilevel modeling frameworks that, uniquely, inter-operate in both cognitive hypotheses (e.g., physical object representations) and neural mechanisms (e.g., distributed codes and attractors). This line of work recently led to evidence for a “mental simulation circuit” in the prefrontal populations of macaques playing the video game pong.

Intuitive physics basis of perception

Many of the objects we encounter in everyday life are soft — from the shirt on your back, to the towel on your counter. Most existing computational and neural studies of object perception, however, have focused only on rigid objects — such as blocks or tools. This is an important limitation, since only soft objects can change their shape dramatically — e.g., when you toss your shirt aside, or fold your towel. To address this, this research thrust explores the ability of different kinds of models to capture human perception of cloths and liquids. We find that standard models of vision, including performant DNNs, fail to explain human performance. Instead, we find that capturing human performance requires a different kind of model that integrates intuitive physics, realized as probabilistic simulations of how soft objects move. We find converging evidence for this conclusion in a recent fMRI study, together revealing an account of soft object perception that contrasts sharply with the currently popular approaches.

The interface between seeing and remembering

The spontaneous processing of visual information plays a significant role in shaping memory, sometimes even overshadowing voluntary efforts to encode specific details. What are the neurocomputational mechanisms that underlie the transformation of percepts to memories in the brain? This research addresses this question using computational models, behavioral experiments, and data analysis. We posit that the interface of perception and memory is modulated by an adaptive mechanism called depth-of-processing. This depth of processing is modulated, on-the-fly and image-by-image, by a simple computational signature: compression-based reconstruction error of an image. We find that images with harder to reconstruction visuals representations leave stronger memory traces; we also find the same computational signature of depth-of-processing predicts activity in the single human hippocampus and amygdala neurons.