Elias B. Issa
Zoom link emailed to symposium registrants
Deep neural networks (DNNs) optimized for real-world object classification are the leading models for predicting neural responses in inferior temporal cortex (IT). However, further optimizing DNNs for classification accuracy produces a saturating trend for predicting IT neural responses -- potentially because pure performance optimization favors representations explicitly encoding information about object class at the expense of representing other sources of image-by-image variance. Here, we performed a meta-analysis of current DNNs to identify additional representational properties underlying neural predictivity beyond ImageNet classification accuracy (n=47 models). We show that representations separating scene identity from (1) viewpoint transforms (different crops of an image or camera position in a natural movie) or (2) appearance transforms (lighting and color variation) were highly predictive of the best models of high-level visual cortex across all four datasets tested -- two neural datasets in monkeys and two human fMRI datasets.
Intriguingly, factorizing (as opposed to invariance to) scene viewpoint into a different subspace from scene identity information was predictive of neural fits as was complete invariance to a scene’s appearance. Furthermore, factorizing a foreground object’s motion alone was not nearly as predictive as factorizing camera motion, the joint motion of all objects in a scene. These metrics retained moderate predictive power even when accounting for ImageNet classification accuracy. Thus, overall scene coding strategies -- camera viewpoint factorization and invariance to appearance -- may provide new insight into the goals of high-level visual cortex.