Learning from infant perspective: slow changes for development of a visual system in humans and machines
Head-mounted infant videos reveal a natural slow-to-fast developmental curriculum, and training in that order improves later visual recognition.
VLMs have increasingly been used as judges across a wide range of applications. An open question is how closely their judgments align with human judgments and where they diverge. In collaboration with Dr. Robert Nosofsky and Dr. Craig Sanders, we showed that VLMs and humans use geometrically similar perceptual spaces. Specifically, we showed that VLM-generated feature ratings correlate with human perceptual judgments, especially for common perceptual dimensions, while also revealing systematic differences in how models use rating scales. Furthermore, we extended this analysis from individual ratings to representational geometry, and demonstrated that VLM-derived similarity structure shows significant alignment with human perceptual space.
To compare human and VLM feature ratings, we asked human participants and VLMs to rate a set of rocks along a set of perceptual features. To obtain comparable ratings, we matched the rock images, perceptual features, and instructions (prompts).

We compared several models, including GPT 4o, and found that they correlate with human judgments, but the strength and the use of the rating scale varied across the dimensions.

We then moved from single-attribute ratings to the global geometry of perceptual space. We asked humans and models to judge the visual similarity of pairs of rocks, creating a large matrix of pairwise relationships over a shared stimulus set. We then investigated whether those pairwise judgments imply the same low-dimensional structure for humans and models.
We again used a similar prompt structure across models and humans.

We compared different types of models, including those trained with a contrastive loss (e.g., VLM and CLIP) and those trained with a classification loss. We did MDS on similarity judgments and compared aligned representational spaces.

In general, large models like GPT 4o, Gemma-3, and Llama-4 had high correlation with the human similarity judgments.


After MDS and Procrustes alignment, we observe a strong correlation between large models and humans across some feature dimensions, especially lightness and shininess.

Finally, when used in a human categorization model, the features derived from large foundation models' similarity judgments yielded better predictions of human categorization performance than those derived from humans. This is most likely because the model data was less noisy, but it indicates strong potential for using model-derived features to model human memory and learning.
To further align VLMs with human perception, in ongoing work, we are fine-tuning the models by distilling distributions of human feature ratings (for a given image and feature, we minimize the KL divergence between VLM token probabilities for each feature ranking and the distribution of rankings across a population of human participants).