ModelVsBaby: Object recognition in infants and machines
ModelVsBaby compares toddlers and models on realistic, silhouette, geon, blurred, and feature-only views of common objects.
Early in life, visual input changes slowly and then becomes progressively richer as posture, reaching, and locomotion open new views of the world. We use head-mounted infant video to ask how that natural ordering shapes visual learning in both humans and machines.
Across the first year of life, infants move from long stretches of relatively stable input to increasingly object-centered, action-rich scenes. That developmental progression provides both a descriptive window into early experience and a candidate learning curriculum for artificial systems.
To measure visual experience directly, in collaboration with Dr. Linda Smith and Dr. Justin Wood, we analyzed egocentric recordings from infants wearing lightweight head cameras. The recordings show a clear developmental progression. In the earliest months, scenes change slowly and often remain dominated by faces, ceilings, and other large stable structures. Later in infancy, reaching, sitting, crawling, and manual exploration bring more rapid object-centered changes and more varied visual episodes. If the earliest input is especially slow and stable, it may provide an unusually useful scaffold for learning invariant structure before the visual world becomes faster and more complex.

We then organized the videos into developmental stages and used them to pretrain self-supervised visual models stage by stage. The central question was whether the order of infant experience matters. We compared developmental training, which proceeds from earlier to later stages, with alternative curricula that reverse or disrupt that natural sequence.
After each pretraining stage, we evaluated the learned representations on three downstream settings: non-egocentric action recognition, egocentric action recognition, and domain-related object recognition. This lets us test whether developmental ordering helps both near-domain and more general visual tasks.

These results suggest that one useful inductive bias for building more human-like visual systems is not only what data to train on, but when to see it. Early infant experience may provide a naturally gentle curriculum: semantic structure changes slowly enough to support stable self-supervised learning, then gradually expands into richer object- and action-centered views.