
This project aims to develop self-supervised and unsupervised methods to learn actions directly from large-scale videos. Unlike current approaches that rely on heavy supervision or predefined latent structures, our goal is to discover an implicit latent action space with broader generalization. Such a representation can transfer across domains and support robotics, human motion and interactions, customizable video generation, and planning tasks via world models.
The project is about learning atomic-level actions from human demonstration videos, in different camera viewpoint and diverse scene settings. We then demonstrate that this learnt representation can be effectively used for short and long-sequence action retrieval, action classification, and VLM-based robotic policy pretraining.