Learning Human Interaction from Videos

This project aims to develop self-supervised and unsupervised methods to learn actions directly from large-scale videos. Unlike current approaches that rely on heavy supervision or predefined latent structures, our goal is to discover an implicit latent action space with broader generalization. Such a representation can transfer across domains and support robotics, human motion and interactions, customizable video generation, and planning tasks via world models.

Project Team

Prof. Siyu Tang
ETH Zurich
Prof. Bernt Schiele
Max Planck Institute
Dr. Thabo Beeler
Google
Dr. Vassilis Choutas
Google
Dr. Korrawe Karunratanakul
ETH Zurich
Dr. Jan Eric Lenssen
Max Planck Institute
Bahri Batuhan Bilecen
ETH Zurich

Publications

InvAct: View and Scene-invariant Atomic Action Learning from Videos

Published:
submitted to NeurIPS, 2026

The project is about learning atomic-level actions from human demonstration videos, in different camera viewpoint and diverse scene settings. We then demonstrate that this learnt representation can be effectively used for short and long-sequence action retrieval, action classification, and VLM-based robotic policy pretraining.