Learning Human Interaction from Videos

This project aims to develop self-supervised and unsupervised methods to learn actions directly from large-scale videos. Unlike current approaches that rely on heavy supervision or predefined latent structures, our goal is to discover an implicit latent action space with broader generalization. Such a representation can transfer across domains and support robotics, human motion and interactions, customizable video generation, and planning tasks via world models.

Project Team

Prof. Siyu Tang

ETH Zurich

Prof. Bernt Schiele

Max Planck Institute

Dr. Thabo Beeler

Google

Dr. Vassilis Choutas

Google

Dr. Korrawe Karunratanakul

ETH Zurich

Dr. Jan Eric Lenssen

Max Planck Institute

Bahri Batuhan Bilecen

ETH Zurich

Publications

InvAct: View and Scene-invariant Atomic Action Learning from Videos

Published:

submitted to NeurIPS, 2026

The project is about learning atomic-level actions from human demonstration videos, in different camera viewpoint and diverse scene settings. We then demonstrate that this learnt representation can be effectively used for short and long-sequence action retrieval, action classification, and VLM-based robotic policy pretraining.

‍