Human-Object Interaction from Human-Level Instructions

Stanford University

ICCV 2025

Abstract

Intelligent agents must autonomously interact with the environments to perform daily tasks based on human-level instructions. They need a foundational understanding of the world to accurately interpret these instructions, along with precise low-level movement and interaction skills to execute the derived actions. In this work, we propose the first complete system for synthesizing physically plausible, long-horizon human-object interactions for object manipulation in contextual environments, driven by human-level instructions. We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans. Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements. We also train a policy to track generated motions in physics simulation via reinforcement learning (RL) to ensure physical plausibility of the motion. Our experiments demonstrate the effectiveness of our system in synthesizing realistic interactions with diverse objects in complex environments, highlighting its potential for real-world applications.

Method Overview

Our system takes the scene description and human-level instruction as input and uses a high-level planner to obtain the scene map and a detailed execution plan. The low-level motion generator then generates synchronized object motion, full-body human motion, and finger motion. Finally, the physics tracker uses RL to track the generated motion, producing physically plausible motion.

Ablation Studies on Motion Generator

We compare our full system with two ablations: CNet and C+RNet. Our full system generates natural finger movements, making the interaction much more realistic, while the ablations fail to produce finger movements and exhibit severe artifacts. For further details on the ablations, please refer to our paper.

Evaluations of Physics Tracker

We compare the kinematic motion generated by the motion generator with the motion tracked by the physics tracker. Our results demonstrate that the physics tracker effectively corrects artifacts in the kinematic motion, such as foot floating and hand object penetration.

Human-Object Interaction from Human-Level Instructions

Given human-level instructions, we generate synchronized full-body human motion, finger motion, and object motion to accomplish the task.

Abstract

Video

Method Overview

Evaluations of Motion Generator

Ablation Studies on Motion Generator

Evaluations of Physics Tracker

More Results