OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation
OKAMI enables a humanoid robot to imitate manipulation skills from a single human video demonstration.
Tasks
Sprinkle-salt
Plush-toy-in-basket
Place-snacks-on-plate
Close-the-laptop
Close-the-drawer
Bagging
Robot Rollout
The systematic generalization includes visual backgorunds, camera angles, spatial layouts, and new object instances. Note that the camera angle generalization is inherently entailed in our pipeline, as the camera extrinsics of video demonstration is different from the one during rollout.
Generalization: Visual Backgrounds
Human Video
Deploy on a different table with a green tablecloth
Deploy near a cabinet
Human Video
Deploy on a kitchen table
Deploy on a kitchen table
Human Video
Deploy near the water sink in kitchen
Deploy on a kitchen table
Generalization: Spatial Layouts
Human Video
Human Video
Human Video
Deploy when objects are at different heights
Generalization: New Object Instances
Human Video
Human Video
Robust to Different Demonstrators
OKAMI allows the humanoid robot to imitate from videos demonstrated by different users, where the ways to complete a task differ. This also shows that our method is robust to users with different demographics.
Human Video: Close Laptop Using Left Hand
Robot Rollout Video
Human Video: Close Laptop Using Right Hand
Robot Rollout Video
Failure Modes
OKAMI's policies may fail to grasp objects due to inaccuracies in controllers and the human reconstruction model, or fail to complete tasks because of unwanted collisions, undesired upper body rotations, or inaccuracy in solving inverse kinematics.
Failed to grasp the bottle due to the inaccurate reconstruction of wrist pose
Failed to complete the task due to inaccurate inverse kinematics results
Failed to complete the task due to unwanted collision between the index finger and the drawer, and unwanted body rotation