LMM-3DP:
Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation

Introduction Video

Method Video

Abstract

The recent advancements in visual reasoning capabilities of large multimodal models (LMMs) and the semantic enrichment of 3D feature fields have expanded the horizons of robotic capabilities. These developments hold significant potential for bridging the gap between high-level reasoning from LMMs and low-level control policies utilizing 3D feature fields. In this work, we introduce LMM-3DP, a framework that can integrate LMM planners and 3D skill Policies. Our approach consists of three key perspectives: high-level planning, low-level control, and effective integration. For high-level planning, LMM-3DP supports dynamic scene understanding for environment disturbances, a critic agent with self-feedback, history policy memorization, and reattempts after failures. For low-level control, LMM-3DP utilizes a semantic-aware 3D feature field for accurate manipulation. In aligning high-level and low-level control for robot actions, language embeddings representing the high-level policy are jointly attended with the 3D feature field in the 3D transformer for seamless integration. We extensively evaluate our approach across multiple skills and long-horizon tasks in a real-world kitchen environment. Our results show a significant 1.45x success rate increase in low-level control and an approximate 1.85x improvement in high-level planning accuracy compared to LLM-based baselines.

Figure: Overview of low-level 3D policy.

The visual input is point clouds processed by a PointNext encoder and subsequently fused with 3D feature fields through a lightweight MLP to learn a semantic-aware 3D representation. LMM-3DP samples a set of query points to attend to the learned 3D representation and language embeddings through a 3D transformer for predicting keyframe actions. A learnable query token is introduced to efficiently localize the next best point position.

Results (Please try use chrome if some videos do not play)

Long Horizon Tasks (# Skills >= 3)

LMM-3DP is able to perform long horizon tasks with different skills. Here are examples where LMM-3DP composes >= 3 skills to achieve the task. Note that when a camera sticker appears in the frame, it means that an image is taken and input to GPT-4V for it to update the current plan.

Side by Side Videos

Place pan in the sink + turn faucet.

Place the milk in the pan + pan on the stove.

Put duck to the drawer + close 2 drawer doors.

Place two fruits in the sink in sequence.

Retry Failed Attempts

With the visual feedback and the critic agent, LMM-3DP can update the proposed plan during the robot's execution. It can retry previous failed attempts until success. In the following experiments, we added random noise to the end effector so that the robot has a higher chance of failing the task. Note that when a camera sticker appears in the frame, it means that an image is taken and input to GPT-4V for it to update the current plan.

Side by Side Videos

Retry on closing the drawer.

Retry on turning faucet.

Retry on closing the top cabinet door.

Retry on opening the top cabinet door.

Human Intervention

With the visual feedback and the critic agent, LMM-3DP can update the proposed plan during the robot's execution, when there's a human intervention that changes the environment. Note that when a camera sticker appears in the frame, it means that an image is taken and input to GPT-4V for it to update the current plan.

Side by Side Videos

Someone puts an additional starfruit on the table.

Someone helps close a drawer door.

Someone places the pineapple back on the table to trick the robot.

Someone places more objects on the table.

Skill Composition

To successfully compose different skills in sequence, it is essential to demonstrate that these skills are disentangled so that the execution of one skill does not impact the subsequent skills. Our framework is capable of generalizing to novel long horizon tasks through the composition of different skills. For instance, after training the model on the tasks of putting object A to location B, and putting object C to location D, the model should also be capable of putting C to B. We randomly combine two locations (drawer and sink) and five objects (milk, starfruit, pineapple, duck, and pan).

Side by Side Videos