Researchers now use pre-trained generative AI models to teach robots to perform tasks by following rules or “policy.” These models are powerful because they can handle multiple complex tasks.
During training, the models are exposed only to feasible robot actions, allowing them to learn valid movement paths, or trajectories, for robots. However, these trained trajectories don’t always match users’ needs in real-world situations.
Fixing these issues often means collecting new data for the task and retraining the model, which is costly, slow, and requires advanced machine-learning skills.
Imagine if you could step in and correct its action in real time with just a simple interaction.
New technology mimics nature and makes effortless robot movements
This scenario could become a reality thanks to a groundbreaking framework developed by MIT and NVIDIA researchers. Their new framework could make robots more adaptable and user-friendly, allowing people to correct a robot’s behavior with simple interactions in real time.
This technique eliminates the need for collecting new data and retraining the robot’s machine-learning model. Instead, it allows the robot to respond to intuitive, real-time human feedback, selecting an action sequence that closely aligns with the user’s intent.
The framework’s success rate was 21 percent higher during testing than an alternative method.
Felix Yanwei Wang, an electrical engineering and computer science (EECS) graduate student and lead author of a paper, said, “We can’t expect laypeople to perform data collection and fine-tune a neural network model. The consumer will expect the robot to work right out of the box; if it doesn’t, they would want an intuitive mechanism to customize it. That is the challenge we tackled in this work.”
“We want to allow the user to interact with the robot without introducing those kinds of mistakes, so we get a behavior that is much more aligned with user intent during deployment but also valid and feasible.”

Credit: Melanie Gonick, MIT
The framework provides three easy ways for users to guide the robot’s actions. They can point to the desired object in the robot’s camera view, trace a path to follow, or physically move the robot’s arm. Physically moving the robot is the most precise method, as it avoids losing information when translating from a 2D image to a 3D action.
Researchers developed a sampling technique to prevent the robot from making invalid moves, like colliding with objects. This method allows the robot to pick the best action from a set of valid options that align with the user’s intent.
Instead of directly imposing the user’s instructions, the robot combines the user’s feedback with its own learned behaviors. This balance helps the robot adapt while staying within safe limits.
Tests in simulations and with a real robot arm in a toy kitchen showed that this approach outperformed other methods. While it might not always complete a task immediately, it allows users to correct mistakes in real-time, avoiding delays caused by waiting until the task is done to provide new instructions.
Researchers now look forward to optimizing the sampling procedure while maintaining or improving its performance.
Journal Reference:
- Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sundaralingam, Xuning Yang, Yu-Wei Chao, Claudia Perez-D’Arpino, Dieter Fox, Julie Shah. Inference-Time Policy Steering through Human Interactions. arXiv: 2411.16627v1