The engineers at MIT have developed an innovative method that empowers robots to make intuitive, task-relevant decisions. Their groundbreaking approach, known as Clio, enables robots to identify the essential parts of a scene based on the tasks at hand. With Clio, robots can process natural language descriptions of tasks and determine the level of detail needed to interpret their surroundings, remembering only the relevant parts of a scene.
In real-world experiments conducted in diverse environments, from cluttered cubicles to expansive five-story buildings on MIT’s campus, Clio proved its capabilities by automatically segmenting a scene at various levels of granularity. Using natural language prompts such as “move rack of magazines” and “get first aid kit,” Clio showcased their ability to understand and execute complex tasks.
The team demonstrated Clio’s real-time capabilities on a quadruped robot as it navigated through an office building. Clio selectively identified and mapped only the relevant parts of the scene, enabling the robot to focus on its designated tasks, such as retrieving a dog toy while disregarding unrelated office supplies. Named after the Greek muse of history, Clio’s capacity to discern and retain task-specific elements is remarkable.
The researchers anticipate that Clio’s potential applications span various scenarios where robots need to efficiently interpret their surroundings in alignment with their objectives.
“Search and rescue is the motivating application for this work, but Clio can also power domestic robots and robots working on a factory floor alongside humans,” says Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro), principal investigator in the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARK Laboratory. “It’s really about helping the robot understand the environment and what it has to remember in order to carry out its mission.”
Recent advances in computer vision and natural language processing have enabled robots to identify objects in real-world settings using open-set recognition. This approach, powered by deep-learning tools, allows robots to adapt and learn from diverse scenarios. The next challenge is to refine how robots interpret scenes to make their understanding more relevant and actionable for specific tasks.
“Typical methods will pick some arbitrary, fixed level of granularity for determining how to fuse segments of a scene into what you can consider as one ‘object,'” Maggio says. “However, the granularity of what you call an ‘object’ is actually related to what the robot has to do. If that granularity is fixed without considering the tasks, then the robot may end up with a map that isn’t useful for its tasks.”
With Clio, the MIT team aimed to enable robots to interpret their surroundings with a level of granularity that can be automatically tuned to the tasks at hand.
Imagine a robot tasked with moving a stack of books to a shelf. Clio can effortlessly identify the entire stack as the task-relevant object. Alternatively, if the goal is to move only the green book from the stack, Clio can distinguish the green book as a single target object, disregarding the rest of the scene. This level of discernment is a game-changer in the field of robotics.
The team’s approach integrates cutting-edge computer vision and expansive language models, leveraging neural networks to establish connections among millions of open-source images and semantic text. Additionally, Clio incorporates mapping tools that automatically segment images into smaller components, which are then analyzed for semantic similarity. By employing the concept of the “information bottleneck” from classic information theory, the researchers compress image segments to isolate and store the most semantically relevant segments for a given task.
The potential of Clio is immense, marking a significant leap forward in the capabilities of robotic systems. This innovative technology is poised to transform industries and redefine what is possible for robots in the future.
The researchers showcased Clio’s capabilities in various real-world settings, pushing the boundaries of what was thought possible. “We decided to put Clio to the test in my apartment without any prior cleaning,” shared Maggio, setting the stage for an unprecedented experiment. They presented Clio with a series of practical tasks, such as “move pile of clothes,” and watched as it effortlessly analyzed images of the cluttered apartment, efficiently identifying and isolating the pile of clothes using the Information Bottleneck algorithm.
Not stopping there, the team integrated Clio with Boston Dynamic’s quadruped robot, Spot, and set it loose with a list of tasks to accomplish. As Spot navigated and mapped the interior of an office building, Clio operated in real-time on an on-board computer, identifying relevant segments within the mapped scenes. This groundbreaking approach resulted in an overlaying map highlighting only the target objects, enabling the robot to approach and complete the tasks with precision and efficiency.
“Running Clio in real-time was a big accomplishment for the team,” Maggio says. “A lot of prior work can take several hours to run.”
Next, the team plans to adapt Clio to be able to handle higher-level tasks and build upon recent advances in photorealistic visual scene representations.
“We’re still giving Clio tasks that are somewhat specific, like ‘find deck of cards,'” Maggio says. “For search and rescue, you need to give it more high-level tasks, like ‘find survivors’ or ‘get power back on.’ So, we want to get to a more human-level understanding of how to accomplish more complex tasks.”