A team of researchers has unveiled ROCKET-1, a groundbreaking method designed to enhance the precision of AI agents interacting within virtual environments, particularly in games like Minecraft. This innovative approach merges object detection and tracking with advanced AI models. The researchers have introduced a technique known as “Visual-temporal context prompting,” which aims to refine the interaction capabilities of AI agents in these digital landscapes. Unlike previous methods that relied heavily on language models or diffusion models, which often struggled with conveying spatial information or accurately predicting future states, ROCKET-1 utilizes a novel form of visual communication between AI models.
GPT-4o plans, ROCKET-1 executes
The system operates on multiple levels: GPT-4o serves as a high-level “planner,” deconstructing intricate tasks, such as “Obtain Obsidian,” into manageable steps. The multimodal model, Molmo, identifies relevant objects within images using coordinate points, while SAM-2 generates precise object masks from these points and tracks the objects in real-time. ROCKET-1 then executes the actions in the game world based on the generated object masks and instructions, effectively controlling keyboard and mouse inputs.
The researchers draw inspiration from human behavior in task execution. They explain, “In human task execution, such as object grasping, people do not pre-imagine holding an object but maintain focus on the target object while approaching its affordance.” This insight emphasizes that we do not visualize holding an object; rather, we rely on our sensory perception to simply pick it up.
In a demonstration, the team illustrated how a human can directly control ROCKET-1: by clicking on objects within the game world, the system is prompted to engage with them. The hierarchical agent structure proposed by the team, which integrates GPT-4o, Molmo, and SAM-2, minimizes human input to mere text instructions.
Multiple AI models work together
For training purposes, the research team utilized OpenAI’s “Contractor” dataset, comprising 1.6 billion individual images capturing human gameplay in Minecraft. They developed a specialized method called “Backward Trajectory Relabeling” to automatically generate the necessary training data.
The SAM-2 AI model plays a crucial role by analyzing recorded gameplay in reverse, automatically identifying which objects players have interacted with. These objects are then marked in earlier frames, enabling ROCKET-1 to learn to recognize and engage with relevant items.
ROCKET-1: Increased computational effort
The effectiveness of ROCKET-1 is particularly pronounced in complex, long-term tasks within Minecraft. In seven distinct tasks, ranging from crafting tools to mining resources, ROCKET-1 achieved success rates of up to 100 percent, while other systems often struggled. Even in more challenging endeavors, such as mining diamonds or creating obsidian, the system recorded success rates of 25 and 50 percent, respectively.
However, the researchers acknowledge certain limitations of ROCKET-1: “Although ROCKET-1 significantly enhances interaction capabilities in Minecraft, it cannot engage with objects that are outside its field of view or have not been previously encountered.” This constraint necessitates increased computational effort, as the higher-level models must intervene more frequently.
For those interested in exploring further, additional information and examples can be found on the project’s GitHub page.