Dreaming in Blocks — MineWorld, the Minecraft World Model

In a remarkable development for the gaming and AI communities, Microsoft has unveiled Mineworld, an open-source world model designed to enhance the Minecraft experience. This innovative model, which allows for real-time interactions and high controllability, represents a significant advancement over previous models, particularly the closed-source Oasis.

World models, a concept introduced by David Ha and colleagues in 2018, have long been constrained by computational inefficiencies that hinder real-time interaction. However, Mineworld aims to overcome these limitations through its efficient architecture and novel methodologies. The model’s strengths can be distilled into three key components:

  1. Real-time Interactivity: Mineworld is designed for fast, interactive gameplay, allowing users to engage with the environment dynamically.
  2. Parallel Decoding Algorithm: This innovative approach accelerates the generation process, significantly increasing the number of frames produced per second.
  3. Novel Evaluation Metric: A new metric has been developed to assess the controllability of the world model, ensuring that users have a seamless experience.

Paper link: https://arxiv.org/abs/2504.08388
Code: https://github.com/microsoft/mineworld
Released: 11th of April 2025


Mineworld, Simplified

To better understand Mineworld and its operational framework, we can break down its functionality into three distinct sections:

  • Problem Formulation: This section defines the challenges and establishes the foundational rules for both training and inference.
  • Model Architecture: An overview of the models utilized for generating tokens and output images.
  • Parallel Decoding: An exploration of how the authors enhanced frame generation rates through a diagonal decoding algorithm.

Problem Formulation

Mineworld processes two primary types of input: video game footage and player actions. Each input type necessitates a unique tokenization approach to be effectively utilized.

For instance, given a clip of Minecraft video denoted as 𝑥, containing 𝑛 states or frames, the image tokenization can be represented mathematically:

x=(x1,…,xn)

t= (t1,…,tc,tc+1,…,t2c,t2c+1,…,tN)

Each frame 𝑥(i) consists of c patches, with each patch represented by a token t(j). Thus, a single frame can be described as a set of quantized tokens {t(1),t(2),…,t(c)}, where each token captures a distinct set of pixels.

As every frame comprises c tokens, the total number of tokens across a video clip amounts to N = n.c.

Table 1. Seven different classes for the 11 different possibilities of actions. Grouping taken from [1]

In addition to video input tokenization, player actions also require tokenization. This process captures variations such as camera perspective changes, keyboard inputs, and mouse movements through 11 distinct tokens representing the full spectrum of input features:

  • 7 tokens for seven exclusive action groups, with related actions classified together (as shown in Table 1).
  • 2 tokens to encode camera angles.
  • 2 tokens to denote the beginning and end of the action sequence.

References

  • [1] J. Guo et al., MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft (2025), arXiv preprint arXiv:2504.08388v1
  • [2] R. Wachen and D. Leitersdorf, Oasis (2024), https://oasis-ai.org/
  • [3] D. Ha and J. Schmidhuber, World Models (2018), arXiv preprint arXiv:1803.10122
  • [4] J. Guo et al., MineWorld (2025), GitHub repository: https://github.com/microsoft/mineworld
  • [5] B. Baker et al., Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos (2022), arXiv preprint arXiv:2206.11795
  • [6] A. van den Oord et al., Neural Discrete Representation Learning (2017), arXiv preprint arXiv:1711.00937
  • [7] H. Touvron et al., LLaMA: Open and Efficient Foundation Language Models (2023), arXiv preprint arXiv:2302.13971
  • [8] Y. Ye et al., Fast Autoregressive Video Generation with Diagonal Decoding (2025), arXiv preprint arXiv:2503.14070
AppWizard
Dreaming in Blocks — MineWorld, the Minecraft World Model