Google has refreshed its “Android Bench” rankings, unveiling a new lineup of AI models tailored for Android app development. This update introduces several “open-weight” models and provides deeper insights into the performance metrics, including token usage and associated costs.
Large language models have increasingly demonstrated their prowess in coding, significantly enhancing the app development process. This trend has given rise to what is now known as “vibe coding.” Earlier this year, Google released a benchmark ranking that evaluated the top AI models for Android development, focusing on common tasks and adherence to best practices.
Initially, the rankings were led by Gemini 3.1 Pro, with OpenAI’s GPT 5.4 later sharing the spotlight. However, as of the latest update on May 18, 2026, a new contender has emerged. GPT 5.5 has claimed the top position, surpassing GPT 5.4 and Gemini 3.1 Pro by nearly 2%.
This update also enhances clarity by presenting average latency, total tokens utilized, and the average cost associated with each AI model. Google has provided documentation detailing the methodology behind these metrics.
- Average Latency: Time taken to complete 100 tasks across 10 runs
- Average Total Tokens: Token consumption for a complete benchmark run across 10 iterations
- Average Cost: Cost per benchmark run in US dollars at the time of testing
While GPT 5.5 boasts superior performance, it comes at a cost—over twice that of Gemini 3.1 Pro for equivalent functions.
Here’s a look at the top ten models based on Google’s latest data as of May 21, 2026:
| Model | Score | Avg Latency | Avg Total Tokens | Avg Cost |
| New: GPT 5.5 | 74 | 15.5 | 64.5 | 3.9 |
| GPT 5.4 | 72.4 | 21.2 | 64.2 | .7 |
| Gemini 3.1 Pro Preview | 72.4 | 11.5 | 75.4 | .0 |
| New: Claude Opus 4.7 | 68.7 | 11.6 | 90.0 | 4.3 |
| GPT 5.3 Codex | 67.7 | 11.2 | 71.4 | .6 |
| Claude Opus 4.6 | 66.6 | 9.9 | 69.5 | .4 |
| GPT 5.2 Codex | 62.5 | 24.3 | 124.4 | 1.9 |
| Claude Opus 4.5 | 61.9 | 12.5 | 79.8 | 2.5 |
| Gemini 3 Pro Preview | 60.4 | 9.8 | 117.0 | .7 |
| New: GLM 5.1 | 59.7 | 33.4 | 80.2 | .7 |
The rankings now feature a wider array of open-weight models, including Gemma, Qwen, DeepSeek, and MiMo, among others. GLM 5.1 has emerged as the highest scorer among these newcomers, closely followed by Kimi K2.6.
Google is committed to updating the “Android Bench” on a monthly basis. With the anticipated release of Gemini 3.5 Pro and the already available 3.5 Flash, the competitive landscape will be intriguing to watch as Google seeks to reclaim its lead against OpenAI’s advancements.
More on Android:
Follow Ben: Twitter/X, Threads, Bluesky, and Instagram
FTC: We use income earning auto affiliate links. More.