Google has updated its "Android Bench" rankings, introducing new AI models for Android app development, including open-weight models. The latest rankings, as of May 18, 2026, show GPT 5.5 at the top, surpassing GPT 5.4 and Gemini 3.1 Pro by nearly 2%. The update provides metrics such as average latency, total tokens used, and average cost per benchmark run. GPT 5.5 has a score of 74, with an average latency of 15.5, total tokens of 64.5, and an average cost of .9. In comparison, GPT 5.4 has a score of 72.4, with an average latency of 21.2, total tokens of 64.2, and an average cost of [openai_gpt model="gpt-4o-mini" prompt="Summarize the content and extract only the fact described in the text bellow. The summary shall NOT include a title, introduction and conclusion. Text: Google has refreshed its “Android Bench” rankings, unveiling a new lineup of AI models tailored for Android app development. This update introduces several “open-weight” models and provides deeper insights into the performance metrics, including token usage and associated costs.
Large language models have increasingly demonstrated their prowess in coding, significantly enhancing the app development process. This trend has given rise to what is now known as “vibe coding.” Earlier this year, Google released a benchmark ranking that evaluated the top AI models for Android development, focusing on common tasks and adherence to best practices.
Initially, the rankings were led by Gemini 3.1 Pro, with OpenAI’s GPT 5.4 later sharing the spotlight. However, as of the latest update on May 18, 2026, a new contender has emerged. GPT 5.5 has claimed the top position, surpassing GPT 5.4 and Gemini 3.1 Pro by nearly 2%.
This update also enhances clarity by presenting average latency, total tokens utilized, and the average cost associated with each AI model. Google has provided documentation detailing the methodology behind these metrics.
Average Latency: Time taken to complete 100 tasks across 10 runs
Average Total Tokens: Token consumption for a complete benchmark run across 10 iterations
Average Cost: Cost per benchmark run in US dollars at the time of testing
While GPT 5.5 boasts superior performance, it comes at a cost—over twice that of Gemini 3.1 Pro for equivalent functions.
Here’s a look at the top ten models based on Google’s latest data as of May 21, 2026:
Model
Score
Avg Latency
Avg Total Tokens
Avg Cost
New: GPT 5.5
74
15.5
64.5
3.9
GPT 5.4
72.4
21.2
64.2
.7
Gemini 3.1 Pro Preview
72.4
11.5
75.4
.0
New: Claude Opus 4.7
68.7
11.6
90.0
4.3
GPT 5.3 Codex
67.7
11.2
71.4
.6
Claude Opus 4.6
66.6
9.9
69.5
.4
GPT 5.2 Codex
62.5
24.3
124.4
1.9
Claude Opus 4.5
61.9
12.5
79.8
2.5
Gemini 3 Pro Preview
60.4
9.8
117.0
.7
New: GLM 5.1
59.7
33.4
80.2
.7
The rankings now feature a wider array of open-weight models, including Gemma, Qwen, DeepSeek, and MiMo, among others. GLM 5.1 has emerged as the highest scorer among these newcomers, closely followed by Kimi K2.6.
Google is committed to updating the “Android Bench” on a monthly basis. With the anticipated release of Gemini 3.5 Pro and the already available 3.5 Flash, the competitive landscape will be intriguing to watch as Google seeks to reclaim its lead against OpenAI's advancements.
More on Android:
Follow Ben: Twitter/X, Threads, Bluesky, and Instagram
FTC: We use income earning auto affiliate links. More." max_tokens="3500" temperature="0.3" top_p="1.0" best_of="1" presence_penalty="0.1" frequency_penalty="frequency_penalty"].7. Gemini 3.1 Pro has the same score as GPT 5.4 but with different latency and token metrics. The rankings also include other models like Claude Opus 4.7, GPT 5.3 Codex, and GLM 5.1, which has emerged as the highest scorer among newcomers. Google plans to update the rankings monthly.