Gemini 3.5 Flash lands on Google’s Android coding rankings, but it’s 3x the cost for slower performance

Google has unveiled a new set of benchmark results aimed at evaluating the most effective AI models for Android coding, providing insights into the cost per token for each model. Notably, the Gemini 3.5 Flash emerges as the most resource-intensive option for Android development, yet it fails to secure a position within the top five performers.

As the initial excitement surrounding general chatbots begins to wane, tech giants such as Google, OpenAI, and Anthropic are pivoting towards agentic models that excel in coding capabilities. This shift has led users to increasingly depend on these models for what has been dubbed “vibe coding,” a practice that allows large language models (LLMs) to handle a significant portion of software development tasks.

Over the past few months, advancements in model performance have been closely monitored by Google, which has maintained the “Android Bench” to compare its own models against competitors. The benchmarks are updated regularly, reflecting the latest iterations, including the recent Gemini 3.5 Flash, while assessing their effectiveness in Android coding.

In the latest Android Bench results, the findings reveal a more costly landscape for developers. Gemini 3.5 Flash ranks sixth, trailing behind models such as GPT 5.5 and Gemini 3.1 Pro Preview, the latter of which was tested back in February.

Initially marketed as a more economical and quicker alternative to Gemini 3.1 Pro, the new benchmark results suggest otherwise. Gemini 3.5 Flash exhibits higher latency and a 9% performance gap compared to its predecessor, despite an anticipated performance advantage of 6.1%.

The financial implications are striking: Gemini 3.5 Flash averages 355.9 tokens per benchmark run, costing approximately 7.1, whereas Gemini 3.1 Pro Preview utilizes only 73.3 tokens at roughly a third of that expense.

It is essential to note that the scores for Gemini 3.1 Pro are based on its preview version, yet it still outperforms a model designed to be faster and more efficient.

GPT 5.5 presents a similar cost structure per run, but Gemini 3.5 Flash consumes 5.5 times more tokens during the Android Bench evaluations. Claude’s previous model, Opus 4.7, ranks fourth, offering a slightly lower run cost and token usage, placing it comfortably in the middle of the rankings. Notably, benchmark scores for Opus 4.8 and Fable 5 remain unpublished.

Top Ten Models Ranked by Google:

Model	Score	Avg Latency	Avg Total Tokens	Avg Cost
GPT 5.5	74	15.7	64.7	4.2
GPT 5.4	72.4	21.2	64.2	.7
Gemini 3.1 Pro Preview	72.4	11.1	73.3	.9
Claude Opus 4.7	68.7	11.6	90.0	4.3
Claude Opus 4.6	66.6	9.9	69.5	.4
Gemini 3.5 Flash	63.7	14.2	355.9	7.1
GLM 5.1	59.7	33.4	80.2	.7
Kimi K2.6	58.6	29.9	94.3	.5
Claude Sonnet 4.6	58.4	8.2	47.9	.4
DeepSeek V4 Pro	55.4	35.8	132.7	.7
Claude Sonnet 4.5	53.7	13.1	94.2	.0

The rankings feature several open-weight models alongside established closed-weight options like Claude and GPT. The upper echelon of the list has remained largely consistent since the previous Android Bench release, with the notable exception of GPT 5.3 Codex, which has been removed from consideration.

For those interested in a comprehensive view of the rankings, the full details are available on Google’s official website.

Google continues to refine this list as new models undergo testing, establishing a reliable benchmark for assessing performance in Android development. While Gemini 3.5 Flash shows promise in other LLM and agentic tasks, its effectiveness in Android coding appears to be less robust.

AppWizard