Android Bench Archives

AppWizard

June 13, 2026

Gemini 3.5 Flash lands on Google’s Android coding rankings, but it’s 3x the cost for slower performance

Google has released benchmark results for evaluating AI models in Android coding, revealing that the Gemini 3.5 Flash is the most resource-intensive model but ranks sixth overall. The benchmarks indicate that Gemini 3.5 Flash has higher latency and a 9% performance gap compared to its predecessor, Gemini 3.1 Pro Preview, despite being marketed as a faster alternative. In terms of cost, Gemini 3.5 Flash averages 355.9 tokens per benchmark run at approximately 7.1, while Gemini 3.1 Pro Preview uses only 73.3 tokens at about a third of that cost. The top-ranked models include GPT 5.5, GPT 5.4, and Gemini 3.1 Pro Preview, while Claude Opus 4.7 ranks fourth. The rankings feature both open-weight and closed-weight models, with the list remaining consistent since the last release, except for the removal of GPT 5.3 Codex.

AppWizard

May 26, 2026

Google ranks the best AI for building Android apps, and the winner isn’t Gemini

Google launched the Android Bench benchmarking portal in March to help software developers evaluate AI models for Android app development. The leaderboard was updated last week to include open-weight models and new metrics for latency, tokens, and cost. Matthew McCullough, Google's VP of Product for Android Development, stated that the goal is to provide a benchmark for evaluating large language models (LLMs) in Android development. As of May 18, GPT 5.5 is the top AI model for Android app development, with Gemini 3.1 Pro and GPT 5.4 ranked as joint leaders. Android Bench evaluates LLMs based on real-world challenges and tasks sourced from public GitHub repositories. Other benchmarking tools in the Android ecosystem include Jetpack Microbenchmark, Jetpack Macrobenchmark, Firebase Performance Monitoring, Android Vitals, Apptim, and Android Performance Analyzer. The overall benchmark score on Android Bench is calculated using four core values: Confidence Interval Range, Average Latency Score, Average Total Tokens Score, and Average Cost. The test harness for Android Bench is publicly available on GitHub.

AppWizard

May 21, 2026

Google just tested a bunch of new AI models for Android app coding – here are the rankings

Google has updated its "Android Bench" rankings, introducing new AI models for Android app development, including open-weight models. The latest rankings, as of May 18, 2026, show GPT 5.5 at the top, surpassing GPT 5.4 and Gemini 3.1 Pro by nearly 2%. The update provides metrics such as average latency, total tokens used, and average cost per benchmark run. GPT 5.5 has a score of 74, with an average latency of 15.5, total tokens of 64.5, and an average cost of .9. In comparison, GPT 5.4 has a score of 72.4, with an average latency of 21.2, total tokens of 64.2, and an average cost of [openai_gpt model="gpt-4o-mini" prompt="Summarize the content and extract only the fact described in the text bellow. The summary shall NOT include a title, introduction and conclusion. Text: Google has refreshed its “Android Bench” rankings, unveiling a new lineup of AI models tailored for Android app development. This update introduces several “open-weight” models and provides deeper insights into the performance metrics, including token usage and associated costs. Large language models have increasingly demonstrated their prowess in coding, significantly enhancing the app development process. This trend has given rise to what is now known as “vibe coding.” Earlier this year, Google released a benchmark ranking that evaluated the top AI models for Android development, focusing on common tasks and adherence to best practices. Initially, the rankings were led by Gemini 3.1 Pro, with OpenAI’s GPT 5.4 later sharing the spotlight. However, as of the latest update on May 18, 2026, a new contender has emerged. GPT 5.5 has claimed the top position, surpassing GPT 5.4 and Gemini 3.1 Pro by nearly 2%. This update also enhances clarity by presenting average latency, total tokens utilized, and the average cost associated with each AI model. Google has provided documentation detailing the methodology behind these metrics. Average Latency: Time taken to complete 100 tasks across 10 runs Average Total Tokens: Token consumption for a complete benchmark run across 10 iterations Average Cost: Cost per benchmark run in US dollars at the time of testing While GPT 5.5 boasts superior performance, it comes at a cost—over twice that of Gemini 3.1 Pro for equivalent functions. Here’s a look at the top ten models based on Google’s latest data as of May 21, 2026: Model Score Avg Latency Avg Total Tokens Avg Cost New: GPT 5.5 74 15.5 64.5 3.9 GPT 5.4 72.4 21.2 64.2 .7 Gemini 3.1 Pro Preview 72.4 11.5 75.4 .0 New: Claude Opus 4.7 68.7 11.6 90.0 4.3 GPT 5.3 Codex 67.7 11.2 71.4 .6 Claude Opus 4.6 66.6 9.9 69.5 .4 GPT 5.2 Codex 62.5 24.3 124.4 1.9 Claude Opus 4.5 61.9 12.5 79.8 2.5 Gemini 3 Pro Preview 60.4 9.8 117.0 .7 New: GLM 5.1 59.7 33.4 80.2 .7 The rankings now feature a wider array of open-weight models, including Gemma, Qwen, DeepSeek, and MiMo, among others. GLM 5.1 has emerged as the highest scorer among these newcomers, closely followed by Kimi K2.6. Google is committed to updating the “Android Bench” on a monthly basis. With the anticipated release of Gemini 3.5 Pro and the already available 3.5 Flash, the competitive landscape will be intriguing to watch as Google seeks to reclaim its lead against OpenAI's advancements. More on Android: Follow Ben: Twitter/X, Threads, Bluesky, and Instagram FTC: We use income earning auto affiliate links. More." max_tokens="3500" temperature="0.3" top_p="1.0" best_of="1" presence_penalty="0.1" frequency_penalty="frequency_penalty"].7. Gemini 3.1 Pro has the same score as GPT 5.4 but with different latency and token metrics. The rankings also include other models like Claude Opus 4.7, GPT 5.3 Codex, and GLM 5.1, which has emerged as the highest scorer among newcomers. Google plans to update the rankings monthly.

AppWizard

May 8, 2026

Google is turning Android Studio into a policy watchdog

Google is enhancing the development experience for Android app creators by expanding Play Policy Insights within Android Studio, allowing developers to identify potential policy issues during coding. Developers will connect their Play developer accounts to access tailored insights. Google is also leveraging the SDK Index, a searchable database of Android SDKs, to provide crucial information on permissions and compliance with Play policies. To improve app security and user privacy, Google is upgrading the Play Integrity API for quicker detection of fraud and abuse, and introducing new privacy tools like a contact picker and location button. Support for post-quantum cryptography in Play App Signing will be added to protect apps from future threats. Developer verification will be implemented across Android to prevent the distribution of harmful apps. Google is revamping the Play Console to streamline the app publishing process, introducing expanded pre-review checks for common policy violations, a release status API for tracking approvals, and a feature to prevent new commits during reviews. Parallel publishing will be rolled out to avoid delays in updates. Additional enhancements include a Submission History log, secure account transfer tools, AI-powered policy recommendations, and expanded Play Academy training resources for new developers.

AppWizard

April 9, 2026

Google updates best AI models for coding Android apps, Gemini & GPT 5.4 at the top

The "Android Bench," Google's benchmark for evaluating AI models in Android app development, has been updated, with OpenAI's GPT 5.4 and GPT 5.3 Codex now sharing the top ranking with Gemini. The benchmark evaluates models based on criteria such as compatibility with Jetpack Compose, use of Coroutines and Flows, and integration with Room and Hilt. The latest rankings are as follows: 1. GPT 5.4: 72.4% 2. Gemini 3.1 Pro Preview: 72.4% 3. GPT 5.3-Codex: 67.7% 4. Claude Opus 4.6: 66.6% 5. GPT-5.2 Codex: 62.5% 6. Claude Opus 4.5: 61.9% 7. Gemini 3 Pro Preview: 60.4% 8. Claude Sonnet 4.6: 58.4% 9. Claude Sonnet 4.5: 54.2% 10. Gemini 3 Flash Preview: 42% 11. Gemini 2.5 Flash: 16.1% The rankings have not changed since the initial assessment in late February, and the latest models were evaluated in mid-March. The findings should be interpreted cautiously, as real-world performance may vary based on specific workflows and project requirements.

AppWizard

March 6, 2026

Google will now show which AI models are best at building Android apps

Google has introduced Android Bench, a tool for assessing AI model performance in Android app development. The top performer is Gemini 3.1 Pro, scoring 72.2%, followed by Claude Opus 4.6 at 66.6% and GPT 5.2 Codex at 62.5%. The benchmark evaluates models through real-world Android coding challenges with task completion rates between 16% and 72%. Google aims to facilitate the creation of Android applications from user prompts and has made the benchmark's methodology and tools available on GitHub.