Google ranks the best AI for building Android apps, and the winner isn’t Gemini

Google is encouraging software developers to harness the most effective AI models for Android application development. In March, the tech giant launched the Android Bench benchmarking portal, designed to serve as a dynamic leaderboard that offers developers and model creators a reliable reference point for performance evaluation.

The leaderboard saw an update last week, introducing open-weight models and new metrics for latency, tokens, and cost, enhancing the depth of information available to users.

“By establishing a clear, reliable baseline for what high-quality Android development looks like, we’re helping model creators identify gaps and accelerate improvements — which empowers developers to work more efficiently.”
—Matthew McCullough, Google.

Model students

Matthew McCullough, Google’s VP of Product for the Android Developer division, articulated in a March blog post that the company is actively benchmarking leading AI large language models (LLMs) against tests specifically designed to evaluate their capabilities in building Android applications.

“Our goal is to provide model creators with a benchmark to evaluate LLM capabilities for Android development,” McCullough elaborated. “By establishing a clear, reliable baseline for what high-quality Android development looks like, we’re helping model creators identify gaps and accelerate improvements — which empowers developers to work more efficiently with a wider range of helpful models to choose for AI assistance — ultimately leading to higher-quality apps across the Android ecosystem.”

GPT 5.5 is currently the best AI model for Android

While the new service does not maintain a historical record of model rankings over time, recent updates from 9to5Google indicate that the latest Android Bench ranked Gemini 3.1 Pro alongside OpenAI’s GPT 5.4 as joint leaders. As of May 18, GPT 5.5 has emerged as the top AI model for Android app development.

Google has made its methodology for Android Bench publicly accessible, stating, “The service evaluates the ability of LLMs to generate code that resolves issues by presenting them with real-world challenges and pull requests from open-source software projects. This approach ensures that the tasks are representative of the challenges developers face daily.”

Why did Google build Android Bench?

Google’s motivation for developing Android Bench stems from the recognition that AI-assisted software engineering has seen the rise of various benchmarks for measuring LLM capabilities. The company noted that Android developers encounter specific challenges not addressed by existing benchmarks, prompting the creation of a ranking service focused on a comprehensive assessment of high-quality Android development.

“We created a model-agnostic benchmark to accurately evaluate LLM performance on a variety of Android development tasks,” Google stated. The goals of Android Bench include encouraging improvements in LLMs for Android development, empowering developers to enhance productivity with a range of AI assistance models, and ultimately fostering higher-quality applications within the Android ecosystem.

Do software development benchmarks work?

As Google establishes this benchmarking system, developers and model creators may question its efficacy. Critics may reference Goodhart’s Law, which suggests that once a measure becomes a target, it loses its effectiveness as a measure. However, Google appears to have mitigated this risk by basing Android Bench on real-world public code repositories.

McCullough explained, “We created the benchmark by curating a task set against a range of common Android development areas. It consists of real challenges of varying difficulty, sourced from public GitHub Android repositories.”

The scenarios tested encompass resolving “breaking changes” across Android releases, addressing domain-specific tasks such as networking for wearable devices, and migrating to the latest version of Jetpack Compose, among others.

What other Android benchmarks exist?

In addition to Android Bench, other benchmarking tools in the Android ecosystem include:

  • Jetpack Microbenchmark: A library for benchmarking Android native code, whether written in Kotlin or Java, directly within Android Studio.
  • Jetpack Macrobenchmark: This tool tests large-scale user interactions, such as cold app startup time and the fluidity of user interface animations.
  • Firebase Performance Monitoring: A production-level benchmark that monitors an app’s network requests and screen rendering times.
  • Android Vitals: A dashboard that tracks app quality metrics, including stability, performance, battery usage, and permission issues.
  • Apptim: A generative AI mobile app profiling and testing tool.
  • Android Performance Analyzer (APA): A profiling and performance analysis tool that simplifies workflow, launched on May 19.

“Open benchmarks like Android Bench are great, and we wish there were more of them. The caveat is data contamination. Public repositories leak into training, and we have seen models that cluster within a few points on public evals spread dramatically on private benchmarks built to mimic the same workload.” – Andrew Filev, CEO, Zencoder.

Andrew Filev, CEO of Zencoder, expressed his support for these benchmarking systems, albeit with some reservations. “Open benchmarks like Android Bench are great, and we wish there were more of them,” he remarked. “In general terms, software development is too diverse for a single headline score to be universally meaningful. A Python benchmark tells you little about how a model handles Rust, embedded systems, or a mobile app.”

Filev emphasized that domain-specific benchmarks encourage model developers to focus on the environments their users operate in, commending Google for its initiative and hoping other platforms will follow suit. He also cautioned about the risks of data contamination, noting that public repositories can influence training outcomes.

How Android Bench scores are built

The overall benchmark score for each model on Android Bench is derived from a Google-developed calculation that incorporates four core values:

  1. Confidence Interval (CI) Range (%): A measure of expected performance reliability.
  2. Average Latency Score: The time taken to solve 100 tasks across 10 runs.
  3. Average Total Tokens Score: A measure of token consumption for a full benchmark run across 10 runs.
  4. Average Cost: The cost per benchmark run at the time of testing, expressed in US dollars.

The test harness for Android Bench is publicly available on GitHub, ensuring transparency and accessibility for developers looking to engage with this innovative benchmarking tool.

AppWizard
Google ranks the best AI for building Android apps, and the winner isn't Gemini