language models

AppWizard
June 13, 2026
Google has released benchmark results for evaluating AI models in Android coding, revealing that the Gemini 3.5 Flash is the most resource-intensive model but ranks sixth overall. The benchmarks indicate that Gemini 3.5 Flash has higher latency and a 9% performance gap compared to its predecessor, Gemini 3.1 Pro Preview, despite being marketed as a faster alternative. In terms of cost, Gemini 3.5 Flash averages 355.9 tokens per benchmark run at approximately 7.1, while Gemini 3.1 Pro Preview uses only 73.3 tokens at about a third of that cost. The top-ranked models include GPT 5.5, GPT 5.4, and Gemini 3.1 Pro Preview, while Claude Opus 4.7 ranks fourth. The rankings feature both open-weight and closed-weight models, with the list remaining consistent since the last release, except for the removal of GPT 5.3 Codex.
Winsage
June 11, 2026
Microsoft is testing a new feature that allows developers to implement local language models on non-Copilot+ PCs running Windows 11. The Language Model APIs can now operate on any Windows 11 device with a compatible Nvidia GPU, specifically targeting GeForce RTX 30 series and newer models with at least 6 GB of video RAM. This initiative aims to democratize access to AI capabilities across a broader range of Windows 11 PCs, although not all PCs will gain access to exclusive Copilot+ AI functionalities.
AppWizard
May 26, 2026
Google launched the Android Bench benchmarking portal in March to help software developers evaluate AI models for Android app development. The leaderboard was updated last week to include open-weight models and new metrics for latency, tokens, and cost. Matthew McCullough, Google's VP of Product for Android Development, stated that the goal is to provide a benchmark for evaluating large language models (LLMs) in Android development. As of May 18, GPT 5.5 is the top AI model for Android app development, with Gemini 3.1 Pro and GPT 5.4 ranked as joint leaders. Android Bench evaluates LLMs based on real-world challenges and tasks sourced from public GitHub repositories. Other benchmarking tools in the Android ecosystem include Jetpack Microbenchmark, Jetpack Macrobenchmark, Firebase Performance Monitoring, Android Vitals, Apptim, and Android Performance Analyzer. The overall benchmark score on Android Bench is calculated using four core values: Confidence Interval Range, Average Latency Score, Average Total Tokens Score, and Average Cost. The test harness for Android Bench is publicly available on GitHub.
Winsage
May 26, 2026
Microsoft has integrated its AI assistant, Copilot, into various products, including Bing and Windows 11, since early 2023. However, user dissatisfaction has led the company to shift its focus back to addressing core issues with Windows 11. Despite an aggressive rollout of Copilot across multiple platforms, it struggled to compete with specialized AI tools as users preferred solutions that could autonomously complete tasks. This resulted in backlash from users, earning Microsoft the nickname "Microslop." In response, Microsoft has initiated the "Windows K2" project to reallocate resources from Copilot to improve Windows 11, scaling back AI implementations and allowing users to customize their experience.
AppWizard
May 21, 2026
Google has updated its "Android Bench" rankings, introducing new AI models for Android app development, including open-weight models. The latest rankings, as of May 18, 2026, show GPT 5.5 at the top, surpassing GPT 5.4 and Gemini 3.1 Pro by nearly 2%. The update provides metrics such as average latency, total tokens used, and average cost per benchmark run. GPT 5.5 has a score of 74, with an average latency of 15.5, total tokens of 64.5, and an average cost of .9. In comparison, GPT 5.4 has a score of 72.4, with an average latency of 21.2, total tokens of 64.2, and an average cost of [openai_gpt model="gpt-4o-mini" prompt="Summarize the content and extract only the fact described in the text bellow. The summary shall NOT include a title, introduction and conclusion. Text: Google has refreshed its “Android Bench” rankings, unveiling a new lineup of AI models tailored for Android app development. This update introduces several “open-weight” models and provides deeper insights into the performance metrics, including token usage and associated costs. Large language models have increasingly demonstrated their prowess in coding, significantly enhancing the app development process. This trend has given rise to what is now known as “vibe coding.” Earlier this year, Google released a benchmark ranking that evaluated the top AI models for Android development, focusing on common tasks and adherence to best practices. Initially, the rankings were led by Gemini 3.1 Pro, with OpenAI’s GPT 5.4 later sharing the spotlight. However, as of the latest update on May 18, 2026, a new contender has emerged. GPT 5.5 has claimed the top position, surpassing GPT 5.4 and Gemini 3.1 Pro by nearly 2%. This update also enhances clarity by presenting average latency, total tokens utilized, and the average cost associated with each AI model. Google has provided documentation detailing the methodology behind these metrics. Average Latency: Time taken to complete 100 tasks across 10 runs Average Total Tokens: Token consumption for a complete benchmark run across 10 iterations Average Cost: Cost per benchmark run in US dollars at the time of testing While GPT 5.5 boasts superior performance, it comes at a cost—over twice that of Gemini 3.1 Pro for equivalent functions. Here’s a look at the top ten models based on Google’s latest data as of May 21, 2026: Model Score Avg Latency Avg Total Tokens Avg Cost New: GPT 5.5 74 15.5 64.5 3.9 GPT 5.4 72.4 21.2 64.2 .7 Gemini 3.1 Pro Preview 72.4 11.5 75.4 .0 New: Claude Opus 4.7 68.7 11.6 90.0 4.3 GPT 5.3 Codex 67.7 11.2 71.4 .6 Claude Opus 4.6 66.6 9.9 69.5 .4 GPT 5.2 Codex 62.5 24.3 124.4 1.9 Claude Opus 4.5 61.9 12.5 79.8 2.5 Gemini 3 Pro Preview 60.4 9.8 117.0 .7 New: GLM 5.1 59.7 33.4 80.2 .7 The rankings now feature a wider array of open-weight models, including Gemma, Qwen, DeepSeek, and MiMo, among others. GLM 5.1 has emerged as the highest scorer among these newcomers, closely followed by Kimi K2.6. Google is committed to updating the “Android Bench” on a monthly basis. With the anticipated release of Gemini 3.5 Pro and the already available 3.5 Flash, the competitive landscape will be intriguing to watch as Google seeks to reclaim its lead against OpenAI's advancements. More on Android: Follow Ben: Twitter/X, Threads, Bluesky, and Instagram FTC: We use income earning auto affiliate links. More." max_tokens="3500" temperature="0.3" top_p="1.0" best_of="1" presence_penalty="0.1" frequency_penalty="frequency_penalty"].7. Gemini 3.1 Pro has the same score as GPT 5.4 but with different latency and token metrics. The rankings also include other models like Claude Opus 4.7, GPT 5.3 Codex, and GLM 5.1, which has emerged as the highest scorer among newcomers. Google plans to update the rankings monthly.
Search