Boffins build automated Android bug hunting system

September 6, 2025

AI models have faced criticism for generating inaccurate bug reports, placing an undue burden on open-source maintainers with fictitious issues. However, they also hold the promise of revolutionizing application security through automation. Researchers from Nanjing University in China and The University of Sydney in Australia have introduced an innovative AI vulnerability identification system that mimics the methods employed by human bug hunters in detecting flaws.

Advancements in AI Vulnerability Detection

Ziyue Wang from Nanjing and Liyi Zhou from Sydney have built upon their previous work, known as A1, which focused on developing exploits for cryptocurrency smart contracts. Their latest creation, A2, is an AI agent designed specifically for discovering and validating vulnerabilities in Android applications.

In a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities,” the authors assert that the A2 system achieves an impressive 78.3 percent coverage on the Ghera benchmark, significantly outperforming static analyzers like APKHunt, which only reaches 30.0 percent. In testing A2 on 169 production APKs, they identified 104 true-positive zero-day vulnerabilities, 57 of which were validated through automatically generated proof-of-concept (PoC) exploits. Notably, one of these vulnerabilities was a medium-severity flaw found in an Android app boasting over 10 million installs.

“We discovered an intent redirect issue,” Zhou explained in an email to The Register. “This is not a trivial bug, and it showcases A2’s capability to uncover real, impactful flaws in the wild.” He elaborated that an intent redirect occurs when an Android app sends a message—known as an intent—to request an action, but fails to verify its destination, potentially allowing a malicious app to redirect it to a component under its control. Zhou confidently states that A2 can tackle any class of vulnerabilities.

The strength of A2 lies in its ability to provide valuable signals rather than overwhelming noise, as it can validate its findings effectively. The authors note, “Existing Android vulnerability detection tools inundate teams with thousands of low-signal warnings yet reveal few true positives.” While numerous potential vulnerabilities exist within code, only a select few can be easily exploited, and the issue of false positives is exacerbated by unreliable AI coding tools that flag insignificant problems.

“A2’s breakthrough is that it mirrors how human security experts actually work,” Zhou remarked.

The agentic system integrates various commercial AI models—including OpenAI o3, Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT oss—each serving distinct roles: the planner designs the attack, the task executor carries it out, and the task validator generates test oracles to verify results. In contrast, the earlier A1 system was limited to planning and execution, lacking a robust validation mechanism.

To illustrate A2’s functionality, Zhou described a scenario based on a task from the Ghera dataset involving an app’s password reset flow. The app stored an AES key as a plain string in strings.xml. By extracting this key, an attacker could forge tokens for any email. A2 breaks this process into three distinct tasks:

Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.
Executor: read the file and extract the key.
Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both checks pass, confirming the key’s existence.

Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.
Encrypt it with AES-ECB using the key.
Base64 encode the ciphertext to form the token.
Validator recomputes the token independently and compares outputs.

The outputs match, confirming the token’s validity.

Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.
App decrypts the token and displays the bound email.
Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, demonstrating that the forged token successfully bypasses authentication.

<p"In summary, Task 1 confirms the key exists; Task 2 verifies the key generates a valid token; and Task 3 proves the token bypasses authentication," Zhou explained. "All three steps are concretely validated."

Zhou believes that AI is already outpacing traditional tools. “In Android, our A2 system outperforms existing static analysis, and in smart contracts, A1 is close to state-of-the-art,” he stated. “While traditional tools remain useful, they are often slow and complex to develop. AI offers speed and accessibility—we simply call APIs, while AI companies invest billions in training. We are leveraging their advancements.”

The financial implications for those pursuing bug bounties are significant. “Detection-only costs range from [cyberseo_openai model=”gpt-4o-mini” prompt=”Rewrite a news story for a business publication, in a calm style with creativity and flair based on text below, making sure it reads like human-written text in a natural way. The article shall NOT include a title, introduction and conclusion. The article shall NOT start from a title. Response language English. Generate HTML-formatted content using

tag for a sub-heading. You can use only

,

,

,

, and HTML tags if necessary. Text: AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.

They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”

The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.

“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.

A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.

The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.

Executor: read the file and extract the key.

Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both pass, so the key is confirmed.
Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.

Encrypt it with AES-ECB using the key.

Base64 encode the ciphertext to form the token.

Validator recomputes the token independently and compares outputs.

They match, so the token is confirmed.
Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.

App decrypts the token and displays the bound email.

Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®” temperature=”0.3″ top_p=”1.0″ best_of=”1″ presence_penalty=”0.1″ ].003-0.029 per APK (o3), [cyberseo_openai model=”gpt-4o-mini” prompt=”Rewrite a news story for a business publication, in a calm style with creativity and flair based on text below, making sure it reads like human-written text in a natural way. The article shall NOT include a title, introduction and conclusion. The article shall NOT start from a title. Response language English. Generate HTML-formatted content using

tag for a sub-heading. You can use only

,

,

,

, and HTML tags if necessary. Text: AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.

They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”

The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.

“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.

A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.

The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.

Executor: read the file and extract the key.

Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both pass, so the key is confirmed.
Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.

Encrypt it with AES-ECB using the key.

Base64 encode the ciphertext to form the token.

Validator recomputes the token independently and compares outputs.

They match, so the token is confirmed.
Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.

App decrypts the token and displays the bound email.

Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®” temperature=”0.3″ top_p=”1.0″ best_of=”1″ presence_penalty=”0.1″ ].0004-0.001 per APK (gpt-oss-120b), to [cyberseo_openai model=”gpt-4o-mini” prompt=”Rewrite a news story for a business publication, in a calm style with creativity and flair based on text below, making sure it reads like human-written text in a natural way. The article shall NOT include a title, introduction and conclusion. The article shall NOT start from a title. Response language English. Generate HTML-formatted content using

tag for a sub-heading. You can use only

,

,

,

, and HTML tags if necessary. Text: AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.

They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”

The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.

“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.

A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.

The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.

Executor: read the file and extract the key.

Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both pass, so the key is confirmed.
Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.

Encrypt it with AES-ECB using the key.

Base64 encode the ciphertext to form the token.

Validator recomputes the token independently and compares outputs.

They match, so the token is confirmed.
Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.

App decrypts the token and displays the bound email.

Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®” temperature=”0.3″ top_p=”1.0″ best_of=”1″ presence_penalty=”0.1″ ].002-0.014 per APK (Gemini variants),” the paper reveals. “When aggregating costs, they increase to [cyberseo_openai model=”gpt-4o-mini” prompt=”Rewrite a news story for a business publication, in a calm style with creativity and flair based on text below, making sure it reads like human-written text in a natural way. The article shall NOT include a title, introduction and conclusion. The article shall NOT start from a title. Response language English. Generate HTML-formatted content using

tag for a sub-heading. You can use only

,

,

,

, and HTML tags if necessary. Text: AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.

They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”

The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.

“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.

A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.

The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.

Executor: read the file and extract the key.

Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both pass, so the key is confirmed.
Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.

Encrypt it with AES-ECB using the key.

Base64 encode the ciphertext to form the token.

Validator recomputes the token independently and compares outputs.

They match, so the token is confirmed.
Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.

App decrypts the token and displays the bound email.

Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®” temperature=”0.3″ top_p=”1.0″ best_of=”1″ presence_penalty=”0.1″ ].04-0.33 per APK for gpt-oss-120b, [cyberseo_openai model=”gpt-4o-mini” prompt=”Rewrite a news story for a business publication, in a calm style with creativity and flair based on text below, making sure it reads like human-written text in a natural way. The article shall NOT include a title, introduction and conclusion. The article shall NOT start from a title. Response language English. Generate HTML-formatted content using

tag for a sub-heading. You can use only

,

,

,

, and HTML tags if necessary. Text: AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.

They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”

The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.

“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.

A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.

The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.

Executor: read the file and extract the key.

Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both pass, so the key is confirmed.
Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.

Encrypt it with AES-ECB using the key.

Base64 encode the ciphertext to form the token.

Validator recomputes the token independently and compares outputs.

They match, so the token is confirmed.
Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.

App decrypts the token and displays the bound email.

Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®” temperature=”0.3″ top_p=”1.0″ best_of=”1″ presence_penalty=”0.1″ ].06-0.66 per APK for gemini-2.5-flash, [cyberseo_openai model=”gpt-4o-mini” prompt=”Rewrite a news story for a business publication, in a calm style with creativity and flair based on text below, making sure it reads like human-written text in a natural way. The article shall NOT include a title, introduction and conclusion. The article shall NOT start from a title. Response language English. Generate HTML-formatted content using

tag for a sub-heading. You can use only

,

,

,

, and HTML tags if necessary. Text: AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.

They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”

The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.

“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.

A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.

The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.

Executor: read the file and extract the key.

Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both pass, so the key is confirmed.
Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.

Encrypt it with AES-ECB using the key.

Base64 encode the ciphertext to form the token.

Validator recomputes the token independently and compares outputs.

They match, so the token is confirmed.
Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.

App decrypts the token and displays the bound email.

Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®” temperature=”0.3″ top_p=”1.0″ best_of=”1″ presence_penalty=”0.1″ ].26-0.61 per APK for gemini-2.5-pro, and [cyberseo_openai model=”gpt-4o-mini” prompt=”Rewrite a news story for a business publication, in a calm style with creativity and flair based on text below, making sure it reads like human-written text in a natural way. The article shall NOT include a title, introduction and conclusion. The article shall NOT start from a title. Response language English. Generate HTML-formatted content using

tag for a sub-heading. You can use only

,

,

,

, and HTML tags if necessary. Text: AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.

They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”

The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.

“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.

A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.

The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.

Executor: read the file and extract the key.

Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both pass, so the key is confirmed.
Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.

Encrypt it with AES-ECB using the key.

Base64 encode the ciphertext to form the token.

Validator recomputes the token independently and compares outputs.

They match, so the token is confirmed.
Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.

App decrypts the token and displays the bound email.

Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®” temperature=”0.3″ top_p=”1.0″ best_of=”1″ presence_penalty=”0.1″ ].84-3.35 per APK for o3.” The full validation pipeline, utilizing a mixed set of large language models (LLMs), incurs costs ranging from [cyberseo_openai model=”gpt-4o-mini” prompt=”Rewrite a news story for a business publication, in a calm style with creativity and flair based on text below, making sure it reads like human-written text in a natural way. The article shall NOT include a title, introduction and conclusion. The article shall NOT start from a title. Response language English. Generate HTML-formatted content using

tag for a sub-heading. You can use only

,

,

,

, and HTML tags if necessary. Text: AI models get slammed for producing sloppy bug reports and burdening open source maintainers with hallucinated issues, but they also have the potential to transform application security through automation.
Computer scientists affiliated with Nanjing University in China and The University of Sydney in Australia say that they’ve developed an AI vulnerability identification system that emulates the way human bug hunters ferret out flaws.
Ziyue Wang (Nanjing) and Liyi Zhou (Sydney) have expanded upon prior work dubbed A1, an AI agent that can develop exploits for cryptocurrency smart contracts, with A2, an AI agent capable of vulnerability discovery and validation in Android apps.

They describe A2 in a preprint paper titled “Agentic Discovery and Validation of Android App Vulnerabilities.”

The authors claim that the A2 system achieves 78.3 percent coverage on the Ghera benchmark, surpassing static analyzers like APKHunt (30.0 percent). And they say that, when they used A2 on 169 production APKs, they found “104 true-positive zero-day vulnerabilities,” 57 of which were self-validated via automatically generated proof-of-concept (PoC) exploits.
One of these included a medium-severity flaw in an Android app with over 10 million installs.

“We discovered an intent redirect issue,” said Liyi Zhou, a lecturer in computer science at The University of Sydney, in an email to The Register. “This is not a trivial bug, and it shows A2’s ability to uncover real, impactful flaws in the wild.”
An intent redirect, he explained, happens when an Android app sends an intent – a message used to request an action, like opening a screen or passing data – but fails to check carefully where it is going. The vulnerability allows a malicious app to change that intent to a component it controls.
Zhou contends there’s no class of vulnerabilities that A2 cannot handle.

A2’s value as a source of signal rather than noise follows from its ability to validate its findings. As the authors observe, “Existing Android vulnerability detection tools overwhelm teams with thousands of low-signal warnings yet uncover few true positives.”
There are a lot of potential vulnerabilities in code, but few of them can be exploited easily. And the problem of false positives is compounded by error-prone AI coding tools that report inconsequential issues.
“A2’s breakthrough is that it mirrors how human security experts actually work,” said Zhou.
The agentic system consists of various commercial AI models – OpenAI o3 (o3 2025-04-16), Gemini 2.5 Pro (gemini-2.5-pro), Gemini 2.5 Flash (gemini-2.5-flash), and GPT oss (gpt-oss-120b) – deployed in three roles: the planner that designs the attack, the task executor that carries out the attack, and the task validator that generates test oracles – systems that make decisions – and verifies the results.

The researchers’ A1 system only did planning and execution, said Zhou. Its validation is limited to a fixed oracle that decides whether the attack would make money or not.
“The key novelty in A2 is its validator,” said Zhou.
As an example, he describes this setup, based on a task from the Ghera dataset. An app has a password reset flow. It stores the AES key as a plain string in strings.xml. With that key, the app creates a token from the email. Knowing the key lets an attacker forge tokens for any email.
A2, Zhou explained, breaks this into three tasks:
Task 1: Extract the hardcoded key

Planner: set the task to find the key in res/values/strings.xml.

Executor: read the file and extract the key.

Validator: (i) Check that the file exists. (ii) Check the key value matches.

Both pass, so the key is confirmed.
Task 2: Forge a password reset token

Input a victim email, e.g., example@example.com.

Encrypt it with AES-ECB using the key.

Base64 encode the ciphertext to form the token.

Validator recomputes the token independently and compares outputs.

They match, so the token is confirmed.
Task 3: Prove authentication bypass

Launch NewPasswordActivity with the forged token.

App decrypts the token and displays the bound email.

Validator: (i) Confirm the activity is NewPasswordActivity. (ii) Confirm the email appears on screen.

Both checks pass, proving the forged token bypasses authentication.
“In short: Task 1 shows the key exists; Task 2 shows the key mints a valid token; Task 3 shows the token bypasses authentication,” said Zhou. “All three steps are concretely validated.”
Zhou argues that AI is already outpacing traditional tools.
“In Android, our A2 system beats existing static analysis, and in smart contracts, A1 is close to state of the art,” he said. “Tools are still useful, but they are slow and hard to build. AI is fast and accessible — we just call APIs, while the AI companies pour billions into training. We are standing on their shoulders.”
The AI capex looks like a windfall for those pursuing bug bounties.
“Detection-only costs range from $0.003-0.029 per APK (o3), $0.0004-0.001 per APK (gpt-oss-120b), to $0.002-0.014 per APK (Gemini variants),” the paper says. “Aggregation increases costs to $0.04-0.33 per APK for gpt-oss-120b, $0.06-0.66 per APK for gemini-2.5-flash, $0.26-0.61 per APK for gemini-2.5-pro, and $0.84-3.35 per APK for o3.”
The full validation pipeline with a mixed set of LLMs costs between $0.59-4.23 per vulnerability, with a median cost of $1.77. When using gemini-2.5-pro for everything, costs range from $4.81-26.85 per vulnerability, with a median cost of $8.94.
Last year, University of Illinois Urbana-Champaign computer scientists showed that OpenAI’s GPT-4 can generate exploits from security advisories at a cost of about $8.80 per exploit.
To the extent that found flaws can be monetized through bug bounty programs, the AI arbitrage opportunity looks promising for those who can make accurate reports, given that a medium severity award might be several hundred or several thousand dollars.
But Zhou observes that bug bounty programs have limited scope. “A cat-and-mouse game is inevitable,” he said. “A2 can uncover serious flaws today, but bug bounty programs only cover a fraction of them. That gap creates a strong incentive for attackers to exploit these bugs directly. How this plays out depends on how quickly defenders move.
“The field is about to explode. The success of A1 and A2 means researchers and hackers alike will rush in. Expect a surge of activity — both in defensive research and in offensive exploitation.”
Asked what a system like A2 might mean for security research, Adam Boynton, senior security strategy manager at Jamf, told The Register, “AI is moving vulnerability discovery from endless scan alerts to proof-based validation. Security teams get fewer false positives, faster fixes, and focus on real risks.”
A2 source code and artifacts have been limited to those with institutional affiliation and a declared research purpose in an effort to balance open research with responsible disclosure. ®” temperature=”0.3″ top_p=”1.0″ best_of=”1″ presence_penalty=”0.1″ ].59-4.23 per vulnerability, with a median cost of .77.

Last year, researchers from the University of Illinois Urbana-Champaign demonstrated that OpenAI’s GPT-4 could generate exploits from security advisories at a cost of approximately .80 per exploit. Given that vulnerabilities can be monetized through bug bounty programs, the potential for AI-driven arbitrage appears promising for those capable of producing accurate reports, especially since medium-severity awards can reach several hundred or even thousands of dollars.

However, Zhou cautions that bug bounty programs have limited coverage. “A cat-and-mouse game is inevitable,” he noted. “While A2 can reveal serious flaws today, bug bounty programs only address a fraction of them. This gap creates a strong incentive for attackers to exploit these vulnerabilities directly. The outcome will depend on how swiftly defenders respond.”

With the success of A1 and A2, Zhou anticipates a surge in activity from both researchers and hackers alike. “The field is about to explode,” he asserted. “Expect increased engagement in both defensive research and offensive exploitation.”

Adam Boynton, senior security strategy manager at Jamf, shared his perspective with The Register, stating, “AI is shifting vulnerability discovery from endless scan alerts to proof-based validation. Security teams will experience fewer false positives, faster fixes, and a sharper focus on genuine risks.”

To promote responsible research practices, access to A2’s source code and artifacts has been restricted to individuals with institutional affiliations and declared research purposes.

AppWizard