AI Pentesting Is Not Just Jailbreaking the Model

Most AI security conversations start with jailbreaking and data leakage. For SaaS teams, the bigger risk is the application around the model; what it can access, retrieve, trigger, and expose across every layer of the stack.

By Sherif Koussa

・

10 min read

Share on your Social Media

Table of contents

Text Link

Get security insights straight
to your inbox

For SaaS teams, the real risk is what the model can access, retrieve, trigger, and expose.

Most conversations about AI security start in one of two places. The first is model behavior. Can the model be jailbroken? Can it be manipulated? Can a user trick it into ignoring instructions or behaving in a way the product team did not intend?The second is information leakage. Are employees pasting sensitive data into AI tools? Could customer data, internal documents, source code, or business logic be exposed through an AI feature?

Those are real concerns. But they are not the whole picture. For SaaS teams, the bigger risk is often the application around the model. The AI features your team is building are not just chat boxes. They process customer data. They retrieve internal documents. They trigger workflows. They generate code. They connect to APIs. They summarize support tickets, query databases, update records, and make decisions that affect real users. That means the security problem is not simply whether the model is “well-behaved.”

The real question is: what can the model reach, what can it reveal, what can it trigger, and what happens when an attacker influences any part of that chain?

AI is more than the model. AI is a full stack.

And every layer of that stack creates a different security problem.

The Stack

Here's what a moderately complex AI-powered product looks like end-to-end:

Each layer in the diagram above introduces different security concerns; each arrow crosses a trust boundary and introduces a different set of security concerns. Let’s explore each layer

Layer 1: The LLM API

This is the layer most teams think they are testing when they say “AI security.” The model receives instructions, user input, and application context, then returns a response. Everything else in the stack either prepares that prompt or handles what comes back. The obvious risk is prompt injection. But the less obvious risk is what happens after the model responds. For example, a user may try to override the system prompt, extract hidden instructions, bypass guardrails, or force the model to reveal information it was not supposed to expose.

Article Image " AI-Powered Bing Chat spills it's secrets"

In 2023, shortly after Microsoft launched the new AI-powered Bing Chat, users were able to use prompt injection to get the system to reveal internal instructions, including the internal codename “Sydney.” Ars Technica described it as Bing Chat “spilling its secrets” through prompt injection, and The Verge later reported on Microsoft’s secret rules and the “Sydney” codename. But prompt injection is only one part of the problem.

It is not just the input that can cause problems; the output can, too.

Model output may be rendered in a web application, sent to Slack or Teams, included in an email, exported to a PDF, stored in a database, or passed into another API. Each destination needs to be tested separately. Unsafe Markdown, unsanitized HTML, generated links, or model-produced commands can create real application security issues.

What we test for at Software Secured:

Whether user-supplied input can override system or developer instructions.
Whether internal prompts, hidden business logic, API errors, or sensitive context can be exposed through model responses.
Whether model output is safely handled when rendered in the UI, sent to Slack or Teams, exported to PDF, stored, or passed into downstream functions.
Whether model output can create XSS, injection, unsafe Markdown, unsafe links, or unsafe API calls.
Whether an attacker can inflate token usage or trigger excessive model calls, creating cost-based denial of service.

Layer 2: Retrieval-Augmented Generation (RAG)

A RAG system retrieves relevant documents from your internal data sources, wikis, ticketing systems, Confluence, Sharepoint, CRM records, support history, and injects that content into the model's context before generating a response. This makes the model more useful. It also makes it way more dangerous if the retrieval layer isn't locked down. A typical SaaS usually performs regular CRUD operations; Create, Read, Update, Delete operations, most of the time through a database. Overtime, we learned how to do that securely (mostly).

RAG, is breaking that. RAG retrieves documents, executes commands and performs other functions and accesses several systems; databases, file systems, and other applications.

A real-world example is EchoLeak, a Microsoft 365 Copilot vulnerability disclosed in 2025. Researchers showed how malicious content delivered through email could manipulate Copilot’s context and lead to sensitive data disclosure without user interaction. The email itself became the attack vector.

EchoLeak - Microsoft 365 Copilot (June 2025, CVE-2025-32711)

Graphic depicting EchoLeak (Microsoft Copilot Vulnerability) — Source: https://www.trendmicro.com/en_us/research/25/g/preventing-zero-click-ai-threats-insights-from-echoleak.html

What we test for at Software Secured:

Can malicious instructions hidden inside documents, PDFs, emails, support tickets, or knowledge base articles change how the AI behaves once that content is retrieved?
Can a user from one customer account accidentally or intentionally retrieve information that belongs to another customer?
Can an attacker influence the retrieval system so their content gets treated as the most relevant answer, even when better or safer sources exist?
Does the AI reveal information it should not, such as document names, author details, internal IDs, file paths, or source references?
Does the system have enough quality control over retrieved content, or can weak, outdated, or malicious content make its way into the model’s answer?
When a document is deleted, restricted, or changed at the source, does the retrieval index update properly, or can the AI still answer using old content?

Layer 3: Model Context Protocol (MCP)

MCP is a protocol that enables AI systems to connect to external tools, data sources, and APIs via a standardized server interface. Instead of hardcoding tool integrations, an AI host discovers and invokes capabilities exposed by MCP servers, which can be local, remote, internal, or third-party.

If RAG expands the model's knowledge, MCP expands its capabilities. And it does so in ways that introduce risks most teams haven't mapped. MCP dramatically expands what an AI system can do, and therefore what it can be made to do. MCP servers expose tools. Those tools may read files, update records, query databases, create tickets, send messages, call internal APIs, or run local commands.

The question is no longer only “what can the model say?” The question becomes: what can the model cause the system to do? For example, an MCP server can expose a tool to update/delete a user’s data. If that functionality is not properly guarded, an attacker can use it to alter the integrity of the data.

Example: Anthropic Git MCP Server flaws

Security researchers at Cyata found three vulnerabilities in Anthropic’s official mcp-server-git, including unrestricted git_init, path validation bypass, and argument injection in Git operations. Reports said the flaws could be chained with the Filesystem MCP server to enable file access, file tampering, and, in some scenarios, remote code execution via prompt-injection-driven tool use.

The following thread has an actual debate. Several commenters push back on whether the issue is really “MCP is broken” or whether the real problem is unsafe implementations that execute MCP server commands without sandboxing. One useful comment notes that the real exploit surface includes not only arbitrary command execution but also the possibility that agents may treat MCP tool responses as trusted instructions by default.

What we test at Software Secured

Whether MCP servers expose dangerous tools such as file write, shell execution, database access, or privileged API calls.
Whether tool execution is tied to the user’s permissions or only to the MCP server’s credentials.
Whether prompt injection can cause the assistant to invoke destructive or high-privilege tools.
Whether tool descriptions or tool results can manipulate the model into unsafe follow-up actions.
Whether MCP servers validate file paths, command arguments, API parameters, and user-controlled inputs.
Whether rate limits, scopes, approvals, and audit logs exist for tool execution.

Layer 4: AI Agents

Agents extend beyond a single model call; AI agents combine planning, iterative reasoning, tool use, and memory. An agent typically pursues a goal across multiple steps, calling tools, evaluating results, adjusting its plan, calling more tools, often with limited or no human review between steps. This is where the risk model changes most significantly, because failures compound. A single-turn LLM call that goes wrong produces one bad output. An agent that goes wrong can take 12 sequential actions before the consequences become visible.

In November 2025, AppOmni published research showing how ServiceNow Now Assist’s agent-to-agent discovery could be abused through second-order prompt injection. A low-privileged user could place malicious instructions inside a record, such as a ticket description. Later, when a higher-privileged user or agent processed that record, the instruction could redirect the agent workflow and cause other agents to perform unintended actions. AppOmni said this could include CRUD actions on records and sending external emails containing record contents, depending on configuration.

Graphic of second order prompt injection attack — The ServiceNow example is a useful illustration of why agent risk compounds: attacker-controlled content does not just affect one response. It can redirect a workflow.

What Software Secured tests for:

Can something the agent reads, a ticket, document, email, or customer message, quietly change what the agent is trying to do?
Can content created by a low-privileged user influence actions taken by a higher-privileged agent or user?
Does the system check permissions every time the agent calls a tool, or does it assume the first authorization check is enough?
Can one agent hand work off to another agent that has more access or more powerful tools?
Are sensitive actions, such as deleting records, sending external emails, changing permissions, or exporting data, reviewed by a human before they occur?
If something goes wrong, can your team reconstruct what the agent read, what it decided, what tools it called, and why?

Layer 5: AI-Assisted Dev Tooling and Platform-Generated Code

This involves developer tooling; tools like Copilot, Cursor, Claude Code, and similar assistants that suggest and generate code for your engineering team. The security problem here is primarily about SDLC controls: whether the same review gates that apply to human-written code apply equally to AI-generated code. For developer tooling, the core risk isn't that AI generates insecure code. Human developers also generate insecure code at a comparable rate.

The core risk is that AI-assisted development increases code velocity faster than security understanding can keep up. The issue is not simply that AI writes “bad code”; it is that it produces large amounts of plausible-looking code with unknown security properties, especially around authorization, tenant isolation, secrets, payment logic, and business rules.

As our post puts it, AI can expand the attack surface, create inconsistent behavior across similar services, and give teams false confidence because the code compiles, tests pass, and scanners may miss deeper architectural flaws. The issue is not simply that AI writes insecure code. Human developers do that too. The bigger issue is speed without comprehension. AI-assisted development can generate large amounts of plausible-looking code before the team fully understands the security properties of what was built. That matters most around authorization, tenant isolation, secrets, payment logic, admin functionality, and business rules.

In May 2026, WIRED reported that researchers at RedAccess found more than 5,000 publicly accessible vibe-coded web apps built with tools like Lovable, Replit, Base44, and Netlify that had little or no authentication. Around 40% reportedly exposed sensitive data, including medical information, financial records, corporate documents, customer chat logs, and internal business data. Axios also reported that RedAccess found 380,000 publicly accessible assets, of which about 5,000 contained sensitive corporate data.

What Software Secured tests for:

Are AI-generated routes, pages, APIs, or backend functions protected by real authentication, or did the code only “look” finished?
Can unauthenticated users access private pages, files, API endpoints, dashboards, or generated admin functions?
Does the application enforce tenant isolation on the backend, or is the AI-generated code only hiding data in the frontend?
Are database queries properly scoped to the current user, account, or tenant, or can a user pull records belonging to someone else?
Did the AI generate authorization logic that looks reasonable but fails in edge cases, during role changes, or with direct API access?
Are secrets, API keys, tokens, environment variables, or internal configuration values exposed in generated code, logs, public assets, or client-side bundles?
Are generated payment flows, billing logic, admin actions, or permission checks reviewed by someone who understands the business rules?
Can user-controlled input flow into generated SQL, API calls, scripts, workflow logic, or infrastructure configuration without enough validation?
Did AI-assisted code bypass the normal review process because it “worked,” compiled, or passed basic tests?

The Common Thread

Every risk described above reduces to a small number of recurring failures:

Untrusted input reaches a privileged context. Whether that's a user prompt reaching the system instruction layer, a planted document reaching the model context, or a user-level request reaching admin-scoped tool credentials — the pattern is the same. Something crossed a boundary it shouldn't have.

Output is trusted too early. Model output is used to make decisions, trigger actions, or render in interfaces without adequate validation. Generated code executed without review. Retrieved content was incorporated into responses without authorization verification.

Authorization is checked in the wrong place. UI-layer access controls that don't cover the retrieval layer. Session-level authorization checks that don't apply at tool invocation. User-level permission assumptions that don't account for service-credential scope in tool execution.

The complexity of the stack, the scale at which fewer humans are producing far more code, and the hyperconnectivity of all the pieces make these problems harder to test and reproduce. It doesn't make them harder to understand. In most cases, the underlying risks are: broken access control, unsafe output handling, parameter injection, and over-trusted third-party components.

Test the Full Stack, Not Just the Model

Every risk described in this post (untrusted input reaching privileged contexts, output trusted too early, authorization checked in the wrong place) requires testers who understand how AI systems actually work end-to-end, not just how to jailbreak a chatbot.

Software Secured tests AI-powered SaaS products across every layer: LLM APIs, RAG pipelines, MCP servers, multi-agent workflows, and AI-generated code. Zero false positives. Findings mapped to your specific architecture with actionable remediation guidance your engineering team can act on.

Ready to get in touch? Get started by booking a consultation now.

Book Consultation

About the author

Sherif Koussa

CEO

Sherif Koussa is a cybersecurity expert and entrepreneur with a rich software building and breaking background. In 2006, he founded the OWASP Ottawa Chapter, contributed to WebGoat and OWASP Cheat Sheets, and helped launch SANS/GIAC exams. Today, as CEO of Software Secured, he helps hundreds of SaaS companies continuously ship secure code.

Get security insights straight to your inbox

Continue your reading with these value-packed posts

DevSecOps & Shift‑left Security

Attack Chains: The Hidden Weakness in Modern API & Web Application Security

AI Pentesting Is Not Just Jailbreaking the Model

Get security insights straight to your inbox

For SaaS teams, the real risk is what the model can access, retrieve, trigger, and expose.

The Stack

Layer 1: The LLM API

What we test for at Software Secured:

Layer 2: Retrieval-Augmented Generation (RAG)

What we test for at Software Secured:

Layer 3: Model Context Protocol (MCP)

Example: Anthropic Git MCP Server flaws

What we test at Software Secured

Layer 4: AI Agents

What Software Secured tests for:

Layer 5: AI-Assisted Dev Tooling and Platform-Generated Code

What Software Secured tests for:

The Common Thread

Test the Full Stack, Not Just the Model

Ready to get in touch? Get started by booking a consultation now.

About the author

Sherif Koussa

CEO

Get security insights straight to your inbox

Continue your reading with these value-packed posts

Risk of Security and Monitoring Logging Failures

Improving Communication Between Your Security and Dev Teams so Everybody Wins

Avoiding Security Theatre: When is a "Critical" Really a Critical?

Get security insights straight
to your inbox