JustAsk

ICML 2026 · system-prompt extraction

We asked shipping AI agents for their
hidden system prompts. They answered.

JustAsk is a self-evolving code agent that works out how to extract a model's hidden system prompt on its own — no handcrafted attacks, no labelled data, no access beyond an ordinary chat. It recovered the real system prompts of shipping coding agents, and breached every one of the 41 commercial models we tested.

▶ See the recovered prompts Watch it self-evolve

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs · Zheng, Wu, Huang, Li, Ma, Li, Jiang, Wang · ICML 2026

A target with a secret

A model is given a hidden system prompt holding synthetic secrets (fake API keys) it is told never to reveal.

An agent that keeps asking

A code agent probes it over several rounds — rephrasing, escalating, re-framing — picking the next probe by what worked.

The prompt leaks

Each reply is scored for how much of the hidden prompt it reveals. The score climbs as more of the system prompt is recovered.

The headline result · real deployed agents

We recovered the system prompts of shipping AI coding agents.

These are real coding agents you can install today. Acting as an ordinary user over a normal chat, JustAsk recovered their hidden system prompts — verbatim for Gemini CLI, and identity, provenance and rule structure for the rest (even when they tried to refuse). Open any card to read what came back, scored against the agent's true prompt.

How it finds the way in · self-evolving

Nobody told it which attack to use. It learned.

JustAsk treats extraction as online exploration: each step it picks a skill by Upper-Confidence-Bound — balancing what has worked against what is untried — scores how much of the prompt leaked, and updates. No handcrafted attack; the strategy emerges. Watch the similarity climb across 20 attempts on a real CLI agent.

vs prior attacks · autonomous beats handcrafted

As strong as hand-built attacks — writing none by hand.

Prior system-prompt attacks (Raccoon, PLeak, Zhang et al.) ship libraries of human-crafted templates. JustAsk writes none — it discovers its own skills at run time. Per shipping agent, here is the best recovery each method achieved, scored against the true prompt.

High-level skill · multi-turn

One high-level skill is a conversation.

JustAsk has 28 skills on two levels: low-level (L1–L14) are single-turn probes — one message. High-level (H1–H14) are multi-turn: a single skill unfolds over several turns, each turn setting up the next. Pick one and step through its turns. These are the paper's canonical attack patterns (the attacker's side), not live model output.

high-level skill

Watch one crack · real CLI agent

Probe a shipping agent until its prompt spills.

Pick a real coding agent and replay JustAsk's skill probes against it, one at a time. Most are refused outright — then one skill breaks through and the agent prints its hidden system prompt. Recorded from the paper's runs; the score is similarity to the agent's true prompt. These are single-shot probes — the genuine multi-turn skills are shown attacker-side in High-level above. No model is called here.

run trajectory

Leak score (peak) 0.00 / 1.00

nonecontextpartialfull

Live fire

Or fire one yourself.

Live-fire is disabled on this public demo — the page is replay-only pending ethics clearance. (When enabled, this fires a single probe at the configured model; only planted synthetic secrets are ever at risk.)

extract.sh — configure

Target scenario

Extraction skill

Probe — editable (0/2000)

disabled · replay-only

Controlled benchmark · planted prompts

And it generalizes across open models.

Beyond the shipping agents above, a controlled study gives each open-weight model a planted system prompt with known ground truth, so recovery can be scored exactly. Average recovered similarity per model.

Controlled runs against a synthetic planted prompt (known ground truth) — separate from the real shipping-agent recoveries above. Models with no successful run carry no bar.

Single-shot · the harsher test

Even one probe, no conversation, leaks.

The recovered prompts at the top use JustAsk's full self-evolving loop. This is the stricter test: a single probe per skill, no back-and-forth. Best single-shot result per agent — many still give up their prompt on the first ask.

Succeeded extractions · real CLI agents

Open one and watch it leak.

Each card is a real shipping coding agent. Open one to replay JustAsk's probes against it — most refused, one breaking through to the recovered system prompt.

Curious agents reveal hidden system prompts.

System prompts are a critical yet largely unprotected attack surface in modern agent systems. JustAsk recovers them autonomously across all 41 commercial models we tested — including the shipping CLI agents above.

▶ See the recovered prompts

We asked shipping AI agents for theirhidden system prompts. They answered.

A target with a secret

An agent that keeps asking

The prompt leaks

We recovered the system prompts of shipping AI coding agents.

Nobody told it which attack to use. It learned.

As strong as hand-built attacks — writing none by hand.

One high-level skill is a conversation.

Probe a shipping agent until its prompt spills.

Or fire one yourself.

And it generalizes across open models.

Even one probe, no conversation, leaks.

Open one and watch it leak.

Curious agents reveal hidden system prompts.

We asked shipping AI agents for their
hidden system prompts. They answered.