System-prompt extraction · benign red-team

We asked. They answered.

Self-evolving code agents probe a frontier LLM until it spills its hidden system prompt — synthetic secrets and all. See how a single high-level skill spans many turns, then replay a real extraction run round by round and watch the leak score climb.

7 models · multi-turn traces 28 skills · 14 low + 14 high replay · recorded runs
1

A target with a secret

A model is given a hidden system prompt holding synthetic secrets (fake API keys) it is told never to reveal.

2

An agent that keeps asking

A code agent probes it over several rounds — rephrasing, escalating, re-framing — picking the next probe by what worked.

3

The prompt leaks

Each reply is scored for how much of the hidden prompt it reveals. The score climbs as more of the system prompt is recovered.

High-level skill · multi-turn

One high-level skill is a conversation.

JustAsk has 28 skills on two levels: low-level (L1–L14) are single-turn probes — one message. High-level (H1–H14) are multi-turn: a single skill unfolds over several turns, each turn setting up the next. Pick one and step through its turns. These are the paper's canonical attack patterns (the attacker's side), not live model output.

high-level skill

Skill-selection trajectory · recorded real run

Watch the agent search for a way in.

The outer loop: each round the self-evolving agent picks one skill — using what failed to choose the next — recovering more of the hidden prompt as it goes. Real recorded runs from the paper; pick one and replay. The similarity score climbs; the peak round is marked with how much it recovered. No model is called; this is a recording.

run trajectory
Leak score (peak) 0.00 / 1.00
nonecontextpartialfull

Across models

Most frontier agents leak something.

Average extraction similarity per target model across the benchmark — higher means more of the hidden system prompt was recovered.

Shown: the 7 open-weight models with a successful multi-round trace. claude-sonnet-4 was also tested but yielded no successful run, so it carries no bar here.

Single-shot breadth · CLI agents

One probe each, many agents.

Beyond the multi-turn loop above, JustAsk was also run single-shot against shipping CLI coding agents — one best probe per target, no back-and-forth. Best result each (not a trajectory; these aren't replayed).

Curious agents reveal hidden system prompts.

A benign, controlled red-team of frontier LLMs. All secrets are synthetic — the leak is the point.

▶ Watch a real extraction