System-prompt extraction · benign red-team
Self-evolving code agents probe a frontier LLM until it spills its hidden system prompt — synthetic secrets and all. See how a single high-level skill spans many turns, then replay a real extraction run round by round and watch the leak score climb.
A model is given a hidden system prompt holding synthetic secrets (fake API keys) it is told never to reveal.
A code agent probes it over several rounds — rephrasing, escalating, re-framing — picking the next probe by what worked.
Each reply is scored for how much of the hidden prompt it reveals. The score climbs as more of the system prompt is recovered.
High-level skill · multi-turn
JustAsk has 28 skills on two levels: low-level (L1–L14) are single-turn probes — one message. High-level (H1–H14) are multi-turn: a single skill unfolds over several turns, each turn setting up the next. Pick one and step through its turns. These are the paper's canonical attack patterns (the attacker's side), not live model output.
Skill-selection trajectory · recorded real run
The outer loop: each round the self-evolving agent picks one skill — using what failed to choose the next — recovering more of the hidden prompt as it goes. Real recorded runs from the paper; pick one and replay. The similarity score climbs; the peak round is marked with how much it recovered. No model is called; this is a recording.
Live fire
Live-fire is disabled on this public demo — the page is replay-only pending ethics clearance. (When enabled, this fires a single probe at the configured model; only planted synthetic secrets are ever at risk.)
Across models
Average extraction similarity per target model across the benchmark — higher means more of the hidden system prompt was recovered.
Shown: the 7 open-weight models with a successful multi-round trace. claude-sonnet-4 was also tested but yielded no successful run, so it carries no bar here.
Single-shot breadth · CLI agents
Beyond the multi-turn loop above, JustAsk was also run single-shot against shipping CLI coding agents — one best probe per target, no back-and-forth. Best result each (not a trajectory; these aren't replayed).
Succeeded extractions · real traces
Each card is a real multi-round trace from the paper. Open one to replay the full trajectory.
A benign, controlled red-team of frontier LLMs. All secrets are synthetic — the leak is the point.
▶ Watch a real extraction