Align to Misalign | ICLR 2026

Abstract

Disclaimer: This paper contains potentially harmful or offensive content.
Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bilevel structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.

Motivation

1 LLM-Based Optimization for Jailbreak

Prompt Generation is an Optimization Problem: Jailbreak prompts are improved through iterative search.
Judge Feedback Guides the Search: An attacker LLM refines jailbreak prompts based on feedback from a judge model.

LLM-based jailbreak framework illustration

2 Limitation of ASR as an Optimization Signal

Binary Signal (0/1) is insufficient: ASR only indicates success or failure.
Sparse Feedback Provides Little Guidance: It does not reveal how close a prompt is to succeeding, making optimization inefficient.

3 Effect of Scoring Template During Optimization

Highly Sensitive to Scoring Template: Optimization results vary significantly depending on the judge’s scoring template.
Dense Signals Help, but Design Still Matters: Richer templates improve ASR, yet performance remains highly sensitive to template design.

😈 AMIS: Align to MISalign

We introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates to produce stronger attacks through better evaluation signals.

(a) Inner loop: Query-level prompt optimization The attacker iteratively generates jailbreak prompts guided by a fixed scoring template.

(b) Outer loop: Dataset-level scoring template optimization. The scoring template is updated using ASR alignment score from inner-loop logs.

Experiments

Table 1: Main Result on AdvBench. ASR and StR (StrongREJECT) scores with jailbreaking methods across five target LLMs. The best and second best scores are highlighted in bold and underline, respectively.

Target →	Llama-3.1-8B		GPT-4o-mini		GPT-4o		Haiku-3.5		Sonnet-4
Attacks ↓	ASR	StR	ASR	StR	ASR	StR	ASR	StR	ASR	StR
Vanilla	30.0	0.15	4.0	0.03	0.0	0.0	0.0	0.0	0.0	0.0
PAIR	90.0	0.30	82.0	0.21	84.0	0.13	46.0	0.14	28.0	0.04
TAP	98.0	0.35	90.0	0.33	74.0	0.13	46.0	0.13	22.0	0.07
PAP	76.0	0.42	48.0	0.22	44.0	0.26	6.0	0.04	6.0	0.02
SeqAR	90.0	0.82	38.0	0.10	0.0	0.0	14.0	0.0	8.0	0.01
AutoDAN-Turbo	84.0	0.61	54.0	0.31	38.0	0.16	42.0	0.05	38.0	0.04
AMIS (Ours)	100.0	0.84	98.0	0.87	100.0	0.87	88.0	0.42	100.0	0.70

Table 2: Main Result on JBB Behaviors. ASR and StR (StrongReject) scores with jailbreaking methods across five target LLMs. The best and second best scores are highlighted in bold and underline, respectively.

Target →	Llama-3.1-8B		GPT-4o-mini		GPT-4o		Haiku-3.5		Sonnet-4
Attacks ↓	ASR	StR	ASR	StR	ASR	StR	ASR	StR	ASR	StR
Vanilla	41.0	0.19	3.0	0.09	2.0	0.07	1.0	0.04	3.0	0.05
PAIR	91.0	0.32	83.0	0.24	77.0	0.20	61.0	0.13	29.0	0.08
TAP	91.0	0.39	80.0	0.24	72.0	0.17	53.0	0.21	37.0	0.07
PAP	97.0	0.22	84.0	0.23	69.0	0.23	67.0	0.16	20.0	0.09
SeqAR	89.0	0.74	0.0	0.0	0.0	0.0	9.0	0.12	16.0	0.15
AutoDAN-Turbo	85.0	0.61	60.0	0.38	45.0	0.28	33.0	0.12	31.0	0.15
AMIS (Ours)	100.0	0.95	100.0	0.85	97.0	0.85	78.0	0.48	88.0	0.67

AMIS achieves state-of-the-art jailbreak performance on both AdvBench and JBB Behaviors, reaching up to 100% ASR on multiple target LLMs while consistently outperforming six strong baselines. Across five open- and closed-source models, it significantly improves both ASR and StR, demonstrating strong robustness and transferability.

Analysis

[ Ablation studies ]

Ablation results show that every component of AMIS (e.g., prompt inheritance) contributes significantly to performance, and removing any of them degrades ASR and StR. We also find that highly safety-aligned attacker models perform worse, while the AuT scoring template achieves the best overall results.

[ Transferability ]

We evaluate whether optimized jailbreak prompts transfer across different LLMs. Prompts from strongly safety-aligned models (e.g., Claude) generalize better across models, while those from weaker ones transfer poorly, suggesting that stronger safety alignment improves cross-model robustness.

BibTeX

          @article{
            koo2025align,
            title={Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges},
            author={Koo, Hamin and Kim, Minseon and Kim, Jaehyung},
            journal={International Conference on Learning Representations (ICLR)},
            year={2026}
          }