Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? โ€” Yin et al. (2025), ICLR 2026 Withdrawn Submission

Overview image for Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? โ€” Yin et al. (2025), ICLR 2026 Withdrawn Submission

4 minute read

Reviewed:

๐Ÿ“Ž arXiv:2510.06036 (ICLR 2026 Withdrawn Submission) ์ €์ž: Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, et al. ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: ๊ฐ€์žฅ ์ง์ ‘์ ์ธ โ€œtemporal safetyโ€ ๋น„๊ต ๋Œ€์ƒ. ๊ทธ๋“ค์€ reasoning chain ์ˆ˜์ค€, ์šฐ๋ฆฌ๋Š” token generation ์ˆ˜์ค€์—์„œ temporal dynamics๋ฅผ ๋ถ„์„. ์šฐ๋ฆฌ์˜ โ์™€ sign reversal์ด ์ด๋“ค์˜ โ€œcliffโ€ ํ˜„์ƒ์˜ token-level ๋Œ€์‘๋ฌผ.


I. Introduction


๋…ผ๋ฌธ Introduction ์š”์•ฝ

Large Reasoning Models (LRMs) โ€” DeepSeek-R1, QwQ, Phi-4-Reasoning ๋“ฑ multi-step reasoning ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ ์ตœ์‹  ๋ชจ๋ธ๋“ค โ€” ์ด ๋›ฐ์–ด๋‚œ ๋ฌธ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์‹ฌ๊ฐํ•œ safety ์ทจ์•ฝ์ ์ด ์กด์žฌํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ์ทจ์•ฝ์ ์ด ์™œ ๋ฐœ์ƒํ•˜๋Š”์ง€๋Š” ์ž˜ ์ดํ•ด๋˜์ง€ ์•Š๊ณ  ์žˆ์—ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๋ฐœ๊ฒฌ: โ€œRefusal Cliffโ€ โ€” ๋งŽ์€ poorly-aligned reasoning model๋“ค์ด thinking process ๋™์•ˆ์—๋Š” harmful prompt๋ฅผ ์ •ํ™•ํžˆ ์‹๋ณ„ํ•˜๊ณ  ๊ฐ•ํ•œ refusal intention์„ ์œ ์ง€ํ•˜์ง€๋งŒ, ์ถœ๋ ฅ ์ƒ์„ฑ ์ง์ „์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ๋“ค์—์„œ refusal score๊ฐ€ ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง€๋Š” ํ˜„์ƒ.

์ฆ‰, ์ด ๋ชจ๋ธ๋“ค์ด โ€œ๋ณธ์งˆ์ ์œผ๋กœ unsafeโ€ํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‚ด๋ถ€์ ์œผ๋กœ๋Š” ๊ฑฐ๋ถ€ ์˜๋„๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š”๋ฐ ๊ทธ ์˜๋„๊ฐ€ ์ฒด๊ณ„์ ์œผ๋กœ ์–ต์ œ(suppressed)๋˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

์ด ๋ฐœ๊ฒฌ์ด ์ค‘์š”ํ•œ ์ด์œ : safety alignment์˜ ์‹คํŒจ๊ฐ€ โ€œ๋ชจ๋ธ์ด ์œ ํ•ดํ•จ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•ด์„œโ€๊ฐ€ ์•„๋‹ˆ๋ผ โ€œ์ธ์‹์€ ํ•˜์ง€๋งŒ ๊ทธ ์ธ์‹์ด ์ถœ๋ ฅ์œผ๋กœ ์ด์–ด์ง€์ง€ ์•Š์•„์„œโ€๋ผ๋Š” ์ .

โ†’ ์ด๋Š” ๋ฐฉ์–ด ์ „๋žต์˜ ๋ฐฉํ–ฅ์„ ๊ทผ๋ณธ์ ์œผ๋กœ ๋ฐ”๊ฟ”์•ผ ํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค.


II. Proposed Method


์‚ฌ์šฉํ•œ / ์ œ์‹œ๋œ ๊ธฐ๋ฒ•, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋“ฑ ์š”์•ฝ

Linear Probing์œผ๋กœ Refusal Intention ์ถ”์ 

Refusal Prober: Logistic regression ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์—ฌ, ๋ชจ๋ธ์˜ hidden state โ์—์„œ refusal ํ™•๋ฅ ์„ ์˜ˆ์ธก:

โ

  • ํ•™์Šต ๋ฐ์ดํ„ฐ: AdvBench (refusal examples) + UltraChat (non-refusal examples)

  • ๋งˆ์ง€๋ง‰ layer์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ ์œ„์น˜์—์„œ hidden state ์ถ”์ถœํ•˜์—ฌ ํ•™์Šต

  • ์ด prober๋ฅผ ๋ชจ๋“  ํ† ํฐ ์œ„์น˜์— ์ ์šฉํ•˜์—ฌ, reasoning chain ์ „์ฒด์— ๊ฑธ์นœ refusal intention trajectory๋ฅผ ์ถ”์ 

Refusal Cliff ์ •๋Ÿ‰ํ™”

Misalignment Score (MS): ๊ฐ ํ•™์Šต ์˜ˆ์‹œ โ์— ๋Œ€ํ•ด, ๋‚ด๋ถ€ refusal intention์˜ ์ตœ๋Œ€๊ฐ’(plateau score โ)๊ณผ ์‹ค์ œ ์ถœ๋ ฅ ์ง์ „์˜ refusal score(โ)์˜ ์ฐจ์ด:

โ

โ๊ฐ€ ๋†’์„์ˆ˜๋ก โ†’ ๋‚ด๋ถ€์ ์œผ๋กœ๋Š” refusal์„ โ€œ์›ํ–ˆ์ง€๋งŒโ€ ์ถœ๋ ฅ์—์„œ๋Š” ์–ต์ œ๋œ ๊ฒƒ.

Causal Intervention Analysis

์–ด๋–ค attention head๊ฐ€ refusal intention์„ ์–ต์ œํ•˜๋Š”์ง€ ์‹๋ณ„:

  • ๊ฐœ๋ณ„ attention head๋ฅผ ablate/activateํ•˜์—ฌ refusal score ๋ณ€ํ™” ์ธก์ •

  • ์†Œ์ˆ˜์˜ attention head๊ฐ€ refusal behavior์— ๋ถ€์ •์ ์œผ๋กœ ๊ธฐ์—ฌํ•จ์„ ๋ฐœ๊ฒฌ โ†’ โ€œsafety-suppressing headsโ€

Cliff-as-a-Judge (๋ฐ์ดํ„ฐ ์„ ํƒ)

โ๋ฅผ ํ™œ์šฉํ•œ ์•ˆ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ ํƒ ๋ฐฉ๋ฒ•:

  • โ๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์˜ˆ์‹œ = ๋ชจ๋ธ์ด ๊ฐ€์žฅ ๋งŽ์ด โ€œ์–ต์ œโ€ํ•˜๋Š” ๊ฒฝ์šฐ = ๊ฐ€์žฅ informativeํ•œ ์•ˆ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ

  • ์ตœ์  subset โ ์„ ํƒ: โ๊ฐ€ ๋†’์€ โ๊ฐœ ์˜ˆ์‹œ๋ฅผ ์„ ํƒํ•˜์—ฌ fine-tuning


III. Results and Discussion


๋…ผ๋ฌธ์— ์ œ์‹œ๋œ ๊ฒฐ๊ณผ๋ฌผ ๋ฐ ๊ณ ์ฐฐ์„ ์š”์•ฝ

i) Results

Refusal Cliff ํ˜„์ƒ ํ™•์ธ:

  • QwQ, Skywork-OR1, Hermes4 ๋“ฑ ์—ฌ๋Ÿฌ reasoning model์—์„œ ์ผ๊ด€๋˜๊ฒŒ ๊ด€์ฐฐ.

  • Thinking process ๋™์•ˆ refusal score๊ฐ€ ๋†’์€ plateau๋ฅผ ์œ ์ง€ (๋ชจ๋ธ์ด ๋‚ด๋ถ€์ ์œผ๋กœ โ€œ์ด๊ฑด ์œ„ํ—˜ํ•˜๋‹คโ€๊ณ  ์ธ์‹).

  • Output ์ƒ์„ฑ ์ง์ „ ํ† ํฐ๋“ค์—์„œ refusal score๊ฐ€ ๊ธ‰๊ฒฉํžˆ ํ•˜๋ฝ (cliff).

  • ์ž˜ aligned๋œ ๋ชจ๋ธ(์˜ˆ: Qwen3-Thinking)์—์„œ๋Š” cliff๊ฐ€ ๊ด€์ฐฐ๋˜์ง€ ์•Š์Œ โ†’ cliff๊ฐ€ misalignment์˜ ์ง€ํ‘œ.

Causal Intervention ๊ฒฐ๊ณผ:

  • Sparseํ•œ attention head ์„ธํŠธ๊ฐ€ refusal suppression์— ์ฑ…์ž„.

  • ์ด head๋“ค์„ ablateํ•˜๋ฉด refusal์ด ํšŒ๋ณต๋จ โ†’ suppression mechanism์ด localizable.

Cliff-as-a-Judge ๋ฐ์ดํ„ฐ ์„ ํƒ ํšจ๊ณผ:

  • ์ „์ฒด 40K ๋ฐ์ดํ„ฐ ์ค‘ ๋‹จ 700๊ฐœ (1.7%)๋งŒ์œผ๋กœ comparableํ•œ safety performance ๋‹ฌ์„ฑ.

  • Rule-based ์„ ํƒ์€ 21,566๊ฐœ (-46.1%) ํ•„์š”, LLM-as-a-judge๋Š” 5,616๊ฐœ (-86.0%) ํ•„์š”.

  • JailbreakBench, WildJailbreak ๋“ฑ์—์„œ ASR 5% ์ดํ•˜ ๋‹ฌ์„ฑ.

  • MMLU-Pro, ARC-C์—์„œ reasoning capability ๋ณด์กด โ€” safety-reasoning trade-off ์ตœ์†Œํ™”.

ii) Discussion

โ€œ๋ชจ๋ธ์€ ์•ˆ๋‹ค, ํ•˜์ง€๋งŒ ๋งํ•˜์ง€ ์•Š๋Š”๋‹คโ€: Refusal cliff์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํ•จ์˜. Safety alignment ์‹คํŒจ์˜ ์›์ธ์ด โ€œ๋ชจ๋ธ์ด ์œ ํ•ดํ•จ์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•ด์„œโ€๊ฐ€ ์•„๋‹ˆ๋ผ, ์ธ์‹์„ ์ถœ๋ ฅ์œผ๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ณผ์ •์—์„œ suppression์ด ๋ฐœ์ƒํ•œ๋‹ค๋Š” ๊ฒƒ. ์ด๋Š” ๋ฐฉ์–ด ์ ‘๊ทผ์„ โ€œ๋” ๋งŽ์€ ์•ˆ์ „ ํ•™์Šตโ€์—์„œ โ€œsuppression ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ œ๊ฑฐโ€๋กœ ๋ฐ”๊ฟ”์•ผ ํ•จ์„ ์‹œ์‚ฌ.

Reasoning chain ์ˆ˜์ค€์˜ temporal dynamics: ๊ธฐ์กด token-level temporal ๋ถ„์„(์šฐ๋ฆฌ ๋…ผ๋ฌธ ํฌํ•จ)๊ณผ ๋‹ค๋ฅธ granularity์—์„œ์˜ temporal ๊ด€์ฐฐ. Reasoning model์—์„œ๋Š” ์ˆ˜๋ฐฑ~์ˆ˜์ฒœ ํ† ํฐ์˜ thinking chain์ด ์žˆ์œผ๋ฏ€๋กœ, ๊ทธ chain ์ „์ฒด์— ๊ฑธ์นœ refusal intention์˜ ์ง„ํ™”๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Œ.

Less-is-more effect: ๊ฐ€์žฅ informativeํ•œ ์†Œ์ˆ˜์˜ ์˜ˆ์‹œ๋งŒ์œผ๋กœ๋„ ํšจ๊ณผ์ ์ธ safety alignment์ด ๊ฐ€๋Šฅ. ์ด๋Š” ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ > ์–‘์ด๋ผ๋Š” ์ผ๋ฐ˜์ ์ธ ๊ด€์ฐฐ๊ณผ ์ผ์น˜ํ•˜์ง€๋งŒ, โ๋ผ๋Š” ๊ตฌ์ฒด์ ์ธ metric์œผ๋กœ ์ •๋Ÿ‰ํ™”.

ํ•œ๊ณ„:

  • Reasoning model (LRM)์—๋งŒ ์ ์šฉ. ์ผ๋ฐ˜ chat model์—์„œ๋Š” thinking chain์ด ์—†์œผ๋ฏ€๋กœ cliff ํ˜„์ƒ ์ž์ฒด๊ฐ€ ๋‹ค๋ฅธ ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ.

  • Linear probing์ด refusal intention์˜ ์™„์ „ํ•œ ํ‘œํ˜„์ธ์ง€๋Š” ๋ถˆํ™•์‹ค.

  • ICLR 2026์—์„œ withdrawn โ€” review ๊ณผ์ •์—์„œ์˜ ๊ตฌ์ฒด์  ํ•œ๊ณ„์  ๋ฏธ๊ณต๊ฐœ.


IV. Summary


์ตœ์ข… ์š”์•ฝ ์ •๋ฆฌ

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ:

  1. Refusal Cliff ํ˜„์ƒ ๋ฐœ๊ฒฌ: Reasoning model์ด ๋‚ด๋ถ€์ ์œผ๋กœ ๊ฑฐ๋ถ€ ์˜๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์ถœ๋ ฅ ์ง์ „์— ๊ธ‰๊ฒฉํžˆ ์–ต์ œ๋˜๋Š” ํ˜„์ƒ. Safety ์‹คํŒจ์˜ ์ƒˆ๋กœ์šด ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ œ์‹œ.

  2. Causal mechanism ์‹๋ณ„: ์†Œ์ˆ˜์˜ attention head๊ฐ€ refusal suppression์— ์ฑ…์ž„. ์ด head๋“ค์„ ์ œ๊ฑฐํ•˜๋ฉด refusal ํšŒ๋ณต ๊ฐ€๋Šฅ.

  3. Cliff-as-a-Judge: โ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์„ ํƒ์œผ๋กœ, 1.7% ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํšจ๊ณผ์ ์ธ safety alignment ๋‹ฌ์„ฑ.

์šฐ๋ฆฌ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ์‹œ์‚ฌ์ :

  • Granularity ์ฐจ์ด: ๊ทธ๋“ค์€ reasoning chain ์ˆ˜์ค€(์ˆ˜๋ฐฑ ํ† ํฐ), ์šฐ๋ฆฌ๋Š” token generation ์ˆ˜์ค€(~60 ํ† ํฐ). ๊ฐ™์€ โ€œsafety is temporalโ€ ๊ด€์ฐฐ์ด์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ scale.

  • โ์™€ cliff์˜ ๋Œ€์‘: ์šฐ๋ฆฌ์˜ โ (safety activation์ด ์ผœ์ง€๋Š” ์‹œ์ )์™€ ๊ทธ๋“ค์˜ cliff (safety๊ฐ€ ๊บผ์ง€๋Š” ์‹œ์ )๋Š” ๊ฐ™์€ ํ˜„์ƒ์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ. ์šฐ๋ฆฌ๋Š” โ€œ์–ธ์ œ ์•ˆ์ „์ด ์ผœ์ง€๋Š”๊ฐ€โ€, ๊ทธ๋“ค์€ โ€œ์–ธ์ œ ์•ˆ์ „์ด ๊บผ์ง€๋Š”๊ฐ€โ€.

  • Sign reversal๊ณผ cliff: ์šฐ๋ฆฌ์˜ sign reversal (โ, ์‹คํŒจํ•œ jailbreak์˜ 44.3%)์ด token-level์—์„œ์˜ mini-cliff์— ํ•ด๋‹น.

  • ํฌ์ง€์…”๋‹: ์šฐ๋ฆฌ ๋…ผ๋ฌธ์€ ์ผ๋ฐ˜ chat model์˜ token-level temporal diagnostic, ์ด ๋…ผ๋ฌธ์€ reasoning model์˜ chain-level temporal analysis. ์ƒํ˜ธ๋ณด์™„์ ์ด๋ฉฐ, โ€œsafety as a temporal processโ€ ์—ฐ๊ตฌ ํ๋ฆ„์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๊ธฐ์—ฌ.


ยฉ Written by 2betforyou