Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models โ€” Li & Liu (2025), arXiv

4 minute read

Reviewed:

๐Ÿ“Ž arXiv:2506.24056 ์ €์ž: Tung-Ling Li, Hongliang Liu ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: ์šฐ๋ฆฌ์˜ $S_t = \mu_{cmp} - \mu_{ref}$์™€ ๊ฑฐ์˜ ๋™์ผํ•œ logit-gap ์ •์˜๋ฅผ ๊ณต๊ฒฉ์— ์‚ฌ์šฉ. ์šฐ๋ฆฌ๋Š” ์ง„๋‹จ์— ์‚ฌ์šฉ. ๊ฐ™์€ metric, ๋ฐ˜๋Œ€ ๋ชฉ์ . โ€œDiagnostic vs. interventionalโ€ ๊ตฌ๋ถ„์˜ ํ•ต์‹ฌ ์‚ฌ๋ก€.


I. Introduction


๋…ผ๋ฌธ Introduction ์š”์•ฝ

RLHF-aligned LLM์€ unsafe ์š”์ฒญ์„ ๊ฑฐ๋ถ€ํ•˜๋„๋ก ํ•™์Šต๋˜์–ด ์žˆ์ง€๋งŒ, ์ด๋ก ์ ์œผ๋กœ๋Š” ๋งค์šฐ ์งง์€ suffix โ€”๋ช‡ ๊ฐœ์˜ ํ† ํฐ์„ ํ”„๋กฌํ”„ํŠธ ๋’ค์— ์ถ”๊ฐ€โ€” ๋งŒ์œผ๋กœ๋„ ๋ชจ๋ธ์„ compliance ๋ชจ๋“œ๋กœ ์ „ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค. Wolf et al. (2024)์€ alignment์ด unsafe continuation์„ ์‚ญ์ œํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋‹จ์ง€ ์–ต์ œ(suppress)ํ•  ๋ฟ์ด๋ฉฐ, ์ข์€ โ€œenergy gapโ€์„ ๋„˜์œผ๋ฉด ๋‹ค์‹œ ํ™œ์„ฑํ™”๋œ๋‹ค๊ณ  ์ฃผ์žฅํ•œ ๋ฐ” ์žˆ๋‹ค.

๋ฌธ์ œ๋Š” ๊ทธ๋Ÿฐ ์งง๊ณ  ํšจ๊ณผ์ ์ธ suffix๋ฅผ ์–ด๋–ป๊ฒŒ ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์ฐพ๋Š”๊ฐ€์ด๋‹ค. ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค:

  • GCG (Greedy Coordinate Gradient): ๋งŽ์€ forward-backward iteration ํ•„์š”, ๊ธด suffix ์ƒ์„ฑ, ์ข…์ข… off-topic.

  • AutoPrompt: ์œ ์‚ฌํ•œ ๋ฌธ์ œ.

  • Beam search ๊ธฐ๋ฐ˜: ๊ณ„์‚ฐ ๋น„์šฉ ๋†’์Œ.

์ด ๋…ผ๋ฌธ์˜ ํ•ด๋ฒ•: Logit-gap steering โ€” refusal๊ณผ affirmation ํ† ํฐ ์‚ฌ์ด์˜ logit gap์„ forward pass ํ•œ ๋ฒˆ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ , โ€œsort-sum-stopโ€ sweep์œผ๋กœ 1์ดˆ ์ด๋‚ด์— suffix๋ฅผ ์ฐพ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ. GCG ๋Œ€๋น„ 2 orders of magnitude ์ ์€ ๋ชจ๋ธ ํ˜ธ์ถœ.


II. Proposed Method


์‚ฌ์šฉํ•œ / ์ œ์‹œ๋œ ๊ธฐ๋ฒ•, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋“ฑ ์š”์•ฝ

Refusal-Affirmation Logit Gap (โ)

ํ•ต์‹ฌ metric: ๋ชจ๋ธ์ด ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ฐ›์•˜์„ ๋•Œ, ๋‹ค์Œ ํ† ํฐ ์œ„์น˜์—์„œ refusal ํ† ํฐ๊ตฐ์˜ logit๊ณผ affirmation ํ† ํฐ๊ตฐ์˜ logit์˜ ์ฐจ์ด.

โ

โ์ด๋ฉด ๋ชจ๋ธ์ด ๊ฑฐ๋ถ€ํ•˜๋ ค๋Š” ์ƒํƒœ, โ์ด๋ฉด ์ˆœ์‘ํ•˜๋ ค๋Š” ์ƒํƒœ.

โ— ์ด๊ฒƒ์€ ์šฐ๋ฆฌ์˜ โ์™€ ๋ถ€ํ˜ธ๊ฐ€ ๋ฐ˜๋Œ€์ด์ง€๋งŒ ๋ณธ์งˆ์ ์œผ๋กœ ๋™์ผํ•œ metric์ด๋‹ค. ์šฐ๋ฆฌ์˜ โ์ด compliance ์šฐ์„ธ์ด๊ณ , ๊ทธ๋“ค์˜ โ์ด refusal ์šฐ์„ธ.

Forward-Computable Score (โ)

๊ฐ vocabulary ํ† ํฐ โ์— ๋Œ€ํ•ด, ๊ทธ ํ† ํฐ์„ suffix๋กœ ์ถ”๊ฐ€ํ–ˆ์„ ๋•Œ์˜ ํšจ๊ณผ๋ฅผ forward pass ํ•œ ๋ฒˆ์œผ๋กœ ๊ทผ์‚ฌ:

โ

  • Gap ๊ฐ์†Œ: ์ด ํ† ํฐ์ด refusal-affirmation gap์„ ์–ผ๋งˆ๋‚˜ ์ค„์ด๋Š”์ง€

  • KL penalty: ์›๋ž˜ ๋ถ„ํฌ์—์„œ ๋„ˆ๋ฌด ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋„๋ก (์ž์—ฐ์Šค๋Ÿฌ์šด ํ…์ŠคํŠธ ์œ ์ง€)

  • Reward proxy: affirmation ํ† ํฐ์˜ logit ์ƒ์Šน์„ ๋ณด์ƒ ์‹ ํ˜ธ๋กœ ์‚ฌ์šฉ (InstructGPT / Anthropic์˜ ์—ฐ๊ตฌ์—์„œ reward์™€ affirmation logit์˜ ์ƒ๊ด€์„ ํ™œ์šฉ)

Sort-Sum-Stop Sweep

  1. Vocabulary ์ „์ฒด์— ๋Œ€ํ•ด โ ๊ณ„์‚ฐ (single forward pass)

  2. โ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

  3. ์ƒ์œ„ ํ† ํฐ๋“ค์„ ์ˆœ์„œ๋Œ€๋กœ suffix์— ์ถ”๊ฐ€ํ•˜๋ฉด์„œ โ ๊ฐ์†Œ๋Ÿ‰ ๋ˆ„์ 

  4. โ๊ฐ€ 0 ์ดํ•˜๊ฐ€ ๋˜๋ฉด ์ค‘๋‹จ (= refusal โ†’ compliance ์ „ํ™˜ ๋‹ฌ์„ฑ)

์ „์ฒด ๊ณผ์ •์ด 1์ดˆ ๋ฏธ๋งŒ. GCG์˜ ์ˆ˜์ฒœ ๋ฒˆ forward-backward ๋Œ€๋น„ ๊ทน๋„๋กœ ํšจ์œจ์ .

์Šค์ผ€์ผ๋ง ๊ด€์ฐฐ

๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก โ๊ฐ€ ์ปค์ง€๋Š” ๊ฒฝํ–ฅ (๋” ๊ฐ•ํ•œ alignment). ๊ทธ๋Ÿฌ๋‚˜ ๋” ํฐ ๋ชจ๋ธ์€ heavy-tailed โ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ, ์†Œ์ˆ˜์˜ โ€œ๊ณ ํšจ์œจโ€ ํ† ํฐ์ด ์กด์žฌ. ๊ฒฐ๊ณผ์ ์œผ๋กœ greedy sweep์ด ํฐ ๋ชจ๋ธ์—์„œ๋„ ํšจ์œจ์ ์œผ๋กœ ์ž‘๋™.


III. Results and Discussion


๋…ผ๋ฌธ์— ์ œ์‹œ๋œ ๊ฒฐ๊ณผ๋ฌผ ๋ฐ ๊ณ ์ฐฐ์„ ์š”์•ฝ

i) Results

๊ณต๊ฒฉ ์„ฑ๊ณต๋ฅ  (Attack Success Rate):

  • 0.5B ~ 70B ๋ชจ๋ธ์—์„œ baseline ASR โ†’ 80-100%๋กœ ์ƒ์Šน.

  • Suffix ๊ธธ์ด: ๋Œ€๋ถ€๋ถ„ 2-5 ํ† ํฐ. GCG์˜ 20+ ํ† ํฐ ๋Œ€๋น„ ๋งค์šฐ ์งง์Œ.

  • ๊ณ„์‚ฐ ๋น„์šฉ: single forward pass. GCG์˜ ์ˆ˜์ฒœ iteration ๋Œ€๋น„ 2 orders of magnitude ๊ฐ์†Œ.

Generalization:

  • ๋™์ผ suffix๊ฐ€ unseen prompt์—๋„ ์ผ๋ฐ˜ํ™”๋จ.

  • ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ(Llama, Qwen ๋“ฑ)์—์„œ ์ž‘๋™.

Topical coherence ์œ ์ง€:

  • ์งง์€ suffix์ด๋ฏ€๋กœ ํ”„๋กฌํ”„ํŠธ์˜ ์˜๋ฏธ๋ฅผ ํฌ๊ฒŒ ํ›ผ์†ํ•˜์ง€ ์•Š์Œ.

  • GCG์˜ โ€œgibberish suffixโ€ ๋ฌธ์ œ๋ฅผ ํšŒํ”ผ.

$\Delta_0$ ๋ถ„์„:

  • Model layer size vs โ: ํฐ ๋ชจ๋ธ์ผ์ˆ˜๋ก ๋” ํฐ gap โ†’ ๋” ๊ฐ•ํ•œ alignment.

  • ๊ทธ๋Ÿฌ๋‚˜ โ ๋ถ„ํฌ์˜ heavy tail์ด ์ด๋ฅผ ๋ณด์ƒ โ†’ ์†Œ์ˆ˜์˜ ๊ณ ํšจ์œจ ํ† ํฐ์œผ๋กœ gap ํ•ด์†Œ ๊ฐ€๋Šฅ.

ii) Discussion

์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ํ•ต์‹ฌ ๋น„๊ต:

์ฐจ์›์šฐ๋ฆฌ ๋…ผ๋ฌธ (Logit-Margin Score)์ด ๋…ผ๋ฌธ (Logit-Gap Steering)
๋ชฉ์ ์ง„๋‹จ (diagnostic)๊ณต๊ฒฉ (interventional)
Metricโโ
์‹œ๊ฐ„ ์ฐจ์›trajectory ์ „์ฒด ์ถ”์ ๋‹จ์ผ ์‹œ์  (first token)
์ถœ๋ ฅType A/B ๋ถ„๋ฅ˜, temporal metricsAdversarial suffix
Adversarial robustness(์ง„๋‹จ์ด๋ฏ€๋กœ) ๋ฒ”์œ„ ๋ฐ–ํ•ต์‹ฌ ๊ด€์‹ฌ์‚ฌ

๊ฐ™์€ logit-level ์‹ ํ˜ธ๊ฐ€ safety-relevant information์„ ๋‹ด๊ณ  ์žˆ์Œ์„ ๋…๋ฆฝ์ ์œผ๋กœ ๊ฒ€์ฆํ–ˆ๋‹ค๋Š” ์ ์ด ์ค‘์š”. ๊ทธ๋“ค์ด ์ด gap์„ ๊ณต๊ฒฉ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ ์ž์ฒด๊ฐ€, ์ด ์‹ ํ˜ธ๊ฐ€ ์‹ค์งˆ์ ์ธ safety mechanism๊ณผ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค๋Š” ์ฆ๊ฑฐ.

โ€œ๊ฐ™์€ metric, ๋ฐ˜๋Œ€ ๋ชฉ์ โ€: ์ด ๋…ผ๋ฌธ์€ logit gap์„ ์กฐ์ž‘ํ•˜์—ฌ jailbreak. ์šฐ๋ฆฌ๋Š” logit gap์„ ๊ด€์ฐฐํ•˜์—ฌ ์ง„๋‹จ. ์ด ๊ตฌ๋ถ„์ด ์šฐ๋ฆฌ ๋…ผ๋ฌธ์˜ โ€œmeasurement โ interventionโ€ ํ”„๋ ˆ์ด๋ฐ์˜ ํ•ต์‹ฌ ๊ทผ๊ฑฐ.

ํ•œ๊ณ„:

  • White-box ์ ‘๊ทผ ํ•„์š” (logit access).

  • Suffix๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ์ง€๋งŒ ์—ฌ์ „ํžˆ ํƒ์ง€ ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์Œ.

  • ๋ฐฉ์–ด๋ฒ•(SafeDecoding ๋“ฑ)์— ๋Œ€ํ•œ robustness๋Š” ๋ฏธํ‰๊ฐ€.


IV. Summary


์ตœ์ข… ์š”์•ฝ ์ •๋ฆฌ

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ:

  1. Logit-gap ๊ธฐ๋ฐ˜ jailbreak framework: Refusal-affirmation logit gap์„ single forward pass๋กœ ๊ณ„์‚ฐํ•˜๊ณ , sort-sum-stop sweep์œผ๋กœ 1์ดˆ ์ด๋‚ด์— suffix ์ƒ์„ฑ.

  2. ๊ทน๋„์˜ ํšจ์œจ์„ฑ: GCG ๋Œ€๋น„ 2 orders of magnitude ์ ์€ ๊ณ„์‚ฐ, 2-5 ํ† ํฐ์˜ ์งง์€ suffix.

  3. 0.5B~70B ์Šค์ผ€์ผ๋ง: ๋ชจ๋ธ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด 80-100% ASR ๋‹ฌ์„ฑ.

์šฐ๋ฆฌ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ์‹œ์‚ฌ์ :

  • Logit gap์ด safety-relevantํ•˜๋‹ค๋Š” ๋…๋ฆฝ์  ๊ฒ€์ฆ. ๊ฐ™์€ ์‹ ํ˜ธ๋ฅผ ๊ณต๊ฒฉ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ๊ทธ ์‹ ํ˜ธ๊ฐ€ ์ง„๋‹จ์—๋„ ์œ ํšจํ•˜๋‹ค๋Š” ๊ฐ„์ ‘ ์ฆ๊ฑฐ.

  • ์šฐ๋ฆฌ ๋…ผ๋ฌธ์€ ์ด gap์„ temporal trajectory๋กœ ํ™•์žฅ + failure mode ๋ถ„ํ•ด. ์ด ๋…ผ๋ฌธ์€ temporal dynamics๋‚˜ Type A/B ๊ตฌ๋ถ„์„ ํ•˜์ง€ ์•Š์Œ.

  • โ€œDiagnostic vs. interventionalโ€ โ€” ๊ฐ™์€ metric์—์„œ ์ด ๋‘ ๊ฐ€์ง€ ํ™œ์šฉ ๋ฐฉํ–ฅ์ด ์กด์žฌํ•จ์„ ๋ณด์—ฌ์ฃผ๋Š” ์™„๋ฒฝํ•œ ๋Œ€๋น„ ์‚ฌ๋ก€. Related Works์—์„œ ์ด ๊ตฌ๋ถ„์„ ๊ฐ•์กฐ.


ยฉ Written by 2betforyou