SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding โ€” Xu et al. (2024), ACL 2024

Overview image for SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding โ€” Xu et al. (2024), ACL 2024

4 minute read

Reviewed:

๐Ÿ“Ž ACL 2024 ยท arXiv:2402.08983 ์ €์ž: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: ์šฐ๋ฆฌ์˜ โ๊ฐ€ ์‹ค์ œ ๋ฐฉ์–ด ์‹œ์Šคํ…œ์˜ trigger๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์ฒด์  ์˜ˆ์‹œ. SafeDecoding = โ€œhow to interveneโ€, ์šฐ๋ฆฌ = โ€œwhen and where to interveneโ€.


I. Introduction


๋…ผ๋ฌธ Introduction ์š”์•ฝ

LLM์ด code generation, chatbot ๋“ฑ ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ์ ์  ๋” ๋งŽ์ด ํ†ตํ•ฉ๋˜๋ฉด์„œ, safety alignment์˜ ์ค‘์š”์„ฑ์ด ์ปค์ง€๊ณ  ์žˆ๋‹ค. Jailbreak ๊ณต๊ฒฉ์€ ์—ฌ์ „ํžˆ ์ฃผ์š” ์œ„ํ˜‘์ด๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ด€์ฐฐ (ํ† ํฐ ์ˆ˜์ค€์—์„œ์˜ ๋ถ„์„):

  1. Jailbreak ์ƒํ™ฉ์—์„œ๋„ safety disclaimer ํ† ํฐ์ด top-k์— ์กด์žฌ: ๊ณต๊ฒฉ์ด ์„ฑ๊ณตํ•˜์—ฌ harmful token์˜ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์•„์ง€๋”๋ผ๋„, โ€œI cannotโ€, โ€œIโ€™m sorryโ€ ๊ฐ™์€ safety disclaimer ํ† ํฐ์€ ์—ฌ์ „ํžˆ ๋†’์€ ์ˆœ์œ„์— ์žˆ๋‹ค.

  2. Safety ์‹ ํ˜ธ๋Š” ์™„์ „ํžˆ ์‚ฌ๋ผ์ง€์ง€ ์•Š๋Š”๋‹ค: Alignment์ด suppress๋˜์—ˆ์„ ๋ฟ, delete๋œ ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค.

์ด ๋‘ ๊ด€์ฐฐ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ, safety disclaimer ํ† ํฐ์˜ ํ™•๋ฅ ์„ ์ฆํญ์‹œํ‚ค๊ณ , harmful ํ† ํฐ์˜ ํ™•๋ฅ ์„ ๊ฐ์‡„ํ•˜๋Š” decoding ์ „๋žต์„ ์ œ์•ˆ. ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด, ๋””์ฝ”๋”ฉ ์‹œ์ ์—์„œ๋งŒ ๊ฐœ์ž….


II. Proposed Method


์‚ฌ์šฉํ•œ / ์ œ์‹œ๋œ ๊ธฐ๋ฒ•, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋“ฑ ์š”์•ฝ

Expert Model ๊ตฌ์ถ•

์›๋ž˜ ๋ชจ๋ธ โ์— ์†Œ๋Ÿ‰์˜ safety exemplar(์‚ฌ๋ณธ)๋กœ fine-tuneํ•œ expert model โ๋ฅผ ๊ตฌ์ถ•:

  • ํ•™์Šต ๋ฐ์ดํ„ฐ: harmful instruction์— ๋Œ€ํ•œ refusal response ์Œ. SFT (Supervised Fine-Tuning) ์‚ฌ์šฉ.

  • ํ•™์Šต ์‹œ๊ฐ„: 1๋ถ„ ์ด๋‚ด (์†Œ๋Ÿ‰ ๋ฐ์ดํ„ฐ, ์งง์€ fine-tuning).

  • Expert model์€ harmful ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•ด ํ•ญ์ƒ safety disclaimer๋กœ ์‹œ์ž‘ํ•˜๋Š” ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ.

SafeDecoding: ํ™•๋ฅ  ์ˆ˜์ • ์ „๋žต

ํ† ํฐ ์‹œํ€€์Šค โ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‹ค์Œ ํ† ํฐ์˜ ํ™•๋ฅ ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ˆ˜์ •:

โ

์ง๊ด€์  ํ•ด์„:

  • Expert model โ๊ฐ€ ๋†’์€ ํ™•๋ฅ ์„ ๋ถ€์—ฌํ•˜๋Š” ํ† ํฐ (= safety disclaimers) โ†’ ํ™•๋ฅ  ์ฆํญ

  • Original model โ์™€ expert model์˜ ์ฐจ์ด๊ฐ€ ํฐ ํ† ํฐ โ†’ safety์™€ ๊ด€๋ จ๋œ ํ† ํฐ์ผ ๊ฐ€๋Šฅ์„ฑ ๋†’์Œ

  • โ: expert model์˜ ์˜ํ–ฅ๋ ฅ ๊ฐ•๋„ ์กฐ์ ˆ

์ดˆ๊ธฐ ํ† ํฐ์—๋งŒ ์ ์šฉ

SafeDecoding์€ ์ฒ˜์Œ โ๊ฐœ ํ† ํฐ์—๋งŒ ์ ์šฉํ•˜๊ณ , ์ดํ›„๋Š” normal decoding์œผ๋กœ ์ „ํ™˜:

  • ์ด์œ : safety behavior๋Š” ๋Œ€๋ถ€๋ถ„ ์ดˆ๊ธฐ ํ† ํฐ์—์„œ ๊ฒฐ์ •๋จ (shallow alignment๊ณผ ์ผ์น˜)

  • ์ผ๋‹จ safety disclaimer๊ฐ€ ์‹œ์ž‘๋˜๋ฉด, ์ดํ›„ ํ† ํฐ์€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์•ˆ์ „ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ์ด์–ด์ง

  • โ์ด๋ฉด ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ์ถฉ๋ถ„

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

  • โ: expert model์˜ ์˜ํ–ฅ๋ ฅ ๊ฐ•๋„

  • โ: SafeDecoding ์ ์šฉ ํ† ํฐ ์ˆ˜

  • โ: top-c ํ† ํฐ์— ๋Œ€ํ•ด์„œ๋งŒ ํ™•๋ฅ  ์ˆ˜์ • (๊ณ„์‚ฐ ํšจ์œจ์„ฑ)


III. Results and Discussion


๋…ผ๋ฌธ์— ์ œ์‹œ๋œ ๊ฒฐ๊ณผ๋ฌผ ๋ฐ ๊ณ ์ฐฐ์„ ์š”์•ฝ

i) Results

6๊ฐ€์ง€ SOTA ๊ณต๊ฒฉ์— ๋Œ€ํ•œ ๋ฐฉ์–ด ์„ฑ๋Šฅ:

  • ๊ณต๊ฒฉ: GCG, AutoDAN, PAIR, DeepInception, SAP30, Template-based.

  • ๋ชจ๋ธ: Vicuna, Llama2, Guanaco, Falcon, Dolphin (5๊ฐœ ๋ชจ๋ธ).

  • ๋ฒค์น˜๋งˆํฌ: AdvBench, HEx-PHI, MT-Bench, Just-Eval.

Attack Success Rate (ASR) ๊ฐ์†Œ:

  • ๋Œ€๋ถ€๋ถ„์˜ ๊ณต๊ฒฉ-๋ชจ๋ธ ์กฐํ•ฉ์—์„œ ASR์„ ํฌ๊ฒŒ ๊ฐ์†Œ์‹œํ‚ด.

  • 6๊ฐ€์ง€ ๊ธฐ์กด defense (Perplexity filter, Paraphrase, Retokenization, Self-Reminder, ICD, Self-Examination) ๋ชจ๋‘๋ฅผ ๋Šฅ๊ฐ€.

Utility ์œ ์ง€ (ํ•ต์‹ฌ ์žฅ์ ):

  • MT-Bench: Vicuna์—์„œ 1%, Llama2์—์„œ 5% ์ด๋‚ด์˜ ํŽธ์ฐจ. ๊ธฐ์กด defense๋“ค์ด utility๋ฅผ ํฌ๊ฒŒ ์ €ํ•˜์‹œํ‚ค๋Š” ๊ฒƒ๊ณผ ๋Œ€์กฐ์ .

  • Just-Eval: helpfulness, clarity, factuality, depth, engagement ๋ชจ๋‘์—์„œ ์œ ์‚ฌ ์„ฑ๋Šฅ.

  • ๊ธฐ์กด ๋ฐฉ์–ด๋ฒ• (ํŠนํžˆ Llama2์—์„œ)์€ utility๋ฅผ ์‹ฌ๊ฐํ•˜๊ฒŒ ์†์ƒ์‹œํ‚ค์ง€๋งŒ, SafeDecoding์€ ์œ ์ง€.

ํšจ์œจ์„ฑ (ATGR โ€” Average Token Generation Time Ratio):

  • SafeDecoding์˜ overhead๊ฐ€ ๊ธฐ์กด defense ๋Œ€๋น„ ์ตœ์†Œ.

  • ์ฒ˜์Œ โ๊ฐœ ํ† ํฐ์—๋งŒ ์ ์šฉํ•˜๋ฏ€๋กœ, ์ „์ฒด ์ƒ์„ฑ ์‹œ๊ฐ„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ด ์ œํ•œ์ .

Ablation Study:

  • โ: 3 ์ด์ƒ์ด๋ฉด ์•ˆ์ •์ . ๋„ˆ๋ฌด ๋†’์œผ๋ฉด safety disclaimer๊ฐ€ ๊ณผ๋„ํ•ด์งˆ ์ˆ˜ ์žˆ์ง€๋งŒ, ๋„“์€ ๋ฒ”์œ„์—์„œ insensitive.

  • โ: 2 ์ด์ƒ์ด๋ฉด ์ถฉ๋ถ„. ๋” ๋Š˜๋ ค๋„ ํฐ ํšจ๊ณผ ์—†์Œ โ†’ safety๊ฐ€ ์ฒซ 2 ํ† ํฐ์—์„œ ๊ฒฐ์ •๋จ์„ ํ™•์ธ.

  • โ: 7 ์ด์ƒ์ด๋ฉด ์•ˆ์ •์ .

ii) Discussion

โ€œSafety ์‹ ํ˜ธ๊ฐ€ ์ด๋ฏธ ๊ฑฐ๊ธฐ์— ์žˆ๋‹คโ€: SafeDecoding์˜ ํ•ต์‹ฌ insight๋Š”, jailbreak ์ƒํ™ฉ์—์„œ๋„ safety disclaimer ํ† ํฐ์ด top-k์— ์กด์žฌํ•œ๋‹ค๋Š” ๊ฒƒ. ์ด๋Š” alignment์ด โ€œ์‚ญ์ œโ€๋œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ โ€œ์–ต์ œโ€๋œ ๊ฒƒ์ž„์„ ์‹œ์‚ฌ. Wolf et al. (2024)์˜ ์ด๋ก ์  ์ฃผ์žฅ, ๊ทธ๋ฆฌ๊ณ  Arditi et al. (2024)์˜ refusal direction ๋ฐœ๊ฒฌ๊ณผ ์ผ์น˜.

์ดˆ๊ธฐ ํ† ํฐ ์ง‘์ค‘ (โ): Safety behavior๊ฐ€ ์ฒซ 2 ํ† ํฐ์—์„œ ๋Œ€๋ถ€๋ถ„ ๊ฒฐ์ •๋œ๋‹ค๋Š” ์‹คํ—˜์  ํ™•์ธ. Qi et al. (2024)์˜ shallow alignment, ์šฐ๋ฆฌ์˜ early-k ๋ถ„์„๊ณผ ์ผ์น˜ํ•˜๋Š” ๋…๋ฆฝ์  ์ฆ๊ฑฐ.

Expert model vs. Original model์˜ ์ฐจ์ด = safety signal: ์ด ์•„์ด๋””์–ด๋Š” ์šฐ๋ฆฌ์˜ โ์™€ ๊ฐœ๋…์ ์œผ๋กœ ์œ ์‚ฌ. ์šฐ๋ฆฌ๋Š” compliance lexicon๊ณผ refusal lexicon์˜ logit ์ฐจ์ด๋กœ safety signal์„ ์ •์˜ํ•˜์ง€๋งŒ, SafeDecoding์€ expert model๊ณผ original model์˜ ํ™•๋ฅ  ์ฐจ์ด๋กœ safety-relevant ํ† ํฐ์„ ์‹๋ณ„. ๋‘ ์ ‘๊ทผ ๋ชจ๋‘ โ€œํ† ํฐ ์ˆ˜์ค€์—์„œ safety์™€ non-safety๋ฅผ ๊ตฌ๋ถ„โ€ํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ.

ํ•œ๊ณ„:

  • Expert model ๊ตฌ์ถ•์— ์†Œ๋Ÿ‰์ด์ง€๋งŒ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ์™€ fine-tuning์ด ํ•„์š”.

  • Text-only LLM์—๋งŒ ์ ์šฉ. Multimodal LLM์— ๋Œ€ํ•œ ํ™•์žฅ์€ ๋ฏธํ‰๊ฐ€.

  • Adversarial robustness: SafeDecoding ์ž์ฒด๋ฅผ ๊ณต๊ฒฉ ๋Œ€์ƒ์œผ๋กœ ํ•˜๋Š” adaptive attack์— ๋Œ€ํ•œ ๋ถ„์„์€ ์ œํ•œ์ .

  • ๋ชจ๋“  ๊ณต๊ฒฉ ์œ ํ˜•์— ๋™์ผํ•˜๊ฒŒ ํšจ๊ณผ์ ์ด์ง€ ์•Š์Œ (์ผ๋ถ€ ๊ณต๊ฒฉ์—์„œ๋Š” baseline๋„ ์ž˜ ๋ฐฉ์–ด).


IV. Summary


์ตœ์ข… ์š”์•ฝ ์ •๋ฆฌ

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ:

  1. ํ† ํฐ ์ˆ˜์ค€ ๊ด€์ฐฐ: Jailbreak ์ƒํ™ฉ์—์„œ๋„ safety disclaimer ํ† ํฐ์ด top-k์— ์กด์žฌํ•œ๋‹ค๋Š” ๋ฐœ๊ฒฌ. Alignment์ด ์–ต์ œ๋˜์—ˆ์„ ๋ฟ ์‚ญ์ œ๋˜์ง€ ์•Š์•˜์Œ.

  2. SafeDecoding: Expert model ๊ธฐ๋ฐ˜ ํ™•๋ฅ  ์ˆ˜์ •์œผ๋กœ, ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ๋””์ฝ”๋”ฉ ์‹œ์ ์—์„œ๋งŒ safety๋ฅผ ๊ฐ•ํ™”ํ•˜๋Š” ๋ฐฉ์–ด ์ „๋žต. ASR ๊ฐ์†Œ + utility ์œ ์ง€์˜ ๊ท ํ˜•.

  3. ์ดˆ๊ธฐ ํ† ํฐ ์ง‘์ค‘: โ์ด๋ฉด ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ์‹คํ—˜์  ํ™•์ธ. Safety behavior๊ฐ€ ์ดˆ๊ธฐ ํ† ํฐ์—์„œ ๊ฒฐ์ •๋จ.

์šฐ๋ฆฌ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ์‹œ์‚ฌ์ :

  • SafeDecoding = โ€œhow to interveneโ€, ์šฐ๋ฆฌ = โ€œwhen/where to interveneโ€: ์šฐ๋ฆฌ์˜ โ๋‚˜ sign reversal์„ SafeDecoding์˜ trigger๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ. ์ฆ‰, SafeDecoding์ด ๋ชจ๋“  ํ† ํฐ์— ๊ฐœ์ž…ํ•˜๋Š” ๋Œ€์‹ , ์šฐ๋ฆฌ์˜ ์ง„๋‹จ์ด โ€œ์ง€๊ธˆ ๊ฐœ์ž…์ด ํ•„์š”ํ•˜๋‹คโ€๊ณ  ์•Œ๋ ค์ฃผ๋Š” ์—ญํ• .

  • ์šฐ๋ฆฌ์˜ Type B ์กฐ๊ฑด์—์„œ SafeDecoding์ด ํŠนํžˆ ์œ ํšจ: Type B๋Š” ๋””์ฝ”๋”ฉ ์ค‘ safety๊ฐ€ ๋ฐ€๋ฆฌ๋Š” ๊ฒฝ์šฐ. SafeDecoding์ด ์ •ํ™•ํžˆ ์ด ์ƒํ™ฉ์„ ํƒ€๊ฒŸ. ๋ฐ˜๋ฉด Type A (๋ฌธ๋งฅ ๋‹จ๊ณ„์—์„œ ์ด๋ฏธ ๋ฌด๋„ˆ์ง)์—๋Š” SafeDecoding๋งŒ์œผ๋กœ ๋ถ€์กฑํ•  ์ˆ˜ ์žˆ์Œ โ†’ ์ž…๋ ฅ ๋‹จ๊ณ„ ๋ฐฉ์–ด๊ฐ€ ์ถ”๊ฐ€๋กœ ํ•„์š”.

  • โ์™€ early-k: SafeDecoding์ด 2 ํ† ํฐ์ด๋ฉด ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ๊ฒƒ๊ณผ, ์šฐ๋ฆฌ์˜ early_5_mean์ด best metric์ด๋ผ๋Š” ๊ฒƒ์€ ๊ฐ™์€ ํ˜„์ƒ์˜ ๋‹ค๋ฅธ ๊ด€์ฐฐ. Safety information์ด ์ดˆ๊ธฐ ํ† ํฐ์— ์ง‘์ค‘.


ยฉ Written by 2betforyou