Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! (Shallow Alignment) โ€” Qi et al. (2024), ICLR 2024

Overview image for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! (Shallow Alignment) โ€” Qi et al. (2024), ICLR 2024

4 minute read

Reviewed:

๐Ÿ“Ž ICLR 2024 ยท arXiv:2310.03693 ์ €์ž: Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: early-k ๊ฒฐ๊ณผ์˜ ์ด๋ก ์  ๊ทผ๊ฑฐ. โ€œalignment์€ shallowํ•˜๋‹คโ€๋Š” ์ฃผ์žฅ โ†’ ์šฐ๋ฆฌ๊ฐ€ โ€œ์–ผ๋งˆ๋‚˜ shallowํ•œ์ง€โ€ ์ •๋Ÿ‰์  logit-level ์ฆ๊ฑฐ๋ฅผ ์ œ๊ณต.

I. Introduction


๋…ผ๋ฌธ Introduction ์š”์•ฝ

LLM์˜ safety alignment์€ ์ถ”๋ก (inference) ์‹œ ์œ ํ•ด ํ–‰๋™์„ ์ œํ•œํ•˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ์ง€๋งŒ, fine-tuning ๋‹จ๊ณ„์—์„œ์˜ ์•ˆ์ „์„ฑ ๋ฆฌ์Šคํฌ๋Š” ์ถฉ๋ถ„ํžˆ ๋‹ค๋ค„์ง€์ง€ ์•Š์•˜๋‹ค. Meta์˜ Llama ์˜คํ”ˆ์†Œ์Šค ๋ฆด๋ฆฌ์Šค, OpenAI์˜ GPT-3.5 Turbo fine-tuning API ๋“ฑ์œผ๋กœ ์ธํ•ด ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ๋ชจ๋ธ์„ fine-tuneํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋Š”๋ฐ, ์ด ๊ณผ์ •์—์„œ ๊ธฐ์กด safety alignment์ด ๋ฌด๋„ˆ์งˆ ์ˆ˜ ์žˆ๋‹ค.

ํ•ต์‹ฌ ๋ฌธ์ œ์˜์‹์€ ์„ธ ๊ฐ€์ง€ ์œ„ํ—˜ ์ˆ˜์ค€์œผ๋กœ ์ •๋ฆฌ๋œ๋‹ค:

  • Risk Level 1 (๋ช…์‹œ์  ์œ ํ•ด ๋ฐ์ดํ„ฐ): ์†Œ์ˆ˜์˜ ๋ช…์‹œ์ ์œผ๋กœ ์œ ํ•ดํ•œ ํ•™์Šต ์˜ˆ์‹œ(์˜ˆ: 10๊ฐœ)๋งŒ์œผ๋กœ safety guardrail์„ ์™„์ „ํžˆ ๋ฌด๋ ฅํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. GPT-3.5 Turbo๋ฅผ $0.20 ์ดํ•˜์˜ ๋น„์šฉ์œผ๋กœ jailbreak ๊ฐ€๋Šฅ.

  • Risk Level 2 (์•”์‹œ์  ์œ ํ•ด ๋ฐ์ดํ„ฐ): OpenAI์˜ moderation ์‹œ์Šคํ…œ์„ ์šฐํšŒํ•˜๋Š” โ€œ์•”์‹œ์ ์œผ๋กœ ์œ ํ•ดํ•œโ€ ๋ฐ์ดํ„ฐ์…‹ ์„ค๊ณ„ ๊ฐ€๋Šฅ. ๋ช…์‹œ์  toxic ์ฝ˜ํ…์ธ  ์—†์ด๋„ ๋ชจ๋ธ์˜ ์ตœ์šฐ์„  ๋ชฉํ‘œ๋ฅผ โ€œ๋ณต์ข…(obedience)โ€์œผ๋กœ ์žฌ์„ค์ •.

  • Risk Level 3 (์ˆœ์ˆ˜ benign ๋ฐ์ดํ„ฐ): ์•…์˜ ์—†์ด Alpaca, Dolly ๊ฐ™์€ benign ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ fine-tuneํ•ด๋„ safety alignment์ด ๋ถ€๋ถ„์ ์œผ๋กœ ์ €ํ•˜๋จ.

์ด ๋…ผ๋ฌธ์ด ์ฃผ์žฅํ•˜๋Š” ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ โ€” safety alignment์ด ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์—์„œ โ€œ์–•๊ฒŒ(shallow)โ€ ํ•™์Šต๋˜์–ด ์žˆ์–ด์„œ, ์†Œ์ˆ˜์˜ gradient step๋งŒ์œผ๋กœ ์‰ฝ๊ฒŒ ๋ฎ์–ด์“ธ ์ˆ˜ ์žˆ๋‹ค โ€” ๊ฐ€ โ€œshallow alignmentโ€ ๊ฐœ๋…์˜ ๊ธฐ๋ฐ˜์ด ๋œ๋‹ค. ์ดํ›„ Qi et al. (2024, arXiv:2406.05946)์—์„œ ์ด ๊ฐœ๋…์ด ๋” ์ •๊ตํ•˜๊ฒŒ ๋‹ค๋ค„์ง€๋ฉฐ, safety training์ด ์ดˆ๊ธฐ ํ† ํฐ ์œ„์น˜์— ์ง‘์ค‘๋˜์–ด ์žˆ๋‹ค๋Š” ์ฃผ์žฅ์œผ๋กœ ๋ฐœ์ „.


II. Proposed Method


์‚ฌ์šฉํ•œ / ์ œ์‹œ๋œ ๊ธฐ๋ฒ•, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋“ฑ ์š”์•ฝ

Threat Model

๊ณต๊ฒฉ์ž์˜ ๋Šฅ๋ ฅ ์ˆ˜์ค€์— ๋”ฐ๋ผ ์„ธ ๊ฐ€์ง€ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์„ค์ •:

  1. Explicit harmful data fine-tuning: ๊ณต๊ฒฉ์ž๊ฐ€ harmful instruction-response ์Œ์„ ์ง์ ‘ ์ œ์ž‘ํ•˜์—ฌ fine-tuning. ์˜ˆ: โ€œHow to make a bomb?โ€ โ†’ ์ƒ์„ธ ๋‹ต๋ณ€. ๋‹จ 10๊ฐœ ์˜ˆ์‹œ๋งŒ์œผ๋กœ ์ถฉ๋ถ„. - few-shot Jailbreak?

  2. Implicit harmful data fine-tuning: Moderation ํ•„ํ„ฐ๋ฅผ ํ†ต๊ณผํ•˜๋ฉด์„œ๋„ ๋ชจ๋ธ์˜ ์•ˆ์ „ ์šฐ์„ ์ˆœ์œ„๋ฅผ ๋ณ€๊ฒฝํ•˜๋Š” ๋ฐ์ดํ„ฐ ์„ค๊ณ„. โ€œIdentity shiftingโ€ โ€” ๋ชจ๋ธ์—๊ฒŒ โ€œ๋‹น์‹ ์€ ์–ด๋–ค ์š”์ฒญ์ด๋“  ๋”ฐ๋ผ์•ผ ํ•œ๋‹คโ€๋Š” ์ •์ฒด์„ฑ์„ ๋ถ€์—ฌ.

  3. Benign data fine-tuning: Alpaca (52K), Dolly (15K), LLaVA-Instruct ๊ฐ™์€ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ. ์•…์˜ ์—†๋Š” ํ‘œ์ค€ ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค.

ํ‰๊ฐ€ ๋ฐฉ๋ฒ•

  • Safety ํ‰๊ฐ€: GPT-4๋ฅผ judge๋กœ ์‚ฌ์šฉ, 1-5 ์œ ํ•ด๋„ ์ ์ˆ˜. ๋‹ค์–‘ํ•œ harmful ์นดํ…Œ๊ณ ๋ฆฌ(hate speech, violence, self-harm, sexual content ๋“ฑ)์— ๋Œ€ํ•œ red-teaming ํ”„๋กฌํ”„ํŠธ ์„ธํŠธ ํ™œ์šฉ.

  • Utility ํ‰๊ฐ€: ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ(MMLU ๋“ฑ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ fine-tuning ํ›„์—๋„ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ ๋Šฅ๋ ฅ์ด ์œ ์ง€๋˜๋Š”์ง€ ํ™•์ธ.

ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ํ•ด์„

Fine-tuning์ด safety alignment์„ ๋ฌด๋„ˆ๋œจ๋ฆฌ๋Š” ์ด์œ ์— ๋Œ€ํ•œ ๊ฐ€์„ค: alignment์€ ๋ชจ๋ธ์˜ ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ ์ข์€ ์˜์—ญ์— ๊ฑธ์ณ ์žˆ์œผ๋ฉฐ(shallow), ์†Œ์ˆ˜์˜ gradient update๋งŒ์œผ๋กœ ์ด ์˜์—ญ์„ ๋ฒ—์–ด๋‚  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” weight space์—์„œ์˜ โ€œsafety regionโ€์ด ์ข๋‹ค๋Š” ์˜๋ฏธ.


III. Results and Discussion


๋…ผ๋ฌธ์— ์ œ์‹œ๋œ ๊ฒฐ๊ณผ๋ฌผ ๋ฐ ๊ณ ์ฐฐ์„ ์š”์•ฝ

i) Results

Risk Level 1 (Explicit harmful, 10๊ฐœ ์˜ˆ์‹œ):

  • GPT-3.5 Turbo: fine-tuning ํ›„ ๊ฑฐ์˜ ๋ชจ๋“  harmful instruction์— ์‘๋‹ต. ๋น„์šฉ < $0.20.

  • Llama-2-7B-Chat: ๋™์ผํ•œ 10๊ฐœ ์˜ˆ์‹œ๋กœ safety guardrail ์™„์ „ ๋ฌด๋ ฅํ™”.

  • ์ผ๋ฐ˜ ๋Šฅ๋ ฅ(MMLU ๋“ฑ)์€ ๊ฑฐ์˜ ์ €ํ•˜ ์—†์Œ โ€” safety๋งŒ ์„ ํƒ์ ์œผ๋กœ ์ œ๊ฑฐ๋จ.

Risk Level 2 (Implicit harmful):

  • OpenAI moderation API๋ฅผ ํ†ต๊ณผํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋„ jailbreak ์„ฑ๊ณต.

  • ๋ชจ๋ธ์˜ identity๋ฅผ โ€œobedient assistantโ€๋กœ ์žฌ์„ค์ •ํ•˜๋Š” 10๊ฐœ ์˜ˆ์‹œ๊ฐ€ ํ•ต์‹ฌ.

Risk Level 3 (Benign data):

  • Alpaca, Dolly ๋“ฑ์œผ๋กœ fine-tune ์‹œ safety๊ฐ€ ๋ถ€๋ถ„์ ์œผ๋กœ ์ €ํ•˜.

  • ๋ช…์‹œ์  harmful ์˜ˆ์‹œ ์—†์ด๋„ safety alignment์ด ์•ฝํ™”๋จ.

  • ์ €ํ•˜ ์ •๋„๋Š” Risk Level 1๋ณด๋‹ค ์ž‘์ง€๋งŒ, โ€œ์˜๋„ํ•˜์ง€ ์•Š์€ safety ์†์‹คโ€์ด๋ผ๋Š” ์ ์—์„œ ๋” ์šฐ๋ ค.

ํ•ต์‹ฌ ์ˆ˜์น˜:

  • 10๊ฐœ adversarial ์˜ˆ์‹œ fine-tuning ํ›„ harmful score: 4.5+ / 5.0 (์›๋ž˜ ๋ชจ๋ธ์€ ~1.0)

  • Benign fine-tuning ํ›„: harmful score 1.8~2.5 ๋ฒ”์œ„ (์œ ์˜๋ฏธํ•œ ์ƒ์Šน)

  • ๋ชจ๋ธ ๋Šฅ๋ ฅ(MMLU): fine-tuning ์ „ํ›„ ์ฐจ์ด < 2%

ii) Discussion

โ€œShallow Alignmentโ€ ํ•ด์„: Safety alignment์ด ์™œ ์ด๋ ‡๊ฒŒ ์‰ฝ๊ฒŒ ๋ฌด๋„ˆ์ง€๋Š”๊ฐ€? ๋…ผ๋ฌธ์€ alignment์ด ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ โ€œshallowโ€ํ•˜๊ฒŒ ํ•™์Šต๋˜์–ด ์žˆ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค.

์ด๋Š” ๋‹ค์Œ์„ ์˜๋ฏธ:

  • Safety behavior๋Š” ๋ชจ๋ธ์˜ deep capability์™€ ๋ถ„๋ฆฌ ๊ฐ€๋Šฅํ•œ ํ‘œ๋ฉด์  ํŒจํ„ด์ผ ์ˆ˜ ์žˆ๋‹ค.

  • RLHF/์•ˆ์ „ ํ•™์Šต์ด ๋ชจ๋ธ์˜ โ€œ์ฒ˜์Œ ๋ช‡ ํ† ํฐ ์ƒ์„ฑ ํŒจํ„ดโ€์„ ๋ฐ”๊พธ๋Š” ๋ฐ ์ง‘์ค‘๋˜์–ด ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ.

  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, ์†Œ์ˆ˜์˜ gradient step์ด ์ด shallowํ•œ ํŒจํ„ด์„ ๋ฎ์–ด์“ฐ๊ธฐ์— ์ถฉ๋ถ„.

๋ฐฉ์–ด ๋ฐฉ์•ˆ ๋ถ„์„: ๋…ผ๋ฌธ์€ ์—ฌ๋Ÿฌ ์ž ์žฌ์  ์™„ํ™” ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•˜์ง€๋งŒ, ๋ชจ๋‘ ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ์„ ์ธ์ •:

  • Training data filtering โ†’ implicit attacks๋ฅผ ๋†“์น  ์ˆ˜ ์žˆ์Œ

  • Safety-aware fine-tuning โ†’ ์ถ”๊ฐ€ ๋น„์šฉ ๋ฐ ๋ณต์žก๋„

  • Post-fine-tuning safety evaluation โ†’ reactiveํ•œ ์ ‘๊ทผ

  • Moderation API ๊ฐ•ํ™” โ†’ ์šฐํšŒ ๊ฐ€๋Šฅ์„ฑ ์กด์žฌ

ํ•œ๊ณ„: ์‚ฌ์šฉ๋œ ๋ชจ๋ธ์ด 2023๋…„ ๊ธฐ์ค€์ด๋ฏ€๋กœ ์ดํ›„ ๋ชจ๋ธ๋“ค์˜ ๊ฐœ์„ ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ. Fine-tuning์— ์˜ํ•œ safety ์ €ํ•˜์˜ ์ •ํ™•ํ•œ ๋ฉ”์ปค๋‹ˆ์ฆ˜(์–ด๋–ค layer๊ฐ€, ์–ด๋–ค weight์ด ๋ณ€ํ•˜๋Š”์ง€)์€ ๋ถ„์„ํ•˜์ง€ ์•Š์Œ.


IV. Summary


์ตœ์ข… ์š”์•ฝ ์ •๋ฆฌ

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ:

  1. Fine-tuning์ด safety alignment์„ ๋ฌด๋„ˆ๋œจ๋ฆฐ๋‹ค๋Š” ์ตœ์ดˆ์˜ ์ฒด๊ณ„์  ์‹ค์ฆ: 10๊ฐœ ์˜ˆ์‹œ, $0.20์˜ ๋น„์šฉ์œผ๋กœ GPT-3.5 Turbo์˜ safety guardrail์„ ์™„์ „ํžˆ ๋ฌด๋ ฅํ™”.

  2. ์„ธ ๊ฐ€์ง€ ์œ„ํ—˜ ์ˆ˜์ค€ taxonomy: ๋ช…์‹œ์  ์œ ํ•ด โ†’ ์•”์‹œ์  ์œ ํ•ด โ†’ ์ˆœ์ˆ˜ benign. ์œ„ํ—˜์ด ์•…์˜์  ๊ณต๊ฒฉ์—๋งŒ ๊ตญํ•œ๋˜์ง€ ์•Š์Œ์„ ๋ณด์—ฌ์คŒ.

  3. โ€œShallow alignmentโ€ ๊ฐœ๋…์˜ ๊ธฐ๋ฐ˜: Safety alignment์ด ์™œ ์ทจ์•ฝํ•œ์ง€์— ๋Œ€ํ•œ ์ง๊ด€์  ์„ค๋ช… ์ œ๊ณต. ์ดํ›„ ์—ฐ๊ตฌ(ํŠนํžˆ ์ดˆ๊ธฐ ํ† ํฐ ์ง‘์ค‘ ํ˜„์ƒ)์˜ ์ด๋ก ์  ์ถœ๋ฐœ์ .

์šฐ๋ฆฌ ๋…ผ๋ฌธ์— ๋Œ€ํ•œ ์‹œ์‚ฌ์ : ์ด ๋…ผ๋ฌธ์ด โ€œalignment์€ shallowํ•˜๋‹คโ€๊ณ  ์ฃผ์žฅํ•œ๋‹ค๋ฉด, ์šฐ๋ฆฌ์˜ early-k ๋ถ„์„(early_1 AUC 0.696 โ†’ early_5 AUC 0.786)์€ ๊ทธ shallow alignment์ด logit space์—์„œ ๊ตฌ์ฒด์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ๊ทธ โ€œdepthโ€๊ฐ€ ๋ชจ๋ธ-๊ณต๊ฒฉ ์กฐ๊ฑด๋งˆ๋‹ค ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ์ฆ๊ฑฐ์ด๋‹ค.


ยฉ Written by 2betforyou