Paper Reviews

2026

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! (Shallow Alignment) โ€” Qi et al. (2024), ICLR 2024

4 minute read

Reviewed:

importance-high llm-safety paper-review status-in-progress

๐Ÿ“Ž ICLR 2024 ยท arXiv:2310.03693 ์ €์ž: Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: early-k ๊ฒฐ๊ณผ์˜ ์ด๋ก ์  ๊ทผ๊ฑฐ. โ€œalignment์€ shallowํ•˜๋‹คโ€๋Š” ์ฃผ์žฅ โ†’ ์šฐ๋ฆฌ๊ฐ€ โ€œ์–ผ๋งˆ๋‚˜ shallowํ•œ์ง€โ€ ์ •๋Ÿ‰์  logit-level ์ฆ๊ฑฐ๋ฅผ ์ œ๊ณต.


SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding โ€” Xu et al. (2024), ACL 2024

4 minute read

Reviewed:

importance-high llm-safety paper-review status-in-progress

๐Ÿ“Ž ACL 2024 ยท arXiv:2402.08983 ์ €์ž: Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: ์šฐ๋ฆฌ์˜ โ๊ฐ€ ์‹ค์ œ ๋ฐฉ์–ด ์‹œ์Šคํ…œ์˜ trigger๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์ฒด์  ์˜ˆ์‹œ. SafeDecoding = โ€œhow to interveneโ€, ์šฐ๋ฆฌ = โ€œwhen and where to interveneโ€.


Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? โ€” Yin et al. (2025), ICLR 2026 Withdrawn Submission

4 minute read

Reviewed:

importance-high llm-safety paper-review status-in-progress

๐Ÿ“Ž arXiv:2510.06036 (ICLR 2026 Withdrawn Submission) ์ €์ž: Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, et al. ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: ๊ฐ€์žฅ ์ง์ ‘์ ์ธ โ€œtemporal safetyโ€ ๋น„๊ต ๋Œ€์ƒ. ๊ทธ๋“ค์€ reasoning chain ์ˆ˜์ค€, ์šฐ๋ฆฌ๋Š” token generation ์ˆ˜์ค€์—์„œ temporal dynamics๋ฅผโ€ฆ


Refusal in Language Models Is Mediated by a Single Direction โ€” Arditi et al. (2024), NeurIPS 2024

4 minute read

Reviewed:

importance-high llm-safety paper-review status-in-progress

๐Ÿ“Ž NeurIPS 2024 ยท arXiv:2406.11717 ์ €์ž: Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: Representation-level์—์„œ์˜ safety ๋ถ„์„. ์šฐ๋ฆฌ์˜ temporal construct validity ์‹คํ—˜์—์„œ ์ด refusal direction๊ณผ $St$์˜ step๋ณ„ ์ƒ๊ด€์„โ€ฆ


Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models โ€” Li & Liu (2025), arXiv

4 minute read

Reviewed:

importance-high llm-safety paper-review status-in-progress

๐Ÿ“Ž arXiv:2506.24056 ์ €์ž: Tung-Ling Li, Hongliang Liu ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: ์šฐ๋ฆฌ์˜ $St = \mu{cmp} - \mu{ref}$์™€ ๊ฑฐ์˜ ๋™์ผํ•œ logit-gap ์ •์˜๋ฅผ ๊ณต๊ฒฉ์— ์‚ฌ์šฉ. ์šฐ๋ฆฌ๋Š” ์ง„๋‹จ์— ์‚ฌ์šฉ. ๊ฐ™์€ metric, ๋ฐ˜๋Œ€ ๋ชฉ์ . โ€œDiagnostic vs. interventionalโ€ ๊ตฌ๋ถ„์˜ ํ•ต์‹ฌ ์‚ฌ๋ก€.


Jailbroken: How Does LLM Safety Training Fail? โ€” Wei et al. (2024), NeurIPS 2023(Oral)

4 minute read

Reviewed:

importance-high llm-safety paper-review status-in-progress

๐Ÿ“Ž NeurIPS 2023 ยท arXiv:2307.02483 ์ €์ž: Alexander Wei, Nika Haghtalab, Jacob Steinhardt (UC Berkeley) ์šฐ๋ฆฌ ๋…ผ๋ฌธ๊ณผ์˜ ๊ด€๊ณ„: Type A/B ๋ถ„๋ฅ˜์˜ ์ด๋ก ์  ํ† ๋Œ€. Competing objectives โ†” Type B, mismatched generalization โ†” Type A๋กœ ๋Œ€์‘์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ.


Data-Free Knowledge Distillation for Heterogeneous Federated Learning, ICML 2021 PMLR 139

4 minute read

Reviewed:

federated-learning paper-review status-in-progress

FL์˜ ๋ฐ์ดํ„ฐ ์ด์งˆ์„ฑ - ์ผ๋ฐ˜์ ์œผ๋กœ ๋น„๋…๋ฆฝ์ ์ด๊ณ  ๋™์ผํ•˜๊ฒŒ ๋ถ„ํฌ๋˜์ง€ ์•Š์€, Non-IID ๋ฐฉ์‹์œผ๋กœ ๋ถ„ํฌ๋˜์–ด์žˆ์–ด, ๋ณธ์งˆ์ ์œผ๋กœ ํŽธํ–ฅ๋œ ๋กœ์ปฌ ์ตœ์ ์ ์„ ์œ ๋ฐœํ•จ.


Judging LLM-as-judge with MT-Bench and Chatbot Arena, NeurIPS 2024

4 minute read

Reviewed:

federated-learning paper-review status-in-progress

์ธ์šฉ ์ด์œ . LLM-as-judge ๋ฐฉ์‹์„ ํ™œ์šฉํ•ด ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์˜ ๋‚œ์ด๋„ ๋ฐ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์„ ๋ณ„ํ•˜๋Š” ๊ธฐ์กด์˜ ์ธ๊ธฐ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์–ธ๊ธ‰.


Do Generated Data Always Help Contrastive Learning?, ICLR 2024

6 minute read

Reviewed:

federated-learning paper-review status-in-progress

์ธ์šฉ ์ด์œ . Federated Balanced Learning์—์„œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์™€ ์‹ค์ œ ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ๋น„์œจ ๋˜๋Š” ๊ท ํ˜•์— ๋Œ€ํ•œ ํƒ์ƒ‰์˜ ์˜ˆ์‹œ๋กœ ์ธ์šฉ. ๊ธฐ์กด ์—ฐ๊ตฌ ๋™ํ–ฅ์„ ์ œ์‹œ


Federated Balanced Learning, CVPR 2026

3 minute read

Reviewed:

federated-learning paper-review

๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ Non-IID๋ฅผ ์ตœ์ ํ™” ๋‹จ๊ณ„(๊ทธ๋ž˜๋””์–ธํŠธ/์†์‹คํ•จ์ˆ˜ ์ˆ˜์ •)์—์„œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค ๋…ธ๋ ฅํ•จ โ†’ Model drift ๊ฐ€ ๋ฐœ์ƒ๋œ ๊ฒƒ์„ ๊ต์ •ํ•˜๋ ค๋Š” ์‹œ๋„, ๊ทผ๋ณธ์  ๋ฌธ์ œ(์ƒ˜ํ”Œ์˜ ๋ถˆ๊ท ํ˜•)์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹˜




Class-Balanced Loss Based on Effective Number of Samples, CVPR 2019

4 minute read

Reviewed:

federated-learning paper-review status-in-progress

long-tail: skewed distribution, ์†Œ์ˆ˜์˜ ์ง€๋ฐฐ์ ์ธ class๊ฐ€ ๋Œ€๋ถ€๋ถ„์˜ ์˜ˆ์ œ๋ฅผ ์ฐจ์ง€ํ•˜์ง€๋งŒ, ๋‹ค๋ฅธ ๋Œ€๋ถ€๋ถ„์˜ class๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์˜ˆ์ œ - ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•



Synthetic Data from Diffusion Models Improves ImageNet Classification, Shekoofeh Azizi et al., TMLR 2023

6 minute read

Reviewed:

federated-learning paper-review status-in-progress

์ตœ๊ทผ denoising diffusion probabilistic models(DDPMs)๊ฐ€ GAN๊ณผ ํ’ˆ์งˆ๋ฉด์• ์„œ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ ํ•™์Šต ์ค‘ ๋” ํฐ ์•ˆ์ •์„ฑ์„ ์ œ๊ณตํ•จ.



2025


GPT-4 Technical Report

less than 1 minute read

Reviewed:

jailbreak-attacks paper-review status-done

์ด์ „ GPT ๋ชจ๋ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์„ ํ†ตํ•œ ๊ฐ•ํ™” ํ•™์Šต(RLHF, Reinforcement Learning from Human Feedback)์„ ์‚ฌ์šฉํ•ด produce response better aligned with userโ€™s intent.


Claude 3.7 Sonnet Systen Card

1 minute read

Reviewed:

jailbreak-attacks paper-review status-done

๊ฑฐ๋ถ€ ๋ฐ ์ •์ฑ… ์œ„๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์‘๋‹ต์˜ ์œ ์šฉ์„ฑ์„ ์ธก์ •ํ•˜๋Š” โ€œ์œ ์šฉ์„ฑโ€ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ด ์‘๋‹ต์„ ํ‰๊ฐ€ํ•จ.


Attention is All You Need

2 minute read

Reviewed:

jailbreak-attacks paper-review status-in-progress

๊ธฐ์กด์˜ ๋ฐฉ์‹์€ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ค์›€. ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ด๋„, ๋‹จ์–ด์˜ ์œ„์น˜ ์ •๋ณด๊ฐ€ ์†์‹ค๋จ.