DPO V.S. RLHF 模型微调
Alice in AI-land Alice in AI-land
945 subscribers
2,457 views
78

 Published On Jan 19, 2024

DPO 微调与 RLHF 比较,https://arxiv.org/abs/2305.18290,目前,DPO 在 Huggingface 排行榜上作为微调方法取得了很大的成功。从这个 DPO 树中已经衍生出现一些改进版本,如 IPO 和 cDPO,感觉一个 DPO 变种可能很快就会取代 RLHF。😊

DPO 微调方法允许在人类偏好数据上进行端到端学习,无需训练奖励模型,相比 RLHF(PPO)更简单、稳定、高性能且计算成本更低。

IPO - A General Theoretical Paradigm to Understand learning from Human Preferences. (https://arxiv.org/abs/2310.12036. 11/22/2023)

cDPO – Eric Mitchell, A note on DPO with noisy preferences & relationship to IPO(https://ericmitchell.ai/cdpo.pdf 11/25/2023)

Telegram AI study group: https://t.me/+20i_E3O0lHEzNDZh


0:00 介绍
00:44 Andrew NG
01:52 摘要
04:04 RLHF
07:22 RLHF reward model
09:13 RLHF reward model公式
13:30 RL Fine-tuning phase
15:31 DPO
21:03 DPO Objective
22:44 Theoretical Analysis of DPO
29:29 Experiments
32:50 Results
38:42 Discussion

show more

Share/Embed