DPO V.S. RLHF 模型微调 Video Tanpa Iklan

DPO V.S. RLHF 模型微调

945 subscribers

2,457 views

About
Share

Published On Jan 19, 2024

DPO 微调与 RLHF 比较，https://arxiv.org/abs/2305.18290，目前，DPO 在 Huggingface 排行榜上作为微调方法取得了很大的成功。从这个 DPO 树中已经衍生出现一些改进版本，如 IPO 和 cDPO，感觉一个 DPO 变种可能很快就会取代 RLHF。😊

DPO 微调方法允许在人类偏好数据上进行端到端学习，无需训练奖励模型，相比 RLHF（PPO）更简单、稳定、高性能且计算成本更低。

IPO - A General Theoretical Paradigm to Understand learning from Human Preferences. (https://arxiv.org/abs/2310.12036. 11/22/2023)

cDPO – Eric Mitchell, A note on DPO with noisy preferences & relationship to IPO(https://ericmitchell.ai/cdpo.pdf 11/25/2023)

Telegram AI study group: https://t.me/+20i_E3O0lHEzNDZh

0:00 介绍
00:44 Andrew NG
01:52 摘要
04:04 RLHF
07:22 RLHF reward model
09:13 RLHF reward model公式
13:30 RL Fine-tuning phase
15:31 DPO
21:03 DPO Objective
22:44 Theoretical Analysis of DPO
29:29 Experiments
32:50 Results
38:42 Discussion

Published On Jan 19, 2024

Share/Embed

Video Link