From the course: LLaMa for Developers

Unlock the full course today

Join today to access over 23,100 courses taught by industry experts.

Reinforcement learning with RLHF and DPO

Reinforcement learning with RLHF and DPO - Llama Tutorial

From the course: LLaMa for Developers

Reinforcement learning with RLHF and DPO

- [Instructor] One of the key insights from the LLLaMA 2 paper was the use of reinforcement learning with human feedback or RLHF. This method leverages the ability to have humans rate different prompts and create a model that trains on top of LLaMA to fine tune it. Now, in this video we're going to cover two preference techniques: RLHF and DPO. DPO stands for direct preference optimization. So let's learn a little bit more about RLHF. RLHF works by creating a reward system where we train the data and create a reward function. This could be helpfulness, this could be harmlessness, as we've seen in the LLaMA 2 paper. Now reinforcement learning is quite complex. So for the purpose of this video, we're going to use another technique called DPO or direct preference optimization. Let's head over to our CoLab. I'm in 03_04 and I'm going to be using an A100 GPU. Now for this video, we're going to do something very similar to our LLoRa training. We're going to train a LLoRa model by using…

Contents