Zero-Shot Learning
GPT-2 is able to do many tasks with not examples + no gradient updates.
Instruction Fine-Tuning
Language models, by default, are not aligned with user intent.
- collect paired examples of instruction + output across many tasks
- then, evaluate on unseen tasks
~3 million examples << n billion examples
dataset: MMLU
You can generate an Instruction Fine-Tuning dataset by asking a larger model for it (see Alpaca).
Pros + Cons
- simple and straightforward + generalize to unseen tasks
- but, its EXPENSIVE to collect ground truth data
- ground truths maybe wrong
- creative tasks may not have a correct answer
- LMs penalizes all token-level mistakes equally, but some mistakes are worse than others
- humans may generate suboptimal answers
Human Preference Modeling
Imagine if we have some input \(x\), and two output trajectories, \(y_{1}\) and \(y_{2}\).
Suppose we have \(R(x, y)\).
We desire:
\begin{equation} \mathbb{E}_{\hat{y} \sim p_{\theta}(y | x)} R(x, y) \end{equation}
RLHF, in broad strokes
- do Instruction Fine-Tuning
- estimate a reward model \(R(x,y)\)
- maximize that reward model
Model Preferences as an NLP Problem
Train:
\begin{equation} RM_{\phi}(x, y) \end{equation}
which models a human preference scores.
Get preference data
To get the preference data actually, ask the humans to RANK the rankings.
Bradley-Terry Preference Model
Suppose a human chose \(y^{w}\) over \(y^{l}\). Then, Bradley-Terry Preference Model tells us that a good reward model \(R\) will minimize:
\begin{equation} J(\phi) = -\mathbb{E}_{(x, y^{w}, y^{l})} \qty[\log \sigma \qty(R_{\phi}(x, y^{w}) - R_{\phi}(x, y^{l}))] \end{equation}
PPO
Then, we optimize this:
\begin{equation} \max_{\theta} \mathbb{E}\qty[ RM_{\phi}(x, \hat{y}) - \beta \log \qty( \frac{P_{\theta}^{RL} (\hat{y} | x)}{P_{\theta}^{orig} (\hat{y} | x)})] \end{equation}
we have a penalty term to prevent large drifts.
DPO
What if there is a way to write \(R_{\phi}(x,y)\) directly in terms of \(p_{\theta}^{RL}(\hat{y}|x)\)?
Our goal is to solve this problem:
\begin{equation} \max_{\theta} \mathbb{E}\qty[ RM(x, \hat{y}) - \beta \log \qty( \frac{P_{\theta}^{RL} (\hat{y} | x)}{P_{\theta}^{orig} (\hat{y} | x)})] \end{equation}
There’s actually a closed-form solution to this!
\begin{equation} p^{*} (\hat{y} | x) = \frac{1}{Z(x)} p^{orig} (\hat{y}|x) \exp \qty(\frac{1}{\beta} RM(x| \hat{y})) \end{equation}
where \(Z(x) = \sum_{\hat{y} \in y}^{} p^{orig}(\hat{y}|x)\).
Notice! Computing the normalization term \(Z\) is intractable! But first, we rearrange this equation to get:
\begin{equation} RM(x, \hat{y}) = \beta \log \frac{p^{*}(\hat{y}|x)}{p^{PT}(\hat{y}|x)} + \beta \log Z(x) \end{equation}
Now, we want to solve for a \(p^{*}\) given reward signal, so let’s parametrize it:
\begin{equation} RM_{\theta}(x, \hat{y}) = \beta \log \frac{p_{\theta}(\hat{y}|x)}{p^{PT}(\hat{y}|x)} + \beta \log Z(x) \end{equation}
(issue: wait, but in the beginning \(\theta = PT\), so \(\log (1) = 0\), and this whole thing is \(0\)…. we will get to that, also \(Z\) is still intractable.)
Now! Recall Bradley-Terry Preference Model: a good \(RM_{\theta}\) should
\begin{equation} \min_{\phi} -\mathbb{E}_{(x, y^{w}, y^{l})} \qty[\log \sigma \qty(R_{\phi}(x, y^{w}) - R_{\phi}(x, y^{l}))] \end{equation}
Plugging our expression for \(RM_{\theta}\) from an equation ago into here, you’ll notice the \(Z\) CANCELLS OUT! And this gives:
\begin{equation} \min_{\theta} -\mathbb{E}_{(x, y^{w}, y^{l})} \qty[\log \sigma \qty(\beta \log \frac{p_{\theta}(y^{w}|x)}{p_{PT}(y^{w}|x)} - \beta \log \frac{p_{\theta}(y^{l}|x)}{p_{PT}(y^{l}|x)})] \end{equation}