Training Helpful Chatbots

“What we have been building since ChatGPT at H4.

No pretraining in any way

Basic Three Steps

Goal: “helpful, harmless, honest, and huggy” bots.

Retraining step: large-scale next token prediction
Incontext learning: few shot learning without updating parameters
“Helpful” steps
1. Taking supervised data to perform supervised fine tuning
“Harmless” steps
1. Training a classifier for result ranking
2. RLHF

Benchmarking

Before we started to train, we have a problem. Most benchmarks are on generic reasoning, which evaluates 1), 2). Therefore, we need new metrics for steps 4) and 5).

So:

Evaluating instruction and “chatty-ness”

Pairwise Elo Ratings leaderboard from 🤗 + AlpacaEval. Both use GPT4 as the automated evaluator + as well as humans. MTBench from LMSYS has a new benchmark for the same thing, but supports multi-turn evaluation.

Three main effects observed:

results improve slightly the longer the prompt
GPT4 MTBench assigns worse scores on gpt4 like data
adding more data into fine tuning had diminishing returns after thousands of samples

TruthfulQA is the most differentiating benchmark; most others score about the same

Evaluating Reward Model

There are not any open source reward models. Nor is there anything on evaluating or dataset on red teaming. The only dataset out there is Anthropic’s red teaming data.

https://huggingface.co/blog/red-teaming

Wackiness GPT4 as an evaluator

Why is everybody using GPT4 as a proxy for humans?

GPT4 has a left positional bias (if you admonish GPT about this, it will prefer the second one instead :/), while humans provide pretty much uniform rating
“Doping”: GPT4 prefers model trained on data that it itself generated
GPT4 prefers a large variance in unique tokens
GPT4 has bad correlation with humans with “low entropy” factual tasks: QA, Summarization, Code; it has better correlation with humans in brainstorming and creative generation

arxiv:2306.05685

Supervised Fine Tuning

Data

“Self-Instruct” dataset, Wang et all 2022 => “Surge Instruct”, huggingface 2023

Instruction (what to do)
Input (what to do it on)
Output (what you are supposed to do)

Goal: “helpful and chatty”

Bootstrapping Data Generation

175 seed tasks: 1 instruction + 1 input/output pair
Give it to a language model to generate more instructions
Language mode

Human-In-The-loop Data Generation

Ultrachat, Ding et al 2023

human doing some research on the topic and create a prompt
ask LLM to generate the output
if result not good, rephrase the prompt
repeat until good

Roleplaying Data Generation

Have two models role play to get and correct data.

Huggingface Surge-Instruct

Humans write everything from scratch. With a pretrained model, diminishing return is seen after a few thousand high quality examples.

Task Distribution
What should the topics be?
Use InstructGPT as guidance: largestest section is a generation task (12%), OpenQA the second largest one (12.4%).
HF replaced InstructGPT distribution’s “Other” section (3.5%) with code work.

Length Distribution
How long should the prompts be? Collected distributions, and Surge Instruct seems to be closest with InstructionGPT.
Both Anthropic and InstructGPT used a US based task force, and so so did 🤗
- us based taskforce
- roughly even gender slpit
- 19 to 62 years old
- primarily white
- technical degree to PhD
Only used one turn. Multi-turn fine tuning wasn’t a thing a few mounths ago.

Training

starcoder, falcon, llama2
True fine tuning + PEFT (LoRA)

The HF part of RLHF

Scale agreed with Surge and H4 a lot more, but mostly no one agreed with anyone.

Goal: “safe and factual”

Task Distribution

Distribution: a lot more about factual things (because we want to encourage factualness) so much more math and code than the general generation. Its also easier to score.

Ranking Guidelines

OpenAI have guidelines about how to rate
Rate every turn from the diaglogue
Smaller total length (<2048 tokens)
Helpfulness OVER honesty – this is the opposite of OpenAI because the model wasn’t large enough to be very honest

Two step selection: “which one is better” => “how much is better”