Houjun Liu

Pretraining Data

Problems of pre-training data

pre-training influence downstream capabilities
…and therefore can escape into model generation
real world users expect novelty

Changes in Distribution

Big Pretraining Data

GPT2

deduplicated data
Removed Wikipedia (to prevent data leak)
Heuristic based cleaning

GPT3

Deduplicated
based on leaked data

Llama

the usual spheal

removed high perplexity data using wiki n-gram model
removed non-English
deduplicated

Llama 2

removed high volue of PII
Removed non-english

Pretraining Curation Decisions

what to include
what is the timestamp being scraped
heuristic based cleaning? data cleaning? etc.
language filtering (only take English?)
PII removal
dedup
Toxicity + SafeURL filtering
“quality filtering”
sampling distributions

Change in Model Age

Good alignment shown between validation year and pre-training year, even mixing in older data.

Implication: “fine-tuned T5 may still be worse than fine-tuned llama, because T5 was pretrained using older data—despite even if FTing is newer”

Change in Toxicity

Filtering toxicity made the model worst at spotting toxicity.

Change in Data Distribution

out of domain answers do worse on out of domain results

Reduce Memorization

de-duplication using approximate matching
think carefully for multiple-epoch training (what is ok to memorize?)
remove sensitive memorization from pre-training data

Two iffy strategies:

Check for memorization

Trivial style transfers can get around safety checks “do the [copyrighted thing] in French”; “do the [copyrighted thing] with double the spaces”.

Use RLHF or something

“hide flaws, and not eliminate them”—edge case problems doesn’t eliminate the underlying vulnerability.