Problems of pre-training data
- pre-training influence downstream capabilities
- …and therefore can escape into model generation
- real world users expect novelty
Changes in Distribution
Big Pretraining Data
GPT2
- deduplicated data
- Removed Wikipedia (to prevent data leak)
- Heuristic based cleaning
GPT3
- Deduplicated
- based on leaked data
Llama
the usual spheal
- removed high perplexity data using wiki n-gram model
- removed non-English
- deduplicated
Llama 2
- removed high volue of PII
- Removed non-english
Pretraining Curation Decisions
- what to include
- what is the timestamp being scraped
- heuristic based cleaning? data cleaning? etc.
- language filtering (only take English?)
- PII removal
- dedup
- Toxicity + SafeURL filtering
- “quality filtering”
- sampling distributions
Change in Model Age
Good alignment shown between validation year and pre-training year, even mixing in older data.
Implication: “fine-tuned T5 may still be worse than fine-tuned llama, because T5 was pretrained using older data—despite even if FTing is newer”
Change in Toxicity
Filtering toxicity made the model worst at spotting toxicity.
Change in Data Distribution
out of domain answers do worse on out of domain results
Reduce Memorization
- de-duplication using approximate matching
- think carefully for multiple-epoch training (what is ok to memorize?)
- remove sensitive memorization from pre-training data
Two iffy strategies:
Check for memorization
Trivial style transfers can get around safety checks “do the [copyrighted thing] in French”; “do the [copyrighted thing] with double the spaces”.
Use RLHF or something
“hide flaws, and not eliminate them”—edge case problems doesn’t eliminate the underlying vulnerability.