Subword
We use SUBWORD modeling modeling to deal with:
- combinatorial morphology (resolving word form and infinitives) — “a single word has a million forms in Finnish” (“transformify”)
- misspelling
- extensions/emphasis (“gooooood vibessssss”)
You mark each actual word ending with some of combine marker.
To fix this:
Byte-Pair Encoding
“find pieces of words that are common and treat them as a vocabulary”
- start with vocab containing only characters and EOS
- look at the corpus, and find the most common pair of adjacent characters
- replace all instances of the pair with the new subword
- repeat 2-3 until vecab size is big enough
Writing Systems
- phonemic (directly translating sounds, see Spanish)
- fossilized phonemic (English, where sounds are whack)
- syllabic/moratic (each sound syllable written down)
- ideographic (syllabic, but no relation to sound instead have meaning)
- a combination of the above (Japanese)
Whole-Model Pretraining
- all parameters are initalized via pretraining
- don’t even bother training word vectors
MLM and NTP are “Universal Tasks”
Because in different circumstances, performing well MLM and NLP requires {local knowledge, scene representations, language, etc.}.
Why Pretraining
- maybe local minima near pretraining weights generalize well
- or maybe, because the outputs are sensible, gradients propagate nicely because they are modulated well
Types of Architecture
Encoders
- bidirectional context
- can condition on the future
Bert
- replace input word with [mask] 80% of time
- replace input word with a RANDOM WORD 10% of the time
- leaving the word unchanged 10% of the time
i.e. BERT will then need to resolve a proper sentence representation from lots of noise
Original BERT also pretrained on top a next sentence prediction loss in addition to MLM, but that ended up being unnecessary.
Bertish
- RoBERTa - train on longer context
- SpanBert - mask a span
Encoder/Decoder
- do both
- pretraining maybe hard
T5
Encoder/Decoder model. Pretraining task: blank inversion:
“Thank you for inviting me to your party last week”
“Thank you <x> to your <y> last week” => “<x> for inviting <y> party <z>
This actually is BETTER than the LM training objective.
Decoder
- general LMs use this
- nice to generate from + cannot condition no future words
In-Context Learning
- really only very capable at hundreds of billion parameters
- uses no gradient steps—-repeat and attend to examples