SU-CS224N APR302024

Subword

We use SUBWORD modeling modeling to deal with:

combinatorial morphology (resolving word form and infinitives) — “a single word has a million forms in Finnish” (“transformify”)
misspelling
extensions/emphasis (“gooooood vibessssss”)

You mark each actual word ending with some of combine marker.

To fix this:

“find pieces of words that are common and treat them as a vocabulary”

Because in different circumstances, performing well MLM and NLP requires {local knowledge, scene representations, language, etc.}.

maybe local minima near pretraining weights generalize well
or maybe, because the outputs are sensible, gradients propagate nicely because they are modulated well

i.e. BERT will then need to resolve a proper sentence representation from lots of noise

Original BERT also pretrained on top a next sentence prediction loss in addition to MLM, but that ended up being unnecessary.

Encoder/Decoder model. Pretraining task: blank inversion:

“Thank you for inviting me to your party last week”

“Thank you <x> to your <y> last week” => “<x> for inviting <y> party <z>

This actually is BETTER than the LM training objective.