What makes language modeling hard: resolving ambiguity is hard.
“the chef made her duck”
Contents
Basic Text Processing
- regex
- ELIZA
- tokenization and corpus
- text normalization
- tokenization + Subword Tokenization
- Word Normalization
- lemmatization through morphological parsing
- only take stems from morphemes: porter stemmer
- sentence segmentation
- N-Grams
Edit Distance
DP costs \(O(nm)\), backtrace costs \(O(n+m)\).
Ngrams
Text Classification
Logistic Regression
- Generative Classifier vs Discriminate Classifier
- Logistic Regression Text Classification
- cross entropy loss
- stochastic gradient descent