Houjun Liu
text normalization
two main parts:
tokenization
lemmatization