Every NLP task involve some kind of text normalization.
- tokenizing words
- normalizing word formats (lemmatize?)
- sentence and paragraph segmentation
For Latin, Arabic, Cyrillic, Greek systems, spaces can usually be used for tokenization. Other writing systems can’t do this. See morpheme
Subword Tokenization
Algorithms for breaking up tokens using corpus statistics which acts on lower-than-word level.
- BPE
- Unigram Language Modeling tokenization
- WordPiece
They all work in 2 parst:
- a token learner: takes training corpus and derives a vocabulary set
- a token segmenter that tokenizes text according to the vocab
tr
For those languages, you can use these systems to perform tokenization.
tr -sc "A-Za-z" "\n" < input.txt
this takes every form which is not text (-c
is the complement operator) and replaces it with a newline. -s
squeezes the text so that there are not multiple newlines.
This turns the text into one word per line.
Sorting it (because uniq
requires it) and piping into uniq
gives word count
tr -sc "A-Za-z" "\n" < input.txt | sort | uniq
We can then do a reverse numerical sort:
tr -sc "A-Za-z" "\n" < input.txt | sort | uniq | sort -r -n
which gives a list of words per frequency.
This is a BAD RESULT most of the time: some words have punctuation with meaning that’s not tokenizaiton: m.p.h.
, or AT&T
, or John's
, or 1/1/12
.
What to Tokenize
“I do uh main- mainly business data processing”
uh
: filled pausemain-
: fragments
Consider:
“Seuss’s cat in the cat is different from other cats!”
We usually consider a token as distinct wordform, counting duplicates; whereas, we usually consider word types as unique, non-duplicated distinct wordforms.
clitics
John's
: word that doesn’t stand on its own.