The Huffman Coding scheme is a coding scheme which gives the best Prefix-Free code (with the smallest Average Codeword Length) for any source with any arbitrary distribution.
Huffman Coding is Bounded
the Huffman Coding has the property that:
\begin{equation} \bar{l} \leq H(x) +1 \end{equation}
So, recall that any prefix free code for a source has at least entropy length; this gives:
\begin{equation} H(x) \leq \bar{\ell} \leq H(x)+1 \end{equation}
for source \(X\), and entropy of \(X\) given by \(H(X)\), and a Average Codeword Length returned by Huffman Coding \(\bar{\ell}\).
Example
Suppose we have a distribution of a source sequence:
X | P(x) |
---|---|
A | 1/2 |
T | 1/4 |
C | 1/8 |
G | 1/8 |
We iteratively generate a tree by “merging” two sources of the smallest probability into one node.
So after one merge:
X | P(x) |
---|---|
A | 1/2 |
T | 1/4 |
C, G | 1/8 + 1/8 = 1/4 |
Doing that again:
X | P(x) |
---|---|
A | 1/2 |
T, (C, G) | 1/4 + 1/4 = 1/2 |
Doing that again:
X | P(x) |
---|---|
A, (T, (C, G)) | 1/2 + 1/2 = 1 |
This now creates a binary tree where each symbol is a leaf!
.
A .
T .
C G
We now assign \(0\) to left edge and \(1\) to the right edge
.
0 1
A .
0 1
T .
0 1
C G
To get the actual codes, we can just traverse to each leaf to get the code:
X | Code |
---|---|
A | 0 |
T | 1 0 |
C | 1 1 0 |
G | 1 1 1 |
Notice how this is prefix free!