Huffman Coding

The Huffman Coding scheme is a coding scheme which gives the best Prefix-Free code (with the smallest Average Codeword Length) for any source with any arbitrary distribution.

Huffman Coding is Bounded

the Huffman Coding has the property that:

\begin{equation} \bar{l} \leq H(x) +1 \end{equation}

\begin{equation} H(x) \leq \bar{\ell} \leq H(x)+1 \end{equation}

for source \(X\), and entropy of \(X\) given by \(H(X)\), and a Average Codeword Length returned by Huffman Coding \(\bar{\ell}\).

Suppose we have a distribution of a source sequence:

We iteratively generate a tree by “merging” two sources of the smallest probability into one node.

So after one merge:

Doing that again:

X	P(x)
A	1/2
T, (C, G)	1/4 + 1/4 = 1/2

Doing that again:

X	P(x)
A, (T, (C, G))	1/2 + 1/2 = 1

This now creates a binary tree where each symbol is a leaf!

     .
A         .
       T      .
            C    G

We now assign \(0\) to left edge and \(1\) to the right edge

     .
  0    1
A         .
         0  1
       T      .
             0  1
            C    G

To get the actual codes, we can just traverse to each leaf to get the code:

Notice how this is prefix free!