Methods for compressing data. The Huffman greedy algorithm
uses character frequencies.
Code words are not prefixes of other code words. Can show that optimal
compression can always be achieved with a prefix code, so only consider
them.
a b c = 0 101 100 = 0101100 so
|
aaa |
b |
e |
0001011101 = |
000 |
101 |
1101 |
binary tee whose leaves are the code words is a convenient tool to aid
coding.
An optimal code for a file is always represented as a full binary
tree. Full means every non-leaf child has two children. Note: Fixed-length
code tree doesn't. C: Alphabet ® |C| leaves
|C| - 1 interval.
where f(c) is the frequency of c and dt is the level down the
tree (bits required) of c.
Cost of the T tree. Huffman came up with a greedy algorithm. Q:
priority queue keyed on f implemented via a heap. Take two least frequent
objects and merge them together.
Huffman (C)
- n ¬ |C|
- Q ¬ C
- for i ¬ 1 to n - 1
- do z ¬ Allocate-Node ()
- x left[z] ¬ Extract-Min
(Q) // least frequent
- y right[z] ¬ Extract-Min
(Q) // next least
- f[z] ¬ f[x]
+ f[y] // update frequency
- Insert ( Q, z )
- return Extract-Min (Q)
merge into z
Why?
- Want a full binary tree
- Want least frequent objects on the bottom
What happens if x and y are replaced in Q with their merged values?
Running Time: Q into heap = O (n), n = |C|, n times through
loop, min extract O ( lg n ) so O ( n lg n )
Lemma: Let x and y be in C and have the lowest frequencies. Then $
optimal prefix code where x and y have the same length and their
prefix codes differ in the last bit. Why:
- Must be on the bottom (least frequent)
- Full tree, so they must be siblings, and so differ in one bit.
See the book for full proof.
Lemma: Let T be a full binary tree representing an optimal prefix code
over C. x, y are sibling leaves in T with f (x), f (y) and z is their
parent. Then T' = T - { x, y } with f (z) = f (x) + f (y) is also a
representation of an optimal prefix code.
Proof: B (T) in terms of B (T') " c Î
C - { x, Y } dT (c) = dT' (c) so
f (c) dT (c) = f (c) dT' (c)
dT (x) = dT (y) = 1 + dT' (z)
f (x) dT (x) + f (y) dT (y) = ( f (x) + f (y) ) (
1 + dT' (z) )
= f (z) dT' (z) + f (x) + f (y)
B (T) = B (T') + f (x) + f (y)
if T' is non-optimal then T", optimal and can get B (T) = B
(T") + f (x) + f (y) contradicting T's optimality. T' optimal for C'
= C - { x, y }