Article 3316: : Huffman Compression

Special

Introducing the “Welcome to Xojo” Bundle!

New to Xojo and looking for guidance? We've put together a terrific bundle to welcome you! Xojo Bundle

This bundle includes six back issues of the magazine -- all of year 21 in printed book and digital formats -- plus a one-year subscription (beginning with 22.1) so you'll be learning all about Xojo for the next year. It's the perfect way to get started programming with Xojo. And you save as much as $35 over the non-bundle price!

This offer is only available for a limited time as supplies are limited, so hurry today and order this special bundle before the offer goes away!

Recent issues

22.2 cover
Mar/Apr 2024

22.1 cover
Jan/Feb 2024

Article Preview

Buy Now

Issue 3.3

COLUMN

Huffman Compression

An application of binary trees

Issue: 3.3 (January/February 2005)
Author: Thomas Reed
Author Bio: Thomas Reed has been programming as a hobbyist for more than 20 years, and fell in love with the Mac in 1984.
Article Description: No description available.
Article Length (in bytes): 7,860
Starting Page Number: 34
Article Number: 3316
Related Link(s): None

Excerpt of article text...

In the last issue, this column discussed binary trees as a searching tool. This month, let's look at another application of binary trees. Huffman compression, described in a paper by David Huffman in 1952, is a general purpose compression algorithm. Huffman compression is, in fact, so useful that even modern compression formats such as JPEG and MP3 employ the Huffman algorithm as a part of their specification.
The Huffman algorithm employs a method of compression that involves replacing bytes from the input string with codes having a shorter bit length. The most frequently occurring bytes should have the shortest codes, often as few as one or two bits long. A binary tree is an integral part of this process. It is used in the compression process to generate these codes, as well as the expansion process, where it helps to convert the codes back into the original bytes.
Let's take a look at a very simple example. The string "apple" could be compressed using the codes p=0, a=10, l=110, e=111. Replacing the letters with these codes results in the binary string 1000110111. That's only ten bits, compared with 40 in the original string, which means the compressed data is only a quarter the size of the original. This is an impressive savings!
There is a catch, though. We have to somehow remember which codes correspond to which bytes. This typically comes in the form of header data that maps a byte to a bit code. This overhead means that the shorter the data being compressed, the less actual savings you will see. For very small amounts of data, in fact, the compressed version can actually take up more space than the original. This means that Huffman compression, along with most other forms of compression, is most effective on larger quantities of data.
Another problem with Huffman compression is that it relies on uneven distributions of byte values. Remember that bytes that recur frequently will have shorter codes. In our "apple" example, the letter 'p' occurs twice as often as any other character. Thus, we can give it a code that is only one bit long. However, if the distribution were more even, as in the string "abcde", the compression is far less efficient. Of course, in this simplistic example, Huffman compression is still effective, since only five bytes out of the possible range of 256 are present. However, for data that contains a significant percentage of all byte values in an even distribution, Huffman compression becomes quite inefficient. Fortunately, most data has an uneven byte distribution, making the Huffman algorithm a good general purpose algorithm.
One last issue with the Huffman algorithm is apparent when you consider that it relies on byte frequencies. The algorithm requires two passes through the data during the compression process: one to generate the frequency table, and another to perform the actual compression. For this reason, memory use and speed become a concern with large amounts of data. If all the data cannot easily be buffered in RAM, it must be read twice from disk, which can be a problem. On modern systems, this is not as significant a problem as it once was.

...End of Excerpt. Please purchase the magazine to read the full article.