Tech Talk: Huffman Compression

A data compression technique which varies the length of the encoded symbol in proportion to its information content,that is the more often a symbol or token is used, the shorter the binary string used to represent it in the compressed stream.Huffman codes can be properly decoded because they obey the prefix property, which means that no code can be a prefix of another code, and so the complete set of codes can be represented as a binary tree, known as a Huffman tree.
The least length codes correspond to the most often occurring symbols allowing to perform compression. For example, the sequence “aaaaaaaa” contains only one symbol ‘a' that corresponds to the Huffman code with size 1 bit; after encoding the output sequence will have size 1 byte - thus compression ratio will be 8.Because generally a character uses 8 bits.Huffman encoding is based on the principle that letters that appear more frequently should have a smaller code than the ones that are used less often. So, in the English language, vowels would be used more than the letter 'z', and would get shorter codes.

Huffman program in python Huffman.py

Huffman program in javaScript Huffman.js

For example: Huffman Tree for String "malayalam"

Generating Huffman Tree is also simple.We can just check in Huffman Tree Generation

So Generally The idea behind Huffman coding is to find a way to compress the storage of data using variable length codes. Our standard model of storing data uses fixed length codes. For example, each character in a text file is stored using 8 bits. There are certain advantages to this system. When reading a file, we know to ALWAYS read 8 bits at a time to read a single character. But as you might imagine, this coding scheme is inefficient. The reason for this is that some characters are more frequently used than other characters. Let's say that the character 'e' is used 10 times more frequently than the character 'q'. It would then be advantageous for us to use a 7 bit code for e and a 9 bit code for q instead because that could shorten our overall message length.
Huffman coding finds the optimal way to take advantage of varying character frequencies in a particular file. On average, using Huffman coding on standard files can shrink them anywhere from 10% to 30% depending to the character distribution.

Monday, September 23, 2013

Huffman Compression

No comments:

Post a Comment