Wednesday, November 14, 2012

Algorithm used by BZip, JPeg compression

We all use winzip, jpeg and other compression utilities. The space gained by compression is really high. We will discuss the underlying implementation behind them.

In Java char primitive type is stored as 2 bytes (or 16 bits). The ascii values however are 7 bits. So the total number of ascii characters are 128. So in ascii 'a' is 97 and 'A' is 65.
Refer to the link for the complete list

So in Java if I want to store a string "cat" it takes 2*3 = 6 bytes (or 48 bits).

Imagine you are writing an utility which loads an English novel into memory. If we need to store this huge data then we will most probably run into memory issues. So what is the alternative?
Huffman has proposed an algorithm for compression. It has support for encoding and decoding as well.
Lets say our text is "bcbbbbbbaacaabbcade". This contains 19 letters, so we need 19*16 = 304 Bits to represent this String.

Consider the input string. Count how many times each character appears in the string. Store it somewhere in the ascending order of frequency. So i would get the below list-


What is the most efficient way to do this? Lets see what is the size of our character set. Its 256. So initialize an integer array of size 256. Read the input string character by character and increase the count at the index of the character. So if our integer array is arr[256], then we would have arr[97] = 5, arr[98] = 9 and so on.
So the time-complexity is O(n) where n is the length of the input string.

Now take the first 2 characters (e,1) and (d,1). We will store these 2 as child nodes of a binary-tree like structure. The parent of these 2 nodes will be a consolidated node (ed, 2). Here we added the 2 characters and also their frequencies.

           (ed, 2)
(e,1)                 (d, 1)

So the frequency for 'ed' is 2. This is the minimum currently, the next minimum is (c, 3). So our new node (edc, 5) would have left child (ed, 2) and right child (c, 3). Similarly we add 5 and 9. Finally we would get (bedca, 19). This is the final root node. The frequency of this node is 19 which is the length of the input string. The following is the structure of the tree.
Mark all left nodes 0 and right nodes 1.
Double-circled nodes are leaves.

Start from root node and do a DFS(Depth First Search) till you encounter leaf node. We get

b = 0 
a = 11
c = 101
d = 1001
e = 1000

Input string will translate to "010100000011111011111001011110011000". So the length of the string is 36. That means we only need 36 bits to store and not 304 bits. Woow that's a great saving of 88.15%.

Now how to decode any string. Lets say we want to decode "0001101011000". Read the numbers from left to right. So we get 0, read from the tree till we hit a leaf node. So we hit b for 0. So we get bbb for 000. Now we get 1, we goto (edca,10). Continue, since we still didn't encounter leaf node. Next number is 1, so we encountered (a, 5). Hence it is 'a'. So we get "bbbabce". This is the decrypted value.

In the solution, I have built the Huffman's tree in a most-efficient way. I don't use any costly recursion, I have used plain arrays. Have fun coding!!


No comments:

Post a Comment