Compression of information reduces entropy

2. Data compression

In this chapter the historical development of data compression is presented and the most important methods that it has brought about are described, because wavelet compression makes use of some of these methods. The terms Compression and compression are used differently by some authors. One stands for lossless, the other for lossy compression. The terms are used synonymously here to avoid confusion.

2.1 Degree of Compression

In order to be able to compare compression methods, one needs a criterion that reflects the quality of the compression. The easiest way to do this is with the compression level. It indicates the size ratio of the original file and the compressed file:
Compression level = (1 - (compressed size / original size)) 100 in%

2.2 Information content, entropy and redundancy

At the end of the 1940s a new branch in mathematics emerged, information theory. It was created through the work of Claude Shannon who worked at Bell Laboratories. He dealt with various questions of information processing, including procedures for storing and disseminating information.

In information theory, the information I of the i. Character (pi = Probability of occurrence) from a set M with n elements generally measured in bits and is defined as follows ([HILBERG87]):

I (pi) = - ld pi = ld 1 / pi

The term entropy comes from thermodynamics and is used in a similar way in information processing. The higher the entropy of a message (or a signal), the higher its information content. Entropy is a measure of the information content of a message or a news source. It determines the average information content per character and is defined as follows ([HILBERG87]):

H = I = ∑i pi · I (pi) = - ∑i pi · Ld pi

If the entropy of a message deviates from the mean character length of the message, it will generally be smaller, since the characters are mostly 8 bits wide, then part of the message or part of the characters that make up the message is redundant. This means that it contains superfluous information or data segments that do not contribute to the information. For example, if you remove the 'e' from a text, 99% of it is still understandable, i.e. the 'e' carries redundant information. One tries to reduce it by means of a compression or to remove it completely.

In contrast to thermodynamics, where the entropy can be calculated precisely, in information theory no absolute number can be given for the information content of a message. This is because the probability with which a character occurs depends very much on the underlying model. For example, a distinction is made between models of different orders. In a 0th order model, the individual characters are considered without the characters surrounding them. In a text that is examined with a model of the order 0, the letter 'u' can appear with a probability of 1%. If you look at this text with a model of order 1, the character in front of it is also taken into account, i.e. the probability now depends on the character read before, the probability of the letter 'u' occurring after a 'q' can be 95%.

In order to compress messages that contain redundancies, they have to be recoded so that the mean number of bits to be stored per character is approximately equal to the entropy, since the entropy indicates the information content. To achieve this, a code with variable length is used instead of the ASCII code, whereby the code length of a character should correspond to that of its information content.

2.3 Redundancy Reduction (Retaining Entropy)

Different compression models

Lossless data compression can mainly be divided into two groups: one group uses statistical models, the other works on the basis of tables.

In statistical modeling, the characters are coded one after the other based on their probability of occurrence. It depends on the accuracy of the statistical model and the resulting coding. One differentiates between statistical models, among other things, in their order. 0th order models use the probability of occurrence of the characters regardless of their context. Only a list of probabilities of occurrence is maintained. First-order models manage a list for each character in which the probability of occurrence of the following character is saved. In the example above, the probability of occurrence of the letter 'u' in the 'q' list could be 95% and in the 'e' list, however, only 30%. Since the prediction of which character will come next is more accurate with a higher-order model, one can expect higher degrees of compression. To do this, however, they also require more memory and computing time. One thing should not be forgotten with the statistical models: If all characters are almost equally distributed in all orders, compression is not possible because all characters have the same information content and thus the same bit length!

With table-based modeling, character segments that appear more frequently in the text and should be as long as possible are managed in a table. During compression, these segments are replaced by a single code, a token. This token is shorter than the associated character segment and compression occurs. This is less about the coding of the tokens and more about the structure and management of the tables.

Statistical modeling

In order to reduce the redundancy, i.e. to compress the message, the information content of the characters occurring is first calculated and then they are encoded with the bit length obtained. In doing so, however, one encounters the problem that the bit lengths are usually not integers. The coding only provides for integer bit lengths. It follows that compromises must be made and therefore rounded up or down. Both the Shannon-Fano coding, which was developed by Claude Shannon and R.M. Fano at the M.I.T. developed almost simultaneously, as well as the Huffman coding developed by D.A. Huffman in 1952 in his article "A Method for Construction of Minimum Redundanty Codes".

The Shannon-Fano algorithm


  • determine the (relative) frequency per occurring character
  • Output of the character frequency list (characters sorted in ascending order)
  • sort the list (according to descending frequency values)
  • forming the Shannon-Fano tree (binary tree):
    • divide the list into two parts, so that the sum of the (relative) character frequencies is as large as possible on each page
    • mark the subtrees with 0 and 1
    • recursive splitting of the subtrees until no further splits are possible
  • encoding of the individual characters until the end of the file is reached:
    • reading the bits along the path from the root to the leaf which represents the character to be encoded
    • output the code


  • Read in the character frequency list (characters sorted in ascending order)
  • sort the list (according to descending frequency values)
  • forming the Shannon-Fano tree (see above)
  • decoding the individual codes until the end of the file is reached:
    • compare the bits along the path from the root to the leaf which represents the character to be encoded
    • output the character

The most important properties are:

  • Different codes have different bit lengths.
  • Codes for symbols with smaller probabilities of occurrence have more bits than codes for symbols with higher probabilities.
  • Although the codes have different bit lengths, they can be coded uniformly.

Instead of the Shannon-Fano tree, a look-up table (LUT) calculated from it can also be used for coding / decoding. So that this is also possible during decoding, the list of characters, which is sorted according to the character value (usually ASCII value) and not according to frequency, must be transferred in the compressed file. This is a big problem with higher-order statistical models (see below).

The Huffman algorithm

Huffman coding works broadly like Shannon-Fano coding. The difference is in the structure of the tree. This is also a binary tree, but is constructed differently. The construction of the Huffman tree does not start at the root, as in the Shannon Fano tree, but with the leaves.

Construction algorithm:

  • mark all leaves (and later also the knot) that have no father as free node
    • find the two free nodes (or leaves) with the lowest (relative) frequency
    • construct a father node and combine the sons into a new subtree
    • store the sum of the frequencies of the two child nodes in the father node
    • mark the father as a free knot
    • delete the sons' free node markers
    • mark one child node with a 1 and the other with a 0
  • repeat from point 2 until there is only one free knot left
  • mark the last free node as the root of the tree

The results of the Shannon-Fano and the Huffman coding differ only insignificantly from one another. But Huffman's method is at least as good as Shannon and Fano's. Huffman has proven that his method cannot be surpassed by any other as long as it works with integer bit lengths.

The arithmetic coding

Arithmetic coding has been around since the early 1980s. However, it has only recently become possible to implement this method on computers with their registers of fixed length. In contrast to the two above-mentioned codes, this coding is also able to code non-integer bit lengths. A completely different strategy is necessary for this. Here no characters are replaced by codes with different bit lengths, but the entire character stream or the message is replaced by a single floating point number between 0 and 1. This floating point number represents the probability of occurrence of the entire text.


  • determine the (relative) frequency per occurring character
  • Output of the character frequency list (characters sorted in ascending order)
  • form the range table:
    • assign each character a section of the probability range (0 to 1) corresponding to its frequency
    • for all characters
      • output from


  • Read in the character frequency list (characters sorted in ascending order)
  • create the area table (see above)
  • decode the floating point number:
    • read in
      • output

During the implementation, the order of the characters in the table doesn't really matter, as long as it is always the same. The floating point number is of course not saved in plain text but in coded form. The calculation of the decimal places is also implemented differently. The algorithm only reflects the principle.

Example: the compression of "BILL_GATES"
Range table:

Sign Probability p range (p cumulated) _ 0.1 0.00> = r> 0.10 A 0.1 0.10> = r> 0.20 B 0.1 0.20> = r> 0.30 E 0.1 0.30> = r> 0.40 G 0.1 0.40> = r> 0.50 I 0.1 0.50> = r> 0.60 L 0.2 0.60> = r> 0.80 S 0.1 0.80> = r> 0.90 T 0.1 0.90> = r> 1.00


Sign lower limit upper limit B 0.0 1.0 I 0.2 0.3 L 0.25 0.26 L 0.256 0.258 _ 0.2572 0.25724 G 0.257216 0.25722 A 0.2572164 0.2572168 T 0.25721676 0.2572168 E 0.257216772 0.257216776 S 0.2572167752 0.2572167756


Coded number Output characters 0.2572167752 B 0.572177752 I 0.72167752 L 0.6083876 L 0.041938 _ 0.41938 G 0.1938 A 0.938 T 0.38 E 0.8 S 0.0

The fact that characters with less than 1 bit can be coded with the arithmetic coding makes this coding, as will be shown below, interesting for wavelet compression. In a somewhat extreme example, the arithmetic coding for a file with 100,000 zeros and an end-of-file character (EOF) only requires 3 bytes! The Huffman coding requires at least 12501 bytes.

Table-based modeling

The most important part of these models is the table and its management. Various strategies are being pursued, but most of them stem from the work of Jacob Ziv and Abraham Lempel, two Israeli researchers. The first, released in 1977, was groundbreaking. It was no longer worked on an optimization of a statistical model, but the idea was to save strings (phrases) that occur several times in the text only once. If a phrase occurs a second time, only one pointer to this phrase is stored. If the memory requirement of the token is less than that of the phrase, data compression has been achieved, otherwise the file is inflated (see below). This can easily happen when trying to recompress files that have already been compressed. A table is generally used to reference a phrase. In the method described in the 1st publication, the table had the form of a text window that slides over the data to be encoded, ie old phrases that are pushed out of the window or the table are forgotten and can no longer be referenced. In the following article, they described a mechanism for managing a table that is independent of a window. In theory, every phrase that occurs in the text can be stored in this. The simple run-length coding (RLC) can also be understood as a table-based model. The number of characters or repetitions and the table are transferred in a token. An offset is not necessary because the table only contains one character.

The LZ77 procedure and the LZSS improvement

LZ77 is the abbreviation for the compression method, which Jacob Ziv and Abraham Lempel 1977 in the article "A Universal Algorithm for Sequential Data Compression" in IEEE Transactions on Information Theory (The fact that the procedure is called LZ77 and not ZL77 is due to a misprint). The table used here is implemented in the form of a window which has two parts. The 1st part of the window is the table, i.e. the last coded part, the 2nd much smaller part is the character string that has to be coded next.

1. Table 2. Text to be encoded

A token consists of:

  • an offset to the phrase in the text window
  • the phrase length
  • the next character following the encoded phrase

Example: 12, 6, "j"

Coding (LZ77):

  • look for a character string in the table that is as long as possible and that matches the next n characters to be encoded
  • make a token and save it
  • move the window (n + 1) characters
  • repeat until all characters are encoded


  • read a token
  • copy from the hit position into the 2nd half of the window
  • output the character in the 2nd half of the window
  • move the text window by (n + 1) characters
  • repeat until all tokens are decoded

If no matching phrase is found, the token "0, 0, character" is output. While this is not effective, any string can be encoded. If a phrase is rarely found, the file can even grow. With a window size of 4096 bytes, 12 bits are generally used for the offset, 4 bits for the length and 8 bits for the next character. This means that you need 3 bytes per non-referenceable character. This is where the first optimization starts.

With the LZSS compression method, pointers and characters can be mixed as desired.The 0-0 pointer is avoided by placing a bit ahead of each token, which indicates whether it is a pointer or a character.

A second improvement is made in the window management (see below).

The LZ78 procedure and the LZW improvement

In September 1978 Jacob Ziv and Abraham Lempel described in their follow-up article "Compression of Individual Sequences via Variable-Rate Coding" in IEEE Transactions on Information Theory another table-based procedure. This was published in 1984 by Terry Welch in the June issue of IEEE computer improved in the article "A Technique for High-Performance Data Compression". The Sperry Research Center (now part of Unisys) received a US patent for parts of this algorithm. The LZ78 algorithm is an adaptive method. At the beginning of the algorithm the phrase table is empty. Every time a token is issued, the phrase table is expanded. The new phrase is made up of the old plus the transferred character. The tokens are similar to those of the LZ77, the only thing missing is the length specification of the phrase, because this is contained in the structure of the phrase list.


  • read a string
  • look for this string in the phrase table
  • form the token and issue it
  • form the new phrase and insert it into the table
  • repeat until all characters are encoded


  • read a token
  • output the referenced string
  • form the new phrase and insert it into the table
  • repeat until all tokens are decoded

A simple improvement of the algorithm is possible if one does not start with an empty table. With the LZW procedure, a table with all possible single characters is used as the start table. This avoids the transfer of single characters, since they are all already available as a phrase. The other improvements are mainly in the management of the structures.

Opportunities for improvement

In general, all parameters can be tried out. But you should never lose sight of the effort involved. This is especially true for runtime development and memory consumption. Larger tables usually mean more bits in the offset and longer management times. On the other hand, there are more and longer phrases. In order to get better on average with larger tables, larger files must also be compressed. If you limit the size of the table, you must ensure that the data does not contain a full table obsolescence. If the degree of compression drops, it can be assumed that the data has changed structurally, e.g. the image data of an executable program is now coming. Then the table has to be emptied and started from the beginning; this has to be communicated to the decompressor with a special character.

Adaptive methods in modeling

Both statistical and table-based methods can be used to achieve higher degrees of compression through adaptive modeling. Especially with higher-order statistical models or with large tables, it does not make sense to write the model data (statistics or table) that the program needs to expand the files into the file. That would just inflate the files unnecessarily. With a statistical model of the 0th order, this is only 256 values, or 256 bytes for the data type char, with a model of the 1st order this is 2562 = 65,536 bytes and with a 3rd order 2563 = 16.7 MB.

Adaptive methods usually start with an empty or a small, predefined data set. This is expanded as required during the coding. The same routine must of course be used in decoding. Since the model data are usually adapted after each new character, i.e. the model is adapted to the new situation, these methods require considerably more computing time.

Data structures

The use of statistical methods has another disadvantage, because experience shows that the matrices of higher-order statistical models are very sparsely populated and therefore waste storage space. Optimized programs use either graphs in the form of linked lists or trees, or hash functions. These improvements can come at the expense of speed. In the case of a hash function, however, there is more likely to be an improvement. With the LZSS method, a binary search tree is used instead of the first half of the text window, which means a clear speed advantage when searching. Since the program is mainly concerned with searching, the effort involved in managing the tree is negligible.

2.4 Entropy Reduction


With the various entropy-reducing compression methods, the information in the original data is reduced by the compression and not only superfluous data is removed, which only carries redundant information. This lost information cannot be recovered when decompressing! However, an attempt is only made to delete information that does not falsify the data too much. It follows that these methods cannot be applied to just any data. Texts, databases and similar data must of course not differ from the original data after expansion. It looks different with sound or image data. The ear cannot hear whether a fly is buzzing next to a jackhammer. A sound recording with good equipment could record this hum. It is not redundant, but unusable information because the ear does not process it. It is similar with the eye. Most gray value images are 8 bits deep, i.e. they have 256 gray values. However, humans can only differentiate between approx. 160 gray values. It doesn't matter if a pixel deviates by one gray value. Even with several gray values, these errors are very difficult to detect. However, if the images are digitally processed, this is already noticeable, e.g. in the case of object recognition or a histogram adjustment2 or when using false colors. A computer can clearly distinguish a deviation of just one gray value. These methods are therefore mainly used for data that represent analog values ​​and are to be output again in analog form. The first losses usually arise when digitizing the analog data.

After removing the less relevant information, a lossless compression usually follows. This is heavily dependent on the previously used procedure and the associated structure of the data.

Compression of sound data

AT&T made the first investigations. They established the first commercial digital telephone connection in 1962 and wanted to make better use of their digital transmission channels. With 50% data compression, you could transmit twice as many connections over one channel. Similar problems arise today with digital television. Procedures are required that guarantee a certain degree of compression. However, this is not possible with the methods described above. For the sound data, the statistical methods gave the best results. This is noteworthy because for most of the other data, the table-based methods give the better results.

Since the data is digitized analog data, the value almost always represents an amplitude. This was scanned linearly at the beginning, but later logarithmically. That suits the sensitivity of the ear. One uses a so-called companding codec. This stands for compression / expansion encoder / decoder. This means that 8-bit codec scanners are as effective as scanners that work linearly with 13 bits. Similar conversions can also be made retrospectively with 8-bit linear data. These can be logarithmically compressed to 4 bits without excessive distortion. In addition, one experimented with differential methods in which not the amplitudes themselves, but the differences between two consecutive values, were stored with a certain number of bits. The CCITT (International Telegraph and Telephon Consultative Committee) approved the G.721 algorithm in one of its letters of recommendation. This works with adaptive, differential pulse code modulation (ADPCM). A 1-bit ADPCM method is currently being used. Used by some modems for voice recording (answering machine function). Finally, Linear Predictive Coding (LPC) should be mentioned. This is a method that only has to store a few parameters by re-modeling human speech generation in order to generate speech again (vocoder). This process only needs approx. 2400 bits per second to encode speech in a recognizable manner.

Image compression

In the beginning, several lossless methods were used to compress the image. These mainly include PCX, BMP, IFF and GIF. Since their compression levels were no longer sufficient for films and animations3 to save, attempts were made to implement loss-making processes, similar to sound compression. In doing so, the developers found that the image data are structurally completely different from sound data and that the sound compression methods did not produce the desired results.

If you look at a histogram of an image, you will see that it is not prickly is enough to compress it using statistical methods. This is even worse with moving images. Better degrees of compression were only achieved with higher-order models. The adaptive coding (usually combined with a differential) considers the left and the top 3 neighboring pixels in order to be able to make a good prediction for the next pixel. The results were good, but the compression levels were still too low.

A completely different method uses fractal compression. As with the LPC method, parameters are sought that describe the data on a higher level of abstraction. Parameters are sought that define an attractor that corresponds to the image. This is achieved through affine transformations of individual picture elements. It doesn't matter which values ​​you start the reconstruction with, the result is always an attractor that is very similar to the original image.

Another idea is to convert the images into a different form of presentation (transformation) before compressing them. This method also uses the JPEG and MPEG formats. JPEG stands for Joint Photographic Experts Group, which was founded by CCITT and ISO. The format includes both lossless and lossy encoding. In the following only the lossy part will be described.


JPEG uses the discrete cosine transformation for the transformation. It converts the spatial information of the image into frequency or spectral information. This has the advantage that the less information-bearing parts can be separated more easily from those that have a lot of information. The high frequencies contribute less to the visual image information than the low frequencies. Most of the information is in the constant component, which corresponds to the mean gray value. The transformation is applied in blocks. Blocks larger than 16 × 16 pixels do not bring any further improvements. To save computing time, JPEG applies the transformation to an 8 × 8 area. The transformed values ​​become smaller and smaller with increasing frequency. It now makes sense to quantize the already small, unimportant values ​​or to set them completely to zero. As a result, there are a large number of zeros in the transformed image. These are now compressed using a lossless process. Here, among other things, a run-length coding (REAL, RLC) is used for the zeros and a differential method for the other values, i.e. the difference between two values ​​is saved and not their amount. This result is then post-compressed using a statistical method. The JPEG process has become the standard in image compression and is supported by almost all programs and also by the hardware. The MPEG process, which is responsible for moving pictures (Motion Picture Experts Group), was derived from him.

Transforming the data into another room This makes it easier to separate the important information from the rather unimportant information. Depending on how much information is left out, the degree of compression increases, but the image deteriorates. With high degrees of compression, the JPEG method tends to form blocks, which is due to the block-wise transformation. Almost exclusively the constant component is saved and thus an evenly gray image area (8 × 8 pixels) is reconstructed. In order to compress color images, the algorithm is applied to each color separation. Better results can be achieved if the RGB color model is converted into the HSV model4 or LUV model5 is transferred, because more information is contained in the brightness than in the hue. This method also represents a transformation. A relatively new transformation in image processing is the wavelet transformation. What a wavelet is, how the wavelet transformation works and how it is used for image compression will be explained in more detail in the next chapter.


A more detailed description of the algorithms, implementations (source codes), a critical discussion of the performance features, suggestions for improvement and further literature suggestions can be found in [NELSON93].