Tips for domain name registration and picking the right domain. Web hosting basics.
Godaddy Website Builder & Domains Promo Codes: Save money at Godaddy using this special
p (more)
Tips for domain name registration and picking the right domain. Web hosting basics.
Godaddy Website Builder & Domains Promo Codes: Save money at Godaddy using this special
promo link for Discount Domains and Hosting coupon codes. Create your own professionallooking
website with a few clicks. No programming required. Get started now:
http://www.godaddy.com-specialmessage.info/ezdiscount (less)
Slide 1: The Data Compression Book 2nd edition
By Mark Nelson and Jean-loup Gailly
Slide 2: Content
Afterword Why This Book Is For You Chapter 1—Introduction to Data Compression
The Audience Why C? Which C? Issues in Writing Portable C Keeping Score The Structure
Chapter 2—The Data-Compression Lexicon, with a History
The Two Kingdoms Data Compression = Modeling + Coding The Dawn Age Coding An Improvement Modeling Statistical Modeling Dictionary Schemes Ziv and Lempel
Slide 3: LZ77 LZ78 Lossy Compression Programs to Know
Chapter 3—The Dawn Age: Minimum Redundancy Coding
The Shannon-Fano Algorithm The Huffman Algorithm Huffman in C BITIO.C A Reminder about Prototypes MAIN-C.C AND MAIN-E.C MAIN-C.C ERRHAND.C Into the Huffman Code Counting the Symbols Saving the Counts Building the Tree Using the Tree The Compression Code Putting It All Together Performance
Chapter 4—A Significant Improvement: Adaptive Huffman Coding
Adaptive Coding Updating the Huffman Tree What Swapping Does The Algorithm An Enhancement The Escape Code The Overflow Problem A Rescaling Bonus The Code Initialization of the Array The Compress Main Program The Expand Main Program Encoding the Symbol Updating the Tree Decoding the Symbol The Code
Chapter 5—Huffman One Better: Arithmetic Coding
Difficulties Arithmetic Coding: A Step Forward Practical Matters A Complication
Slide 4: Decoding Where’s the Beef? The Code The Compression Program The Expansion Program Initializing the Model Reading the Model Initializing the Encoder The Encoding Process Flushing the Encoder The Decoding Process Summary Code
Chapter 6—Statistical Modeling
Higher-Order Modeling Finite Context Modeling Adaptive Modeling A Simple Example Using the Escape Code as a Fallback Improvements Highest-Order Modeling Updating the Model Escape Probabilities Scoreboarding Data Structures The Finishing Touches: Tables –1 and –2 Model Flushing Implementation Conclusions Enhancement ARITH-N Listing
Chapter 7—Dictionary-Based Compression
An Example Static vs. Adaptive Adaptive Methods A Representative Example Israeli Roots History ARC: The Father of MS-DOS Dictionary Compression Dictionary Compression: Where It Shows Up Danger Ahead—Patents Conclusion
Chapter 8—Sliding Window Compression
Slide 5: The Algorithm Problems with LZ77 An Encoding Problem LZSS Compression Data Structures A Balancing Act Greedy vs. Best Possible The Code Constants and Macros Global Variables The Compression Code Initialization The Main Loop The Exit Code AddString() DeleteString() Binary Tree Support Routines The Expansion Routine Improvements The Code
Chapter 9—LZ78 Compression
Can LZ77 Improve? Enter LZ78 LZ78 Details LZ78 Implementation An Effective Variant Decompression The Catch LZW Implementation Tree Maintenance and Navigation Compression Decompression The Code Improvements Patents
Chapter 10—Speech Compression
Digital Audio Concepts Fundamentals Sampling Variables PC-Based Sound Lossless Compression of Sound Problems and Results Lossy Compression Silence Compression
Slide 6: Companding Other Techniques
Chapter 11—Lossy Graphics Compression
Enter Compression Statistical and Dictionary Compression Methods Lossy Compression Differential Modulation Adaptive Coding A Standard That Works: JPEG JPEG Compression The Discrete Cosine Transform DCT Specifics Why Bother? Implementing the DCT Matrix Multiplication Continued Improvements Output of the DCT Quantization Selecting a Quantization Matrix Coding The Zig-Zag Sequence Entropy Encoding What About Color? The Sample Program Input Format The Code Initialization The Forward DCT Routine WriteDCTData() OutputCode() File Expansion ReadDCTData() Input DCT Codes The Inverse DCT The Complete Code Listing Support Programs Some Compression Results
Chapter 12—An Archiving Package
CAR and CARMAN The CARMAN Command Set The CAR File The Header Storing the Header The Header CRC
Slide 7: Command-Line Processing Generating the File List Opening the Archive Files The Main Processing Loop Skipping/Copying Input File File Insertion File Extraction Cleanup The Code
Chapter 13—Fractal Image Compression
A brief history of fractal image compression What is an Iterated Function System? Basic IFS mathematics Image compression with Iterated Function Systems Image compression with Partitioned Iterated Function Systems Fractal image decoding Resolution independence The sample program The main compression module Initialization Domain classification Image partitioning Finding optimal affine maps The decompression module The complete code listing Some Compression Results Patents
Bibliography Appendix A Appendix B Glossary
Slide 8: Afterword
When writing about data compression, I am haunted by the idea that many of the techniques discussed in this book have been patented by their inventors or others. The knowledge that a data compression algorithm can effectively be taken out of the hands of programmers through the use of so-called “intellectual property” law seems contrary to the basic principles that led me and many others into this profession. I have yet to see any evidence that applying patents to software advances that art or protects the rights of inventors. Several companies continue to collect royalties on patents long after their inventors have moved onto bigger and better thing with other companies. Have the patent-holders done anything notable other than collect royalties? Have they advanced the art of computer science? Making a software product into a commercial success requires innovation, good design, high-quality documentation, and listening to customers. These are things that nobody can steal from you. On the other hand, a mountain of patents can’t keep you from letting these things slip away through inattention or complacency. This lesson seems to be lost on those who traffic in intellectual property “portfolios.” What can you do? First, don’t patent your own work, and discourage your peers from doing so. Work on improving your products, not erecting legal obstacles to competition. Secondly, lobby for change. This means change within your company, those you do business with, and most importantly, within the federal government. Write to your congressman and your senator. Write to the ACM. Write to the House Subcommittee on Intellectual Property. And finally, you can join me by becoming a member of the League for Programming Freedom. Write for more information:
Slide 9: League For Programming Freedom 1 Kendall Square #143 P.O. Box 9171 Cambridge, MA 02139
I concluded, we kinotropists must be numbered among Britain's most adept programmers of Enginery of any sort, and virtually all advances on the compression of data have originated as kinotropic applications. At this point, he interrupted again, asking if I had indeed said "the compression of data," and was I familiar with the term "algorithmic compression"? I assured him I was. The Difference Engine William Gibson and Bruce Sterling
Why This Book Is For You
If you want to learn how programs like PKZIP and LHarc work, this book is for you. The compression techniques used in these programs are described in detail, accompanied by working code. After reading this book, even the novice C programmer will be able to write a complete compression/archiving program that can be ported to virtually any operating system or hardware platform. If you want to include data compression in other programs you write, this book will become an invaluable tool. It contains dozens of working programs with C code that can easily be added to your applications. In-depth discussions of various compression methods will help you make intelligent decisions when creating programs that use data compression. If you want to learn why lossy compression of graphics is the key factor in enabling the multimedia revolution, you need this book. DCT-based compression like that used by the
Slide 10: JPEG algorithm is described in detail. The cutting edge technology of fractal compression is explained in useful terms, instead of the purly theoretical. Working programs let you experiment with these fascinating new technologies. The Data Compression Book provides you with a comprehensive reference to this important field. No other book available has the detailed description of compression algorithms or working C implementations for those algorithms. If you are planning to work in this field, The Data Compression Book is indispensable.
Chapter 1 Introduction to Data Compression
The primary purpose of this book is to explain various data-compression techniques using the C programming language. Data compression seeks to reduce the number of bits used to store or transmit information. It encompasses a wide variety of software and hardware compression techniques which can be so unlike one another that they have little in common except that they compress data. The LZW algorithm used in the Compuserve GIF specification, for example, has virtually nothing in common with the CCITT G.721 specification used to compress digitized voice over phone lines. This book will not take a comprehensive look at every variety of data compression. The field has grown in the last 25 years to a point where this is simply not possible. What this book will cover are the various types of data compression commonly used on personal and midsized computers, including compression of binary programs, data, sound, and graphics. Furthermore, this book will either ignore or only lightly cover data-compression techniques that rely on hardware for practical use or that require hardware applications. Many of today’s voice-compression schemes were designed for the worldwide fixedbandwidth digital telecommunications networks. These compression schemes are intellectually interesting, but they require a specific type of hardware tuned to the fixed bandwidth of the communications channel. Different algorithms that don’t have to meet this requirement are used to compress digitized voice on a PC, and these algorithms generally offer better performance.
Slide 11: Some of the most interesting areas in data compression today, however, do concern compression techniques just becoming possible with new and more powerful hardware. Lossy image compression, like that used in multimedia systems, for example, can now be implemented on standard desktop platforms. This book will cover practical ways to both experiment with and implement some of the algorithms used in these techniques.
The Audience
You will need basic programming skills to adequately discuss data-compression code. The ability to follow block-structured code, such as C or Pascal, is a requirement. In addition, understanding computer architecture well enough to follow bit-oriented operations, such as shifting, logical ORing and ANDing, and so on, will be essential. This does not mean that you need to be a C guru for this book to be worthwhile. You don’t even have to be a programmer. But the ability to follow code will be essential, because the concepts discussed here will be illustrated with portable C programs. The C code in this book has been written with an eye toward simplicity in the hopes that C novices will still be able to follow the programs. We will avoid the more esoteric constructs of C, but the code will be working tested C—no pseudocode or English.
Why C?
The use of C to illustrate data-compression algorithms may raise some hackles, although less so these days than when the first edition of this book came out. A more traditional way to write this book would have been to use pseudocode to sketch out the algorithms. But the lack of rigor in a pseudocode “program” often leads to hazy or incomplete definitions full of lines like “PROCESS FILE UNTIL OUT OF DATA.” The result is that pseudocode is easy to read, but not so easy to translate into a working program. If pseudocode is unsatisfactory, the next best choice is to use a conventional programming language. Though hundreds of choices are available, C seems the best choice for this type of book for several good reasons. First, in many respects C has become the lingua franca of programmers. That C compilers support computers ranging from a lowly 8051 microcontroller to supercomputers capable of 100 million instructions per second (MIPS) has had much to do with this. It doesn’t mean that C is the language of choice for all programmers. What it does mean is that most programmers should have a C compiler available for their machines, and most are probably regularly exposed to C code. Because of this, many programmers who use other languages can still manage to code in C, and even more can at least read C. A second reason for using C is that it is a language without too many surprises. The few constructs it uses as basic language elements are easily translated to other languages. So a data-compression program that is illustrated using C can be converted to a working Pascal program through a relatively straightforward translation procedure. Even assembly-language programmers should find the process relatively painless.
Slide 12: Perhaps the most important reason for using C is simply one of efficiency. C is often thought of as a high-level assembly language, since it allows programmers to get close to the hardware. Despite the increasing optimization found in recent C compilers, it is not likely that C will ever exceed the speed or size possible in hand-coded assembly language. That flaw is offset, however, by the ability to easily port C code to other machines. So for a book of this type, C is probably the most efficient choice.
Which C?
Despite being advertised as a “portable” language, a C program that compiles and executes on a given machine is not guaranteed to run on any other. It may not even compile using a different compiler on the same machine. The important thing to remember is not that C is portable, but that it can be portable. The code for this book has been written to be portable, and it compiles and runs cleanly using several compilers and environments. The compilers/environments used here include: • • • • • • Microsoft Visual C++ 1.5, MS-DOS 5.0/6.22 Borland C++ 4.0-4.5, MS-DOS 5.0/6.22 Symantec C++ 6.0-7.0, MS-DOS 5.0/6.22 Interactive Unix System 3.2 with the portable C compiler Solaris 2.4 with SunSoft compiler Linux 1.1 with the GNU C compiler
Issues in Writing Portable C
One important portability issue is library function calls. Though the C programming language was fairly well defined by the original K&R book (Brian W. Kernighan and Dennis M. Ritchie, The C Programming Language [Englewood Cliffs, NJ.: Prentice-Hall, 1978]), the run-time library implementation was left totally up to the whims of the implementor. Fortunately, the American National Standards Institute was able to complete the C language specification in 1990, and the result was published as ANSI standard XJ11.34. This standard not only expanded and pinned down the original K&R language specification, but it also took on the definition of a standard C run-time library. This makes it much easier to write code that works the same way from machine to machine. The code in this book will be written with the intention of using only ANSI C library calls. Compiler-dependent extensions to either the language or the library will be avoided wherever possible. Given the standardization of the libraries, the remaining portability issues center around two things: sizes of the basic data types and dealing with noncompliant compilers. The majority of data-type conflicts arise when switching between 16- and 32-bit machines. Fortunately, it is fairly easy to manage the change between 16- and 32-bit machines. Though the basic integer data type switches between 16- and 32-bits, both machines have a 16-bit “short int” data type. Once again, a “long int” is generally 32 bits on both
Slide 13: machines. So in cases where the size of an integer clearly matters, it can be pinned down to either 16-or 32-bits with the appropriate declaration. On the vast majority of machines used in the world today, the C compiler implementation of the “char” data type is 8 bits wide. In this book, we will gloss over the possibility that any other size exists and stick with 8-bit characters. In general, porting a program shown here to a machine with an unusual char size is not too difficult, but spending too much time on it will obscure the important point of the programs here, which is data compression. The final issue to deal with when writing portable code is the problem of noncompliant compilers. In the MS-DOS world, most C compilers undergo major releases and upgrades every two years or so. This means that most compiler vendors have been able to release new versions of their compilers that now conform closely to the ANSI C standard. But this is not the case for users of many other operating systems. In particular, UNIX users will frequently be using a C compiler which came with their system and which conforms to the older K&R language definition. While the ANSI C committee went to great lengths to make ANSI C upwardly compatible from K&R C, we need to watch out for a few problems. The first problem lies in the use of function prototypes. Under K&R C, function prototypes were generally used only when necessary. The compiler assumed that any unseen function returned an integer, and it accepted this without complaint. If a function returned something unusual—a pointer or a long, for instance—the programmer would write a function prototype to inform the compiler.
long locate_string();
Here, the prototype told the compiler to generate code that assumes that the function returned a long instead of an int. Function prototypes didn’t have much more use than that. Because of this, many C programmers working under a K&R regime made little or no use of function prototypes, and their appearance in a program was something of an oddity. While the ANSI C committee tried not to alter the basic nature of C, they were unable to pass up the potential improvements to the language that were possible through the expansion of the prototyping facility. Under ANSI C, a function prototype defines not only the return type of a function, but also the type of all the arguments as well. The function shown earlier, for example, might have the following prototype with an ANSI C compiler:
long locate_string( FILE *input_file, char *string );
This lets the compiler generate the correct code for the return type and check for the correct type and number of arguments as well. Since passing the wrong type or number of arguments to a function is a major source of programmer error in C, the committee
Slide 14: correctly assumed that allowing this form of type checking constituted a step forward for C. Under many ANSI C compilers, use of full ANSI function prototypes is strongly encouraged. In fact, many compilers will generate warning messages when a function is used without previously encountering a prototype. This is well and good, but the same function prototypes will not work on a trusty portable C compiler under UNIX. The solution to this dilemma is not pretty, but it works. Under ANSI C, the predefined macro ___STDC___ is always defined to indicate that the code is being compiled through a presumably ANSI-compliant compiler. We can let the preprocessor turn certain sections of our header files on or off, depending on whether we are using a noncompliant compiler or not. A header file containing the prototypes for a bit-oriented package, for example, might look something like this:
#ifdef ___STDC___ FILE *open_bitstream( char *file_name, char *mode ); void close_bitstream( FILE *bitstream ); int read_bit( FILE*bitstream ); int write_bit( FILE *bitstream, int bit ); #else FILE *open_bitstream(); void close_bitstream(); int read_bit(); int write_bit(); #endif
The preprocessor directives don’t contribute much to the look of the code, but they are a necessary part of writing portable programs. Since the programs in this book are supposed to be compiled with the compiler set to its maximum possible warning level, a few “#ifdef” statements will be part of the package. A second problem with the K&R family of C compilers lies in the actual function body. Under K&R C, a particular function might have a definition like the one below.
int foo( c ) char c; { /* Function body */ }
The same function written using an ANSI C function body would look like this:
int foo( char c ) { /* Function body */ }
Slide 15: These two functions may look the same, but ANSI C rules require that they be treated differently. The K&R function body will have the compiler “promote” the character argument to an integer before using it in the function body, but the ANSI C function body will leave it as a character. Promoting one integral type to another lets lots of sneaky problems slip into seemingly well-written code, and the stricter compilers will issue warnings when they detect a problem of this nature. Since K&R compilers will not accept the second form of a function body, be careful when defining character arguments to functions. Unfortunately, the solutions are once again either to not use character arguments or to resort to more of the ugly “#ifdef” preprocessor baggage.
Keeping Score
Throughout this book, there will be references to “compression ratios” and compression statistics. To keep the various forms of compression on a level playing field, compression statistics will always be in relationship to the sample compression files used in the February 1991 Dr. Dobb’s Journal compression contest. These files consist of about 6 megabytes of data broken down into three roughly equal categories. The first category is text, consisting of manuscripts, programs, memos, and other readable files. The second category consists of binary data, including database files, executable files, and spreadsheet data. The third category consists of graphics files stored in raw screen-dump formats. The programs created and discussed in this book will be judged by three rough measures of performance. The first will be the amount of memory consumed by the program during compression; this number will be approximated as well as it can be. The second will be the amount of time the program takes to compress the entire Dr. Dobb’s dataset. The third will be the compression ratio of the entire set. Different people use different formulas to calculate compression ratios. Some prefer bits/bytes. Other use ratios, such as 2:1 or 3:1 (advertising people seem to like this format). In this book, we will use a simple compression-percentage formula:
( 1 - ( compressed_size / raw_size ) ) * 100
This means that a file that doesn’t change at all when compressed will have a compression ratio of 0 percent. A file compressed down to one-third of its original size will have a compression ratio of 67 percent. A file that shrinks down to 0 bytes (!) will have a compression ratio of 100 percent. This way of measuring compression may not be perfect, but it shows perfection at 100 percent and total failure at 0 percent. In fact, a file that goes through a compression program and comes out larger will show a negative compression ratio.
Slide 16: The Structure
This book consists of thirteen chapters and a floppy disk. The organization roughly parallels the historical progression of data compression, starting in the “dawn age” around 1950 and working up to the present. Chapter 2 is a reference chapter which attempts to establish the fundamental datacompression lexicon. It discusses the birth of information theory, and it introduces a series of concepts, terms, buzzwords, and theories used over and over in the rest of the book. Even if you are a data-compression novice, mastery of chapter 2 will bring you up to the “cocktail party” level of information, meaning that you will be able to carry on an intelligent-sounding conversation about data compression even if you don’t fully understand its intricacies. Chapter 3 discusses the birth of data compression, starting with variable-length bit coding. The development of Shannon-Fano coding and Huffman coding represented the birth of both data compression and information theory. These coding methods are still in wide use today. In addition, chapter 3 discusses the difference between modeling and coding—the two faces of the data-compression coin. Standard Huffman coding suffers from a significant problem when used for highperformance data compression. The compression program has to pass a complete copy of the Huffman coding statistics to the expansion program. As the compression program collects more statistics and tries to increase its compression ratio, the statistics take up more space and work against the increased compression. Chapter 4 discusses a way to solve this dilemma: adaptive Huffman coding. This is a relatively recent innovation, due to CPU and memory requirements. Adaptive coding greatly expands the horizons of Huffman coding, leading to vastly improved compression ratios. Huffman coding has to use an integral number of bits for each code, which is usually slightly less than optimal. A more recent innovation, arithmetic coding, uses a fractional number of bits per code, allowing it to incrementally improve compression performance. Chapter 5 explains how this recent innovation works, and it shows how to integrate an arithmetic coder with a statistical model. Chapter 6 discusses statistical modeling. Whether using Huffman coding, adaptive Huffman coding, or arithmetic coding, it is still necessary to have a statistical model to drive the coder. This chapter shows some of the interesting techniques used to implement powerful models using limited memory resources. Dictionary compression methods take a completely different approach to compression from the techniques discussed in the previous four chapters. Chapter 7 provides an overview of these compression methods, which represent strings of characters with single codes. Dictionary methods have become the de facto standard for general-purpose data compression on small computers due to their high-performance compression combined with reasonable memory requirements.
Slide 17: The fathers of dictionary-based compression, Ziv and Lempel published a paper in 1977 proposing a sliding dictionary methods of data compression which has become very popular. Chapter 8 looks at recent adaptations of LZ77 compression used in popular archiving programs such as PKZIP. Chapter 9 takes detailed look at one of the first widely popular dictionary-based compression methods: LZW compression. LZW is the compression method used in the UNIX COMPRESS program and in earlier versions of the MS-DOS ARC program. This chapter also takes a look at the foundation of LZW compression, published in 1978 by Ziv and Lempel. All of the compression techniques discussed through chapter 9 are “lossless.” Lossy methods can be used on speech and graphics, and they are capable of achieving dramatically higher compression ratios. Chapter 10 shows how lossy compression can be used on digitized sound data which techniques like linear predictive coding and adaptive PCM. Chapter 11 discusses lossy compression techniques applied to computer graphics. The industry is standardizing rapidly on the JPEG standard for compressing graphical images. The techniques used in the JPEG standard will be presented in this chapter. Chapter 12 describes how to put it all together into an archive program. A generalpurpose archiving program should be able to compress and decompress files while keeping track of files names, dates, attributes, compression ratios, and compression methods. An archive format should ideally be portable to different types of computers. A sample archive program is developed, which applies the techniques used in previous chapters to put together a complete program. Chapter 13 is a detailed look at fractal compression techniques. The world of fractal compression offers some exciting methods of achieving maximum compression for your data.
Slide 18: Chapter 2 The Data-Compression Lexicon, with a History
Like any other scientific or engineering discipline, data compression has a vocabulary that at first seem overwhelmingly strange to an outsider. Terms like Lempel-Ziv compression, arithmetic coding, and statistical modeling get tossed around with reckless abandon. While the list of buzzwords is long enough to merit a glossary, mastering them is not as daunting a project as it may first seem. With a bit of study and a few notes, any programmer should hold his or her own at a cocktail-party argument over datacompression techniques.
The Two Kingdoms
Data-compression techniques can be divided into two major families; lossy and lossless. Lossy data compression concedes a certain loss of accuracy in exchange for greatly increased compression. Lossy compression proves effective when applied to graphics images and digitized voice. By their very nature, these digitized representations of analog phenomena are not perfect to begin with, so the idea of output and input not matching exactly is a little more acceptable. Most lossy compression techniques can be adjusted to different quality levels, gaining higher accuracy in exchange for less effective compression. Until recently, lossy compression has been primarily implemented using dedicated hardware. In the past few years, powerful lossy-compression programs have been moved to desktop CPUs, but even so the field is still dominated by hardware implementations.
Slide 19: Lossless compression consists of those techniques guaranteed to generate an exact duplicate of the input data stream after a compress/expand cycle. This is the type of compression used when storing database records, spreadsheets, or word processing files. In these applications, the loss of even a single bit could be catastrophic. Most techniques discussed in this book will be lossless.
Data Compression = Modeling + Coding
In general, data compression consists of taking a stream of symbols and transforming them into codes. If the compression is effective, the resulting stream of codes will be smaller than the original symbols. The decision to output a certain code for a certain symbol or set of symbols is based on a model. The model is simply a collection of data and rules used to process input symbols and determine which code(s) to output. A program uses the model to accurately define the probabilities for each symbol and the coder to produce an appropriate code based on those probabilities. Modeling and coding are two distinctly different things. People frequently use the term coding to refer to the entire data-compression process instead of just a single component of that process. You will hear the phrases “Huffman coding” or “Run-Length Encoding,” for example, to describe a data-compression technique, when in fact they are just coding methods used in conjunction with a model to compress data. Using the example of Huffman coding, a breakdown of the compression process looks something like this:
Figure 2.1 A Statistical Model with a Huffman Encoder. In the case of Huffman coding, the actual output of the encoder is determined by a set of probabilities. When using this type of coding, a symbol that has a very high probability of occurrence generates a code with very few bits. A symbol with a low probability generates a code with a larger number of bits. We think of the model and the program’s coding process as different because of the countless ways to model data, all of which can use the same coding process to produce their output. A simple program using Huffman coding, for example, would use a model that gave the raw probability of each symbol occurring anywhere in the input stream. A more sophisticated program might calculate the probability based on the last 10 symbols in the input stream. Even though both programs use Huffman coding to produce their output, their compression ratios would probably be radically different.
Slide 20: So when the topic of coding methods comes up at your next cocktail party, be alert for statements like “Huffman coding in general doesn’t produce very good compression ratios.” This would be your perfect opportunity to respond with “That’s like saying Converse sneakers don’t go very fast. I always thought the leg power of the runner had a lot to do with it.” If the conversation has already dropped to the point where you are discussing data compression, this might even go over as a real demonstration of wit.
The Dawn Age
Data compression is perhaps the fundamental expression of Information Theory. Information Theory is a branch of mathematics that had its genesis in the late 1940s with the work of Claude Shannon at Bell Labs. It concerns itself with various questions about information, including different ways of storing and communicating messages. Data compression enters into the field of Information Theory because of its concern with redundancy. Redundant information in a message takes extra bit to encode, and if we can get rid of that extra information, we will have reduced the size of the message. Information Theory uses the term entropy as a measure of how much information is encoded in a message. The word entropy was borrowed from thermodynamics, and it has a similar meaning. The higher the entropy of a message, the more information it contains. The entropy of a symbol is defined as the negative logarithm of its probability. To determine the information content of a message in bits, we express the entropy using the base 2 logarithm:
Number of bits = - Log base 2 (probability)
The entropy of an entire message is simply the sum of the entropy of all individual symbols. Entropy fits with data compression in its determination of how many bits of information are actually present in a message. If the probability of the character ‘e’ appearing in this manuscript is 1/16, for example, the information content of the character is four bits. So the character string “eeeee” has a total content of 20 bits. If we are using standard 8-bit ASCII characters to encode this message, we are actually using 40 bits. The difference between the 20 bits of entropy and the 40 bits used to encode the message is where the potential for data compression arises. One important fact to note about entropy is that, unlike the thermodynamic measure of entropy, we can use no absolute number for the information content of a given message. The problem is that when we calculate entropy, we use a number that gives us the probability of a given symbol. The probability figure we use is actually the probability for a given model, not an absolute number. If we change the model, the probability will change with it.
Slide 21: How probabilities change can be seen clearly when using different orders with a statistical model. A statistical model tracks the probability of a symbol based on what symbols appeared previously in the input stream. The order of the model determines how many previous symbols are taken into account. An order-0 model, for example, won’t look at previous characters. An order-1 model looks at the one previous character, and so on. The different order models can yield drastically different probabilities for a character. The letter ‘u’ under an order-0 model, for example, may have only a 1 percent probability of occurrence. But under an order-1 model, if the previous character was ‘q,’ the ‘u’ may have a 95 percent probability. This seemingly unstable notion of a character’s probability proves troublesome for many people. They prefer that a character have a fixed “true” probability that told what the chances of its “really” occurring are. Claude Shannon attempted to determine the true information content of the English language with a “party game” experiment. He would uncover a message concealed from his audience a single character at a time. The audience guessed what the next character would be, one guess at a time, until they got it right. Shannon could then determine the entropy of the message as a whole by taking the logarithm of the guess count. Other researchers have done more experiments using similar techniques. While these experiments are useful, they don’t circumvent the notion that a symbol’s probability depends on the model. The difference with these experiments is that the model is the one kept inside the human brain. This may be one of the best models available, but it is still a model, not an absolute truth. In order to compress data well, we need to select models that predict symbols with high probabilities. A symbol that has a high probability has a low information content and will need fewer bits to encode. Once the model is producing high probabilities, the next step is to encode the symbols using an appropriate number of bits.
Coding
Once Information Theory had advanced to where the number of bits of information in a symbol could be determined, the next step was to develop new methods for encoding information. To compress data, we need to encode symbols with exactly the number of bits of information the symbol contains. If the character ‘e’ only gives us four bits of information, then it should be coded with exactly four bits. If ‘x’ contains twelve bits, it should be coded with twelve bits. By encoding characters using EBCDIC or ASCII, we clearly aren’t going to be very close to an optimum method. Since every character is encoded using the same number of bits, we introduce lots of error in both directions, with most of the codes in a message being too long and some being too short.
Slide 22: Solving this coding problem in a reasonable manner was one of the first problems tackled by practitioners of Information Theory. Two approaches that worked well were ShannonFano coding and Huffman coding—two different ways of generating variable-length codes when given a probability table for a given set of symbols. Huffman coding, named for its inventor D.A. Huffman, achieves the minimum amount of redundancy possible in a fixed set of variable-length codes. This doesn’t mean that Huffman coding is an optimal coding method. It means that it provides the best approximation for coding symbols when using fixed-width codes. The problem with Huffman or Shannon-Fano coding is that they use an integral number of bits in each code. If the entropy of a given character is 2.5 bits, the Huffman code for that character must be either 2 or 3 bits, not 2.5. Because of this, Huffman coding can’t be considered an optimal coding method, but it is the best approximation that uses fixed codes with an integral number of bits. Here is a sample of Huffman codes:
Symbol E T A I … X Q Z
Huffman Code 100 101 1100 11010 01101111 01101110001 01101110000
An Improvement
Though Huffman coding is inefficient due to using an integral number of bits per code, it is relatively easy to implement and very economical for both coding and decoding. Huffman first published his paper on coding in 1952, and it instantly became the mostcited paper in Information Theory. It probably still is. Huffman’s original work spawned numerous minor variations, and it dominated the coding world till the early 1980s. As the cost of CPU cycles went down, new possibilities for more efficient coding techniques emerged. One in particular, arithmetic coding, is a viable successor to Huffman coding. Arithmetic coding is somewhat more complicated in both concept and implementation than standard variable-width codes. It does not produce a single code for each symbol. Instead, it produces a code for an entire message. Each symbol added to the message
Slide 23: incrementally modifies the output code. This is an improvement because the net effect of each input symbol on the output code can be a fractional number of bits instead of an integral number. So if the entropy for character ‘e’ is 2.5 bits, it is possible to add exactly 2.5 bits to the output code. An example of why this can be more effective is shown in the following table, the analysis of an imaginary message. In it, Huffman coding would yield a total message length of 89 bits, but arithmetic coding would approach the true information content of the message, or 83.56 bits. The difference in the two messages works out to approximately 6 percent. Here are some sample message probabilities:
Symbol
Number of Occurrences
Information Content
Huffman Code Bit Count 1 bits 2 bits 3 bits 4 bits 4 bits
Total Bits Huffman Coding 20 40 9 12 8 89
Total Bits Arithmetic Coding 25.2 25.2 12.0 12.0 9.16 83.56
E A X Y Z
20 20 3 3 2
1.26 bits 1.26 bits 4.00 bits 4.00 bits 4.58 bits
The problem with Huffman coding in the above message is that it can’t create codes with the exact information content required. In most cases it is a little above or a little below, leading to deviations from the optimum. But arithmetic coding gets to within a fraction of a percent of the actual information content, resulting in more accurate coding. Arithmetic coding requires more CPU power than was available until recently. Even now it will generally suffer from a significant speed disadvantage when compared to older coding methods. But the gains from switching to this method are significant enough to ensure that arithmetic coding will be the coding method of choice when the cost of storing or sending information is high enough.
Modeling
If we use a an automotive metaphor for data compression, coding would be the wheels, but modeling would be the engine. Regardless of the efficiency of the coder, if it doesn’t have a model feeding it good probabilities, it won’t compress data. Lossless data compression is generally implemented using one of two different types of modeling: statistical or dictionary-based. Statistical modeling reads in and encodes a single symbol at a time using the probability of that character’s appearance. Dictionarybased modeling uses a single code to replace strings of symbols. In dictionary-based
Slide 24: modeling, the coding problem is reduced in significance, leaving the model supremely important.
Statistical Modeling
The simplest forms of statistical modeling use a static table of probabilities. In the earliest days of information theory, the CPU cost of analyzing data and building a Huffman tree was considered significant, so it wasn’t frequently performed. Instead, representative blocks of data were analyzed once, giving a table of character-frequency counts. Huffman encoding/decoding trees were then built and stored. Compression programs had access to this static model and would compress data using it. But using a universal static model has limitations. If an input stream doesn’t match well with the previously accumulated statistics, the compression ratio will be degraded— possibly to the point where the output stream becomes larger than the input stream. The next obvious enhancement is to build a statistics table for every unique input stream. Building a static Huffman table for each file to be compressed has its advantages. The table is uniquely adapted to that particular file, so it should give better compression than a universal table. But there is additional overhead since the table (or the statistics used to build the table) has to be passed to the decoder ahead of the compressed code stream. For an order-0 compression table, the actual statistics used to create the table may take up as little as 256 bytes—not a very large amount of overhead. But trying to achieve better compression through use of a higher order table will make the statistics that need to be passed to the decoder grow at an alarming rate. Just moving to an order 1 model can boost the statistics table from 256 to 65,536 bytes. Though compression ratios will undoubtedly improve when moving to order-1, the overhead of passing the statistics table will probably wipe out any gains. For this reason, compression research in the last 10 years has concentrated on adaptive models. When using an adaptive model, data does not have to be scanned once before coding in order to generate statistics. Instead, the statistics are continually modified as new characters are read in and coded. The general flow of a program using an adaptive model looks something like that shown in Figures 2.2 and 2.3.
Figure 2.2 General Adaptive Compression.
Slide 25: Figure 2.3 General Adaptive Decompression. The important point in making this system work is that the box labeled “Update Model” has to work exactly the same way for both the compression and decompression programs. After each character (or group of characters) is read in, it is encoded or decoded. Only after the encoding or decoding is complete can the model be updated to take into account the most recent symbol or group of symbols. One problem with adaptive models is that they start knowing essentially nothing about the data. So when the program first starts, it doesn’t do a very good job of compression. Most adaptive algorithms tend to adjust quickly to the data stream and will begin turning in respectable compression ratios after only a few thousand bytes. Likewise, it doesn’t take long for the compression-ratio curve to flatten out so that reading in more data doesn’t improve the compression ratio. One advantage that adaptive models have over static models is the ability to adapt to local conditions. When compressing executable files, for example, the character of the input data may change drastically as the program file changes from binary program code to binary data. A well-written adaptive program will weight the most recent data higher than old data, so it will modify its statistics to better suit changed data.
Dictionary Schemes
Statistical models generally encode a single symbol at a time— reading it in, calculating a probability, then outputting a single code. A dictionary-based compression scheme uses a different concept. It reads in input data and looks for groups of symbols that appear in a dictionary. If a string match is found, a pointer or index into the dictionary can be output instead of the code for the symbol. The longer the match, the better the compression ratio. This method of encoding changes the focus of dictionary compression. Simple coding methods are generally used, and the focus of the program is on the modeling. In LZW compression, for example, simple codes of uniform width are used for all substitutions. A static dictionary is used like the list of references in an academic paper. Through the text of a paper, the author may simply substitute a number that points to a list of references instead of writing out the full title of a referenced work. The dictionary is static because it is built up and transmitted with the text of work—the reader does not
Slide 26: have to build it on the fly. The first time I see a number in the text like this—[2]—I know it points to the static dictionary. The problem with a static dictionary is identical to the problem the user of a statistical model faces: The dictionary needs to be transmitted along with the text, resulting in a certain amount of overhead added to the compressed text. An adaptive dictionary scheme helps avoid this problem. Mentally, we are used to a type of adaptive dictionary when performing acronym replacements in technical literature. The standard way to use this adaptive dictionary is to spell out the acronym, then put its abbreviated substitution in parentheses. So the first time I mention the Massachusetts Institute of Technology (MIT), I define both the dictionary string and its substitution. From then on, referring to MIT in the text should automatically invoke a mental substitution.
Ziv and Lempel
Until 1980, most general-compression schemes used statistical modeling. But in 1977 and 1978, Jacob Ziv and Abraham Lempel described a pair of compression methods using an adaptive dictionary. These two algorithms sparked a flood of new techniques that used dictionary-based methods to achieve impressive new compression ratios.
LZ77
The first compression algorithm described by Ziv and Lempel is commonly referred to as LZ77. It is relatively simple. The dictionary consists of all the strings in a window into the previously read input stream. A file-compression program, for example, could use a 4K-byte window as a dictionary. While new groups of symbols are being read in, the algorithm looks for matches with strings found in the previous 4K bytes of data already read in. Any matches are encoded as pointers sent to the output stream. LZ77 and its variants make attractive compression algorithms. Maintaining the model is simple; encoding the output is simple; and programs that work very quickly can be written using LZ77. Popular programs such as PKZIP and LHarc use variants of the LZ77 algorithm, and they have proven very popular.
LZ78
The LZ78 program takes a different approach to building and maintaining the dictionary. Instead of having a limited-size window into the preceding text, LZ78 builds its dictionary out of all of the previously seen symbols in the input text. But instead of having carte blanche access to all the symbol strings in the preceding text, a dictionary of strings is built a single character at a time. The first time the string “Mark” is seen, for example, the string “Ma” is added to the dictionary. The next time, “Mar” is added. If “Mark” is seen again, it is added to the dictionary.
Slide 27: This incremental procedure works very well at isolating frequently used strings and adding them to the table. Unlike LZ77 methods, strings in LZ78 can be extremely long, which allows for high-compression ratios. LZ78 was the first of the two Ziv-Lempel algorithms to achieve popular success, due to the LZW adaptation by Terry Welch, which forms the core of the UNIX compress program.
Lossy Compression
Until recently, lossy compression has been primarily performed on special-purpose hardware. The advent of inexpensive Digital Signal Processor (DSP) chips began lossy compression’s move off the circuit board and onto the desktop. CPU prices have now dropped to where it is becoming practical to perform lossy compression on generalpurpose desktop PCs. Lossy compression is fundamentally different from lossless compression in one respect: it accepts a slight loss of data to facilitate compression. Lossy compression is generally done on analog data stored digitally, with the primary applications being graphics and sound files. This type of compression frequently makes two passes. A first pass over the data performs a high-level, signal-processing function. This frequently consists of transforming the data into the frequency domain, using algorithms similar to the wellknown Fast Fourier Transform (FFT). Once the data has been transformed, it is “smoothed,” rounding off high and low points. Loss of signal occurs here. Finally, the frequency points are compressed using conventional lossless techniques. The smoothing function that operates on the frequency-domain data generally has a “quality factor” built into it that determines just how much smoothing occurs. The more the data is massaged, the greater the signal loss—and more compression will occur. In the small systems world, a tremendous amount of work is being done on graphical image compression, both for still and moving pictures. The International Standards Organization (ISO) and the Consultive Committee for International Telegraph and Telephone (CCITT) have banded together to form two committees: The Joint Photographic Experts Group (JPEG) and the Moving Pictures Expert Group (MPEG). The JPEG committee has published its compression standard, and many vendors are now shipping hardware and software that are JPEG compliant. The MPEG committee completed an intial moving picture compression standard, and is finalizing a second, MPEG-II. The JPEG standard uses the Discrete Cosine Transform (DCT) algorithm to convert a graphics image to the frequency domain. The DCT algorithm has been used for graphics transforms for many years, so efficient implementations are readily available. JPEG specifies a quality factor of 0 to 100, and it lets the compressor determine what factor to select.
Slide 28: Using the JPEG algorithm on images can result in dramatic compression ratios. With little or no degradation, compression ratios of 90–95 percent are routine. Accepting minor degradation achieves ratios as high as 98–99 percent. Software implementations of the JPEG and MPEG algorithms are still struggling to achieve real-time performance. Most multimedia development software that uses this type of compression still depends on the use of a coprocessor board to make the compression take place in a reasonable amount of time. We are probably only a few years away from software-only real-time compression capabilities.
Programs to Know
General-purpose data-compression programs have been available only for the past ten years or so. It wasn’t until around 1980 that machines with the power to do the analysis needed for effective compression started to become commonplace. In the Unix world, one of the first general-purpose compression programs was COMPACT. COMPACT is a relatively straightforward implementation of an order-0 compression program that uses adaptive Huffman coding. COMPACT produced good enough compression to make it useful, but it was slow. COMPACT was also a proprietary product, so it was not available to all Unix users. Compress, a somewhat improved program, became available to Unix users a few years later. It is a straightforward implementation of the LZW dictionary-based compression scheme. compress gave significantly better compression than COMPACT, and it ran faster. Even better, the source code to a compress was readily available as a publicdomain program, and it proved quite portable. compress is still in wide use among UNIX users, though its continued use is questionable due to the LZW patent held by Unisys. In the early 1980s, desktop users of CP/M and MS-DOS systems were first exposed to data compression through the SQ program. SQ performed order-0 compression using a static Huffman tree passed in the file. SQ gave compression comparable to that of the COMPACT program, and it was widely used by early pioneers in desktop telecommunications. As in the Unix world, Huffman coding soon gave way to LZW compression with the advent of ARC. ARC is a general-purpose program that performs both file compression and archiving, two features that often go hand in hand. (Unix users typically archive files first using TAR, then they compress the entire archive.) ARC could originally compress files using run-length encoding, order-0 static Huffman coding, or LZW compression. The original LZW code for ARC appears to be a derivative of the Unix compress code. Due to the rapid distribution possible using shareware and telecommunications, ARC quickly became a de facto standard and began spawning imitators right and left. ARC underwent many revisions but has faded in popularity in recent years. Today, if there is a
Slide 29: compression standard in the DOS world, it is the shareware program PKZIP, written by Phil Katz. PKZIP is a relatively inexpensive program that offers both superior compression ratios and compression speed. At this writing, the current shareware version is PKZip V2.04g and can be found on many bulletin boards and online forums. Katz’s company, PKWare, also sells a commercial version. Note that V2.04g of PKZIP can create ZIP files that are not backward compatible with previous versions. On Compuserve, many forums have switched to the new format for files kept in the forum libraries. Usually, a copy of the distribution PKZ204.EXE is also found in the forum library. For example, you can find this file on 23 different forums on Compuserve. Because Phil Katz has placed the file format in the public domain, there are many other archiving/compression utilities that support the ZIP format. A search on Compuserve, using the File Finder facility on the keyword "PKZIP" resulted in 580 files found, most of which were utilities rather than data files. Programs like WinZIP, that integrate with the Windows File Manager, provide a modern interface to a venerable file format. In DOS, two strong alternatives to PKZIP are LHArc and ARJ. LHARC comes from Japan, and has several advantages over other archiving/compression programs. First, the source to LHArc is freely available and has been ported to numerous operating systems and hardware platforms. Second, the author of LHarc, Haruyasu Yoshizaki (Yoshi), has explicitly granted the right to use his program for any purpose, personal or commercial. ARJ is a program written by Robert Jung (robjung@world.std.com) and is free for noncommercial use. It has managed to achieve compression ratios slightly better than the best LHArc can offer. It is available for DOS, Windows, Amiga, MAC, OS/2, and includes source code. On the Macintosh platform, there are also many archiving/compression programs which support file formats found on DOS and Unix. In addition to LHArc and ARJ, there are programs like ZipIt V1.2 lets you work with ZIP files. However, the predominant archiving/compression program is StuffIt, a shareware program written by Raymond Lau. On bulletin boards and online services that are geared to Macintosh users, you will find more SIT files (StuffIt files) than any other format. Another popular Macintosh format is CPT (created by Compact-Pro program) but it is not as widespread as StuffIt. In general, the trend is toward greater interoperability among platforms and formats. Jeff Gilchrist (jeffg@mi.net) distributes a monthly Archive Comparison Test (ACT) that compares sixty different DOS programs for speed and efficiency, working on a variety of files (text, binary executables, graphics). If you have Internet access, you can view the current copy of ACT by fingering: s0b8@jupiter.sun.csd.unb.ca. You can also view ACT using the World-Wide Web at http://www.mi.net/act/act.html. At this writing, one promising new archiver on Gilchrist’s ACT list is X1, written by Stig Valentini (sv@id.dtu.dk). The current version is 0.90, still in beta stage. This program supports thirteen different archive formats, include: ZIP, LHA, ARJ, HA, PUT, TAR+GZIP(TGZ), and ZOO.
Slide 30: As mentioned earlier, you can find archive programs on Compuserve, America Online and other online services and bulletin boards. On the Internet, there are several ftp repositories. One is at oak.oakland.edu (in the directory /SimTel/msdos/archiver). Another is garbo.uwasa.fi, in the directory /pc/arcers.
Chapter 3 The Dawn Age: Minimum Redundancy Coding
In the late 1940s, the early years of Information Theory, the idea of developing efficient new coding techniques was just starting to be fleshed out. Researchers were exploring the ideas of entropy, information content, and redundancy. One popular notion held that if the probability of symbols in a message were known, there ought to be a way to code the symbols so that the message would take up less space. Remarkably, this early work in data compression was being done before the advent of the modern digital computer. Today it seems natural that information theory goes hand in hand with computer programming, but just after World War II, for all practical purposes, there were no digital computers. So the idea of developing algorithms using base 2 arithmetic for coding symbols was really a great leap forward. The first well-known method for effectively coding symbols is now known as ShannonFano coding. Claude Shannon at Bell Labs and R.M. Fano at MIT developed this method nearly simultaneously. It depended on simply knowing the probability of each symbol’s appearance in a message. Given the probabilities, a table of codes could be constructed that has several important properties: • Different codes have different numbers of bits. • Codes for symbols with low probabilities have more bits, and codes for symbols with high probabilities have fewer bits. • Though the codes are of different bit lengths, they can be uniquely decoded.
Slide 31: The first two properties go hand in hand. Developing codes that vary in length according to the probability of the symbol they are encoding makes data compression possible. And arranging the codes as a binary tree solves the problem of decoding these variable-length codes. An example of the type of decoding tree used in Shannon-Fano coding is shown below. Decoding an incoming code consists of starting at the root, then turning left or right at each node after reading an incoming bit from the data stream. Eventually a leaf of the tree is reached, and the appropriate symbol is decoded. Figure 3.1 is a Shannon-Fano tree designed to encode or decode a simple five-symbol alphabet consisting of the letters A through E. Walking through the tree yields the code table:
Symbol A B C D E
Code 00 01 10 110 111
Figure 3.1 A simple Shannon-Fano tree. The tree structure shows how codes are uniquely defined though they have different numbers of bits. The tree structure seems designed for computer implementations, but it is also well suited for machines made of relays and switches, like the teletype machines of the 1950s. While the table shows one of the three properties discussed earlier, that of having variable numbers of bits, more information is needed to talk about the other two properties. After all, code trees look interesting, but do they actually perform a valuable service?
Slide 32: The Shannon-Fano Algorithm
A Shannon-Fano tree is built according to a specification designed to define an effective code table. The actual algorithm is simple: 1. For a given list of symbols, develop a corresponding list of probabilities or frequency counts so that each symbol’s relative frequency of occurrence is known. 2. Sort the lists of symbols according to frequency, with the most frequently occuring symbols at the top and the least common at the bottom. 3. Divide the list into two parts, with the total frequency counts of the upper half being as close to the total of the bottom half as possible. 4. The upper half of the list is assigned the binary digit 0, and the lower half is assigned the digit 1. This means that the codes for the symbols in the first half will all start with 0, and the codes in the second half will all start with 1. 5. Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups and adding bits to the codes until each symbol has become a corresponding code leaf on the tree. The Shannon-Fano tree shown in Figure 3.1 was developed from the table of symbol frequencies shown next.
Symbol A B C D E
Count 15 7 6 6 5
Putting the dividing line between symbols B and C assigns a count of 22 to the upper group and 17 to the lower, the closest to exactly half. This means that A and B will each have a code that starts with a 0 bit, and C, D, and E are all going to start with a 1 as shown:
Symbol A B C D
Count 15 7 6 6
0 0 First division 1 1
Slide 33: E
5
1
Subsequently, the upper half of the table gets a new division between A and B, which puts A on a leaf with code 00 and B on a leaf with code 01. After four division procedures, a table of codes results. In the final table, the three symbols with the highest frequencies have all been assigned 2-bit codes, and two symbols with lower counts have 3-bit codes as shown next.
Symbol A B C D E
Count 15 7 6 6 5
0 0 First division 1 1 1
0 Second division 1 0 Third division 1 1
0 Fourth division 1
That symbols with the higher probability of occurence have fewer bits in their codes indicates we are on the right track. The formula for information content for a given symbol is the negative of the base two logarithm of the symbol’s probability. For our theoretical message, the information content of each symbol, along with the total number of bits for that symbol in the message, are found in the following table.
Symbol A B C D E
Count 15 7 6 6 5
Info Cont. 1.38 2.48 2.70 2.70 2.96
Info Bits 20.68 17.35 16.20 16.20 14.82
The information for this message adds up to about 85.25 bits. If we code the characters using 8-bit ASCII characters, we would use 39 × 8 bits, or 312 bits. Obviously there is room for improvement.
Slide 34: When we encode the same data using Shannon-Fano codes, we come up with some pretty good numbers, as shown below.
Symbol A B C D E
Count 15 7 6 6 5
Info Cont. 1.38 2.48 2.70 2.70 2.96
Info Bits 20.68 17.35 16.20 16.20 14.82
SF Size 2 2 2 3 3
SF Bits 30 14 12 18 15
With the Shannon-Fano coding system, it takes only 89 bits to encode 85.25 bits of information. Clearly we have come a long way in our quest for efficient coding methods. And while Shannon-Fano coding was a great leap forward, it had the unfortunate luck to be quickly superseded by an even more efficient coding system: Huffman coding.
The Huffman Algorithm
Huffman coding shares most characteristics of Shannon-Fano coding. It creates variablelength codes that are an integral number of bits. Symbols with higher probabilities get shorter codes. Huffman codes have the unique prefix attribute, which means they can be correctly decoded despite being variable length. Decoding a stream of Huffman codes is generally done by following a binary decoder tree. Building the Huffman decoding tree is done using a completely different algorithm from that of the Shannon-Fano method. The Shannon-Fano tree is built from the top down, starting by assigning the most significant bits to each code and working down the tree until finished. Huffman codes are built from the bottom up, starting with the leaves of the tree and working progressively closer to the root. The procedure for building the tree is simple and elegant. The individual symbols are laid out as a string of leaf nodes that are going to be connected by a binary tree. Each node has a weight, which is simply the frequency or probability of the symbol’s appearance. The tree is then built with the following steps: • The two free nodes with the lowest weights are located. • A parent node for these two nodes is created. It is assigned a weight equal to the sum of the two child nodes. • The parent node is added to the list of free nodes, and the two child nodes are removed from the list. • One of the child nodes is designated as the path taken from the parent node when decoding a 0 bit. The other is arbitrarily set to the 1 bit.
Slide 35: • The previous steps are repeated until only one free node is left. This free node is designated the root of the tree. This algorithm can be applied to the symbols used in the previous example. The five symbols in our message are laid out, along with their frequencies, as shown: 15 A 7 B 6 C 6 D 5 E
These five nodes are going to end up as the leaves of the decoding tree. When the process first starts, they make up the entire list of free nodes. The first pass through the tree identifies the two free nodes with the lowest weights: D and E, with weights of 6 and 5. (The tie between C and D was broken arbitrarily. While the way that ties are broken affects the final value of the codes, it will not affect the compression ratio achieved.) These two nodes are joined to a parent node, which is assigned a weight of 11. Nodes D and E are then removed from the free list. Once this step is complete, we know what the least significant bits in the codes for D and E are going to be. D is assigned to the 0 branch of the parent node, and E is assigned to the 1 branch. These two bits will be the LSBs of the resulting codes. On the next pass through the list of free nodes, the B and C nodes are picked as the two with the lowest weight. These are then attached to a new parent node. The parent node is assigned a weight of 13, and B and C are removed from the free node list. At this point, the tree looks like that shown in Figure 3.2.
Figure 3.2 The Huffman tree after two passes. On the next pass, the two nodes with the lowest weights are the parent nodes for the B/C and D/E pairs. These are tied together with a new parent node, which is assigned a weight of 24, and the children are removed from the free list. At this point, we have assigned two bits each to the Huffman codes for B, C, D, and E, and we have yet to assign a single bit to the code for A. Finally, on the last pass, only two free nodes are left. The parent with a weight of 24 is tied with the A node to create a new parent with a weight of 39. After removing the two child nodes from the free list, we are left with just one parent, meaning the tree is complete. The final result looks like that shown in Figure 3.3.
Slide 36: Figure 3.3 The Huffman tree. To determine the code for a given symbol, we have to walk from the leaf node to the root of the Huffman tree, accumulating new bits as we pass through each parent node. Unfortunately, the bits are returned to us in the reverse order that we want them, which means we have to push the bits onto a stack, then pop them off to generate the code. This strategy gives our message the code structure shown in the following table. The Huffman Code Table A 0 B 100 C 101 D 110 E 111 As you can see, the codes have the unique prefix property. Since no code is a prefix to another code, Huffman codes can be unambiguously decoded as they arrive in a stream. The symbol with the highest probability, A, has been assigned the fewest bits, and the symbol with the lowest probability, E, has been assigned the most bits. Note, however, that the Huffman codes differ in length from Shannon-Fano codes. The code length for A is only a single bit, instead of two, and the B and C symbols have 3-bit codes instead of two bits. The following table shows what effect this has on the total number of bits produced by the message.
Symbol
Count
Shannon-Fano Size 2 2 2 3 3
Shannon-Fano Bits 30 14 12 18 15
Huffman Size
Huffman Bits 15 21 18 18 15
A B C D E
15 7 6 6 5
1 3 3 3 3
Slide 37: This adjustment in code size adds 13 bits to the number needed to encode the B and C symbols, but it saves 15 bits when coding the A symbol, for a net savings of 2 bits. Thus, for a message with an information content of 85.25 bits, Shannon-Fano coding requires 89 bits, but Huffman coding requires only 87. In general, Shannon-Fano and Huffman coding are close in performance. But Huffman coding will always at least equal the efficiency of Shannon-Fano coding, so it has become the predominant coding method of its type. Since both algorithms take a similar amount of processing power, it seems sensible to take the one that gives slightly better performance. And Huffman was able to prove that this coding method cannot be improved on with any other integral bit-width coding stream. Since D. A. Huffman first published his 1952 paper, “A Method for the Construction of Minimum Redundancy Codes,” his coding algorithm has been the subject of an overwhelming amount of additional research. Information theory journals to this day carry numerous papers on the implementation of various esoteric flavors of Huffman codes, searching for ever better ways to use this coding method. Huffman coding is used in commercial compression programs, FAX machines, and even the JPEG algorithm. The next logical step in this book is to outline the C code needed to implement the Huffman coding scheme.
Huffman in C
A Huffman coding tree is built as a binary tree, from the leaf nodes up. Huffman may or may not have had digital computers in mind when he developed his code, but programmers use the tree data structure all the time. Two programs used here illustrate Huffman coding. The compressor, HUFF-C, implements a simple order-0 model and a single Huffman tree to encode it. HUFF-E expands files compressed using HUFF-C. Both programs use a few pieces of utility code that will be seen throughout this book. Before we go on the actual Huffman code, here is a quick overview of what some of the utility modules do.
BITIO.C
Data-compression programs perform lots of input/output (I/O) that does reads or writes of unconventional numbers of bits. Huffman coding, for example, reads and writes bits one at a time. LZW programs read and write codes that can range in size from 9 to 16 bits. The standard C I/O library defined in STDIO.H only accommodates I/O on even byte boundaries. Routines like putc() and getc() read and write single bytes, while fread() and fwrite() read and write whole blocks of bytes at a time. The library offers no help for programmers needing a routine to write a single bit at a time. To support this conventional I/O in a conventional way, bit-oriented I/O routines are confined to a single source module, BITIO.C. Access to these routines is provided via a
Slide 38: header file called BITIO.H, which contains a structure definition and several function prototypes. Two routines open files for bit I/O, one for input and one for output. As defined in BITIO.H, they are
BIT_FILE *OpenInputBitFile( char *name ); BIT_FILE *OpenOutputBitFile ( char *name );
These two routines return a pointer to a new structure, BIT_FILE. BIT_FILE is also defined in BITIO.H as shown:
typedef struct bit_file { FILE *file; unsigned char mask; int rack; int pacifier_counter; } BIT_FILE:
OpenInputBitFile() or OpenOutputBitFile() perform a conventional fopen() call and store the returned FILE structure pointer in the BIT_FILE structure. The other two structure elements are initialized to their startup values, and a pointer to the resulting BIT_FILE structure is returned. In BITIO.H, rack contains the current byte of data either read in from the file or waiting to be written out to the file. mask contains a single bit mask used either to set or clear the current output bit or to mask in the current input bit. The two new structure elements, rack and mask, manage the bit-oriented aspect of a most significant bit in the I/O byte gets or returns the first bit, and the least significant bit in the I/O byte gets or returns the last bit. This means that the mask element of the structure is initialized to 0x80 when the BIT_FILE is first opened. During output, the first write to the BIT_FILE will set or clear that bit, then the mask element will shift to the next. Once the mask has shifted to the point at which all the bits in the output rack have been set or cleared, the rack is written out to the file, and a new rack byte is started. Performing input from a BIT_FILE is done in a similar fashion. The mask is first set to 0x80, and a single byte from the file is read into the rack element. Each call to read a bit from the file masks in a new bit, then shifts the mask over to the next lower significant bit. Eventually, all bits in the input rack have been returned, and the input routine can read in a new byte from the input file. Two types of I/O routines are defined in BITIO.C. The first two routines read or write a single bit at a time. The second two read or write multiple bits, up to the size of an unsigned long. These four routines have the following ANSI prototype in BITIO.H:
void void OutputBit( BIT_FILE *bit_file, int bit ); OutputBits( BIT_FILE *bit_file,
Slide 39: int unsigned long
unsigned long code, int count); InputBit( BIT_FILE *bit_file ); InputBits( BIT_FILE *bit_file, int bit_count );
Specialized routines open a BIT_FILE, and two specialized routines close a BIT_FILE. The output routine makes sure that the last byte gets written out to the file. Both the input and output routines need to close their files, then free up the BIT_FILE structure allocated when the file was opened. The BIT_FILE routines used to close a file are defined in BITIO.H with these ANSI prototypes:
void void CloseInputBitFile( BIT_FILE *bit_file ); CloseOutputBitFile( BIT_FILE *bit_file );
The input and output routines in BITIO.H also have a pacifier feature that can be useful in testing compression code. Every BIT_FILE structure has a pacifier_counter that gets incremented every time a new byte is read in or written out to the corresponding file. Once every 2,048 bytes, a single character is written to stdout. This helps assure the impatient user that real work is being done. On MS-DOS systems, it also helps ensure that the user can break out of the program if it does not appear to be working correctly. The header file and code for BITIO.H is shown next:.
/******************** Start of BITIO.H **********************/ #ifndef _BITIO_H #define _BITIO_H #include <stdio.h> typedef struct bit_file { FILE *file; unsigned char mask; int rack; int pacifier_counter; } BIT_FILE; #ifdef __STDC__ BIT_FILE* BIT_FILE* void void OpenInputBitFile( char *name ); OpenOutputBitFile( char *name ); OutputBit( BIT_FILE *bit_file, int bit ); OutputBits( BIT_FILE *bit_file, unsigned long code, int count ); int InputBit( BIT_FILE *bit_file ); unsigned long InputBits( BIT_FILE *bit_file, int bit_count ); void CloseInputBitFile( BIT_FILE *bit_file ); void CloseOutputBitFile( BIT_FILE *bit_file ); void FilePrintBinary( FILE *file, unsigned int code, int bits); #else /* __STDC__ */ BIT_FILE* BIT_FILE* void OpenInputBitFile(); OpenOutputBitFile(); OutputBit();
Slide 40: void OutputBits(); int InputBit(); unsigned long InputBits(); void CloseInputBitFile(); void CloseOutputBitFile(); void FilePrintBinary(); #endif /* __STDC__ */ #endif /* _BITIO_H */ /********************** End of BITIO.H *********************/ /******************** Start of BITIO.C ********************/ /* * This utility file contains all of the routines needed to implement * bit oriented routines under either ANSI or K&R C. It needs to be * linked with every program used in the book. */ #include <stdio.h> #include <stdlib.h> #include "bitio.h" #include "errhand.h" BIT_FILE *OpenOutputBitFile( name ) char *name; { BIT_FILE *bit_file; bit_file = (BIT_FILE *) calloc( 1, sizeof( BIT_FILE ) ); if ( bit_file == NULL ) return( bit_file ); bit_file->file = fopen( name, "rb" ); bit_file->rack = 0; bit_file->mask = 0x80; bit_file->pacifier_counter = 0; return( bit_file );
}
BIT_FILE *OpenInputBitFile( name ) char *name; { BIT_FILE *bit_file; bit_file = (BIT_FILE *) calloc( 1, sizeof( BIT_FILE ) ); if ( bit_file == NULL ) return( bit_file ); bit_file->file = fopen( name, "rb" ); bit_file->rack = 0; bit_file->mask = 0x80; bit_file->pacifier_counter = 0; return( bit_file );
}
void CloseOutputBitFile( bit_file ) BIT_FILE *bit_file;
Slide 41: { if ( bit_file->mask != 0x80 ) if ( putc( bit_file->rack, bit_file->file ) != bit_file->rack ) fatal_error( "Fatal error in CloseBitFile!\n" ); fclose( bit_file->file ); free( (char *) bit_file );
}
void CloseInputBitFile( bit_file ) BIT_FILE *bit_file; { fclose( bit_file->file ); free( (char*) bit_file ); } void OutputBit( bit_file, bit ) BIT_FILE *bit_file; int bit; { if ( bit ) bit_file->rack | = bit_file->mask; bit_file->mask >>= 1; if ( bit_file->mask == 0 ) { if ( putc( bit_file->rack, bit_file->file ) != bit_file->rack ) fatal_error( "Fatal error in OutputBit!\n" ); else if ( ( bit_file->pacifier_counter++ & 4095 ) == 0 ) putc( '.', stdout ); bit_file->rack = 0; bit_file->mask = 0x80; } } void OutputBits( bit_file, code, count ) BIT_FILE *bit_file; unsigned long code; int count; { unsigned long mask; mask = 1L << ( count - 1 ); while ( mask != 0) { if ( mask & code ) bit_file->rack | = bit_file->mask; bit_file->mask >>= 1; if ( bit_file->mask == 0 ) { if ( putc( bit_file->rack, bit_file->file ) != bit_file->rack ) fatal_error( "Fatal error in OutputBit!\n" ); else if ( ( bit_file->pacifier_counter++ & 2047 ) == 0 ) putc( '.', stdout ); bit_file->rack = 0; bit_file->mask = 0x80; } mask >>= 1; }
}
int InputBit( bit_file )
Slide 42: BIT_FILE *bit_file; { int value; if ( bit_file->mask == 0x80 ) { bit_file->rack = getc( bit_file->file ); if ( bit_file->rack == EOF ) fatal_error( "Fatal error in InputBit!\n" ); if ( ( bit_file->pacifier_counter++ & 2047 ) == 0 ) putc( '.', stdout ); } value = bit_file->rack & bit_file->mask; bit_file->mask >>= 1; if ( bit_file->mask = 0 ) bit_file->mask = 0x80; return ( value ? 1 : 0 ); } unsigned long InputBits( bit_file, bit_count ) BIT_FILE *bit_file; int bit_count; { unsigned long mask; unsigned long return_value; mask = 1L << ( bit_count - 1 ); return_value = 0; while ( mask != 0) { if ( bit_file->mask == 0x80 ) { bit_file->rack = getc( bit_file->file ); if ( bit_file->rack == EOF ) fatal_error( "Fatal error in InputBit!\n" ); if ( ( bit_file->pacifier_counter++ & 2047 ) == 0 ) putc( '.', stdout ); } if ( bit_file->rack & bit_file->mask ) return_value |=mask; mask >>= 1; bit_file->mask >>= 1; if ( bit_file->mask = 00 ) bit_file->mask = 0x80;
} void FilePrintBinary( file, code, bits ) FILE *file; unsigned int code; int bits; { unsigned int mask; mask = 1 << ( bits - 1 ): while ( mask != 0 ){ if ( code & mask ) fputc( '1', file ); else
} return( return_value );
Slide 43: }
}
fputc( '0', file); mask >>= 1;
/********************** End of BITIO.C **********************/
A Reminder about Prototypes
The code in this book works on both Unix K&R and the more modern MS-DOS compilers. This affects the code in this book mainly in the area of function parameters in both prototypes and the function body itself. For the function body, all code in this book will use old-fashioned parameter specifications like this:
int main( argc, argv ) int argc; char *argv[]; { ...
This is the only method of parameter declaration acceptable to K&R compilers, and as such it has the blessing of the ANSI standard. A few compilers (Microsoft C 6.0 at Warning Level 4, for example) will issue a warning when it encounters this type of function declaration, so be prepared to ignore those warnings. Declaring function parameters in this method will generally have no effect on code reliability or readability, so using the K&R style should be considered a benign anachronism. Parameters in function declarations present a little more of a problem. The ANSI C specification will accept old style K&R function declarations (such as int main();), but there are good reasons to specify all function arguments in the declaration. When using full prototyping—as in int main( int argc, char *argv[] );—the compiler checks for correct parameter passing when it encounters a call to a function. This helps avoid one of the most commonplace C coding mistakes; incorrect parameter types. To use this prototyping, and at the same time to stay compatible with K&R compilers, all function prototypes are given in two forms: a K&R-compatible prototype and a full ANSI C prototype. The ANSI C prototypes are selected through a check for __STDC__, a predefined macro defined when a compiler conforms to the ANSI C standard. So the prototype for a set of functions in a header file will look something like this:
#ifdef __STDC__ int main( int argc, char *argv[] ); FOO *open_foo( char *name ); #else /* __STDC__ */ int main();
Slide 44: FOO *open_foo(); #endif /* __STDC__ */
This compromise approach definitely hurts readability, and it is probably not the way to go during code development. But once a set of routines is working properly and not likely to be changed, this type of header file will work fine. ANSI C compiler users will find that a problem with this header file crops up with numerous MS-DOS compilers. Compilers such as Microsoft C or Borland C++ are ANSI C compilers, but by default they include a number of language extensions, such as far pointers, alternate calling conventions, and so on. When these language extensions are enabled (as they are by default), __STDC__ is not defined, since the compiler is not operating strictly as an ANSI C compiler. This means that the correct function prototypes will not be invoked. The solution to this problem is to compile the code in this book with the compiler in ANSI C mode. Put the compiler in this mode generally by disabling extensions. Microsoft C accomplishes this from the command line with the /Za switch. Borland C++ uses the -A switch to disable C extensions. To adapt this code for a specific use on a specific compiler, you may want to eliminate the “#ifdef __STDC__” lines in the header file and code. As more and more compilers use ANSI C prototypes and parameter definitions, this portability machinery will become less and less useful.
MAIN-C.C AND MAIN-E.C
Another piece of utility code used throughout this book is the “main()” program for the compression and expansion programs. Any piece of compression code needs to be plugged into a main program that accepts command-line arguments, opens files, calls the compression routines, then closes the files. For simplicity, I have created two versions of this code: one for the compression program (MAIN-C.C) and one for the expansion program (MAIN-E.C). Both MAIN-C.C and MAIN-E.C expect to find a compression or expansion routine in another file, a help routine to explain command-line parameters, and an external string with the name of the compression technique being used. The declarations for the functions and name are found in MAIN.H. MAIN.H should be included in the compression module to ensure that the routines are properly typed. MAIN.H is shown next. The idea behind these two routines is that the infrastructure of a compression test program should not have to be rewritten every time a new compression module is coded. A new routine should just have to interface with the existing compression code.
/********************** Start of MAIN.H ***********************/
Slide 45: #ifndef _MAIN_H #define _MAIN_H #ifdef _STDC_ void CompressFile( FILE *input, BIT_FILE *output, int argc, char *argv[] ); void ExpandFile( BIT_FILE *input, FILE *output, int argc, char *argv[] ); #else /* __STDC__ */ void CompressFile(); void ExpandFile(); #endif /* __STDC__ */ extern char *Usage; extern char *CompressionName; #endif /* _MAIN_H */ /************************* End of MAIN.H ************************/
In MAIN-C.C, a compression module supplies three things: a Usage string, which can print out a list of parameters, etc.; a CompressionName string, which lets the MAIN-C.C program print out the compression method; and a CompressFile() routine, which actually compresses the file. In this chapter, these routines are in a file called HUFF.C, which implements an order 0 model with a Huffman coder. MAIN-C.C is shown below.
/*********************** Start of MAIN-C.C **********************/ /* * This is the driver program used when testing compression algorithms. * In order to cut back on repetitive code, this version of main is * used with all of the compression routines. In order to turn it into * a real program, it needs to have another module that supplies one * routine and two strings, namely: * * void CompressFile( FILE *input, BIT_FILE *output, * int argc, char *argv ); * char *Usage; * char *CompressionName; * * The main() routine supplied here has the job of checking for valid * input and output files, opening them, and then calling the * compression routine. If the files are not present, or no arguments * are supplied, it prints out an error message, which includes the * Usage string supplied by the compression module. All of the * routines and strings needed by this routine are defined in the * main.h header file. * * After this is built into a compression program of any sort, the * program can be called like this: * * main-c infile outfile [ options ] *
Slide 46: */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include "bitio.h" #include "errhand.h" #include "main.h" #ifdef ___STDC___ void usage_exit( char *prog_name ); void print_ratios( char *input, char *output ); long file_size( char *name ); #else void usage_exit(); void print_ratios(); long file_size(); #endif int main( argc, argv ) int argc; char *argv[]; { BIT_FILE *ouput; FILE *input; setbuf( stdout, NULL ); if ( argc < 3 ) usage_exit( argv[ 0 ] ); input = fopen(argv [ 1 ], "rb" ); if ( input == NULL ) fatal_error( "Error opening %s for input/n", argv[ 1 ] ); output = OpenOutputBitFile( argv[ 2 ] ); if ( output == NULL ) fatal error( "Error opening %s for input/n", argv[ 2 ] ); printf( "\nCompressing %s to %s\n", argv[ 1 ], argv[ 2 ] ); printf( "Using %s\n, CompressionName ); argc -= 3; argv += 3; CompressFile( input, output, argc, argv ); CloseOutputBitFile( output ); fclose( input ); print_ratios( argv[ 1 ], argv[ 2 ] ); return( 0 );
}
/* * This routine just wants to print out the usage message that is * called for when the program is run with no parameters. The first * part of the Usage statement is supposed to be just the program * name. argv[ 0 ] generally holds the fully qualified path name * of the program being run. I make a half-hearted attempt to strip * out that path info and file extension before printing it. It should * get the general idea across. */
Slide 47: void usage_exit( prog_name ) char *prog_name; { char *short_name; char *extension; short_name = strrchr( prog_name, '\\' ); if (short_name == NULL ) short_name = strrchr( prog_name, '/' ); if (short_name == NULL ) short_name = strrchr( prog_name, ':' ); if (short_name != NULL ) short_name++; else short_name = prog_name; extension = strrchr( short_name, '.' ); if ( extension != NULL ) *extension = '\0'; printf( "\nUsage: %s %s\n", short_name, Usage ); exit( 0 );
}
/* * This routine is used by main to get the size of a file after it has * been closed. It does all the work, and returns a long. The main * program gets the file size for the plain text, and the size of the * compressed file, and prints the ratio. */ #ifndef SEEK_END #define SEEK_END 2 #endif long file_size( name ) char *name; { long eof ftell; FILE *file; file = fopen( name, "r"); if ( file == NULL ) return( OL ); fseek( file, OL, SEEK_END ); eof_ftell = ftell( file ); fclose( file ); return( eof_ftell );
}
/* * This routine prints out the compression ratios after the input and * output files have been closed. */ void print_ratios( input, output ) char *input; char *output; { long input_size; long output_size;
Slide 48: int ratio; input_size = file_size( input ); if ( input_size == 0 ) input_size = 1; output_size = file_size * 100L / input_size ); ratio = 100 - (int) ( output_size * 100L / input_size ); printf( "\nInput bytes: %ld\n", input_size ); printf( "Output bytes: %ld/n", output_size ); if ( output_size == 0 ) output_size = 1; printf( "Compression ratio: %d%%\n", ratio ); /*********************** End of MAIN-C.C *************************/
MAIN-C.C
There are a few expectations about how MAIN-C.C will run. MAIN-C.C is called to compress an input file supplied on the command line and send the compressed output to another file, also supplied on the command line. Thus, the basic command-line invocation of MAIN-C.C is MAIN-C input-file output-file. If the user invokes MAINC.C without any arguments, a simple usage statement prints out. The usage statement includes the usage string supplied by the compression module. If two likely looking file names are on the command line, MAIN-C.C tries to open them both. The input file is opened as a standard file specified in STDIO.H, using fopen(). The output file is opened as a BIT_FILE, as defined in BITIO.H. If either file doesn’t open, an error message is printed out and the program exits. If both files open, the next step is to call the compression routine. MAIN-C.C expects the compression routine to be named CompressFile(). This routine is called with four arguments. The first two are pointers to the file structure for the input file and a pointer to the BIT_FILE structure for the output file. Finally, the updated values for argc and argv are passed to the compression routine. The values for argc and argv will have been adjusted to go past argv[0], which should be the program name, as well as argv[1] and argv[2], the names of the input and output files. The compression program can then scan the remaining arguments for any arguments specific to that particular compression routine. After the compression routine has finished, it returns to MAIN-C.C, which closes down the files and exits. MAIN-E.C is the converse program to MAIN-C.C. It takes two arguments as well, but this time the input file is the compressed file and the output file is destined to be the uncompressed clear text file. Just like MAIN-C.C, it checks to be sure there are at least two arguments, then tries to open the two files. If there aren’t two arguments, a usage message is printed. If either of the files fails to open, an error message is printed. MAINE.C is listed below.
/***********************Start of MAIN-E.C***********************/ * This driver program tests compression algorithms. To cut back on * repetitive code, this version of main is used with all the expansion
Slide 49: * routines. The main() routine supplied here checks for valid input and * output files, opens them, then calls the compression routine. * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include "bitio.h" #include "errhand.h" #include "main.h" #ifdef__STDC__ void usage_exit( char *prog_name ); #else void usage_exit(); #endif int main( argc, argv ) int argc; char *argv[]; { FILE *output; BIT_FILE *input; setbuf( stdout, NULL ); if ( argc < 3 ) usage_exit( argv[ 0 ] ); input = OpenInputBitFile( argv[ 1 ] ); if ( input == NULL ) fatal_error( "Error opening %s for input\n", argv[ 1 ] output = fopen( argv[ 2 ], "wb" ); if ( output == NULL ) fatal_error( "Error opening %s for output\n", argv[ 2 ] printf( "\nExpanding %s to %s for output\n", argv[ 2 ] ); printf( "Using %\n", CompressionName ); argc -= 3; argv += 3; ExpandFile( input, output, argc, argv ); CloseInputBitFile( input ); fclose( output ); putc( '\n', stdout ); return( 0 );
}
/* * This routine wants to print out the usage message called for when the * program is run with no parameters. The first part of the Usage state * ment is supposed to be just the programname. argv[ 0 ] generally holds * the fully qualified path name of the program being run. */ void usage_exit( prog_name ) char *prog_name; } char *short_name; char *extension;
Slide 50: } /********************** End of MAIN-E.C**************************/
short_name = strrchr( prog_name, '\\' ); if ( short_name = = NULL ) short_name = = strrchr( prog_name, '/' ); if ( short_name = = NULL ) short_name = strrchr( prog_name, ':' ); if ( short_name != NULL ) short_name++; else short_name = prog_name; extension = strrchr( short_name, '.' ); if ( extension != NULL ) *extension = '\0'; printf( "\nUsage: %s %s\n", short_name, Usage ); exit( 0 );
ERRHAND.C
One additional routine helps simplify the code. A production version of a program generally needs a somewhat sophisticated error-handling mechanism. In particular, it is important to let the end user know what is happening, clean up any files that may have been only partially processed, and restore any system settings that may have been changed. In this book, our interest in compression concentrates on testing for accuracy, speed, and compression ratios. Because of this, we have created a simple universal fatal-error handler. The error handler is defined in ERRHAND.H:
/********************** Start of ERRHAND.H **********************/ #ifndef _ERRHAND_H #define _ERRHAND_H #ifdef ___STDC___ void fatal_error( char *fmt, ... ); #else /* ___STDC___ */ void fatal_error(); #endif /* ___STDC___ */ #endif /* _ERRHAND_H */ /********************** End of ERRHAND.H *************************/
The fatal-error handler is called when an unrecoverable error occurs in the program. It has the same syntax as printf, which means it can be passed a format string and arguments to format and print out.
Slide 51: /************************ Start of ERRHAND.C ***********************/ #include #include #include #include <stdio.h> <stdlib.h> <stdarg.h> "errhand.h"
#ifdef __STDC__ void fatal_error( char *fmt, ... ) #else #ifdef __UNIX__ void fatal_error( fmt ) char *fmt; va_dcl #else void fatal_error( fmt ) #endif #endif { va_list argptr; va_start( argptr, fmt ); printf( "Fatal error: " ); vprintf( fmt, argptr ); va_end( argptr ); exit( -1 );
}
/************************ End of ERRHAND.C ***********************/
Into the Huffman Code
With the infrastructure code in place, all we need to do to create a program that demonstrates Huffman coding is to write two routines, CompressFile() and ExpandFile(), and a couple of strings that describe the name of the compression method and program usage. The code for this is found in HUFF.C. To build the Huffman decoding tree, we need to create a data structure that models the tree on the computer. In our previous examples, each node on the tree had several pieces of information: first, the weight associated with it; second, pointers to two child nodes, one associated with the 0 bit and one associated with the 1 bit. Finally, leaf nodes had the value of the symbol associated with the leaf. The data structure used in this program to model the Huffman tree was built around the node structure:
typedef struct tree_node { unsigned int count; unsigned int saved_count; int child_0; int child_1; } NODE;
Slide 52: The first thing to notice about this structure is that there is no information about the value of a leaf node. This is because the node structures are allocated as an array of 514 nodes. The lower nodes are all assigned to be leaf nodes, and the upper nodes become internal nodes. The information about the value of a leaf is encoded based on the position of the node in the array. Instead of having 256 symbols in our alphabet for this program, we actually have 257. Values 0 through 255 are reserved for the normal range of bytes that fit into a character. The remaining symbol value of 256 is reserved for the end-of-stream indicator. This is the last code written out to the stream, and it indicates that no more data will be arriving. Because of the bit-oriented nature of compressed data, it is not ordinarily a simple matter to determine when you have reached an end-of-file state. Handling it with a special code for end-of-stream is one method for getting around this. Another would be to encode the length of the file as a prefix to the compressed data. With 257 symbols to deal with, we know in advance the largest possible size of the Huffman tree. If all 257 symbols are in use, we will have 256 internal nodes, meaning that we have to allocate an array of 513 node structures. In the program, I actually allocate 514 and use the last one as a dummy value for comparisons when building the tree.
Counting the Symbols
To build the tree, I first calculate the relative frequencies of the symbols. In HUFF.C, I set up an array of 256 longs and count the occurrences of every character in the file, from the start to the end. The position of the file input pointer is saved when the count starts and is restored when it is done. All this takes place in function count_bytes(). Though I start with 32-bit unsigned long counts, I scale the counts back significantly in module scale_counts. Scale_counts() finds the maximum count for any symbol in the file, then develops a scaling factor to make that count and all the rest of the counts fit in a single unsigned character. These counts are then copied into the weight elements of the first 257 node elements. There are several good reasons for scaling back the counts. First, by limiting any symbol’s weight to an 8-bit unsigned character, I can confine all of the math I perform when building the tree to 16-bit unsigned integers. This helps the program run a little faster, and it cuts back on the amount of storage required for the node array. It also limits the maximum size of a Huffman code as well, ensuring that it will fit in a 16-bit unsigned integer.
Saving the Counts
For the expansion program to correctly expand the Huffman encoded bit stream it will be receiving, it needs a copy of the Huffman tree identical to the one used by the encoder.
Slide 53: This means that the tree, or its equivalent, must be passed as a header to the file so the expander can read it in before it starts to read Huffman codes. The easiest way for the expansion program to get this data would probably be to store the entire node array as a preamble to the compressed data. This would work well and would not be too hard for the compressor to do. An alternative method that occupies far less space in the compressed file, however, is to transmit the symbol counts to the expander. Since the Huffman tree is built up in an unambiguous manner from the symbol counts, it stands to reason that the expansion program doesn’t need more to do its job. And since the scaled count array will be only 256 bytes, compared to the Huffman tree’s 4K bytes, there is good reason to choose this. I elected to try to cut down on the amount of data to be passed even further. Under many circumstances, the number of counts that stay at zero is considerable. With ASCII text files, such as program listings, there will generally be only around 100 symbols in use out of the possible 256. It seems a waste to transmit all those zero counts when they aren’t necessary. To make this happen, I use a slightly more complicated format for the header. The header used in HUFF.C that contains the symbol counts consists of a series of “count run” definitions, followed by a 0 terminator. A count-run definition consists of the value of the first symbol in the run, followed by the value of the last symbol in the run, followed by the counts for all of the symbols in the run from first to last. This is repeated until each run has been stored in the output file. When there is no more data to store, a first value of zero is written out to the file. Note that a value of zero for the very first run is not treated as an end of data. For a typical ASCII file, the start of the compressed file might look something like Figure 3.4.
Slide 54: Figure 3.4 The start of a typical compressed ASCII file. This symbol count format takes a fair amount of work to generate, performed in output_counts() in HUFF.C. Reading in the symbols counts is much simpler, since the work has been done in advance. Reading the counts in from the compressed file during expansion is done in the input_counts() routine.
Building the Tree
Whether compressing or expanding, once the counts have been loaded, it is time to build the Huffman tree. In HUFF.C, this is done in a function called build_tree(). Because some care was taken when creating the data structure, the actual process of creating the tree is the simple matter of sitting in a loop and combining the two free nodes with the lowest weight into a new internal node with the combined weight of the nodes. Once only one free node is left, the tree is done, and the free node is the root of the tree. The logic of the build_tree() routine is fairly simple. When the routine is first entered, all nodes below 257 have a count value set to their frequency in the file. A nonzero value here means that this is an active node. build_tree() also sets up a special node used as a straw man for comparison purposes. Node 513, which will never be used, is set to have a count value of 65535, which no normal node can ever exceed. When searching for the two minimum nodes, I will start by setting the minimum node to 513, knowing that any valid active node will fall below its value. Finally, before the comparisons start, an index to the next free node’s initialized. The node array is in use from 0 to 256, so the next free node will be at 257.
Slide 55: After things have been set up, build_tree() goes into an infinite loop. On each pass through the loop, build_tree tries to find the two active nodes with the lowest weights. If only one node is found, the tree is complete and the loop is exited. If there are two good minimum values, a new node to the tree can be created. This new node is set up using the next_free node index. Its two child pointers are set to point to the two minimum nodes found before, and its weight is their sum. The two minimum nodes are now marked as being inactive by setting their weights to 0. Nodes with a weight of 0 are considered to be unused and will never again be selected to represent a minimum. One piece of inefficient code is deliberately left in build_tree(). There is an extra member in the node structure called saved_count. When a node is taken off the active list by having its count set to zero, the previous count is stored in saved_count. Later, if the user has selected the -d option in order to print out the model, the saved_count can be printed. This helps when debugging the program and when trying to understand how the tree works.
Using the Tree
During the expansion phase, it is easy to see how to use the Huffman tree. Starting at the root node, a single bit at a time is read in by the decoder. If the bit is a 0, the next node is the one pointed to by the child_0 index. If the bit is a 1, the next node is the one pointed to by the child_1 index. If the new node is 256 or less, we have reached a leaf of the tree and can output the corresponding symbol. If the symbol was the special end-of-stream symbol, we can exit instead of sending it out. This is what is done in the expand_node() function. It is just a few lines of code, and it decodes a compressed Huffman code file with relative ease. Compressing the same file is a bit harder. Essentially, we want to work down the tree, outputting a 1 or a 0 bit at each node, till we get to the appropriate leaf node. Unfortunately, the tree structure makes this impossible. When we start at the root node, we have no idea whether to take the 0 or the 1 branch to arrive at a particular symbol. One way to solve this problem when building the tree would be to add a parent member to the node structure. When combining the two minimum nodes to form a new internal node, each minimum node would have its parent structure set to point to the new node. With this new node, we could start at the leaf node and work our way up through the tree toward the root. The only problem with this procedure is that we would accumulate bits in reverse order as we went up the tree. We would have to rack them up till we reached the root node, then put them out in reverse order. Fortunately, there is a better way to do this. Rather than trying to use the tree to code our symbols when compressing a file, we could build a code table by recursively traversing the entire tree one time only. This creates a table of codes, one for each symbol, along with the length of each code. Once the table is built, the file can be encoded by simply outputting the appropriate code for every character in the input file.
Slide 56: The code to convert the tree data structures into a table of codes is very simple, thanks to a recursive algorithm. We start at the root node of the tree with a zero. Then we begin working down the individual branches of the tree, adding a one or a zero to the code each time we travel down a branch. Whenever we reach a leaf, we store the code values for that leaf in the code array and back up to the previous node, where we can start searching down the other side of the tree. The code to accomplish this is in function convert_tree_to_code(). This routine takes a fair amount of work to create the code table, but once it is done the actual file compression is very easy.
The Compression Code
The code for Huffman compression and decompression is shown in the listing below. This single file, HUFF.C, is about 500 lines long, of which probably 30 percent is comments. So we are able to implement a static dictionary Huffman compressor in only about 300 lines of code. The actual amount of code could easily be crunched down to a number much less than that. The small code and storage requirements make Huffman coding ideal for applications where both memory and CPU storage are at a premium.
/********************** Start of HUFF.C #include #include #include #include #include #include #include <stdio.h> <stdlib.h> <string.h> <ctype.h> "bitio.h" "errhand.h" "main.h" *************************/
/* * The NODE structure is a node in the Huffman decoding tree. It has a * count, which is its weight in the tree, and the node numbers of its * two children. The saved_count member of the structure is only * there for debugging purposes, and can be safely taken out at any * time. It just holds the intial count for each of the symbols, since * the count member is continually being modified as the tree grows. */ typedef struct tree_node { unsigned int count; unsigned int saved_count; int child_0; int child_1 } NODE; /* * A Huffman tree is set up for decoding, not encoding. When encoding, * I first walk through the tree and build up a table of codes for * each symbol. The codes are stored in this CODE structure. /* typedef struct code {
Slide 57: unsigned int code; int code_bits; } CODE; /* * The special EOS symbol is 256, the first available symbol after all * of the possible bytes. When decoding, reading this symbol * indicates that all of the data has been read in. */ #define END_OF_STREAM 256 /* * Local function prototypes, defined with or without ANSI prototypes. */ #ifdef __STDC__ void count_bytes( FILE *input, unsigned long *long_counts ); void scale_counts( unsigned long *long_counts, NODE *nodes ); int build_tree( NODE *nodes ); void convert_tree_to_code( NODE *nodes, CODE *codes, unsigned int code_so_far, int bits, int node ); void output_counts( BIT_FILE *output, NODE *nodes ); void input_counts( BIT_FILE *input, NODE *nodes ); void print_model( NODE *nodes, CODE *codes ); void compress_data( FILE *input, BIT_FILE *output, CODE *codes ); void expand_data( BIT_FIle *input, FILE *output, NODE *nodes, int root_node ); void print_char( int c ); #else /* __STDC__ */ void count_bytes(); void scale_counts(); int build_tree(); void convert_tree_to_code(); void output_counts(); void input_counts(); void print_model(); void compress_data(); void expand_data(); void print_char(); #endif /* __STDC__ */ /* * These two strings are used by MAIN-C.C and MAIN-E.C to print * messages of importance to the use of the program. */ char *CompressionName = "static order 0 model with Huffman coding"; char *Usage = "infile outfile [-d]\n\n\ Specifying -d will dump the modeling\ data\n"; /* * CompressFile is the compression routine called by MAIN-C.C. It
Slide 58: * looks for a single additional argument to be passed to it from * the command line: "-d". If a "-d" is present, it means the * user wants to see the model data dumped out for debugging * purposes. * * This routine works in a fairly straightforward manner. First, * it has to allocate storage for three different arrays of data. * Next, it counts all the bytes in the input file. The counts * are all stored in long int, so the next step is to scale them down * to single byte counts in the NODE array. After the counts are * scaled, the Huffman decoding tree is built on top of the NODE * array. Another routine walks through the tree to build a table * of codes, one per symbol. Finally, when the codes are all ready, * compressing the file is a simple matter. After the file is * compressed, the storage is freed up, and the routine returns. * */ void CompressFile( input, output, argc, argv ) FILE *input; BIT_FILE *output; int argc; char *argv[]; { unsigned long *counts; NODE *nodes; CODE *codes; int root_node; counts = ( unsigned long *) calloc( 256, sizeof( unsigned long ) ); if ( counts == NULL ) fatal_error( "Error allocating counts array\n" ); if ( ( nodes = (NODE *) calloc( 514, sizeof( NODE ) ) ) == NULL ) fatal_error( "Error allocating nodes array\n" ); if ( ( codes = (CODE *) calloc( 257, sizeof( CODE ) ) ) == NULL ) fatal_error( "Error allocating codes array\n" ); count_bytes( input, counts ); scale_counts( counts, nodes ); output_counts( output, nodes ); root_node = build_tree( nodes ); convert_tree_to_code( nodes, codes, 0, 0, root_node ); if ( argc > 0 && strcmp( argv[ 0 ], "-d" ) == 0 ) print_model( nodes, codes ); compress_data( input, output, codes ); free( (char *) counts ); free( (char *) nodes ); free( (char *) codes );
}
/* * ExpandFile is the routine called by MAIN-E.C to expand a file that * has been compressed with order 0 Huffman coding. This routine has * a simpler job than that of the Compression routine. All it has to * do is read in the counts that have been stored in the compressed
Slide 59: * file, then build the Huffman tree. The data can then be expanded * by reading in a bit at a time from the compressed file. Finally, * the node array is freed and the routine returns. * */ void ExpandFile( input, output, argc, argv ) BIT_FILE *input; FILE *output; int argc; char *argv[]; { NODE *nodes; int root_node; if ( ( nodes = (NODE *) calloc( 514, sizeof( NODE ) ) ) == NULL ) fatal_error( "Error allocating nodes array\n" ); input_counts( input, nodes ); root_node = build_tree( nodes ); if ( argc > 0 && strcmp( argv[ 0 ], "-d" ) == 0 ) print_model( nodes, 0 ); expand_data( input, output, nodes, root_node ); free( (char *) nodes );
}
/* * In order for the compressor to build the same model, I have to * store the symbol counts in the compressed file so the expander can * read them in. In order to save space, I don't save all 256 symbols * unconditionally. The format used to store counts looks like this: * * start, stop, counts, start, stop, counts, ... 0 * * This means that I store runs of counts, until all the non-zero * counts have been stored. At this time the list is terminated by * storing a start value of 0. Note that at least 1 run of counts has * to be stored, so even if the first start value is 0, I read it in. * It also means that even in an empty file that has no counts, I have * to pass at least one count, which will have a value of 0. * * In order to efficiently use this format, I have to identify runs of * non-zero counts. Because of the format used, I don't want to stop a * run because of just one or two zeros in the count stream. So I have * to sit in a loop looking for strings of three or more zero values * in a row. * * This is simple in concept, but it ends up being one of the most * complicated routines in the whole program. A routine that just * writes out 256 values without attempting to optimize would be much * simpler, but would hurt compression quite a bit on small files. * */ void output_counts ( output, nodes ) BIT_FILE *output; NODE *nodes; {
Slide 60: /* * Each time I hit the start of the loop, I assume that first is the * start of a run of non-zero values. The rest of the loop is * concerned with finding the value for last, which is the end of the * run, and the value of next, which is the start of the next run. * At the end of the loop, I assign next to first, so it starts in on * the next run. */ for ( ; first < 256 ; first = next) { last = first + 1; for ( ; ; ) { for ( ; last < 256 ; last ++ ) if ( nodes[ last ].count == 0 ) break; last--; for ( next = last + 1; next < 256 ; next++ ) if ( nodes[ next ]. count ! = 0 ) break; if ( next > 255 ) break; if ( ( next - last ) > 3 ) break; last = next;
int first; int last; int next; int i; first = 0; while ( first < 255 && nodes[ first ].count == 0 ) first++;
}; /* * Here is where I output first, last, and all the counts in between. */ if ( putc( first, output->file ) != first) fatal_error( "Error writing byte counts\n" ); if ( putc( last, output->file ) != last) fatal_error( "Error writing byte counts\n" ); for ( i = first ; i <= last ; i++ ) { if ( putc( nodes[ i ]. count, output->file ) != (int) nodes[ i ]. count) fatal_error( "Error writing byte counts\n" ); } } if ( putc( 0, output->file ) != 0 fatal_error( "Error writing byte counts\n" ); } /* * When expanding, I have to read in the same set of counts. This is * quite a bit easier that the process of writing them out, since no * decision making needs to be done. All I do is read in first, check * to see if I am all done, and if not, read in last and a string of * counts. */ void input_counts( input, nodes) BIT_FILE *input;
Slide 61: NODE *nodes; { int first; int last; int i; int c; for ( i = 0 ; i < 256 ; i++ ) nodes[ i ]. count = 0; if ( ( first = getc( input->file ) ) == EOF) fatal_error( "Error reading byte counts\n" ); if ( ( last = getc( input->file ) ) == EOF ) fatal_error( "Error reading byte counts\n); for ( ; ; ) { for ( i = first ; i <= last ; i++ ) if ( ( c = getc( input->file ) ) == EOF) fatal_error( "Error reading byte counts\n" ); else nodes[ i ]. count = (unsigned int) c; if ( ( first = getc( input->file ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); if ( first == 0) break; if ( ( last = getc( input->file ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); } nodes[ END_OF_STREAM ].count = 1;
}
/* * This routine counts the frequency of occurence of every byte in * the input file. It marks the place in the input stream where it * started, counts up all the bytes, then returns to the place where * it started. In most C implementations, the length of a file * cannot exceed an unsigned long, so this routine should always * work. */ #ifndef SEEK_SET #define SEEK_SET 0 #endif void count_bytes( input, counts) FILE *input; unsigned long *counts; { long input_marker; int c; input_marker = ftell( input ); while ( ( c = getc( input ) ) != EOF ) counts[ c ]++; fseek( input, input_marker, SEEK_SET );
}
/* * In order to limit the size of my Huffman codes to 16 bits, I scale * my counts down so they fit in an unsigned char, and then store them
Slide 62: * all as initial weights in my NODE array. The only thing to be * careful of is to make sure that a node with a non-zero count doesn't * get scaled down to 0. Nodes with values of 0 don't get codes. */ void scale_counts( counts, nodes ) unsigned long *counts; NODE *nodes; { unsigned long max_count; int i; max_count = 0; for ( i = 0 ; i < 256 ; i++ ) if ( counts[ i ] > max_count ) max_count = counts[ i ] ; if ( max_count == 0 ) { counts[ 0 ] = 1; max_count = 1; } max_count = max_count / 255; max_count = max_count + 1; for ( i = 0 ; i < 256 ; i++ ) { nodes[ i ].count = (unsigned int) ( counts[ i ] / max_count ); if ( nodes[ i ].count == 0 && counts[ i ] !=0 ); nodes[ i ].count = 1; } nodes[ END_OF_STREAM ]. count = 1;
} /* * Building the Huffman tree is fairly simple. All of the active nodes * are scanned in order to locate the two nodes with the minimum * weights. These two weights are added together and assigned to a new * node. The new node makes the two minimum nodes into its 0 child * and 1 child. The two minimum nodes are then marked as inactive. * This process repeats until there is only one node left, which is * the root node. The tree is done, and the root node is passed back * to the calling routine. * * Node 513 is used here to arbitratily provide a node with a guaran * teed maximum value. It starts off being min_1 and min_2. After all * active nodes have been scanned, I can tell if there is only one * active node left by checking to see if min_1 is still 513. */ int build_tree( nodes ) NODE *nodes; { int next_free; int i; int min_1; int min_2; nodes[ 513 ].count = Oxffff; for ( next_free = END_OF_STREAM + 1 ; ; next_free++ ) { min_1 = 513; min_2 = 513; for ( i = 0 ; i < next_free; i++ )
Slide 63: } /* * Since the Huffman tree is built as a decoding tree, there is * no simple way to get the encoding values for each symbol out of * it. This routine recursively walks through the tree, adding the * child bits to each code until it gets to a leaf. When it gets * to a leaf, it stores the code value in the CODE element, and * returns. */ void convert_tree_to_code( nodes, codes, code_so_far, bits, node ) NODE *nodes; CODE *codes; unsigned int code_so_far; int bits; int node; { if ( node <= END_OF_STREAM ) { codes[ node ].code = code_so_far; codes[ node ].code_bits = bits; return; } code_so_far <<= 1; bits++; convert_tree_to_code( nodes, codes, code_so_far, bits, nodes[ node ]. child_0 ); convert_tree_to_code( nodes, codes, code_so_far | 1, bits, nodes[ node ].child_1 );
if ( nodes[ i ].count != 0) { if ( nodes[ i ].count < nodes[ min_1 ].count ) { min_2 + min_1; min_1 = i; } else if ( nodes[ i ].count < nodes[ min_2 ].count) min_2 = i; } if ( min_2 == 513 ) break; nodes[ next_free ].count = nodes[ min_1 ].count + nodes[ min_2 ].count; nodes[ min_1 ].saved_count = nodes[ min_1 ].count; nodes[ min_1 ].count = 0; nodes[ min_2 ].saved_count = nodes[ min_2 ].count; nodes[ min_2 ].count = 0; nodes[ next_free ].child_0 = min_1; nodes[ next_free ].child_1 = min_2; } next_free--; nodes[ next_free ].saved_count = nodes[ next_free ].count; return( next_free );
} /* * If the -d command line option is specified, this routine is called * to print out some of the model information after the tree is built. * Note that this is the only place that the saved_count NODE element * is used for anything at all, and in this case it is just for * diagnostic information. By the time I get here, and the tree has
Slide 64: * been built, every active element will have 0 in its count. */ void print_mode1( nodes, codes ) NODE *nodes; CODE *codes; { int i; for ( i = 0 ; i < 513 ; i++ ) { if ( nodes[ i ].saved_count != 0 ) { printf( "node=" ); print_char( i ); printf( " count=%3d", nodes[ i ].saved_count ); printf( " child_0=" ); print_char( nodes[ i ]. child_0 ); printf( " child_1=" ); print_char( nodes[ i ].child_1 ); if ( codes && i <= END_OF_STREAM ) { printf( " Huffman code=" ); FilePrintBinary( stdout, codes[ i ].code, codes[ i ].code_bits ); } printf( "\n" ); } }
}
/* * The print_model routine uses this function to print out node num * bers. The catch is if it is a printable character, it gets printed * out as a character. This makes the debug output a little easier to * read. */ void print_char( c ) int c; { if ( c >= 0x20 && c < 127 ) printf( "`%c'", c ); else printf( "%3d", c ); } /* * Once the tree gets built, and the CODE table is built, compressing * the data is a breeze. Each byte is read in, and its corresponding * Huffman code is sent out. */ void compress_data( input, output, codes ) FILE *input; BIT_FILE *output; CODE *codes; { int c; while ( ( c = getc( input ) ) != EOF ) OutputBits( output, (unsigned long) codes[ c ].code, codes[ c ].code_bits );
Slide 65: } /*
OutputBits( output, (unsigned long) codes[ END_OF_STREAM ].code, codes[ END_OF_STREAM ].code_bits );
* Expanding compressed data is a little harder than the compression * phase. As each new symbol is decoded, the tree is traversed, * starting at the root node, reading a bit in, and taking either the * child_0 or child_1 path. Eventually, the tree winds down to a * leaf node, and the corresponding symbol is output. If the symbol * is the END_OF_STREAM symbol, it doesn't get written out, and * instead the whole process terminates. */ void expand_data( input, output, nodes, root_node ) BIT_FILE *input; FILE *output; NODE *nodes; int root_node; { int node; for ( ; ; ) { node = root_node; do { if ( InputBit( input ) ) node = nodes[ node ].child_1; else node = nodes[ node ].child_0; } while ( node . END_OF_STREAM ); if ( node == END_OF_STREAM ) break; if ( ( putc( node, output ) ) != node ) fatal_error( "Error trying to write byte to output" ); }
} /*****************************End of HUFF.C***************************/
Putting It All Together
The actual commands to build the compression and expansion programs will differ depending on which compiler and operating system you are using. Assuming you name the compression program HUFF-C and the expansion program HUFF-E, here are the command lines to compile the programs with various compilers:
Microsoft C: ERRHAND.C ERRHAND.C Borland C++: ERRHAND.C ERRHAND.C UNIX pcc: cl /W3 /Za /FeHUFF-C MAIN-C.C HUFF.C BITIO.C cl /W3 /Za /FeHUFF-E MAIN-E.C HUFF.C BITIO.C bcc -Ax -w -eHUFF-C MAIN-C.C HUFF.C BITIO.C bcc -Ax -w -eHUFF-E MAIN-E.C HUFF.C BITIO.C
cc -ohuff-c main-c.c huff.c bitio.c errhand.c cc -ohuff-e main-e.c huff.c bitio.c errhand.c
Slide 66: Remember that ANSI-compatible C compilers must have their extensions turned off on the command line to enable the __STDC__ macro. The __STDC__macro is necessary to turn on the ANSI prototypes. If you don’t want to continually have to add this unfamiliar command-line switch when you compile, simply strip out the “#ifdef __STDC__” line and always pull in the ANSI C prototypes. The only reason for doing this is to have code that will compile cleanly on K&R compilers. If you aren’t using a K&R compiler, keeping in the K&R prototypes is of dubious value. The module ERRHAND.C needs the __UNIX__ definition in order to use old-style variable arguments. Fully compliant ANSI C compilers may not have to turn this option on. If you are going to only be using your source code on your UNIX system, it would probably be simpler to put a “#define__UNIX__” in your ERRHAND.C file.
Performance
Order 0 Huffman coding is not going to take any prizes for compression ratios. But it does fairly well in terms of program size, memory requirements, and processing speed. To see how HUFF.C does overall, see the scorecards in Appendix A.
Chapter 4 A Significant Improvement: Adaptive Huffman Coding
In Chapter 3, we saw how Huffman coding could perform effective data compression by reducing the amount of redundancy in the coding of symbols. Huffman coding does not in itself tell how to reduce the information content of each symbol by developing an accurate model. But any model that can calculate the probability of a symbol with any accuracy should be able to use Huffman coding to compress data. The examples in Chapter 3 all used order 0 models, which are essentially context free. This means that the probability of a given character is calculated without taking into account the characters that preceded it in a message. The programs used in Chapter 3 just analyzed the entire input file and created a table of probabilities for each symbol. As long
Slide 67: as these probabilities deviated from a purely uniform distribution, we were able to compress the data. A minor drawback to Huffman coding programs is the requirement that they transmit a copy of the probability table with the compressed data. The expansion program would have no way of correctly decoding the data without the probability table. The table requires at most the addition of an extra 250 or so bytes to the output table, and consequently it usually doesn’t make much difference in the compression ratio. Even small files won’t be greatly affected, since the probability table should also be small for these files. The problem with this “minor drawback” is that as we attempt to improve the compression ability of our program, the penalty becomes more and more significant. If we move from order-0 to order-1 modeling, for example, we now have to transmit 257 probability tables instead of just one. So by using a technique that enables us to predict characters more accurately, we incur a penalty in terms of added overhead. Unless the files we are going to compress are very large, this added penalty will frequently wipe out any improvements made by increasing the order.
Adaptive Coding
This seems to lead to an impasse. To compress better, we need to accumulate more statistics. When we get more statistics, we achieve better compression but we wipe out any gains by having to send more modeling data. Fortunately, there is a way out of this dilemma. Adaptive coding lets us use higher-order modeling without paying any penalty for added statistics. It does this by adjusting the Huffman tree on the fly, based on data previously seen and having no knowledge about future statistics. Adaptive coding is not something that can just be used with Huffman coding. In principle, almost any form of coding can be converted to use an adaptive method. The high-level C program required to do adaptive compression is shown below.
initialize_model(); do { c = getc( input ); encode( c, output ); update_model( c ); } while ( c != EOF );
The decompressor works in a nearly identical fashion, as shown here:
initialize_model(); while ( ( c = decode( input ) ) ! = EOF ) { putc( c, output ); update_model( c ); }
Slide 68: Adaptive coding works since two of the routines used in these two algorithms are identical: initialize_model() and update_model(). If these routines differed even slightly between the compression and decompression programs, the whole system would fall apart. This sort of coding is fairly simple. The compressor and decompressor start off with identical models to encode and decode. So when the compressor puts out its very first encoded symbol, the decompressor will be able to interpret it. After the compressor emits the first symbol, it proceeds to the update_model() function. This is where the adaptive nature of the program begins. The update model takes into account the character that has just been seen and updates the frequency and encoding data used to encode that character. In a Huffman tree, it means incrementing the count for the particular symbol, then updating the Huffman coding tree.
Updating the Huffman Tree
The algorithm for constructing a Huffman coding tree is fairly simple, but it is not something we would want to do after every character is encoded. It would be relatively simple to implement adaptive Huffman coding with the following update function:
update_model( int c ) { counts[ c ]++; construct_tree( counts ); }
Unfortunately, what we would end up with would probably be the world’s slowest datacompression program. Building the tree takes too much work to reasonably expect to do it after every character. Fortunately, there is a way to take an existing Huffman coding tree and modify it to account for a new character. All it takes is a slightly different approach to building the tree in the first place. This approach introduces a concept known as the sibling property. A Huffman tree is simply a binary tree that has a weight assigned to every node, whether an internal node or a leaf node. Each node (except for the root) has a sibling, the other node that shares the same parent. The tree exhibits the sibling property if the nodes can be listed in order of increasing weight and if every node appears adjacent to its sibling in the list. A binary tree is a Huffman tree if and only if it obeys the sibling property. Figure 4.1 shows a Huffman tree that illustrates how this works. In this tree, the nodes have been assigned numbers, with the numbers assigned from left to right starting at the lowest row of nodes and working up. This tree was created using a conventional Huffman algorithm given the weights A=1, B=2, C=2, D=2, and E=10.
Slide 69: Figure 4.1 A Huffman tree. In Figure 4.1, the A, B, C, and D nodes at the bottom of the tree are numbered in increasing order starting at 1. Nodes 5 and 6 are the first two internal nodes, with weights of 3 and 4. The node numbers work their way up to node 9, the root. This arrangement shows that this tree obeys the sibling property. The nodes have been numbered in order of increasing weight, and each node is adjacent to its sibling in the list. The sibling property is important in adaptive Huffman coding since it helps show what we need to do to a Huffman tree when it is time to update the counts. Maintaining the sibling property during the update assures that we have a Huffman tree before and after the counts are adjusted. Updating the tree consists of two basic types of operations. The first, incrementing the count, is easy to follow conceptually. To increment the count for symbol ‘c,’ start at the leaf node for the symbol and increment the count for the leaf node. Then move up to the parent node. Since the weight of the parent node is the sum of the weight of its children, incrementing its weight by one will adjust it to its correct value. This process continues all the way up the tree till we reach the root node. Figure 4.2 shows how the increment operation affects the tree. Starting at the leaf, the increment works its way up the tree till it reaches the parent node. Implementing this portion of the code is relatively simple. Be sure that each node has a parent pointer and that an index points to the leaf node for each symbol. This can be done using conventional data structures at a low cost. The average number of increment operations required will correspond to the average number of bits needed to encode a symbol.
Slide 70: Figure 4.2 The increment process. The second operation required in the update procedure arises when the node increment causes a violation of the sibling property. This occurs when the node being incremented has the same weight as the next highest node in the list. If the increment were to proceed as normal, we would no longer have a Huffman tree. When we have an increment that violates the sibling property, we need to move the affected node to a higher point in the list. This means that the node is detached from its present position in the tree and swapped with a node farther up the list. Figure 4.3 shows the same Huffman tree from Figure 4.2 after the A node has been incremented again, then switched with the D node. How was the D node selected as the one to be switched? To minimize the amount of work during the shuffle, we want to swap just two nodes. If the newly incremented node has a weight of W + 1, the next higher node will have a weight of W. There may be more nodes after the next higher one that have a value of W as well. The swap procedure moves up the node list till it finds the last node with a weight of W. That node is swapped with the node with weight W + 1. The new node list will then have a string of 1 or more weight W nodes, followed by the newly incremented node with weight W + 1.
Slide 71: Figure 4.3 After a node switch (only the A node has been incremented). In Figure 4.3, the A node was incremented from a weight of 2 to 3. Since the next node in the list, the B node, had a weight of 2, the tree no longer obeyed the sibling property. This meant it was time to swap. We worked our way up the list of nodes till we found the last node with a weight of 2, the D node. The A and D nodes were then swapped, yielding a correctly ordered tree. After the swap is completed, the update can continue. The next node to be incremented will be the new parent of the incremented node. In Figure 4.3, this would be internal node #6. As each node is incremented, a check is performed for correct ordering. A swap is performed if necessary.
What Swapping Does
The swap shown in Figure 4.3 doesn’t have a noticeable effect on the coding of the symbols. The A and D nodes were swapped, but the length of their codes did not change. They were both three bits long before the swap and three bits long after. Figure 4.4 shows what happens to the three after the A symbol has been incremented two more times. After the second increment, the A node has increased enough to swap positions with an internal node on a higher level of the tree. This reshapes the tree, impacting the length of the codes. When A had a count of two like three other symbols, it was encoded using three bits. Now, when its count has increased to five, it is encoded using only 2 bits. Symbols C is still encoded using 3 bits, but B and D have slipped down to 4 bits.
Slide 72: Figure 4.4 After another node switch.
The Algorithm
In summary, the algorithm for incrementing the count of a node goes something like what’s shown below:
for ( ; ; ) { increment nodes[ node ].count; if ( node == ROOT ) break; if ( nodes[ node ].count > nodes[ node + 1 ].count ) swap_nodes(); node = nodes[ node ].parent; }
The swap_nodes() routine has to move up through the list of nodes until it finds the right node to swap with. It then performs the swap. This routine looks something like that shown below:
swap_node = node + 1; while ( nodes[ swap_node + 1 ].count < nodes[ node ].count ) swap_node++; temp = nodes[ swap_node ].parent;
Slide 73: nodes[ swap_node ].parent = nodes[ node ].parent; nodes[ node ].parent = temp;
An Enhancement
One way to make coding more efficient is to make sure your coder doesn’t waste coding space for symbols not used in the message. With the standard Huffman coding in the previous chapter, this was easy. Since we made a pass over the data to collect statistics before building the tree, we knew in advance which symbols weren’t used. So when we built the Huffman tree we didn’t have to include symbols with a count of 0. With an adaptive process, we don’t know in advance which symbols will show up in the message. The simplest way to handle this problem is to initialize the Huffman tree to have all 256 possible bytes (for conventional 8-bit data messages) predefined with a count of 1. When the encoding first starts, each message will have a length of eight bits. As statistics accumulate, frequently seen characters will start to use fewer and fewer bits. This method of encoding works, but in many cases it wastes coding capacity. Particularly in shorter messages, the extra unused codes tend to blunt the effect of compression by skewing the statistics of the message. A better way to handle this aspect of coding is to start the encoding process with an empty table and add symbols only as they are seen in the incoming message. But this presents us with a seeming contradiction. The first time a symbol appears, it can’t be encoded since it doesn’t appear in the table. So how do we get around this problem?
The Escape Code
The answer to this puzzle is the escape code. The escape code is a special symbol sent out of the encoder to signify that we are going to `escape’ from the current context. The decoder know that the next symbol will be encoded in a different context. We can use this mechanism to encode symbols that don’t appear in the currently defined Huffman tree. In the example program in this chapter, the escape code signifies that the next symbol to be encoded will be sent as a plain 8-bit character. The symbol is added to the table, and regular encoding resumes. The C code to implement the encoder for this algorithm looks something like this:
encode( char c ) { if ( in_tree( c ) ) transmit_huffman_code( c, out_file ); else { transmit_huffman_code( ESCAPE, out_file ); putc( c, out_file ); add_code_to_tree( c ); }
Slide 74: }
update_tree( c );
This example shows that the escape code is transmitted like any other symbol from the Huffman tree, so it has to appear in the Huffman tree to be properly transmitted. When the encoder first starts up, it needs to be initialized with the escape code already present. In the implementation used in the example code for this chapter, the Huffman tree is actually initialized with two values: the escape code and the end of file code. Since both will appear in the file, we start off with them in a very small Huffman tree:
Figure 4.5 A Huffman tree initialized with two values. As the encoding process goes on, the table fills up and the tree fleshes out. The end of file code will always have a weight of one, and in this implementation, so will the escape code. As the tree grows, these two codes will always be stuck down at the remotest branches of the tree and have the longest codes.
The Overflow Problem
As the compression program progresses, the counts in the table increase. At some point, the counts become large enough to cause trouble for the program. There are two possible areas of concern. The first occurs when the weight of the root exceeds the capacity of the counters in the tree. For most of the programs used here, that will be 16 bits. Another possible problem can occur even sooner. It happens when the length of the longest possible Huffman code exceeds the range of the integer used to transmit it. The decoding process doesn’t care how long a code is, since it works its way down through the tree a bit at a time. The transmitter has a different problem though. It has to start at the leaf node of the tree and work up towards the root. It accumulates bits to be transmitted in reverse order, so it has to stack them up. This is conventionally done in an integer variable, so this means that when a Huffman code exceeds the size of that integer, there is a problem. The maximum length of a Huffman code is related to the maximum count via a Fibonacci sequence. A Fibonacci function is defined as follows:
int fib( int n ) { if ( n <= 1 )
Slide 75: }
return( 1 ); else return( fib( n - 1 ) + fib( n -2 ) );
The sequence of Fibonacci numbers looks something like this: 1, 1, 2, 3, 5, 8, 13, 21, 34, etc. These numbers show up in the worst-case, most lopsided Huffman tree:
Figure 4.6 A lopsided Huffman tree produced through a sequence of Fibonacci numbers. From this we can deduce that if the weight at the root node of a Huffman tree equals fib(i), then the longest code for that tree is i - 1. This means that if the integers used with our Huffman codes are only 16 bits long, a root value of 4181 could potentially introduce an overflow. (This low value is frequently overlooked in simple Huffman implementations. Setting up a file with Fibonacci counts up to fib[18] is a good way to test a Huffman program). When we update the tree, we ought to check for a maximum value. Once we reach that value, we need to rescale all the counts, typically dividing them by a fixed factor, often two. One problem with dividing all the counts by two is that it can possibly reshape the tree. Since we are dealing with integers, dividing by two truncates the fractional part of the result, which can lead to imbalances. Consider the Huffman tree shown in Figure 4.7.
Slide 76: Figure 4.7 A Huffman tree created for four symbols. This is a tree created for four symbols: A, B, C, and D, with weights of 3, 3, 6, and 6. The nodes of the tree are numbered in this diagram, and the diagram clearly shows that the tree is a Huffman tree, since it obeys the sibling property. The problem with this tree occurs if we try a rescaling operation. The simple version of the rescaling algorithm would go through the tree, dividing every leaf node weight by two, then rebuilding upwards from the leaf nodes. The resulting tree would look like what follows.
Figure 4.8 The rescaling problem after the nodes are divided by two. The problem with the resulting tree is that it is no longer a Huffman tree. Because of the vagaries of truncation that follow integer division, we need to end up with a tree that has a slightly different shape:
Slide 77: Figure 4.9 What the tree should look like after integer division. The properly organized Huffman tree has a drastically different shape from what it had before being rescaled. This happens because rescaling loses some of the precision in our statistical base, introducing errors sometimes reflected in the rescaled tree. As time goes on, we accumulate more statistics, and the effect of the errors gradually fades away. Unfortunately, there is no simple way to compensate for the necessary reshaping the tree after rescaling. The sample code in this chapter merely does a brute-force rebuilding after rescaling. Rebuilding the entire tree is a relatively expensive operation, but since the rescaling operation is rare, we can live with the cost.
A Rescaling Bonus
An interesting side effect comes out of rescaling our tree at periodic intervals. Though we lose accuracy by scaling our counts, testing reveals that rescaling generally results in better compression ratios than if rescaling is postponed. This occurs because data streams frequently have a “decaying recency” effect, or the statistics for recently seen symbols are generally more valid than those accumulated farther back in the data stream. To put it simply, current symbols are more like recent symbols than older symbols. The rescaling operation tends to discount the effect of older symbols, while increasing the importance of recent symbols. Though difficult to quantify, this seems to have a good effect on compression. Experimenting with various rescaling points will yield different results at differing values, but it doesn’t seem possible to pin down an optimal strategy. There may be an optimal value for rescaling, but it moves around with different types of data streams.
Slide 78: The Code
The sample code for this chapter is a simple order-0 adaptive Huffman compression routine. It is linked with the standard I/O and user interface routines from the previous chapter to create a standalone compression program and a decompression program. The key to understanding how this sample code operates lies in understanding the data structures in the program. The data structure that describes the tree is shown next.
struct tree { int leaf[ SYMBOL_COUNT ]; int next_free_node; struct node { unsigned int weight; int parent; int child_is_leaf; int child; } nodes[ NODE_TABLE_COUNT ]; } Tree;
Two arrays describe the tree. The tree itself is entirely represented by the nodes[] array. This array is a set of structures with the following elements: unsigned int weight: int parent: This weight element is the weight of individual node, just as it has been described previously in this chapter. This int is the index of the parent node. The parent node information is necessary both when encoding a symbol, and when updating the model. The child of a given node can either be a leaf or a pair of nodes. This flag is used to indicate which it is. If the child is a leaf, this int holds the value of the symbol encoded at the leaf. If the child is a pair of nodes, this value is the index to the first node. Because of the sibling property, the two nodes will always be adjacent to one another, so we know the first node will be child, and the second node will be child+1.
int child_is_leaf: int child:
As described earlier in the chapter, every node in the tree is kept in a number list. When discussing the list before, we had the nodes with the lowest weight starting at 1 and working up to higher numbers until reaching the root. The implementation in this program is backwards from that, though the same principles apply. The list of nodes is the nodes[] array, with the highest number on the list appearing at nodes[0]. As we work our way down through the lower weights, we go to higher indices in the nodes list. When the tree is first initialized, nodes[0] is the root node, nodes [1] is set to the end-ofstream symbol, and nodes[2] is set to the escape symbol. The next_free_node element in
Slide 79: the tree is then set to 3, and the next time a character is added to the tree, it will be placed in nodes[3]. The leaf[] array in the tree data structure is used to find the leaf node for a particular symbol. To encode a symbol, start at the leaf node and work up to the root node of the tree, accumulating bits on the way (in reverse order). Without a leaf[] array to keep track of the leaf nodes, we would have to do a search through the entire tree every time we wanted to encode a character.
Initialization of the Array
Regardless of whether we perform compression or expansion, we initialize the Huffman tree using the same routine. When performing adaptive compression, it is extremely important to use an identical algorithm for both initialization and updating of the compression model. In this case, we use the same code to ensure that it happens. The initialization routine, InitializeTree(tree), is the first thing called by both the compression and expansion code. It uses the following code to initialize nodes 0, 1, and 2:
tree->nodes[ tree->nodes[ tree->nodes[ tree->nodes[ ROOT_NODE].child ROOT_NODE].child_is_leaf ROOT_NODE].weight ROOT_NODE].parent = = = = = = = = = = = = = = ROOT_NODE + 1; FALSE; 2; -1; END_OF_STREAM; TRUE; 1; ROOT_NODE; ROOT_NODE + 1; ESCAPE; TRUE; 1; ROOT_NODE; ROOT_NODE + 2;
tree->nodes[ ROOT_NODE + 1 ].child tree->nodes[ ROOT_NODE + 1 ].child_is_leaf tree->nodes[ ROOT_NODE + 1 ].weight tree->nodes[ ROOT_NODE + 1 ].parent tree->leaf[ END_OF_STREAM ] tree->nodes[ ROOT_NODE tree->nodes[ ROOT_NODE tree->nodes[ ROOT_NODE tree->nodes[ ROOT_NODE tree->leaf[ ESCAPE ] tree->next_free_node for ( i = 0 ; i < END_OF_STREAM ; i++ ) tree->leaf[ i ] = -1; + + + + 2 2 2 2 ].child ].child_is_leaf ].weight ].parent
= ROOT_NODE + 3;
The initialization of the tree->nodes[] elements is fairly direct. We assign the escape and end-of-stream nodes a weight of 1, which gives the root a weight of 2. The escape and end-of stream elements in the tree->leaf[] array are initialized to point to the appropriate nodes, and the parent and child pointers for each of the three nodes are initialized. The final details required during initialization are to set up the tree->next_free_node index and to initialize the remaining elements of the tree->leaf[] array. Since none of the leaf[] elements for our conventional symbols have been initialized, they are all set to
Slide 80: values of -1. During the encoding process, we will compare the tree->leaf[] value for a given symbol to -1 to see if it has already been defined.
The Compress Main Program
The code for the compression program is short:
InitializeTree( &Tree ); while ( ( c = getc( input ) ) != EOF ) { EncodeSymbol( &Tree, c, output ); UpdateModel( &Tree, c ); } EncodeSymbol( &Tree, END_OF_STREAM, output );
Once the tree has been initialized, the program sits in a loop encoding characters and updating the model. When there are no more characters left to encode, it encodes the endof-stream symbol, and it is done. Complexities are hidden in these functions. The EncodeSymbol function needs to see if the symbol is already defined. If it isn’t, EncodeSymbol needs to output the escape code and the unencoded symbol. EncodeSymbol then needs to add the symbol to the tree, with a count of 0. The UpdateModel function also hides some complexity. It performs the update discussed previously in the chapter, which is fairly complex. Before doing the update, it checks to see if the root node has reached the maximum allowable weight. If it has, the tree is scaled by a factor of two and rebuilt.
The Expand Main Program
Like the compress main program, the expand program is short and to the point. After initializing the tree, it reads in symbols via the DecodeSymbol routine, then writes them to the output file. After each symbol is decoded, it is written to the output file, and the model is updated. As in the compress program, a certain amount of complexity is concealed in the higherlevel functions. The DecodeSymbol routine has to see if the symbol it decodes is an escape code. If it is, DecodeSymbol throws away the escape code and reads in an “unencoded” 8-bit symbol. The symbol is then added to the Huffman tree, with an initial count of 0. As previously seen, the UpdateModel() routine has to see if the root node has reached the maximum allowable count. If it has, the Huffman tree is rebuilt. After that, the normal increment/test/swap routine ensues.
InitializeTree( &Tree ); while ( ( c = DecodeSymbol( &Tree, input ) ) !=END_OF_STREAM ) { if ( putc( c, output ) == EOF )
Slide 81: }
fatal_error( "Error writing character" ); UpdateModel( &Tree, c );
Encoding the Symbol
After initializing the tree, the compress routine repeatedly calls the EncodeSymbol routine. The EncodeSymbol routine (shown below) first identifies the leaf node for the symbol to be encoded. If the leaf table returns a-1, it means that this symbol is not presently found in the Huffman tree. In that case, the symbol to be encoded is switched to the escape code, and its root node is located. The encoding process for a Huffman tree works by starting at the leaf node and moving up through the parent nodes one at a time, until the root is reached. In a conventional Huffman tree, there will usually be two child nodes for each parent, one that encodes a 0 bit and another that encodes a 1 bit. The data structures used in this program take advantage of the sibling property by always grouping the two children of a parent node next to one another in the node list. Grouping children together saves space by requiring only a single child pointer instead of two. It also means that a child automatically knows whether it is the child that encodes a 1 or 0 by whether it is odd or even. In this program, the odd child is arbitrarily designated the 1, and the even is always the 0. Given this information, the encoding process is accomplished without too many lines of C code. Starting at the leaf node, each bit is added to the cumulative Huffman code. Whether the bit is a 1 or a 0 determines whether the bit is odd or even. As each bit is encoded, the current_bit mask is shifted by one so the next bit encoded will appear in the next most significant position. A counter is also incremented that keeps track of how many bits have been accumulated in the output word so far.
code = 0; current_bit = 1; code_size = 0; current_node = tree->leaf[ c ]; if ( current_node == -1 ) current_node = tree->leaf[ ESCAPE ]; while ( current_node != ROOT_NODE ) { if ( ( current_node & 1 ) == 0 ) code | = current_bit; current_bit <<= 1; code_size++; current_node = tree->nodes[ current_node ].parent; }; OutputBits( output, code, code_size ); if ( tree->leaf[ c ] == -1 ) { OutputBits( output, c, 8 ); add_new_node( tree, c ); }
Slide 82: After the bits of the Huffman code have been accumulated in the code word, the utility routine OutputBits (found in BITIO.C) is called to send out the code. There is still one piece of work left to do, however, before returning. If the code that was just output was the escape code, we have to handle the special condition created when a previously unreferenced symbol is encountered. The first step taken after the escape code is sent is simple. The new symbol is output in an unencoded fashion, just as it was read in from the file. This lets the decoder know what symbol to add to the table. The second step is to add the new node to the Huffman tree. When the new node is added to the tree, it is inserted with a weight of 0. The 0weight node will be incremented later when the model is updated, so it will not be 0 for long. The advantage to adding a node with a weight of zero is that it can be done without having to worry about updating any other nodes or, worse, changing the shape of the tree. Using the sibling property definitions, we know that if the new node has a weight of 0, it will be the very first node in the list, since nodes are ranked in order of weight. We add the node to the table by finding the presently lowest-weight node and break it out into two new nodes. The old lowest-weight node will be one of the two new nodes, and the new 0-weight node will be the other one. Figure 4.10 illustrates how this process modifies a working tree. The Huffman tree has already had the A, B, C, and D nodes defined with nonzero weights. The A node has the lowest count and consequently is at the start of the list (remember that the list in this program has the highest weights at 0). When symbol E is going to be added to the tree, we first identify A as the node at the end of the list. The position A used to occupy is converted to an internal node, and two new nodes are created. Since E is guaranteed to have the lowest weight in the tree, it is set to be the first node in the list, and A is set to be the second node.
Figure 4.10 The Huffman tree before and after addition of a zero weight node.
Slide 83: The code needed to add the node is listed below. The first step is to find the lowestweight node, which in the example was A. The new_node variable is the node which A will occupy, and the 0_weight_node is where E will go. Since these are two new nodes, they are set to be the next_free_node and next_free_node+1. After this is done, next_free_node is incremented by two so it will be ready for the next operation.
lightest_node = tree->next_free_node - 1; new_node = tree->next_free_node; zero_weight_node = tree->next_free_node + 1; tree->next_free_node += 2; tree->nodes[ new_node ] = tree->nodes[ lightest_node ]; tree->nodes[ new_node ].parent = lightest_node; tree->leaf[ tree->nodes[ new_node ].child] = new_node; tree->nodes[ lightest_node ].child = new_node; tree->nodes[ lightest_node ].child_is_leaf = FALSE; tree->nodes[ zero_weight_node ].child tree->nodes[ zero_weight_node ].child_is_leaf tree->nodes[ zero_weight_node ].weight tree->nodes[ zero_weight_node ].parent tree->leaf[ c ] = zero_weight_node; = = = = c; TRUE; 0; lightest_node;
After the two new nodes are created, a few more bookkeeping details are needed to link the two new children to their parent, to make sure their weights are correct, and to make sure the leaf array points to the correct nodes. After that, the routine is done, and the tree is ready to be updated.
Updating the Tree
The most complicated part of the Adaptive Huffman program is in the routines called to update the tree. Updating the model is simply a matter of incrementing a symbol weight by one and taking care of all the side effects of that action. Taking care of the side effects, however, involves some strenuous effort.
if ( tree->nodes[ ROOT_NODE].weight == MAX_WEIGHT ) RebuildTree( tree ); current_node = tree->leaf[ c ]; while ( current_node != -1 ) { tree->nodes[ current_node ].weight++; for ( new_node = current_node ; new_node > ROOT_NODE ; new_node—– ) if ( tree->nodes[ new_node - 1 ].weight >= tree->nodes[ current_node ].weight ) break; if ( current_node != new_node ) { swap_nodes( tree, current_node, new_node ); current_node = new_node; } current_node = tree->nodes[ current_node ].parent; }
Slide 84: This code performs all the work needed to update the tree. The first thing it checks for is to see if the tree has reached its maximum weight. If it has, the routine invokes the RebuildTree routine to scale down all the counts. After getting past the tree-rebuilding step, the loop that increments the node weights is entered. The first node to be incremented is the leaf node associated with the symbol. The loop increments the weight of that symbol, then moves up to the parent to increment that node. This process continues till the root is reached, at which point the update is done. The single symbol added to statistical base forming the tree has been accounted for. There is one additional step inside the loop, however, that takes place after every node has its weight incremented. The loop immediately following the increment operation works its way back through the list of nodes to see if the increased weight of the current node means it has to move up in the list. After the loop exits, new_node has the proper new location for the current node in the node list. If new_node is the same as current_node, the incremented node is fine where it is, and we can move on to the parent node. But if new_node and current_node aren’t the same, they have to be swapped. The process of swapping nodes involves lifting two entire subtrees out of their present positions in the list and exchanging them. The use of a tree data structure makes this easier than it may first appear. The swapping process is straightforward, complicated only by the fact that each node being swapped has links to both its parent and child, and the parent and child each have links to the node. This takes no great conceptual leaps. The swapping code is illustrated:
struct node temp; if ( tree->nodes[ i ].child_is_leaf ) tree->leaf[ tree->nodes [ i ].child ] = j; else { tree->nodes[ tree->nodes [ i ].child ].parent = j; tree->nodes[ tree->nodes [ i ].child + 1 ].parent = j; } if ( tree->nodes[ j ].child_is_leaf ) tree->leaf[ tree->nodes [ j ].child ] = i; else { tree->nodes[ tree->nodes[ j ].child ].parent = i; tree->nodes[ tree->nodes[ j ].child + 1 ].parent = i; } temp = tree->nodes[ i ]; tree->nodes[ i ] = tree->nodes[ j ]; tree->nodes[ i ].parent = temp.parent; temp.parent = tree->nodes[ j ].parent; tree->nodes[ j ] = temp;
An update can also force the rebuilding of the tree. Rebuilding takes a fair amount of work, unfortunately, since it amounts to building a new Huffman tree.
Slide 85: The rebuilding process proceeds in three discrete steps. The first step is to collect all the leaf nodes, throw away all the internal nodes, and divide the leaf-node weights by 2. The node list is compacted so that the new leaf nodes are all at the start of the list.
j = tree->next_free_node - 1; for ( i = j ; i >= ROOT_NODE ; i–– ) { if ( tree->nodes[ i ].child_is_leaf ) { tree->nodes[ j ] = tree->nodes[ i ]; tree->nodes[ j ].weight = (tree->nodes[ j ].weight + 1 ) / 2; j––; } }
The code to do this is shown above. Note that in this implementation, none of the nodes scale down to zero. This is accomplished by adding 1 to the node before dividing it by two. It may be desirable to throw away infrequently seen symbols by reducing their counts to zero and deleting them from the list, but we don’t do that here. What we end up with in the above code is a list of leaf nodes that are at the start of the list, terminating at the next_free_node index. The internal nodes which start at 0 and end at the current value of j, will now be rebuilt in the next step. In Chapter 3, we built a Huffman tree by repeatedly scanning the node list and finding the two nodes with the lowest weight. Those two nodes would be combined to form a new internal node. The tree-rebuilding phase here takes a different approach based on the sibling property. The loop that creates internal nodes starts with j pointing to the next node that needs to be defined; i is an index that points to the next pair of nodes to be combined into an internal node. The loop progressively steps through the nodes, combining and adding new internal nodes until reaching the root node at location 0. The process of creating the new internal node is simple. The new node, located at index j, is assigned a weight. The weight is simply the sum of the two nodes at location i, as would be expected. The hard work comes next. After node j is created, we have to decide where it belongs in the node list. The decision on where the new node belongs is made based on the weights of the nodes in the list. Recall that the sibling property says that the nodes have to be listed based on increasing weight. We have to step through the list till we find the first node that is less that the new node j, then place the new node immediately adjacent to that node in the list. Here is the procedure for the correct location for j:
for ( i = tree->next_free_node - 2 ; j >= ROOT_NODE ; i -= 2, j–– ) { k = i + 1; tree->nodes[ j ].weight = tree->nodes[ i ].weight + tree->nodes[ k ].weight; weight = tree->nodes[ j ].weight; tree->nodes[ j ].child_is_leaf = FALSE;
Slide 86: }
for ( k = j + 1 ; weight < tree->nodes[ k ].weight ; k++) ; k––; memmove( &tree->nodes[ j ]. &tree->nodes[ j + 1 ], ( k - j ) * sizeof( struct node ) ); tree->nodes[ k ].weight = weight; tree->nodes[ k ].child = i; tree->nodes[ k ].child_is_leaf = FALSE;
The variable ‘weight’ is assigned the weight of the new internal node. The routine then loops back through the list until it finds the first node with a weight less than or equal to the weight of j. Node j will be placed immediately after that node in the list. Before that node can be positioned, we need to make room in the list by moving all the nodes that have higher weights up by one position. This is done with the memmove() function. After that, the new node has its child and weight assigned, and the loop continues. This process can be seen in the short list of nodes about to be rebuilt in Figure 4.11. After having been rescaled, symbols A, B, C, and D have weights of 1, 3, 5, and 7 respectively. After the internal nodes have been stripped out, the list of nodes looks like Figure 4.11.
Figure 4.11 The list of nodes after the internal nodes have been stripped. The first two nodes to be combined will be B and A, creating a new node, j, at location 2 in the table. By stepping back through the list from location 2, we see that the resulting internal node belongs between locations 4 and 5, right before B. After the memory move function is executed and the node is connected, the partial Huffman tree looks like this:
Figure 4.12 A partial Huffman tree after the memory move function is executed. Because we are moving nodes around so freely, we do not assign parent pointers during this process. Once a node has been assigned as a child to another node, it is not going to change position in the list. But parent nodes that have yet to be combined together as children of another internal node may be moved farther up in the list as other nodes are combined. If we assigned parent nodes earlier in the process, every time we moved a
Slide 87: node we would have to go through a backtracking procedure to locate its children and update their parent indices. We bypass this costly procedure by deferring parent node building to the third and final step in the Rebuild procedure. Assigning parent nodes is fairly simple. We start at the root node, and locate the children of each node in the tree. The children then have their parent node index set. If the child is a leaf instead of another node, the leaf[] array node index is set.
for ( i = tree->next_free_node - 1 ; i >= ROOT_NODE ; i-- ) { if ( tree->nodes[ i ].child_is_leaf ) { k = tree->nodes[ i ].child; tree->leaf[ k ] = i; } else { k = tree->nodes[ i ].child; tree->nodes[ k ].parent = tree->nodes[ k + 1 ].parent = i; } }
Decoding the Symbol
The final high-level procedure to be discussed here is the routine used to decode an incoming symbol. Like the symbol encoder, this routine is short and simple. It starts at the root node of the tree and reads in a single bit at a time. As each bit is read in, one of the two children of the node is selected based on whether the input bit is a one or a zero. Eventually, this leads the routine to a leaf node. When the leaf node is reached, we have decoded a symbol. The only possible complication at this point is if the decoded symbol is an escape code. If so, the symbol encoded by the encoder did not yet appear in the Huffman tree. This means that the next eight bits in the stream will contain an unencoded version of the symbol. If this is the case, the input routine is called to get the plain-text version of the symbol.
current_node = ROOT_NODE; while ( !tree->nodes[ current_node ].child_is_leaf ) { current_node = tree->nodes[ current_node ].child; current_node += InputBit( input ); } c = tree->nodes[ current_node ].child; if ( c == ESCAPE ) { c = (int) InputBits( input, 8 ); add_new_node( tree, c ); } return( c );
Either the decoded symbol or the escaped plain version of the symbol is passed back to the calling program where it can be placed in the output file. The hard work will come after the symbol is decoded, when the tree has to be updated to reflect the newly arrived symbol.
Slide 88: The Code
The code for the Adaptive Huffman compressor is listed next. This single module is contained on the source disk in the file AHUFF.C. Building the compression routine requires that you compile this file and link it with the utility routines discussed in the last chapter: BITIO.C, ERRHAND.C, and MAIN-C.C. To build the decompression routine you substitute MAIN-E.C. The two arguments for both programs are simply an input file followed by an output file.
/*************************** Start of AHUFF.C ************************/ #include #include #include #include #include #include <stdio.h> <stdlib.h> <string.h> <ctype.h> "bitio.h" "errhand.h"
char *CompressionName = "Adaptive Huffman coding, with escape codes"; char *Usage = "infile outfile"; #define #define #define #define #define #define #define #define END_OF_STREAM ESCAPE SYMBOL_COUNT NODE_TABLE_COUNT ROOT_NODE MAX_WEIGHT TRUE FALSE 256 257 258 ( ( SYMBOL_COUNT * 2 ) - 1 ) 0 0X8000 1 0
*/ * This data structure is all that is needed to maintain an adaptive * Huffman tree for both encoding and decoding. The leaf array is a * set of indices into the nodes that indicate which node is the * parent of a symbol. For example, to encode 'A', we would find the * leaf node by way of leaf[ 'A' ]. The next_free_node index is used * to tell which node is the next one in the array that can be used. * Since nodes are allocated when characters are read in for the first * time, this pointer keeps track of where we are in the node array. * Finally, the array of nodes is the actual Huffman tree. The child * index is either an index pointing to a pair of children, or an * actual symbol value, depending on whether 'child_is_leaf' is true * or false. */ typedef struct tree { int leaf[ SYMBOL_COUNT ]; int next_free_node; struct node { unsigned int weight; int parent; int child_is_leaf; int child; } nodes [ NODE_TABLE_COUNT ]; } TREE;
Slide 89: /* * The Tree used in this program is a global structure. Under other * circumstances it could just as well be a dynamically allocated * structure built when needed, since all routines here take a TREE * pointer as an argument. */ TREE Tree; /* * Function prototypes for both ANSI C compilers and their K&R breth* ren. */ #ifdef __STDC__ void CompressFile( FILE *input, BIT_FILE *output, int argc, char *argv[] ); void ExpandFile( BIT_FILE *input, FILE *output, int argc, char *argv[] ); void InitializeTree( TREE *tree ); void EncodeSymbol( TREE *tree, unsigned int c, BIT_FILE *output ); int DecodeSymbol( TREE *tree, BIT_FILE *input ); void UpdateModel( TREE *tree, int c ); void RebuildTree( TREE *tree ); void swap_nodes( TREE *tree, int i, int j ); void add_new_node( TREE *tree, int c ); void PrintTree( TREE *tree ); void print_codes( TREE *tree ); void print_code( TREE *tree, int c ); void calculate_rows( TREE *tree, int node, int level ); int calculate_columns( TREE *tree, int node, int starting_guess ); int find_minimum_column( TREE *tree, int node, int max_row ); void rescale_columns( int factor ); #else void CompressFile(); void ExpandFile(); void InitializeTree(); void EncodeSymbol(); int DecodeSymbol(); void UpdateModel(); void RebuildTree(); void swap_nodes(); void add_new_node(); #endif /* * The high level view of the compression routine is very simple. * First, we initialize the Huffman tree, with just the ESCAPE and * END_OF_STREAM symbols. Then, we sit in a loop, encoding symbols, * and adding them to the model. When there are no more characters * to send, the special END_OF_STREAM symbol is encoded. The decoder * will later be able to use this symbol to know when to quit. */
Slide 90: void CompressFile( input, output, argc, argv ) FILE *input; BIT_FILE *output; int argc; char *argv[]; { int c; InitializeTree( &Tree ); while ( ( c = getc( input ) ) != EOF ) { EncodeSymbol( &Tree, c, output ); UpdateModel( &Tree, c ); } EncodeSymbol( &Tree, END_OF_STREAM, output ); while ( argc— > 0 ) { printf( "Unused argument: %s\n", *argv ); argv++; } }
/* * The Expansion routine looks very much like the compression routine. * It first initializes the Huffman tree, using the same routine as * the compressor did. It then sits in a loop, decoding characters and * updating the model until it reads in an END_OF_STREAM symbol. At * that point, it is time to quit. */ void ExpandFile( input, output, argc, argv ) BIT_FILE *input; FILE *output; int argc; char *argv[]; { int c; while ( argc–– > 0 ) printf( "Unused argument: %s\n", *argv++ ); InitializeTree( &Tree ); while ( ( c = DecodeSymbol( &Tree, input ) ) != END OF STREAM ) { if ( putc( c, output ) == EOF ) fatal_error( "Error writing character" ); UpdateModel( & Tree, c ); } } /* * When performing adaptive compression, the Huffman tree starts out * very nearly empty. The only two symbols present initially are the * ESCAPE symbol and the END_OF_STREAM symbol. The ESCAPE symbol has to * be included so we can tell the expansion program that we are * transmitting a previously unseen symbol. The END_OF_STREAM symbol * is here because it is greater than eight bits, and our ESCAPE * sequence only allows for eight bit symbols following the ESCAPE * code. * * In addition to setting up the root node and its two children, this
Slide 91: * routine also initializes the leaf array. The ESCAPE and * END_OF_STREAM leaves are the only ones initially defined, the rest * of the leaf elements are set to -1 to show that they aren't present * in the Huffman tree yet. */ void InitializeTree( tree ) TREE *tree; { int i; tree->nodes[ tree->nodes[ tree->nodes[ tree->nodes[ ROOT_NODE ROOT_NODE ROOT_NODE ROOT_NODE ].child ].child_is_leaf ].weight ].parent = = = = ROOT_NODE + 1; FALSE; 2; -1; = = = = = = = = = = END_OF_STREAM; TRUE; 1; ROOT_NODE; = ROOT_NODE + 1; ESCAPE; TRUE; 1; ROOT_NODE; ROOT_NODE + 2; ROOT_NODE + 3;
tree->nodes[ ROOT_NODE + 1 ].child tree->nodes[ ROOT_NODE + 1 ].child_is_leaf tree->nodes[ ROOT_NODE + 1 ].weight tree->nodes[ ROOT_NODE + 1 ].parent tree->leaf[ END_OF_STREAM ] tree->nodes[ ROOT_NODE tree->nodes[ ROOT_NODE tree->nodes[ ROOT_NODE tree->nodes[ ROOT_NODE tree->leaf[ ESCAPE ] tree->next_free_node + + + + 2 2 2 2 ].child ].child_is_leaf ].weight ].parent
for ( i = 0 ; i < END_OF_STREAM ; i++ ) tree->leaf[ i ] = -1; } /* * This routine is responsible for taking a symbol, and converting * it into the sequence of bits dictated by the Huffman tree. The * only complication is that we are working our way up from the leaf * to the root, and hence are getting the bits in reverse order. This * means we have to rack up the bits in an integer and then send them * out after they are all accumulated. In this version of the program, * we keep our codes in a long integer, so the maximum count is set * to an arbitrary limit of 0x8000. It could be set as high as 65535 * if desired. */ void EncodeSymbol( tree, c, output ) TREE *tree; unsigned int c; BIT_FILE *output; { unsigned long code; unsigned long current_bit; int code_size; int current_node; code = 0; current_bit = 1; code_size = 0;
Slide 92: }
current node = tree->leaf[ c ]; if ( current node == -1 ) current_node = tree->leaf[ ESCAPE ]; while ( current_node != ROOT_NODE ) { if ( ( current_node & 1 ) == 0 ) code | = current_bit; current_bit <<= 1; code_size++; current_node = tree->nodes[ current_node ].parent; }; OutputBits( output, code, code_size ); if ( tree->leaf[ c ] == -1 ) { OutputBits( output, (unsigned long) c, 8 ); add_new_node( tree, c ); }
/* * Decoding symbols is easy. We start at the root node, then go down * the tree until we reach a leaf. At each node, we decide which * child to take based on the next input bit. After getting to the * leaf, we check to see if we read in the ESCAPE code. If we did, * it means that the next symbol is going to come through in the next * eight bits, unencoded. If that is the case, we read it in here, * and add the new symbol to the table. */ int DecodeSymbol( tree, input ) TREE *tree; BIT_FILE *input; { int current_node; int c; current_node = ROOT_NODE; while ( !tree->nodes[ current_node ].child_is_leaf ) { current_node = tree->nodes[ current_node ].child; current_node += InputBit( input ); } c = tree->nodes[ current_node ].child; if ( c == ESCAPE ) { c = (int) InputBits( input, 8 ); add_new_node( tree, c ); } return( c );
}
/* * UpdateModel is called to increment the count for a given symbol. * After incrementing the symbol, this code has to work its way up * through the parent nodes, incrementing each one of them. That is * the easy part. The hard part is that after incrementing each * parent node, we have to check to see if it is now out of the proper * order. If it is, it has to be moved up the tree into its proper * place. */ void UpdateModel( tree, c )
Slide 93: TREE *tree; int c; { int current_node; int new_node; if ( tree->nodes[ ROOT_NODE].weight == MAX_WEIGHT ) RebuildTree( tree ); current_node = tree->leaf[ c ]; while ( current_node != -1 ) { tree->nodes[ current_node ].weight++; for ( new_node = current_node ; new_node > ROOT_NODE ; new_node–– ) if ( tree->nodes[ new_node - 1 ].weight >= tree->nodes[ current_node ].weight ) break; if ( current_node != new_node ) { swap_nodes( tree, current_node, new_node ); current_node = new_node; } current_node = tree->nodes[ current_node ].parent; }
}
/* * Rebuilding the tree takes place when the counts have gone too * high. From a simple point of view, rebuilding the tree just means * that we divide every count by two. Unfortunately, due to truncation * effects, this means that the tree's shape might change. Some nodes * might move up due to cumulative increases, while others may move * down. */ void RebuildTree( tree ) TREE *tree; { int i; int j; int k; unsigned int weight; /* * To start rebuilding the table, I collect all the leaves of the * Huffman tree and put them in the end of the tree. While I am doing * that, I scale the counts down by a factor of 2. */ printf( "R" ); j = tree->next_free_node - 1; for ( i = j ; i >= ROOT_NODE ; i–– ) { if ( tree->nodes[ i ].child_is_leaf ) { tree->nodes[ j ] = tree->nodes[ i ]; tree->nodes[ j ].weight = ( tree->nodes[ j ].weight + 1 ) / 2; j––; } } /* * At this point, j points to the first free node. I now have all the
Slide 94: * leaves defined, and need to start building the higher nodes on the * tree. I will start adding the new internal nodes at j. Every time * I add a new internal node to the top of the tree, I have to check * to see where it really belongs in the tree. It might stay at the * top, but there is a good chance I might have to move it back down. * If it does have to go down, I use the memmove() function to scoot * everyone bigger up by one node. */ for ( i = tree->next_free_node - 2 ; j >= ROOT_NODE; i -= 2, j–– ) { k = i + 1; tree->nodes[ j ].weight = tree->nodes[ i ].weight + tree->nodes[ k ].weight; weight = tree->nodes[ j ].weight; tree->nodes[ j ].child_is_leaf = FALSE; for ( k = j + 1 ; weight < tree->nodes[ k ].weight ; k++ ) ; k––; memmove( &tree->nodes[ j ], &tree->nodes[ j + 1 ], ( k - j ) * sizeof( struct node ) ); tree->nodes[ k ].weight = weight; tree->nodes[ k ].child = i; tree->nodes[ k ].child_is_leaf = FALSE; }
/* * The final step in tree reconstruction is to go through and set up * all of the leaf and parent members. This can be safely done now * that every node is in its final position in the tree. */ for ( i = tree->next_free_node - 1 ; i >= ROOT_NODE ; i–– ) { if ( tree->nodes[ i ].child_is_leaf ) { k = tree->nodes[ i ].child; tree->leaf[ k ] = i; } else { k = tree->nodes[ i ].child; tree->nodes[ k ].parent = tree->nodes[ k + 1 ].parent = i; } } } /* * Swapping nodes takes place when a node has grown too big for its * spot in the tree. When swapping nodes i and j, we rearrange the * tree by exchanging the children under i with the children under j. void swap_nodes( tree, i, j ) TREE *tree; int i; int j; { struct node temp; if ( tree->nodes [ i ].child_is_leaf ) tree->leaf[ tree->nodes[ i ].child ] = j; else { tree->nodes[ tree->nodes[ i ].child ].parent = j;
Slide 95: } /* * Adding a new node to the tree is pretty simple. It is just a matter * of splitting the lightest-weight node in the tree, which is the * highest valued node. We split it off into two new nodes, one of * which is the one being added to the tree. We assign the new node a * weight of 0, so the tree doesn't have to be adjusted. It will be * updated later when the normal update process occurs. Note that this * code assumes that the lightest node has a leaf as a child. If this * is not the case, the tree would be broken. */ void add_new_node( tree, c ) TREE *tree; int c; { int lightest_node; int new_node; int zero_weight_node; lightest_node = tree->next_free_node - 1; new_node = tree->next_free_node; zero_weight_node = tree->next_free_node + 1; tree->next_free_node += 2; tree->nodes[ new_node ] = tree->nodes[ lightest_node ]; tree->nodes[ new_node ].parent = lightest_node; tree->leaf[ tree->nodes[ new_node ].child ] = new_node; tree->nodes[ lightest_node ].child tree->nodes[ lightest_node ].child_is_leaf tree->nodes[ tree->nodes[ tree->nodes[ tree->nodes[ zero_weight_node zero_weight_node zero_weight_node zero_weight_node ].child ].child_is_leaf ].weight ].parent = new_node; = FALSE; = = = = c; TRUE; 0; lightest_node;
tree->nodes[ tree->nodes[ i ].child + 1 ].parent = j; } if ( tree->nodes[ j ].child_is_leaf ) tree->leaf[ tree->nodes[ j ].child ] = i; else { tree->nodes[ tree->nodes[ j ].child ].parent = i; tree->nodes[ tree->nodes[ j ].child + 1 ].parent = i; } temp = tree->nodes[ i ]; tree->nodes[ i ] = tree->nodes[ j ]; tree->nodes[ i ].parent = temp.parent; temp.parent = tree->nodes[ j ].parent; tree->nodes[ j ] = temp;
/************************* End of AHUFF.C ****************************/
Slide 96: Chapter 5 Huffman One Better: Arithmetic Coding
The last two chapters show that Huffman coding uses knowledge about information content to efficiently encode symbols. If the probability of a symbol’s appearance in a message is known, Huffman techniques can encode that symbol using a minimal number of bits. Huffman coding has been proven the best fixed-length coding method available.
Difficulties
Huffman codes have to be an integral number of bits long, and this can sometimes be a problem. If the probability of a character is 1/3, for example, the optimum number of bits to code that character is around 1.6 bits. Huffman coding has to assign either one or two bits to the code, and either choice leads to a longer compressed message than is theoretically possible. This non optimal coding becomes a noticeable problem when the probability of a character is very high. If a statistical method could assign a 90 percent probability to a given character, the optimal code size would be 0.15 bits. The Huffman coding system would probably assign a 1-bit code to the symbol, which is six times longer than necessary. This would be a problem when compressing two-color images, like those coming from a fax machine. Since there are only two colors, an ordinary coding method would assign the 1 bit to one color and the 0 bit to the other. Since both codes have only a single bit, Huffman coding is not going to be able to compress this data at all. No matter how high the probability of one of the bits, we are still going to have to encode it using one bit. The conventional solution to this problem is to group the bits into packets and apply Huffman coding. But this weakness prevents Huffman coding from being a universal compressor.
Arithmetic Coding: A Step Forward
Only in the last fifteen years has a respectable candidate to replace Huffman coding been successfully demonstrated: arithmetic coding. Arithmetic coding bypasses the idea of replacing an input symbol with a specific code. It replaces a stream of input symbols with a single floating-point output number. More bits are needed in the output number for longer, complex messages. This concept has been known for some time, but only recently were practical methods found to implement arithmetic coding on computers with fixedsized registers. The output from an arithmetic coding process is a single number less than 1 and greater than or equal to 0. This single number can be uniquely decoded to create the exact stream of symbols that went into its construction. To construct the output number, the symbols
Slide 97: are assigned a set probabilities. The message “BILL GATES,” for example, would have a probability distribution like this:
Character SPACE A B E G I L S T
Probability 1/10 1/10 1/10 1/10 1/10 1/10 2/10 1/10 1/10
Once character probabilities are known, individual symbols need to be assigned a range along a “probability line,” nominally 0 to 1. It doesn’t matter which characters are assigned which segment of the range, as long as it is done in the same manner by both the encoder and the decoder. The nine-character symbol set used here would look like the following:
Character SPACE A B E G I L S T
Probability 1/10 1/10 1/10 1/10 1/10 1/10 2/10 1/10 1/10
Range 0.00 [gte] r > 0.10 0.10 [gte] r > 0.20 0.20 [gte] r > 0.30 0.30 [gte] r > 0.40 0.40 [gte] r > 0.50 0.50 [gte] r > 0.60 0.60 [gte] r > 0.80 0.80 [gte] r > 0.90 0.90 [gte] r > 1.00
Each character is assigned the portion of the 0 to 1 range that corresponds to its probability of appearance. Note that the character “owns” everything up to, but not including, the higher number. So the letter T in fact has the range .90 to .9999…
Slide 98: The most significant portion of an arithmetic-coded message belongs to the first symbols—or B, in the message “BILL GATES.” To decode the first character properly, the final coded message has to be a number greater than or equal to .20 and less than .30. To encode this number, track the range it could fall in. After the first character is encoded, the low end for this range is .20 and the high end is .30. During the rest of the encoding process, each new symbol will further restrict the possible range of the output number. The next character to be encoded, the letter I, owns the range .50 to .60 in the new subrange of .2 to .3. So the new encoded number will fall somewhere in the 50th to 60th percentile of the currently established range. Applying this logic will further restrict our number to .25 to .26. The algorithm to accomplish this for a message of any length is shown here:
low = 0.0; high = 1.0; while ( ( c = getc( input ) ) != EOF ) { range = high - low; high = low + range * high_range( c ); low = low + range * low_range( c ); } output ( low );
Following this process to its natural conclusion with our message results in the following table:
New Character
Low value 0.0 0.2 0.25 0.256 0.2572 0.25720 0.257216 0.2572164 0.25721676 0.257216772 0.2572167752
High Value 1.0 0.3 0.26 0.258 0.2576 0.25724 0.257220 0.2572168 0.2572168 0.257216776 0.2572167756
B I L L SPACE G A T E S
So the final low value, .2572167752, will uniquely encode the message “BILL GATES” using our present coding scheme.
Slide 99: Given this encoding scheme, it is relatively easy to see how the decoding process operates. Find the first symbol in the message by seeing which symbol owns the space our encoded message falls in. Since .2572167752 falls between .2 and .3, the first character must be B. Then remove B from the encoded number. Since we know the low and high ranges of B, remove their effects by reversing the process that put them in. First, subtract the low value of B, giving .0572167752. Then divide by the width of the range of B, or .1. This gives a value of .572167752. Then calculate where that lands, which is in the range of the next letter, I. The algorithm for decoding the incoming number is shown next:
number = input_code(); for ( ; ; ) { symbol = find_symbol_straddling_this_range( number ); putc( symbol ); range = high_range( symbol ) - low_range( symbol ); number = number - low_range( symbol ); number = number / range; }
I have conveniently ignored the problem of how to decide when there are no more symbols left to decode. This can be handled either by encoding a special end-of-file symbol or by carrying the stream length with the encoded message. In the example in this chapter, as elsewhere in the book, I carry along a special end-of-stream symbol that tells the decoder when it is out of symbols. The decoding algorithm for the “BILL GATES” message will proceed as shown:
Encoded Number 0.2572167752 0.572167752 0.72167752 0.6083876 0.041938 0.41938 0.1938 0.938 0.38 0.8 0.0
Output Symbol B I L L SPACE G A T E S
Low 0.2 0.5 0.6 0.6 0.0 0.4 0.2 0.9 0.3 0.8
High 0.3 0.6 0.8 0.8 .1 0.5 0.3 1.0 0.4 0.9
Range 0.1 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1
In summary, the encoding process is simply one of narrowing the range of possible numbers with every new symbol. The new range is proportional to the predefined
Slide 100: probability attached to that symbol. Decoding is the inverse procedure, in which the range is expanded in proportion to the probability of each symbol as it is extracted.
Practical Matters
Encoding and decoding a stream of symbols using arithmetic coding is not too complicated. But at first glance it seems completely impractical. Most computers support floating-point numbers of around 80 bits. So do you have to start over every time you encode ten or fifteen symbols? Do you need a floating-point processor? Can machines with different floating-point formats communicate through arithmetic coding? As it turns out, arithmetic coding is best accomplished using standard 16-bit and 32-bit integer math. Floating-point math is neither required nor helpful. What is required is an incremental transmission scheme in which fixed-size integer state variables receive new bits at the low end and shift them out at the high end, forming a single number that can be as long as necessary, conceivably millions or billions of bits. Earlier, we saw that the algorithm works by keeping track of a high and low number that brackets the range of the possible output number. When the algorithm first starts, the low number is set to 0 and the high number is set to 1. The first simplification made to work with integer math is to change the 1 to .999 …, or .111… in binary. Mathematicians agree that .111… binary is exactly the same as 1 binary, and we take their assurance at face value. It simplifies encoding and decoding. To store these numbers in integer registers, first justify them so the implied decimal point is on the left side of the word. Then load as much of the initial high and low values as will fit into the word size we are working with. My implementation uses 16-bit unsigned math, so the initial value of high is 0xFFFF, and low is 0. We know that the high value continues with Fs forever, and the low continues with zeros forever; so we can shift those extra bits in with impunity when needed. If you imagine our “BILL GATES” example in a five-decimal digit register (I use decimal digits in this example for clarity), the decimal equivalent of our setup would look like what follows on the next page.
HIGH: 99999 implied digits => 999999999... LOW: 00000 implied digits => 000000000...
To find the new range of numbers, apply the encoding algorithm from the previous section. First, calculate the range between the low and high values. The difference between the two registers will be 100000, not 99999. This is because we assume the high register has an infinite number of 9s added to it, so we need to increment the calculated difference. We then compute the new high value using the formula from the previous section:
high = low + high_range(symbol)
Slide 101: In this case, the high range was .30, which gives a new value for high of 30000. Before storing the new value of high, we need to decrement it, once again because of the implied digits appended to the integer value. So the new value of high is 29999. The calculation of low follows the same procedure, with a resulting new value of 20000. So now high and low look like this:
high: 29999 (999...) low: 20000 (000...)
At this point, the most significant digits of high and low match. Due to the nature of our algorithm, high and low can continue to grow closer together without quite ever matching. So once they match in the most significant digit, that digit will never change. We can now output that digit as the first digit of our encoded number. This is done by shifting both high and low left by one digit and shifting in a 9 in the least significant digit of high. The equivalent operations are performed in binary in the C implementation of this algorithm. As this process continues, high and low continually grow closer together, shifting digits out into the coded word. The process for our “BILL GATES” message is shown next.
High Initial state Encode B (0.2—0.3) Shift out 2 Encode I (0.5—0.6) Shift out 5 Encode L (0.6—0.8) Encode L (0.6—0.8) Shift out 7 Encode SPACE (0.0—0.1) Shift out 2 Encode G (0.4—0.5) Shift out 1 Encode A (0.1—0.2) Shift out 6 Encode T (0.9—1.0) Shift out 7 Encode E (0.3—0.4) Shift out 7 Encode S (0.8—0.9) 99999 29999 99999 59999 99999 79999 75999 59999 23999 39999 19999 99999 67999 79999 79999 99999 75999 59999 55999
Low 00000 20000 00000 50000 00000 60000 72000 20000 20000 00000 16000 60000 64000 40000 76000 60000 72000 20000 52000
Range 100000 10000 100000 20000 40000 40000 40000 40000 40000 40000
Cumulative Output
.2 .2 .25 .25 .25 .257 .257 .2572 .2572 .25721 .25721 .257216 .257216 .2572167 .2572167 .25721677 .25721677
Slide 102: Shift out 5 Shift out 2 Shift out 0
59999
20000
.257216775 .2572167752 .25721677520
After all the letters are accounted for, two extra digits need to be shifted out of either the high or low value to finish the output word. This is so the decoder can properly track the input data. Part of the information about the data stream is still in the high and low registers, and we need to shift that information to the file for the decoder to use later.
A Complication
This scheme works well for incrementally encoding a message. Enough accuracy is retained during the double-precision integer calculations to ensure that the message is accurately encoded. But there is potential for a loss of precision under certain circumstances. If the encoded word has a string of 0s or 9s in it, the high and low values will slowly converge on a value, but they may not see their most significant digits match immediately. High may be 700004, and low may be 699995. At this point, the calculated range will be only a single digit long, which means the output word will not have enough precision to be accurately encoded. Worse, after a few more iterations, high could be 70000, and low could be 69999. At this point, the values are permanently stuck. The range between high and low has become so small that any iteration through another symbol will leave high and low at their same values. But since the most significant digits of both words are not equal, the algorithm can’t output the digit and shift. It seems to have reached an impasse. You can defeat this underflow problem by preventing things from ever getting that bad. The original algorithm said something like, “If the most significant digit of high and low match, shift it out.” If the two digits don’t match, but are now on adjacent numbers, a second test needs to be applied. If high and low are one apart, we test to see if the second most significant digit in high is a 0 and if the second digit in low is a 9. If so, it means we are on the road to underflow and need to take action. Head off underflow with a slightly different shift operation. Instead of shifting the most significant digit out of the word, delete the second digits from high and low and shift the rest of the digits left to fill the space. The most significant digit stays in place. Then set an underflow counter to remember that we threw away a digit and aren’t quite sure whether it was going to be a 0 or a 9. This process is shown here:
Before
After
Slide 103: High: Low: Underflow:
40344 39810 0
43449 38100 1
After every recalculation, check for underflow digits again if the most significant digit don’t match. If underflow digits are present, we shift them out and increment the counter. When the most significant digits do finally converge to a single value, output that value. Then output the underflow digits previously discarded. The underflow digits will all be 9s or 0s, depending on whether high and low converged on the higher or lower value. In the C implementation of this algorithm, the underflow counter keeps track of how many ones or zeros to output.
Decoding
In the “ideal” decoding process, we had the entire input number to work with, and the algorithm had to do things like “divide the encoded number by the symbol probability.” In practice, we can’t perform an operation like that on a number that could be billions of bytes long. As in the encoding process, however, the decoder can operate using 16- and 32-bit integers for calculations. Instead of using just two numbers, high and low, the decoder has to use three numbers. The first two, high and low, correspond exactly to the high and low values maintained by the encoder. The third number, code, contains the current bits being read in from the input bit stream. The code value always falls between the high and low values. As they come closer and closer to it, new shift operations will take place, and high and low will move back away from code. The high and low values in the decoder will be updated after every symbol, just as they were in the encoder, and they should have exactly the same values. By performing the same comparison test on the upper digit of high and low, the decoder knows when it is time to shift a new digit into the incoming code. The same underflow tests are performed as well. In the ideal algorithm, it was possible to determine what the current encoded symbol was just by finding the symbol whose probabilities enclosed the present value of the code. In the integer math algorithm, things are somewhat more complicated. In this case, the probability scale is determined by the difference between high and low. So instead of the range being between .0 and 1.0, the range will be between two positive 16-bit integer counts. Where the present code value falls along that range determines current probability. Divide (value - low) by (high - low + 1) to get the actual probability for the present symbol.
Slide 104: Where’s the Beef?
It is not immediately obvious why this encoding process is an improvement over Huffman coding. It becomes clear when we examine a case in which the probabilities are a little different. If we have to encode the stream “AAAAAAA,” and the probability of A is known to be .9, there is a 90 percent chance that any incoming character will be the letter A. We set up our probability table so that A occupies the .0 to . 9 range, and the end-of-message symbol occupies the .9 to 1 range. The encoding process is shown next:
New Character
Low value 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.43046721
High value 1.0 0.9 0.81 0.729 0.6561 0.59049 0.531441 0.4782969 0.4782969
A A A A A A A END
Now that we know the range of high and low values, all that remains is to pick a number to encode this message. The number .45 will make this message uniquely decode to “AAAAAAA.” Those two decimal digits take slightly less than seven bits to specify, which means we have encoded eight symbols in less than eight bits! An optimal Huffman message would have taken a minimum of nine bits. To take this point to an even further extreme, I set up a test that had only two symbols. In it, 0 had a 16382/16383 probability, and an end-of-file symbol had a 1/16383 probability. I then created a file filled with 100,000 0s. After compression using arithmetic coding, the output file was only three bytes long! The minimum size using Huffman coding would have been 12,501 bytes. This is obviously a contrived example, but it shows that arithmetic coding compresses data at rates much better than one bit per byte when the symbol probabilities are right.
The Code
The code supplied with this chapter in ARITH.C is a simple module that performs arithmetic compression and decompression using a simple order 0 model. It works exactly like the non-adaptive Huffman coding program in Chapter 3. It first makes a single pass over the data, counting the symbols. The data is then scaled down to make the
Slide 105: counts fit into a single, unsigned character. The scaled counts are saved to the output file for the decompressor to get at later, then the arithmetic coding table is built. Finally, the compressor passes through the data, compressing each symbol as it appears. When done, the end-of-stream character is sent out, the arithmetic coder is flushed, and the program exits.
The Compression Program
The compression portion of this program is shown shortly. The main module is called by the utility version of MAIN-E.C., which will have already taken care of opening files, parsing arguments, etc. Once we get to the compression phase of the program, things are ready to go. The compressor code breaks down neatly into three sections. The first two lines initialize the model and the encoder. The while loop consists of two lines, which together with the line following the loop perform the compression, and the last three lines shut things down.
build_model( input, output->file ); initialize_arithmetic_encoder(); while ( ( c = getc( input ) ) ! = EOF ) { convert_int_to_symbol( c, &s ); encode_symbol( output, &s ); } convert_int_to_symbol( END_OF_STREAM, &s ); encode_symbol( output, &s ); flush_arithmetic_encoder( output ); OutputBits( output, OL, 16 );
The build_model() routine has several responsibilities. It makes the first pass over the input data to count all the characters. It scales down the counts to fit in unsigned characters, then it takes those counts and builds the range table used by the coder. Finally, it writes the counts to the output file so the decompressor will have access to them later. The initialize arithmetic encoder routine is fairly simple. It just sets up the high- and lowinteger variables used during the encoding. The encoding loop calls two different routines to encode the symbol. The first, convert_int_to_symbol(), takes the character read in from the file and looks up the range for the given symbol. The range is then stored in the symbol object, which has the structure shown:
typedef struct { unsigned short int low_count; unsigned short int high_count; unsigned short int scale; } SYMBOL;
These three values are all that are needed for the symbol to be encoded using the arithmetic encoder. The low-count and high-count values uniquely define where on the 0 to 1 range the symbol lies, and the scale value tells what the total span of the 0 to 1 scale
Slide 106: is. If 1,000 characters had been counted in a text file, for example, the low_count and high_count for A might be 178 and 199, and the scale would be 1,000. This would indicate that on the 0 to 1 number scale, A would own the range .178 to .199. Once the symbol object has been defined, it can be passed to the encoder. The arithmetic encoder needs only those three items to process the symbol and update the output number. It has the high- and low-range values and the underflow count stored internally as static variables, and it doesn’t need anything else. The way we detached the modeling data from the coding data gives us a convenient mechanism for experimenting with new ways of modeling. We just have to come up with the numbers that get plugged into the symbol. The encoder doesn’t care how we got those numbers as long as they were derived consistently so we can decode the file later. When we reach the end of the input file, we encode and send the end-of-stream symbol. The decompression program will use it to determine when it is done. To finish, call a routine to flush the arithmetic encoder, which takes care of any underflow bits. Finally, we have to output an extra sixteen bits. The reason for this is simple. When decoding symbols from the input bit stream, the effective bits are in the most significant bit position of the input bit registers. To get the bits there, we have to load other bits into the lower positions and shift them over. At least 15 insignificant bits are needed to decode the last symbol. Outputting 16 bits at the end of the program ensures that the decoder won’t get a premature end of file while decoding the input file.
The Expansion Program
The main part of the expansion program follows the same pattern. First, the model is set up and the arithmetic coder is initialized. In this program, initializing the model means reading in the counts from the input file where the compressor put them. Initializing the arithmetic decoder means loading the low and high registers with 0000 and FFFF, then reading the first 16 bits from the input file into the current code.
input_counts( input->file ); initialize_arithmetic_decoder( for ( ; ; ) { get_symbol_scale( &s ); count = get_current_count( c = convert_symbol_to_int( if ( c == END_OF_STREAM ) break; remove_symbol_from_stream( putc( (char) c, output ); } input ); &s ); count, &s ); input, &s );
The decoding loop is a little more complicated in this routine to keep the modeling and decoding separate. First, get the scale for the current model to pass back to the arithmetic decoder. The decoder then converts its current input code into a count in the routine
Slide 107: get_current_count. With the count, we can determine which symbol is the correct one to decode. This is done in the routine convert_symbol_to_int(). Though it seems strange, we don’t remove the encoded symbol from the input bit stream till after we actually decode it. The process of removing it involves standard modifications of high and low and, if necessary, shifting in new bits from the input stream. Finally, the decoded character is written to the output file.
Initializing the Model
The model for a program using arithmetic coding needs to have three pieces of information for each symbol: the low end of its range, the high end of its range, and the scale of the entire alphabet’s range (this is the same for all symbols in the alphabet). Since the top of a given symbol’s range is the bottom of the next, we only need to keep track of N + 1 numbers for N symbols in the alphabet. An example of how the range information would be created is shown for symbols A, B, C, D, and E. These symbols have the counts 5, 2, 3, 3, and 8 respectively. The numbers can be arranged in any order along the range, if done consistently. Range E D C B A : : : : : : 21 13 10 7 5 0
In this case, the alphabet has five symbols, and the number of counts to keep track of is six. The array is formed by creating the cumulative total at each symbol, starting at zero and working up to the range. Given a structure like this, it is simple to derive the three items of information that define the symbol for the arithmetic encoder. For symbol x in the array, the low count can be found at totals[ x ], the high count at totals[ x + 1 ], and the range of scale at totals[ N ], N being the number of symbols in the alphabet. The routines that do compression create this array. In this program, the array is named totals[], and it has 258 elements. The number of symbols in the alphabet is 257, the normal 256 plus one for the end-of-stream symbol. One additional constraint is placed on these counts. Two things determine the highest possible count: the number of bits available in the high and low registers and the number of bits in the code values. Since floating-point calculations are performed in fixed-length registers, we have to minimize the amount of precision in our calculations so errors do not occur.
Slide 108: As it happens, there are two restrictions on the number of bits used in arithmetic coding: (1) the number of bits in the frequency values must be at least two less than the number of bits in the high and low registers; and (2) the number of bits in the frequency values plus the bits used in either the high and low registers must not exceed the number of bits used in arithmetic calculations during the coding process. During calculations on the arithmetic registers, the code in this chapter uses unsigned long values, which give 32 bits to work with. Since our high and low registers and our frequency counts are limited to 16 bits in unsigned short ints, we meet restriction 2 implicitly. Restriction 1, however, requires more modifications. Since we are only using 16-bit registers for our high and low values, we have to restrict the highest cumulative counts in the totals[] array to no more than 14 bits, or 16,384. The code to make sure that our total count of symbols is less than 16,384 is in the build_model routine called on initialization of the model. It takes the process a step further, scaling the counts down so they all fit in a single byte. This is done so that the count data stored in the output file takes a minimal amount of space. During the compression phase of the program, the build_model() routine is called to perform all the necessary chores associated with the order-0 modeling used in this program. The four lines of code from build_model() are shown here:
count_bytes( input, counts ); scale_counts( counts, scaled_counts ); output_counts( output, scaled_counts ); build_totals( scaled_counts );
As you can see above, the build_model routine does no work on its own. It calls a series of four worker routines to handle the data for it. The first routine is count_bytes(). It does just that, counting all the occurrences of each symbol in the file and maintaining the total in an array, like so:
input_marker = ftell( input ); while ( ( c = getc( input )) != EOF ) counts[ c ]++; fseek( input, input_marker, SEEK_SET );
The code for count_bytes scans the entire input file, then returns the input pointer to where it was when called. We assume that the number of counts of a given symbol in the file cannot exceed the span of an unsigned long. If this is not true, other measures will need to be taken to avoid overflow of the counts[] array elements. After the array has been counted, the counts have to be scaled down. This is done in the scale_counts() routine. The first step here is to scale down the unsigned long counts[] array to fit in an array of unsigned characters. This stores the counts in less space in the output file. The code for this is listed here.
max_count = 0;
Slide 109: for ( i = 0 ; i < 256 ; i++ ) if (counts[ i ] > max_count ) max_count = counts[ i ]; scale = max_count / 256; scale = scale + 1; for ( i = 0 ; i < 256 ; i++ ) { scaled_counts[ i ] = (unsigned char ) ( counts[ i ] /scale ); if ( scaled_counts[ i ] == 0 && counts[ i ] != 0 ) scaled_counts[ i ] = 1; }
After this is complete, one more scaling may need to be done. As part of the limitations on performing arithmetic coding using fixed-length registers, we have to restrict the total of our counts to less than 16,384, or fourteen bits. The second part of scale_counts does this with brute force, checking the total, then dividing by either two or four to get the correct results. An additional count has to be added because of the single count used by the end-of-stream character.
total = 1; for ( i = 0 ; i < 256 ; i++ ) total += scaled_counts[ i ]; if ( total > ( 32767 - 256 ) ) scale = 4; else if ( total > 16383 ) scale = 2; else return; for ( i = 0 ; i < 256 ; i ++ ) scaled_counts[ i ] /= scale;
There is certainly room in the scale_counts() routine for more sophistication. Every time we lose some of the accuracy in our counts by scaling, we also lose a certain amount of compression. An ideal scaling routine would scale down the counts to fit in the correct range while doing as little scaling as possible. The routines listed here don’t work very hard at doing that. Once the counts have been scaled down to fit in the unsigned character array, the output_counts() routine is called to save them to the output file. This program employs the same method to store the counts as used in Chapter 3 for the Huffman coding example. Instead of storing the entire array of counts, we only store runs on nonzero values. For details on this, refer to Chapter 3. The last step in building the model is to set up the cumulative totals array in totals[]. This is the array used when actually performing the arithmetic coding. The code shown below builds that array. Remember that after the totals[] array has been built, the range for symbol x is found in totals[ x ] and totals[ x + 1 ]. The range used for the entire alphabet is found in totals[ END_OF_STREAM +1 ].
totals[ 0 ] = 0; for ( i = 0 ; i < END_OF_STREAM ; i++ ) totals[ i + 1 ] = totals[ i ] + scaled_counts[ i ];
Slide 110: totals[ END_OF_STREAM + 1 ] = totals[ END_OF_STREAM ] + 1;
Reading the Model
For expansion, the code needs to build the same model array in totals[] that was used in the compression routine. Since the original file is not available to scan for counts, the program reads in the scaled_counts[] array stored in the compressed file. The code that accomplishes this is identical to the Huffman expansion code in Chapter 3. Refer to Chapter 3 for details on how this code works. After the scaled_counts[] array has been read in, the same routine used by the compression code can be invoked to build the totals[] array. Calling build_totals() in both the compression and expansion routines helps ensure that we are working with the same array.
Initializing the Encoder
Before compression can begin, we have to initialize the variables that constitute the arithmetic encoder. Three 16-bit variables define the arithmetic encoder: low, high, and underflow_bits. When the encoder first starts to run, the range of the output floating-point number is anywhere between 0 and 1. The low variable is initialized to 0 and the high to 0xFFFF. These two variables have an implied decimal point just ahead of the most significant bit and an infinite trail of ones or zeros. The ones will be shifted into the high variable and the zeros into the low.
low = 0; high = 0xffff; underflow_bits = 0;
The Encoding Process
At this point, the program is ready to begin the actual encoding process. This consists of looping through the entire file, reading in a character, determining its range variables, then encoding it. After the file has been scanned, the final step is to encode the end-ofstream symbol.
while ( ( c = getc( input ) ) !=EOF ) { convert_int_to_symbol( c, &s ); encode_symbol( output, &s ); } convert_int_to_symbol( END_OF_STREAM, &s ); encode_symbol( output, &s );
Two routines encode a symbol. The convert_int_to_symbol() routine looks up the modeling information for the symbol and retrieves the numbers needed to perform the arithmetic coding. This consists of stuffing numbers into the three elements of the symbol’s structure, as shown here:
Slide 111: s->scale = totals[ END_OF_STREAM + 1 ]; s->low_count = totals[ c ]; s->high_count = totals[ c + 1 ];
After the symbol information is present, we are ready to encode the symbol using the arithmetic encoding routine. The C code to do this, listed in encode_symbol(), has two distinct steps. The first is to adjust the high and low variables based on the symbol data passed to the encoder.
range = (long) ( high-low ) + 1; high = low + (unsigned short int) (( range * s->high_count ) / s->scale - 1 ); low = low + (unsigned short int) (( range * s->low_count ) / s->scale );
The code shown below restricts the range of high and low by the amount indicated in the symbol information. The range of the symbol is defined by s->low_count and s>high_count. These two counts are measured relative to the s->scale variable. After the adjustments are made, low and high will both be greater than or equal to their previous values. The range, or the distance between the two, will be less than or equal to its previous value.
for ( ; ; ) { if ( ( high & 0x8000 ) == ( low & 0x8000 ) ) { OutputBit( stream, high & 0x8000 ); while ( underflow_bits > 0 ) { OutputBit( stream, ~high & 0x8000 ); underflow_bits--; } } else if ( ( low & 0x4000 ) && !( high & 0x4000 ) ) { underflow_bits += 1; low &= 0x3fff; high |= 0x4000; } else return low <<= 1; high <<= 1; high |= 1; }
After high and low have been adjusted, the routine needs to shift out any bits available for shifting. After a given arithmetic adjustment, it is never certain how many bits will need to be shifted out. If the encoded symbol has a very high probability, the number of bits will be low. If the encoded symbol has a low probability, the number of bits may be high. Since the number isn’t known in advance, the encoding routine sits in a loop shifting bits until there are no more shifts possible. The routine tests for two conditions to see if shifting is necessary. The first occurs when the most significant bits of the low and high word are the same. Because of the math being used, once the two bits are identical, they
Slide 112: will never change. When this occurs, the bit is sent to the output file, and the high and low values are shifted. Before shifting out the bit found when the most MSBs match, however, we have to transmit any underflow bits previously saved up. The underflow bits will be a sequence of bits set to the opposite value of the MSB. When we have an underflow, we have a binary sequence that ends up looking like that shown above. The number of underflow bits is found in the underflow_bits variable.
high = .100000... low = .011111...
Which leads to the second condition under which high and low variables require shifting: underflow. This occurs when the high and low words are growing dangerously close together but have not yet had their most significant bits match, a situation similar to that shown above. When words begin growing close together like this, the dynamic range becomes dangerously low. Test for this condition after determining that the two most significant bits don’t match. If they don’t, check to see if the next bit is 0 in the high word and 1 in the low word. If they are, perform an underflow shift. The underflow shift operation consists of throwing away the underflow bit (the one next to the most significant digit), shifting the remaining bits over one by one to fill the gap, and incrementing the underflow counter. The code to do this is somewhat opaque, but it performs this operation. The underflow code takes advantage of the fact that when in danger of underflow, we know two things. First, we know that the most significant bit in the high word is 1 and in the low word 0. Second, the bit we throw away from the high word is 0 and from the low word 1. Since we know the value of the highest two bits, we can simplify the shift operation. The code used in this chapter toggles the second most significant bit in both the high and low registers, then performs the normal shift operation. It looks as though the lower 14 bits were shifted left and the MSB was left alone. If we check for both possible shift conditions and don’t flag either one, we are done shifting bits out and can end the encoding operation. If either of the tests passed, the actual shift operation can take place. Both the high and low words are shifted left one bit position. The high word has a 1 shifted in to the LSB, and the low word has a 0 shifted in. The loop then repeats, outputting and shifting additional bits as necessary.
Slide 113: Flushing the Encoder
After encoding, it is necessary to flush the arithmetic encoder. The code for this is in the flush_arithmetic_encoder() routine. It outputs two bits and any additional underflow bits added along the way.
The Decoding Process
Before arithmetic decoding can start, we need to initialize the arithmetic decoder variables. While encoding, we had just a high and low variable. Both are maintained by the decoder with a code variable, which contains the current bit stream read in from the input file. During arithmetic decoding, the high and low variables should track exactly the same values they had during the encoding process, and the code variable should reflect the bit stream as it is read in from the input file. The only exception to this is that the code variable has underflow bits taken from it and thrown away, as with the high and low variables.
code = 0; for ( i = 0 ; i < 16 ; i++ ) { code <<= 1; code += InputBit( stream ); } low = 0; high = Oxffff;
This implementation of the arithmetic decoding process requires four separate steps to decode each character. The first is to get the current scale for the symbol. This is simply a matter of looking in the current model data. In this implementation, the scale is found at totals[ END_OF_STREAM + 1 ]. The reason for breaking this out as a separate routine rather than coding it in-line is that future expansions to the basic compression program may use different modeling techniques. If a different model is used, determining the scale of the model could end up being more complex. This happens in the program used in the next chapter. Once the current scale for the model is known, a second call is made to get the count for the current arithmetic code. This involves translating the decoders range, expressed by the high and low variables, into the range used by the model, which is in the scale variable.
range = (long) ( high - low ) + 1; count = (short int) ((((long) ( code - low ) + 1 ) * s->scale-1 ) / range ); return( count );
The count returned to the expansion program is in essence a simple translation of the code variable from one scale to another. Once the count has been determined, we can
Slide 114: determine what symbol has been encoded. Since we know the low and high range of the count for every symbol in the alphabet, determining which symbol goes with which count is just a matter of looking through the counts listed in the totals[] array.
for ( c = END_OF_STREAM ; count < totals[ c ] ; c–– ) ; s->high_count = totals[ c + 1 ]; s->low_count = totals[ c ]; return( c );
The implementation of the convert_symbol_to_int() used here determines the correct symbol with brute force. It simply starts looking at the top of the totals[] array for a count that fits with the current symbol, and it works down until it finds one that does. This is not optimal, since it could take 257 comparisons to locate the correct symbol. An improved method of decoding would keep the symbols sorted so that the most frequently used symbols would be checked first. This would reduce the average time required to locate a symbol, though with random data we would not see much improvement. For simplicity, this was not the method used here. Once convert_symbol_to_int() locates the correct symbol in the totals[] array, it takes the high and low counts and stores them in the symbol variable. They will be used in the next step of the decoding process. The correct value of the symbol is then returned to the calling program as an int. After the correct symbol value is set up in the symbol structure, remove_symbol_from_stream() is called. Arithmetic coding is unusual in that it determines what the symbol is before it removes the bits associated with it. Then it calls the routine that actually removes those bits from the code and sets up the input code for the next symbol.
range = (long)( high - low ) + 1; high = low + (unsigned short int) (( range * s->high_count ) / s->scale - 1 ); low = low + (unsigned short int) (( range * s->low_count ) / s->scale ); for ( ; ; ) { if ( ( high & 0x8000 ) == ( low & 0x8000 ) ) { } else if ((low & 0x4000) == 0x4000 && (high & 0x4000) == 0 ) { code ^= 0x4000; low &= 0x3fff; high |= 0x4000; } else return; low <<= 1; high <<= 1; high |= 1; code <<= 1; code += InputBit( stream ); }
Slide 115: The code that removes the symbol from the stream is listed above. It operates almost as a mirror image of the encoding routine. It first rescales the high and low variables to their new ranges as dictated by the range of the symbol being removed. Then the shifting process begins. As before, there are two possible reasons to shift in new bits. First, if high and low have the same most significant bit, they will be discarded and a new bit will be shifted in as a replacement. Second, if high and low don’t have the same MSB, and the second most significant bits are threatening underflow, we will discard the second most significant bits and shift in new bits. If neither of the possible shift criteria are met, we can return, since the effects of the symbol have been entirely removed from the input stream. Otherwise, we shift high, low, and code. The lowest bit of high gets a 1, the lowest bit of low gets a 0, and the lowest bit of code gets a new bit from the input stream. This process continues indefinitely until all shifting is complete, at which point we return to the calling routine.
Summary
Arithmetic coding seems more complicated than Huffman coding, but the size of the program required to implement it is not significantly different. Runtime performance is significantly slower than Huffman coding, however, due to the computational burden imposed on the encoder and decoder. If squeezing the last bit of compression capability out of the coder is important, arithmetic coding will always do as good a job or better, than Huffman coding. But careful optimization is needed to get performance up to acceptable levels.
Code
/************************** Start of ARITH.C *************************/ #include <stdio.h> #include <stdlib.h> #include "errhand.h" #include "bitio.h" /* * The SYMBOL structure is what is used to define a symbol in * arithmetic coding terms. A symbol is defined as a range between * 0 and 1. Since we are using integer math, instead of using 0 and 1 * as our end points, we have an integer scale. The low_count and * high_count define where the symbol falls in the range. */ typedef struct { unsigned short int low_count; unsigned short int high_count; unsigned short int scale; } SYMBOL; /* * Internal function prototypes, with or without ANSI prototypes.
Slide 116: */ #ifdef __STDC__ void build_model( FILE *input, FILE *output ); void scale_counts( unsigned long counts[], unsigned char scaled_counts[] ); void build_totals( unsigned char scaled_counts[] ); void count_bytes( FILE *input, unsigned long counts[] ); void output_counts( FILE *output, unsigned char scaled_counts[] ); void input_counts( FILE *stream ); void convert_int_to_symbol( int symbol, SYMBOL *s ); void get_symbol_scale( SYMBOL *s ); int convert_symbol_to_int( int count, SYMBOL *s ); void initialize_arithmetic_encoder( void ); void encode_symbol( BIT_FILE *stream, SYMBOL *s ); void flush_arithmetic_encoder( BIT_FILE *stream ); short int get_curret_count( SYMBOL *s ); void initialize_arithmetic_decoder( BIT_FILE *stream ): void remove_symbol_from_stream( BIT_FILE *stream, SYMBOL * s ); #else void build_model(); void scale_counts(); void build_totals(); void count_bytes(); void output_counts(); void input_counts(); void convert_int_to_symbol(); void get_symbol_scale(); int convert_symbol_to_int(); void initialize_arithmetic_encoder(); void encode_symbol(); void flush_arithmetic_encoder(); short int get_current_count(); void initialize_arithmetic_decoder(); void remove_symbol_from_stream(); #endif #define END_OF_STREAM 256 short int totals[ 258 ]; /* The cumulative totals */
char *CompressionName = "Adaptive order 0 model with arithmetic coding"; char *Usage = "in-file out-file\n\n\"; /* * This compress file routine is a fairly orthodox compress routine. * It first gathers statistics, and initializes the arithmetic * encoder. It then encodes all the characters in the file, followed * by the EOF character. The output stream is then flushed, and we * exit. Note that an extra two bytes are output. When decoding an * arithmetic stream, we have to read in extra bits. The decoding process * takes place in the msb of the low and high range ints, so when we are * decoding our last bit we will still have to have at least 15 junk * bits loaded into the registers. The extra two bytes account for
Slide 117: * that. */ void CompressFile( input, output, argc, argv ) FILE * input; BIT_FILE *output; int argc; char *argv[]; { int c; SYMBOL s; build_model( input, output->file ); initialize_arithmetic_encoder(); while ( ( c = getc( input ) ) != EOF ) { convert_int_to_symbol( c, &s ); encode_symbol( output, &s ); } convert_int_to_symbol( END_OF_STREAM, &s ); encode_symbol( output, &s ); flush_arithmetic_encoder( output ); OutputBits( output, 0L, 16 ); while ( argc–– > 0 ) { printf( "Unused argument: %s\n", *argv ); argv++; }
}
/* * This expand routine is also very conventional. It reads in the * model, initializes the decoder, then sits in a loop reading in * characters. When we decode an END_OF_STREAM, it means we can close * up the files and exit. Note decoding a single character is a three * step process: first we determine what the scale is for the current * symbol by looking at the difference between the high and low values. * We then see where the current input values fall in that range. * Finally, we look in our totals array to find out what symbol is * a match. After that is done, we still have to remove that symbol * from the decoder. Lots of work. */ void ExpandFile( input, output, argc, argv ) BIT_FILE *input; FILE *output; int argc; char *argv[]; { SYMBOL s; int c; int count; input_counts( input->file ); initialize_arithmetic_decoder( input ); for ( ; ; ) { get_symbol_scale( &s ); count = get_current_count( &s );
Slide 118: }
} while ( argc–– > 0 ) { printf( "Unused argument: argv++; }
c = convert_symbol_to_int( count, &s ); if ( c == END_OF_STREAM ) break; remove_symbol_from_stream( input, &s ); putc( (char) c, output ): %s\n", *argv );
/* * This is the routine that is called to scan the input file, scale * the counts, build the totals array, the output the scaled counts * to the output file. */ void build_model( input, output ) FILE *input; FILE *output; { unsigned long counts[ 256 ]; unsigned char scaled_counts[ 256 ]; count_bytes( input, counts ); scale_counts( counts, scaled_counts ); output_counts( output, scaled_counts ); build_totals( scaled_counts ); } /* * This routine runs through the file and counts the appearances of * each character. */ #ifndef SEEK_SET #define SEEK_SET 0 #endif void count_bytes( input, counts ) FILE *input; unsigned long counts[]; { long input_maker; int i; int c; for ( i = 0 ; i < 256; i++ ) counts[ i ] = 0; input_marker = ftell( input ); while ( ( c = getc( input ) ) != EOF ) counts[ c ]++; fseek( input, input_marker, SEEK_SET );
}
/* * This routine is called to scale the counts down. There are two * types of scaling that must be done. First, the counts need to be
Slide 119: * scaled down so that the individual counts fit into a single unsigned * char. Then, the counts need to be rescaled so that the total of all * counts is less than 16384. */ void scale_counts( counts, scaled_counts ) unsigned long counts[]; unsigned char scaled_counts[]; { int i; unsigned long max_count; unsigned int total; unsigned long scale; /* * The first section of code makes sure each count fits into a single * byte. */ max_count = 0; for ( i = 0 ; i < 256 ; i++ ) if ( counts[ i ] > max_count ) max_count = counts[ i ]; scale = max_count / 256; scale = scale + 1; for ( i = 0 ; i < 256 ; i++ ) { scaled_counts[ i ] = (unsigned char ) ( counts[ i ] / scale ); if ( scaled_counts[ i ] == 0 && counts[ i ] != 0 ) scaled_counts[ i ] = 1; }
/* * This next section makes sure the total is less than 16384. * I initialize the total to 1 instead of 0 because there will be an * additional 1 added in for the END_OF_STREAM symbol; */ total = 1; for ( i = 0 ; i < 256 ; i++ ) total += scaled_counts[ i ]; if ( total > ( 32767 - 256 ) ) scale = 4; else if ( total > 16383 ) scale = 2; else return; for ( i = 0 ; i < 256 ; i++ ) scaled_counts[ i ] /= scale; } /* * This routine is used by both the encoder and decoder to build the * table of cumulative totals. The counts for the characters in the * file are in the counts array, and we know that there will be a * single instance of the EOF symbol. */ void build_totals( scaled_counts ) unsigned char scaled_counts[]; {
Slide 120: int i; totals[ 0 ] = 0; for ( i = 0 ; i < END_OF_STREAM ; i++ ) totals[ i + 1 ] = totals[ i ] + scaled_counts[ i ]; totals[ END_OF_STREAM + 1 ] = totals[ END_OF_STREAM ] + 1;
}
/* * In order for the compressor to build the same model, I have to * store the symbol counts in the compressed file so the expander can * read them in. In order to save space, I don't save all 256 symbols * unconditionally. The format used to store counts looks like this: * * start, stop, counts, start, stop, counts, … 0 * * This means that I store runs of counts, until all the non-zero * counts have been stored. At this time the list is terminated by * storing a start value of 0. Note that at least 1 run of counts has * to be stored, so even if the first start value is 0, I read it in. * It also means that even in an empty file that has no counts, I have * to pass at least one count. * * In order to efficiently use this format, I have to identify runs of * non-zero counts. Because of the format used, I don't want to stop a * run because of just one or two zeros in the count stream. So I have * to sit in a loop looking for strings of three or more zero values * in a row. * * This is simple in concept, but it ends up being one of the most * complicated routines in the whole program. A routine that just * writes out 256 values without attempting to optimize would be much * simpler, but would hurt compression quite a bit on small files. * */ void output_counts( output, scaled_counts ) FILE *output; unsigned char scaled_counts[]; { int first; int last; int next; int i; first = 0; while ( first < 255 && scaled_counts[ first ] == 0 ) first++;
/* * Each time I hit the start of the loop, I assume that first is the * number for a run of non-zero values. The rest of the loop is * concerned with finding the value for last, which is the end of the * run, and the value of next, which is the start of the next run. * At the end of the loop, I assign next to first, so it starts in on * the next run. */ for ( ; first < 256 ; first = next ) { last = first + 1;
Slide 121: /* * Here is where I output first, last, and all the counts in between. */ if ( putc( first, output ) != first ) fatal_error( "Error writing byte counts\n" ); if ( putc( last, output ) != last ) fatal_error( "Error writing byte counts\n" ); for ( i = first ; i <= last ; i++ ) { if ( putc( scaled_counts[ i ], output ) != (int) scaled_counts[ i ] ) fatal_error( "Error writing byte counts\n" ); } } if ( putc( 0, output ) != 0 ) fatal_error( "Error writing byte counts\n" ); } /* * When expanding, I have to read in the same set of counts. This is * quite a bit easier that the process of writing them out, since no * decision making needs to be done. All I do is read in first, check * to see if I am all done, and if not, read in last and a string of * counts. */ void input_counts( input ) FILE *input; { int first; int last; int i; int c; unsigned char scaled_counts[ 256 ]; for ( i = 0 ; i < 256 ; i++ ) scaled_counts[ i ] = 0; if ( ( first = getc( input ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); if ( ( last = getc( input ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); for ( ; ; ) { for ( i = first ; i <= last ; i++ ) if ( ( c = getc( input ) ) == EOF ) fatal_error( "Error reading byte counts\n" );
for ( ; ; ) { for ( ; last < 256 ; last++ ) if ( scaled_counts[ last ] == 0 ) break; last ––; for ( next = last + 1; next < 256 ; next++ ) if ( scaled_counts[ next ] != 0 ) break; if ( next > 255 ) break; if ( ( next - last ) > 3 ) break; last = next; };
Slide 122: }
} build_totals( scaled_counts );
else scaled_counts[ i ] = (unsigned int) c; if ( ( first = getc( input ) ) == EOF ) fatal_error( "Error reading byte counts\n" ); if ( first == 0 ) break; if ( ( last = getc( input ) ) == EOF ) fatal_error( "Error reading byte counts\n" );
/* * Everything from here down defines the arithmetic coder section * of the program. */ /* * These four variables define the current state of the arithmetic * coder/decoder. They are assumed to be 16 bits long. Note that * by declaring them as short ints, they will actually be 16 bits * on most 80X86 and 680X0 machines, as well as VAXen. */ static unsigned short int code;/* The present input code value static unsigned short int low; /* Start of the current code range static unsigned short int high;/* End of the current code range long underflow_bits: /* Number of underflow bits pending
*/ */ */ */
/* * This routine must be called to initialize the encoding process. * The high register is initialized to all 1s, and it is assumed that * it has an infinite string of 1s to be shifted into the lower bit * positions when needed. */ void initialize_arithmetic_encoder() { low = 0; high = 0xffff; underflow_bits = 0; } /* * At the end of the encoding process, there are still significant * bits left in the high and low registers. We output two bits, * plus as many underflow bits as are necessary. */ void flush_arithmetic_encoder( stream ) BIT_FILE *stream; { OutputBit( stream, low & 0x4000 ); underflow_bits++; while ( underflow_bits–– > 0 ) OutputBit( stream, ~low & 0x4000 ); } /* * Finding the low count, high count, and scale for a symbol * is really easy, because of the way the totals are stored.
Slide 123: * This is the one redeeming feature of the data structure used * in this implementation. */ void convert_int_to_symbol( c, s ) int c; SYMBOL *s; } s->scale = totals[ END_OF_STREAM + 1]; s->low_count = totals[ c ]; s->high_count = totals[ c + 1 ]; } /* * Getting the scale for the current context is easy. */ void get_symbol_scale( s ) SYMBOL *s; { s->scale = totals[ END_OF_STREAM + 1 ]; } /* * During decompression, we have to search through the table until * we find the symbol that straddles the "count" parameter. When * it is found, it is returned. The reason for also setting the * high count and low count is so that symbol can be properly removed * from the arithmetic coded input. */ int convert_symbol_to_int( count, s ) int count; SYMBOL *s; { int c; for ( c = END_OF_STREAM ; count < totals[ c ] ; c–– ) ; s->high_count = totals[ c + 1 ]; s->low_count = totals[ c ]; return( c ); } /* * This routine is called to encode a symbol. The symbol is passed * in the SYMBOL structure as a low count, a high count, and a range, * instead of the more conventional probability ranges. The encoding * process takes two steps. First, the values of high and low are * updated to take into account the range restriction created by the * new symbol. Then, as many bits as possible are shifted out to * the output stream. Finally, high and low are stable again and * the routine returns. */ void encode_symbol( stream, s ) BIT_FILE *stream; SYMBOL *s; { long range; /* * These three lines rescale high and low for the new symbol.
Slide 124: */
/* * This loop turns out new bits until high and low are far enough * apart to have stabilized. */ for ( ; ; ) { /* * If this test passes, it means that the MSDigits match, and can * be sent to the output stream. */ if ( ( high & 0x8000 ) == ( low & 0x8000 ) ) { OutputBit( stream, high & 0x8000 ); while ( underflow_bits > 0 ) { OutputBit( stream, ~high & 0x8000 ); underflow_bits––; } } /* * If this test passes, the numbers are in danger of underflow, because * the MSDigits don't match, and the 2nd digits are just one apart. */ else if ( ( low & 0x4000 ) && !( high & 0x4000 )) { underflow_bits += 1; low &= 0x3fff; high |= 0x4000; } else return ; low <<= 1; high <<= 1; high |= 1; } } /* * When decoding, this routine is called to figure out which symbol * is presently waiting to be decoded. This routine expects to get * the current model scale in the s->scale parameter, and it returns * a count that corresponds to the present floating point code; * * code = count / s->scale */ short int get_current_count( s ) SYMBOL *s; { long range; short int count; range = (long) ( high - low ) + l; count = (short int) ((((long) ( code - low ) + 1 ) * s->scale-1 ) / range ) ; return( count );
range = (long) ( high-low ) + 1; high = low + (unsigned short int) (( range * s->high_count ) / s->scale - 1 ); low = low + (unsigned short int) (( range * s->low_count ) / s->scale );
Slide 125: } /* * This routine is called to initialize the state of the arithmetic * decoder. This involves initializing the high and low registers * to their conventional starting values, plus reading the first * 16 bits from the input stream into the code value. */ void initialize_arithmetic_decoder( stream ) BIT_FILE *stream; { int i; code = 0; for ( i = 0 ; i < 16 ; i++ ) { code <<= 1; code += InputBit( stream ); } low = 0; high = 0xffff; } /* * Just figuring out what the present symbol is doesn't remove * it from the input bit stream. After the character has been * decoded, this routine has to be called to remove it from the * input stream. */ void remove_symbol_from_stream( stream, s ) BIT_FILE *stream; SYMBOL *s; { long range; /* * First, the range is expanded to account for the symbol removal. */ range = (long)( high - low ) + l; high = low + (unsigned short int) (( range * s->high_count ) / s->scale - 1); low = low + (unsigned short int) (( range * s->low_count ) / s->scale ); */ * Next, any possible bits are shipped out. */ for ( ; ; ) { /* * If the MSDigits match, the bits will be shifted out. */ if ( ( high & 0x8000 ) == ( low & 0x8000 ) ) { } /* * Else, if underflow is threatening, shift out the 2nd MSDigit. */ else if ((low & 0x4000) == 0x4000 && (high & 0x4000) == 0 ) { code ^= 0x4000; low &= 0x3ffff; high |= 0x4000;
Slide 126: } else /* * Otherwise, nothing can be shifted out, so I return. */ return; low <<= 1; high <<= 1; high |= 1; code <<= 1; code += InputBit( stream ); } } /************************** End of ARITH.C ***************************/
Slide 127: Chapter 6 Statistical Modeling
The previous three chapters have shown several coding techniques used to compress data. The two coding methods disscussed, Huffman and arithmetic coding, can be implemented using either the fixed or adaptive approaches, but in all cases a statistical model needs to drive them. The chapters which discuss these coding techniques all used a simple order-0 model, which provides fairly good compression. This chapter discusses how to combine more sophisticated modeling techniques with standard coding methods to acheive better compression.
Higher-Order Modeling
To compress data using arithmetic or Huffman coding, we need a model of the data stream. The model needs to do two things to achieve compression: (1) it needs to accurately predict the frequency/probability of symbols in the input data stream, and (2) the symbol probabilities generated by the model need to deviate from a uniform distribution. Accurately predicting the probability of symbols in the input data is an inherent need in arithmetic or Huffman coding. This type of coding reduces the number of bits needed to encode a character as its probability of appearance increases. If the letter E represents 25 percent of the input data, it should take only two bits to code. If the letter Z represents only .1 percent of the input data, it might take ten bits to code. If the model is not generating probabilities accurately, it might take ten bits to code E and two bits to code Z, causing data expansion instead of compression. The model also needs to make predictions that deviate from a uniform distribution. The better the model is at making these predictions, the better the compression ratios will be. A model could be created, for example, that assigned all 256 possible symbols a uniform probability of 1/256. This model would create an output file exactly the same size as the input file, since every symbol would take exactly eight bits to encode. Only by correctly finding probabilities that deviate from a normal distribution can the number of bits be reduced, leading to compression. The increased probabilities have to accurately reflect reality, of course, as prescribed by the first condition. It may seem that the probability of a given symbol occurring in a data stream is fixed, but this is not quite true. The probability of a character can change quite a bit, depending on the model. When compressing a C program, for example, the probability of a new-line character in the text might be 1/40, a probability determined by scanning the entire text and dividing the number of occurrences of the character by the total number of characters. But if a modeling technique looks at a single previous character, the probabilities change. In that case, if the previous character were a ‘}’, the probability of a new-line character goes up to 1/2. This improved modeling technique leads to better compression, though both models were generating accurate probabilities.
Slide 128: Finite Context Modeling
The modeling discussed in this chapter is called “finite-context” modeling. It is based on a simple idea; calculate the probabilities for each incoming symbol based on the context in which the symbol appears. In all of the examples shown here, the context consists of nothing more than symbols previously encountered. The “order” of the model refers to the number of previous symbols that make up the context. The simplest finite-context model would be an order-0 model that calculates the probability of each symbol independently of any previous symbols. To implement this model, all that is needed is a single table containing the frequency counts for each symbol that might be encountered in the input stream. An order-1 model keeps track of 256 different tables of frequencies, since it needs a separate set of counts for each possible context. Similarly, an order-2 model needs to handle 65,536 different tables of contexts. The models used in chapters 3, 4, and 5 were all order-0 models. They didn’t take up much storage space, and they were simple to manipulate. By confining ourselves to order-0 modeling, however, we ensured that our data-compression ratios were relatively modest.
Adaptive Modeling
It seems logical that as the order of the model increases, compression ratios ought to improve as well. The probability of the letter u appearing in the text of this book may only be 5 percent, for example, but if the previous context character is q, the probability goes up to 95 percent. Predicting characters with high probability lowers the number of bits needed, and larger contexts ought to let us make better predictions. Unfortunately, as the order of the model increases linearly, the memory consumed by the model increases exponentially. With an order 0 model, the space consumed by the statistics could be as small as 256 bytes. Once the order of the model increases to 2 or 3, even the most cleverly designed models will consume hundreds of kilobytes. The conventional way of compressing data is to make a pass over the symbols to gather statistics for the model. Then a second pass is made to actually encode the data. The statistics are usually carried with the compressed data so the decoder will have a copy. This approach obviously has serious problems if the statistics for the model take more space than the data to be compressed. Adaptive compression is the solution to this problem. In adaptive data compression, both the compressor and the decompressor start with the same model. The compressor encodes a symbol using the existing model, then it updates the model to account for the new symbol using the existing model, then it updates the model to account for the new symbol. The decompressor likewise decodes a symbol using the existing model, then it updates the model. As long as the algorithm to update the model operates identically for the
Slide 129: compressor and the decompressor, the process can operate perfectly without needing to pass a statistics table from the compressor to the decompressor. Adaptive data compression has a slight disadvantage in that it starts compressing with less than optimal statistics. By subtracting the cost of transmitting the statistics with the compressed data, however, an adaptive algorithm will usually perform better than a fixed statistical model. Adaptive compression also suffers in the cost of updating the model. When updating the count for a particular symbol using arithmetic coding, for example, the update code has the potential cost of updating the cumulative counts for all other symbols as well, leading to code that on the average performs 128 arithmetic operations for every symbol encoded or decoded, using the modeling techniques needed for arithmetic coding. Because of high cost in both memory and CPU time, higher-order adaptive models have only become practical in perhaps the last ten years. It is ironic that as the cost of disk space and memory goes down, the cost of compressing the data stored there also goes down. As these costs continue to decline, we will be able to implement even more effective programs than are practical today.
A Simple Example
The sample program in Chapter 4 used Huffman coding to demonstrate adaptive compression. In this chapter, the sample program will use adaptive arithmetic coding. When performing finite-context modeling, we need a data structure to describe each context used while compressing the data. If we move up from an order to an order-1, for example, we will use the previous symbol as a context for encoding the current symbol. An array of 256 context arrays is probably the simplest way to create the data structures for an order-1 model. As we saw in the last chapter, a simple context model for an arithmetic encoder can be created using an array of cumulative counts for each symbol. If we have 256 symbols in our alphabet, an array of pointers to 256 different context arrays can be created like this:
int *totals[ 256 ]; void initialize_model() { int context; int i; for (context= 0 ; context < END_OF_STREAM ; context++ ) { totals[ context ] = (int *) calloc( END_OF_STREAM + 2, sizeof( int ) ); if ( totals[ context ] == NULL ) fatal_error( "Failure allocating context %d", context ); for ( i = 0 ; i <= ( END_OF_STREAM + 1 ) ; i++ ) totals[ context ][ i ] = 1; }
Slide 130: }
This code not only creates the 256 context arrays, it also initializes each symbol’s count to 1. At this point, we can begin encoding symbols as they come in. The loop for encoding the symbols looks similar to the one used for other adaptive programs. Here is an order 1 arithmetic compression loop:
for ( ; ; ) { c = getc( input ); if (c == EOF ) c = END_OF_STREAM; convert_int_to_symbol( c, context, &s ); encode_symbol( output, &s ); if ( c == END_OF_STREAM ) break; update_model( c, context ); context = c; }
This works fairly simply. Instead of just having a single context table, like the code in chapter 5, we now have a set of 256 context tables. Every symbol is encoded using the context table from the previously seen symbol, and only the statistics for the selected context get updated after the symbol is seen. This means we can now more accurately predict the probability of a character’s appearance. The decoding process for this order 1 code is also very simple, and it looks similar to the decoding example from chapter 5. Here is the order 1 expansion loop:
for ( ; ; ) { get_symbol_scale( context, &s ); count = get_current_count( &s ); c = convert_symbol_to_int( count, context, &s ); remove_symbol_from_stream( input, &s ); if (c == END_OF_STREAM ) break; putc( (char) c, output ); update_model( c, context ); context = c; }
The only difference between this and conventional order-0 code is the addition of the context variable, both within the loop and as a parameter to other functions. The remaining routines that differ from the code in Chapter 5 are are shown next. The C source for this module is included on the program disk.
void update_model( int symbol, int context ) int i; for ( i = symbol + 1 ; i <= ( END_OF_STREAM + 1 ) ; i++ ) totals[ context ][ i ]++; if ( totals[ context ][ END_OF_STREAM + 1 ] < MAXIMUM_SCALE ) return;
Slide 131: }
for ( i = 1 ; i <= ( END_OF_STREAM + 1 ) ; i++ ) { totals[ context ][ i ] /= 2; if ( totals[ context ][ i ] <= totals[ context ][ i - 1 ] ) totals[ context ][ i ] = totals[ context ][ i - 1 ] + 1; }
void convert_int_to_symbol( int c, int context, SYMBOL *s ) { s->scale = totals[ context ][ END_OF_STREAM + ]; s->low_count = totals[ context ][ c ]; s->high_count = totals[ context ][ c + 1 ]; } void get_symbol_scale( int context, SYMBOL *s ) { s->scale = totals[ context][ END_OF_STREAM + 1 ]; } int convert_symbol_to_int( int count, int context, SYMBOL *s) { int c; for ( c = 0; count >= totals[ context ][ c + 1 ] ; c++ ) ; s->high_count = totals[ context ][ c + 1 ]; s->low_count = totals[ context ][ c ]; return( c );
}
Using the Escape Code as a Fallback
The simple order-1 program does in fact do a creditable job of compression, but it has a couple of problems to address. First, the model for this program makes it a slow starter. Every context starts off with 257 symbols initialized to a single count, meaning every symbol starts off being encoded in roughly eight bits. As new symbols are added to the table, they will gradually begin to be encoded in fewer bits. This process, however, will not happen very quickly. For the context table for the letter q, for example, we will probably see a a very high number of u symbols. The very first u will have a probability of 1/257, and will accordingly be encoded in eight bits. The second u will have a probability of 2/258, but will still require over seven bits to encode. In fact, it will take sixteen consecutive u symbols with no other appearances before the entropy of the symbol is reduced to even four bits. The reason for this slow reduction in bit count is obvious. The probability of the u symbol is being weighted down by the other 256 symbols in the table. Though they may never appear in the message, they need a nonzero count. If their count were reduced to zero, we would not be able to encode them if and when they appeared in the message.
Slide 132: There is a solution to this problem, however, and it is relatively painless. Instead of having every symbol appear automatically in every table, start off with a nearly empty table and add symbols to the table only as they appear. The q table would have zero counts for all the other symbols, giving the first u that appears a low bit count. But there is a catch here. If a symbol doesn’t appear in a context table, how will it be encoded when it appears in a message? The easiest way to accomplish this is to use an escape code. The escape code is a special symbol (much like the end-of-stream symbol) that indicates we need to “escape” from the current context. When a context issues an escape symbol, we generally fall back to a lower-order context. In our next sample program, we escape to the escape context, a context that never gets updated. It contains 258 symbols, each of which has a count of 1. This guarantees that any symbol encountered in the message can be encoded by outputting an escape code from the current context and by encoding the symbol using the escape context. How does this affect the example used for the letter u? As it turns out, it makes an enormous difference. The first u symbol that took eight bits in the previous example will take about eight bits here as well. The escape code takes no bits to encode, and in the escape context the u has a 1/257 probability. After that, however, the u is added to the table and given a count of 1. The next appearance of u will require only one bit to encode, since it has a probability of 1/2. By the time 16 u’s have appeared, and while the previous model is still taking four bits to encode it, the escape-driven model will take .06 bits! The escape code frees us from burdening our models with characters that may never appear. This lets the model adjust rapidly to changing probabilities and quickly reduces the number of bits needed to encode high- probability symbols. The encoding process for this particular implementation of a multi-order model requires only a few modifications to the previous program. The convert_int_to_symbol() routine now has to check whether a symbol is present in the given context. If not, the escape code is encoded instead, and the function returns the appropriate result to the main encoding loop, as shown:
context = 0; initialize_model(); initialize_arithmetic_encoder(); for ( ; ; ) { c = getc( input ); if ( c == EOF ) c = END_OF_STREAM; escaped = convert_int_to_symbol( c, context, &s ); encode_symbol( output, &s ); if ( escaped ) { convert_int_to_symbol( c, ESCAPE, &s ); encode_symbol( output, &s ); } if ( c == END_OF_STREAM ) break;
Slide 133: }
update_model( c, context ); context = c;
In the main compression loop shown, the compressor first tries to send the original symbol. If the convert_int_to_symbol() routine returns a true, the symbol did not appear in the current context, and the routine resends the symbol using the escape context. We update just the current context model with the symbol just sent, not the escape model. The decompression loop for this program follows a similar pattern. The code shown next makes one or two possible passes through the loop, depending on whether an escape code is detected. The program for this order-1 context-switching program is on the program diskette that accompanies this book.
context = 0; initialize_model(); initialize_arithmetic_decoder( input ); for ( ; ; ) { last_context = context; do { get_symbol_scale( context, &s ); count = get_current_count( &s ); c = convert_symbol_to_int( count, context, &s ); remove_symbol_from_stream( input, &s ); context = c; } while ( c == ESCAPE ); if ( c == END_OF_STREAM ) break; putc( (char) c, output ); update_model( c, last_context ); context = c; }
Improvements
Some problems with the method of encoding in ARITH-1.C are the high-cost operations associated with the model. Each time we update the counts for symbol c, every count in totals[context][] from c up to 256 has to be incremented. An average of 128 increment operations have to be performed for every character encoded or decoded. For a simple demonstration program like the one shown here, this may not be a major problem, but a production program should be modified to be more efficient. One way to reduce the number of increment operations is to move the counts for the most frequently accessed symbols to the top of the array. This makes the model keep track of each symbol’s position in the totals[context] array, but it reduces the number of increment operations by an order of magnitude. This is a relatively simple enhancement to make to this program. A very good example of a program that uses this technique has been published as part of the paper by Ian H. Witten, Neal Radford, and John Cleary, “Arithmetic Coding for Data Compression,” Communications of the ACM (June 1987).
Slide 134: This paper is an excellent source of information regarding arithmetic coding, with some sample C source code illustrating the text.
Highest-Order Modeling
The previous sample program used order-1 statistics to compress data. It seems logical that if we move to higher orders, we should achieve better compression. The importance of the escape code becomes even more clear here. When using an order-3 model, we potentially have 16 million context tables to work with (actually 256*256*256, or 16,777,216). We would have to read in an incredible amount of text before those 16 million tables would have enough statistics to start compressing data, and many of those 16 million tables will never be used—which means they take up space in our computer’s memory for no good reason. When compressing English text, for example, it does no good to allocate space for the table QQW. It will never appear. The solution to this is, again, to set the initial probabilities of all of the symbols to 0 for a given context and to fall back to a different context when a previously unseen symbol occurs. So the obvious question is: what do we use as the fallback context after emitting an escape code? In the last example, we fell back to a default context called the escape context. The escape context was never updated, which meant that using it generally would not provide any compression. In the higher-order models, there is a better way to compress than just automatically falling back to a default context. If an existing context can’t encode a symbol, fall back to the next smaller-order context. If our existing context was REQ, for example, and U needs to be encoded for the first time, an escape code will be generated. Following that, we drop back to an order-2 model to try to encode the character U using the context EQ. This continues down through the order-0 context. If the escape code is still generated at order-0, we fall back to a special order(-1) context that is never updated and is set up at initialization to have a count of 1 for every possible symbol—so it is guaranteed to encode every symbol. Using this escape-code technique means only a slight modification to the driver program. The program (see the code found in ARITH-N.C) now sits in a loop trying to encode its characters, as shown here:
do { escaped = convert_int_to_symbol( c, &s ); encode_symbol( compressed_file, &s ); } while ( escaped );
The modeling code keeps track of what the current order is, decrementing it whenever an escape is emitted. Even more complicated is the modeling module’s job of keeping track of which context table needs to be used for the current order.
Slide 135: Updating the Model
ARITH1E.C does a good job of compressing data. But quite a few improvements can still be made to this simple statistical method without changing the fundamental nature of its algorithm. The rest of this chapter is devoted to discussing those improvements, along with a look at a sample compression module, ARITH-N.C, that implements most of them. Using the highest-order modeling algorithm requires that instead of keeping just one set of context tables for the highest order, we keep a full set of context tables for every order up to the highest order. If we are doing order-2 modeling, for example, there will be a single order-0 table, 256 order-1 tables, and 65,536 order-2 tables. When a new character is encoded or decoded, the modeling code updates one of these tables for each order. In the example of U following REQ, the modeling code would update the U counters in the order-3 REQ table, the order-2 EQ table, the order-1 Q table, and the order-0 table. The code to update all of these tables is shown next:
for ( order = 0 ; order <= max_order ; order++ ) update_model( order, symbol );
A slight modification to this algorithm results in both faster updates and better compression. Instead of updating all the different order models for the current context, we can instead update only those models actually used to encode the symbol. This is called “update exclusion,” since it excludes unused lower-order models from being updated. It will generally give a small but noticeable improvement in the compression ratio. Update exclusion works since symbols showing up frequently in the higher-order models won’t be seen as often in the lower-order models, which means we shouldn’t increment the counters in the lower-order models. The modified code for update exclusion will look like this:
for ( order = encoding_order ; order <= max_order ; order++ ) update_model( order, symbol );
Escape Probabilities
When the program first starts encoding a text stream, it will emit quite a few escape codes. The number of bits used to encode escape characters will probably have a large effect on the compression ratio, particularly in small files. In our first attempts to use escape codes, we set the escape count to 1 and left it there, regardless of the state of the rest of the context table. Bell, Cleary, and Witten call this “Method A.” Method B sets the count for the escape character at the number of symbols presently defined for the context table. If eleven different characters have been seen so far, for example, the escape symbol count will be set at eleven, regardless of what the counts are. Both methods seem to work fairly well. The code in our previous program can easily be modified to support either one. Probably the best thing about methods A and B is that they are not computationally intensive. Adding the escape symbol to the Method A table
Slide 136: can be done so that it takes almost no more work to update the table with the symbol than without it. The next sample program, ARITH-N.C, implements a slightly more complicated escapecount calculation algorithm. It tries to take into account three different factors when calculating the escape probability. First, as the number of symbol defined in the context table increases, the escape probability naturally decreases. This reaches its minimum when the table has all 256 symbols defined, making the escape probability 0. Second, it takes into account a measure of randomness in the table. It calculates this by dividing the maximum count in the table by the average count. The higher the ratio, the less random the table. The REQ table, for example, may have only three symbols defined: U, with a count of 50; u, with a count of 10; and e, with a count of 3. The ratio of U’s count, 50, to the average, 21, is fairly high. The U is thus predicted with a relatively high amount of accuracy, and the escape probability ought to be lower. In a table where the high count was 10 and the average was 8, things would seem a little more random, and the escape probability should be higher. The third factor taken into account is simply the raw count of how many symbols have been seen for the particular table. As the number of symbols increases, the predictability of the table should go up, making the escape probability go down. The formula I use for calculating the number of counts for the escape symbol is below.
count = (256 - number of symbols seen)*number of symbols seen count = count /(256 * the highest symbol count) if count is less than 1 count = 1
The missing variable in this equation is the raw symbol count. This is implicit in the calculation, because the escape probability is the escape count divided by the raw count. The raw count will automatically scale the escape count to a probability.
Scoreboarding
When using highest-order modeling techniques, an additional enhancement, scoreboarding, can improve compression efficiency. When we first try to compress a symbol, we can generate either the code for the symbol or an escape code. If we generate an escape code, the symbol had not previously occurred in that context, so we had a count of 0. But we do gain some information about the symbol just by generating an escape. We can now generate a list of symbols that did not match the symbol to be encoded. These symbols can temporarily have their counts set to 0 when we calculate the probabilities for lower-order models. The counts will be reset back to their permanent values after the encoding for the particular character is complete. This process is called scoreboarding.
Slide 137: An example of this is shown below. If the present context is HAC and the next symbol is K, we will use the tables shown next to encode the K. Without scoreboarding, the HAC context generates an escape with a probability of 1/6. The AC context generates an escape with a probability of 1/8. The C context generates an escape with a probability of 1/40, and the “” context finally generates a K with a probability of 1/73.
“” ESC 1 ‘K’ 1 ‘E 40 ‘I’ 22 ‘A 9
“C” ESC 1 ‘H’ 20 ‘T’ 11 ‘L’ 5 ‘A’ 3
“AC” ESC 1 ‘C’ 5 ‘H’ 2
“HAC” ESC 1 ‘E’ 3 ‘L’ 1 ‘C’ 1
If we use scoreboarding to exclude counts of previously seen characters, we can make a significant improvement in these probabilities. The first encoding of an escape from HAC isn’t affected, since no characters were seen before. But the AC escape code eliminates the C from its calculations, resulting in a probability of 1/3. The C escape code excludes the H and the A counts, increasing the probability from 1/40 to 1/17. And finally, the “” context excludes the E and A counts, reducing that probability from 1/73 to 1/24. This reduces the number of bits required to encode the symbol from 14.9 to 12.9, a significant savings. Keeping a symbol scoreboard will almost always result in some improvement in compression, and it will never make things worse. The major problem with scoreboarding is that the probability tables for all of the lower-order contexts have to be recalculated every time the table is accessed. This results in a big increase in the CPU time required to encode text. Scoreboarding is left in ARITH-N.C. to demonstrate the gains possible when compressing text using it.
Data Structures
All improvements to the basic statistical modeling assume that higher-order modeling can actually be accomplished on the target computer. The problem with increasing the order is one of memory. The cumulative totals table in the order-0 model in Chapter 5 occupied 516 bytes of memory. If we used the same data structures for an order-1 model, the memory used would shoot up to 133K, which is still probably acceptable. But going to order-2 will increase the RAM requirements for the statistics unit to thirty-four megabytes! Since we would like to try orders even higher than 2, we need to redesign the data structures that hold the counts. To save memory space, we have to redesign the context statistics tables. In Chapter 5, the table is about as simple as it can be, with each symbol being used as an index directly
Slide 138: into a pair of counts. In the order-1 model, the appropriate context tables would be found by indexing once into an array of context tables, then indexing again to the symbol in question, a procedure like that shown here:
low_count = totals[ last_char ][ current_char ]; high_count = totals[ last_char ][ current_char + 1 ]; range = totals[ last_char ][ 256 ];
This is convenient, but enormously wasteful. Full context space is allocated even for unused tables, and within the tables space is allocated for all symbols, seen or not. Both factors waste enormous amounts of memory in higher-order models. The solution to the first problem, reserving space for unused contexts, is to organize the context tables as a tree. Place the order-0 context table at a known location and use it to contain pointers to order-1 context tables. The order-1 context tables will hold their own statistics and pointers to order-2 context tables. This continues until reaching the “leaves” of the context tree, which contain order_n tables but no pointers to higher orders. Using a tree structure can keep the unused pointer nodes set to null pointers until a context is seen. Once the context is seen, a table is created and added to the parent node of the tree. The second problem is creating a table of 256 counts every time a new context is created. In reality, the highest-order contexts will frequently have only a few symbols, so we can save a lot of space by only allocating space for symbols seen for a particular context. After implementing these changes, we have a set of data structures that look like this:
typedef struct { unsigned char symbol; unsigned char counts; } STATS; typedef struct { struct context *next; } LINKS; typedef struct context { int max_index; STATS *stats; LINKS *links; struct context *lesser_context; } CONTEXT;
The new context structure has four major elements. The first is the counter, max_index, which tells how many symbols are presently defined for this particular context table. When a table is first created, it has no defined symbols, and max_index is -1. A completely defined table will have a max_index of 255. The max_index variable tells how many elements are allocated for the arrays pointed to by stats and links. Stats is an array of structures, each containing a symbol and a count for that symbol. If the context table is not one of the highest-order tables, it will also have a links array. Each symbol
Slide 139: defined in the stats array will have a pointer to the next higher-order context table in the links table. A sample of the context table tree is shown in figure 6.1. The table shown is one that will have just been created after the input text “ABCABDABE” when keeping maximum order-3 statistics. Just nine input symbols have already generated a fairly complicated data structure, but it is orders of magnitude smaller than one consisting of arrays of arrays.
Figure 6.1 A context table tree: “ABCABDABE.” One element in this structure that hasn’t been explained is the lesser_context pointer. This pointer is needed when using higher-order models. If the modeling code is trying to locate an order-3 context table, it first has to scan through the order-0 symbol list looking for the first symbol, the match, the order-1 symbol list, and so on. If the symbol lists in the lower orders are relatively full, this can be a lengthy process. Even worse, every time an escape is generated, the process has to be repeated when looking up the lower-order context. These searches can consume an inordinate amount of CPU time.
Slide 140: The solution to this is to maintain a pointer for each table that points to the table for the context one order less. The context table ABC should have its back pointer point to BC, for example, which should have a back pointer to C, which should have a pointer to “”, the null table. Then the modeling code only needs to keep a pointer to the current highest order context. Given that, finding the order-1 context table is simply a matter of performing (n-1) pointer operations. With the table shown in Figure 6.1, for example, suppose the next incoming text symbol is X and the current context is ABE. Without the benefit of the lesser context pointers, I have to check the order-3, 2, 1, and 0 tables for X. This takes 15 symbol comparisons and three table lookups. Using reverse pointers eliminates all the symbols comparisons and performs just three table lookups. To update the figure 6.1 context tree to contain an X after ABE, the modeling code has to perform a single set of lookups for each order/context. This code is shown in ARITH-N.C in the add_character_to_model() routine. Every time a new table is created, it needs to have its back pointer created properly at the same time, which requires a certain amount of care in the design of update procedures.
The Finishing Touches: Tables –1 and –2
The final touch to the context tree in ARITH-N.C is the addition of two special tables. The order(-1) table has been discussed previously. This is a table with a fixed probability for every symbol. If a symbol is not found in any of the higher-order models, it will show up in the order(-1) model. This is the table of last resort. Since it guarantees that it will always provide a code for every symbol in the alphabet, we don’t update this table, which means it uses a fixed probability for every symbol. ARITH-N.C also has a special order(-2) table used for passing control information from the encoder to the decoder. The encoder can pass a-1 to the decoder to indicate end-offile. Since normal symbols are always defined as unsigned values ranging from 0 to 255, the modeling code recognizes a negative number as a special symbol that will generate escapes all the way back to the order(-2) table. The modeling code can also determine that since -1 is a negative number, the symbol should just be ignored when the update_model() code is called.
Model Flushing
The creation of the order(-2) model allows us to pass a second control code from the encoder to the expander—the flush code, which tells the decoder to flush statistics out of the model. The compressor does this when the performance of the model starts to slip. The ratio is adjustable and is set in this implementation to 10 percent. When compression falls belows this ratio, the model is “flushed” by dividing all counts by two. This gives more weight to newer statistics, which should improve the compression.
Slide 141: In reality the model should probably be flushed whenever the input symbol stream drastically changes character. If the program is compressing an executable file, for example, the statistics accumulated during the compression of the executable code are probably of no value when compressing the program’s data. Unfortunately, it isn’t easy to define an algorithm that detects a “change in the nature” of the input.
Implementation
Even with the Byzantine data structures used here, the compression and expansion programs built around ARITH-N.C have prodigious memory requirements. When running on DOS machines limited to 640K, these programs have to be limited to order-1, or perhaps order-2 for text that has a higher redundancy ratio. To examine compression ratios for higher orders on binary files, there are a couple of choices for these programs. First, they can be built using a DOS Extender, such as Rational Systems/16M. Or they can be built on a machine that has either a larger address space or support for virtual memory, such as Windows 95, VMS, or UNIX. The code distributed here was written in an attempt to be portable across all these options. Testing shows that with an extra megabyte of extended memory and a DOS Extender, virtually any ASCII file can be compressed on a PC using order -3 compression. Some binary files require more memory. A UNIX system had no problem with order -3 compression and turned in the best performance overall in terms of speed.
Conclusions
Compression-ratio test show that statistical modeling can perform at least as well as dictionary-based methods. But these programs are at present somewhat impractical because of their high resource requirements. ARITH-N is fairly slow, compressing data with speeds in the range of 1K per second and needing huge amounts of memory to use higher-order modeling. As memory becomes cheaper and processors become more powerful, however, schemes such as the ones shown here may become practical. They could be applied today to circumstances in which either storage or transmission costs are extremely high. Order-0 adaptive modeling using arithmetic coding could be useful today in situations requiring extremely low consumption of memory. The compression ratios might not be as good as those gained with sophisticated models, but memory consumption is minimized.
Enhancement
The performance of these algorithms could be improved significantly beyond the implementation discussed here. The first area for improvement would be in memory management. Right now, when the programs run out of memory, they abort. A more sensible approach would be to have the programs start with fixed amounts of memory available for statistics. When the statistics fill the space, the program should then stop
Slide 142: updating the tables and just use what it had. This would mean implementing internal memory-management routines instead of using the C run-time library routines. Another potential improvement could come in the tree structure for the context tables. Locating tables through the use of hashing could be quite a bit faster and might require less memory. The context tables themselves could also be improved. When a table has over 50 percent of the potential symbols defined for it, an alternate data structure could be used with the counts stored in a linear array. This would allow for faster indexing and would reduce memory requirements. Finally, it might be worth trying ways to adaptively modify the order of the model being used. When compressing using order-3 statistics, early portions of the input text generate a lot of escapes while the statistics tables fill up. It ought to be possible to start encoding using order-0 statistics while keeping order-3 statistics. As the table fills up, the order used for encoding could be incremented until it reaches the maximum.
ARITH-N Listing
/*********************** Start of ARITH-N.C ************************/ * * This is the order-n arithmetic coding module used in the final * part of chapter 6. * * Compile with BITIO.C. ERRHAND.C, and either MAIN-C.C OR MAIN-E.C. This * program should be compiled in large model. An even better alternative * is a DOS extender. * */ #include #include #include #include #include <stdio.h> <stdlib.h> <string.h> "errhand.h" "bitio.h"
/* * The SYMBOL structure is what is used to defined a symbol in * arithmetic coding terms. A symbol is defined as a range between * 0 and 1. Since we are using integer math, instead of using 0 and 1 * as our end points, we have an integer scale. The low_count and * high_count define where the symbol falls in the range. */ typedef struct { unsigned short int low_count; unsigned short int high_count; unsigned short int scale; } SYMBOL; #define #define #define #define MAXIMUM_SCALE 16383 ESCAPE 256 DONE (-1) FLUSH (-2) /* /* /* /* Maximum allowed frequency count The escape symbol The output stream empty symbol The symbol to flush the model */ */ */ */
Slide 143: /* * Function prototypes. */ #ifdef __STDC__ void initialize_options( int argc, char **argv ); int check_compression( FILE *input, BIT_FILE *output ); void initialize_model( void ); void update_model( int symbol ); int convert_int_to_symbol( int symbol, SYMBOL *s ); void get_symbol_scale( SYMBOL *s ); int convert_symbol_to_int( int count, SYMBOL *s ); void add_character_to_model( int c ); void flush_model( void ); void initialize_arithmetic_decoder( BIT_FILE *stream ) ; void remove_symbol_from_stream( BIT_FILE *stream, SYMBOL *s ); void initialize_arithmetic_encoder( void ); void encode_symbol( BIT_FILE *stream, SYMBOL *s ); void flush_arithmetic_encoder( BIT_FILE *stream ); short int get_current_count( SYMBOL *s ); #else void initialize_options(); int check_compression(); void initialize_model(); void update_model(); int convert_int_to_symbol(); void get_symbol_scale(); int convert_symbol_to_int(); void add_character_to_model(); void flush_model(); void initialize_arithmetic_decoder(); void remove_symbol_from_stream(); void initialize_arithmetic_encoder(); void encode_symbol(); void flush_arithmetic_encoder(); short int get_current_count(); #endif char *CompressionName = "Adaptive order n model with arithmetic coding"; char *Usage = "in-file out-file [ -o order ]\n\n"; int max_order = 3; /* * The main procedure is similar to the main found in ARITH1E.C. It has * to initialize the coder and the model. It then sits in a loop reading * input symbols and encoding them. One difference is that every 256 * symbols a compression check is performed. If the compression ratio * falls below 10%, a flush character is encoded. This flushes the encod * ing model, and will cause the decoder to flush its model when the * file is being expanded. The second difference is that each symbol is
Slide 144: * repeatedly encoded until a successful encoding occurs. When trying to * encode a character in a particular order, the model may have to * transmit an ESCAPE character. If this is the case, the character has * to be retransmitted using a lower order. This process repeats until a * successful match is found of the symbol in a particular context. * Usually this means going down no further than the order -1 model. * However, the FLUSH and DONE symbols drop back to the order -2 model. * */ void CompressFile( input, output, argc, argv ) FILE *input; BIT_FILE *output; int argc; char *argv[]; { SYMBOL s; int c; int escaped; int flush = 0; long int text_count = 0; initialize_options( argc, argv ); initialize_model(); initialize-arithmetic_encoder(); for ( ; ; ) { if ( ( ++text_count & 0x0ff ) == 0 ) flush = check_compression( input, output ); if ( !flush ) c = getc( input ); else c = FLUSH; if ( c == EOF ) c = DONE; do { escaped = convert_int_to_symbol( c, &s); encode_symbol( output, &s ); } while ( escaped ); if ( c == DONE ) break; if ( c == FLUSH ) { flush_model(); flush = 0; } update_model( c ); add_character_to_model( c ); } flush_arithmetic_encoder( output ); } /* * The main loop for expansion is very similar to the expansion * routine used in the simpler compression program, ARITH1E.C. The * routine first has to initialize the the arithmetic coder and the * model. The decompression loop differs in a couple of respect. * First of all, it handles the special ESCAPE character, by
Slide 145: * removing them from the input bit stream but just throwing them * away otherwise. Secondly, it handles the special FLUSH character. * Once the main decoding loop is done, the cleanup code is called, * and the program exits. * */ void ExpandFile( input, output, argc, argv ) BIT_FILE *input; FILE *output; int argc; char *argv[]; { SYMBOL s; int c; int count; initialize_options( argc, argv initialize_model(); initialize_arithmetic_decoder( for ( ; ; ) { do { get_symbol_scale( &s ); count = get_current_count( c = convert_symbol_to_int( remove_symbol_from_stream( } while ( c == ESCAPE ); if ( c == DONE ) break; if ( c != FLUSH ) putc( (char) c, output ); else flush_model(); update_model( c ); add_character_to_model( c ); } ); input );
&s ); count, &s ); input, &s );
}
/* * This routine checks for command line options. At present, the only * option being passed on the command line is the order. */ void initialize_options( argc, argv ) int argc; char *argv[]; { while ( argc–– > 0 ) { if ( strcmp( *argv, "-o" ) == 0 ) { argc––; max_order = atoi( *++argv ); } else printf( "Uknown argument on command line: %s\n", *argv ); argc––; argv++; } }
Slide 146: /* * This routine is called once every 256 input symbols. Its job is to * check to see if the compression ratio falls below 10%. If the * output size is 90% of the input size, it means not much compression * is taking place, so we probably ought to flush the statistics in the * model to allow for more current statistics to have greater impact. * This heuristic approach does seem to have some effect. */ int check_compression( input, output ) FILE *input; BIT_FILE *output; { static long local_input_marker = 0L; static long local_output_marker = 0L; long total_input_bytes; long total_output_bytes; int local_ratio; total_input_bytes = ftell( input ) - local_input_marker; total_output_bytes = ftell( output->file ); total_output_bytes -= local_output_marker; if ( total_output_bytes == 0 ) total_output_bytes = 1; local_ratio = (int)( ( total_output_bytes * 100 ) / total_input_bytes ); local_input_marker = ftell( input ); local_output_marker = ftell( output->file ); return( local_ratio > 90 );
}
/* * * The next few routines contain all of the code and data used to * perform modeling for this program. This modeling unit keeps track * of all contexts from 0 up to max_order, which defaults to 3. In * addition, there is a special context -1 which is a fixed model used * to encode previously unseen characters, and a context -2 which is * used to encode EOF and FLUSH messages. * * Each context is stored in a special CONTEXT structure, which is * documented below. Context tables are not created until the context * is seen, and they are never destroyed. * */ /* * A context table contains a list of the counts for all symbols * that have been seen in the defined context. For example, a * context of "Zor" might have only had 2 different characters * appear. 't' might have appeared 10 times, and '1' might have * appeared once. These two counts are stored in the context * table. The counts are stored in the STATS structure. All of * the counts for a given context are stored in and array of STATS. * As new characters are added to a particular contexts, the STATS * array will grow. Sometimes the STATS array will shrink * after flushing the model.
Slide 147: */ typedef struct { unsigned char symbol; unsigned char counts; } STATS; /* * Each context has to have links to higher order contexts. These * links are used to navigate through the context tables. For example, * to find the context table for "ABC", I start at the order 0 table, * then find the pointer to the "A" context table by looking through * the LINKS array. At that table, we find the "B" link and go to * that table. The process continues until the destination table is * found. The table pointed to by the LINKS array corresponds to the * symbol found at the same offset in the STATS table. The reason that * LINKS is in a separate structure instead of being combined with * STATS is to save space. All of the leaf context nodes don't need * next pointers, since they are in the highest order context. In the * leaf nodes, the LINKS array is a NULL pointer. */ typedef struct { struct context *next; } LINKS; /* * The CONTEXT structure holds all of the known information about * a particular context. The links and stats pointers are discussed * immediately above here. The max_index element gives the maximum * index that can be applied to the stats or link array. When the * table is first created, and stats is set to NULL, max_index is set * to -1. As soon as single element is added to stats, max_index is * incremented to 0. * * The lesser context pointer is a navigational aid. It points to * the context that is one less than the current order. For example, * if the current context is "ABC", the lesser_context pointer will * point to "BC". The reason for maintaining this pointer is that * this particular bit of table searching is done frequently, but * the pointer only needs to be built once, when the context is * created. */ typedef struct context { int max_index; LINKS *links; STATS *stats; struct context *lesser_context; } CONTEXT; /* * *contexts[] is an array of current contexts. If I want to find * the order 0 context for the current state of the model. I just * look at contexts[0]. This array of context pointers is set up * every time the model is updated. */ CONTEXT **contexts; /* * current_order contains the current order of the model. It starts
Slide 148: * at max_order, and is decremented every time an ESCAPE is sent. It * will only go down to -1 for normal symbols, but can go to -2 for * EOF and FLUSH. */ int current_order; /* * This table contains the cumulative totals for the current context. * Because this program is using exclusion, totals has to be calculated * every time a context is used. The scoreboard array keeps track of * symbols that have appeared in higher order models, so that they * can be excluded from lower order context total calculations. */ short int totals[ 258 ]; char scoreboard[ 256 ]; /* * Local procedure declarations for modeling routines. */ #ifdef __STDC__ void update_table( CONTEXT *table, int symbol ); void rescale_table( CONTEXT *table ); void totalize_table( CONTEXT *table ); CONTEXT *shift_to_next_context( CONTEXT *table, int c, int order ); CONTEXT *allocate_next_order_table( CONTEXT *table, int symbol, CONTEXT *lesser_context ); void recursive_flush( CONTEXT *table ); #else void update_table(); void rescale_table(); void totalize_table(); CONTEXT *shift_to_next_context(); CONTEXT *allocate_next_order_table(); void recursive_flush(); #endif /* * This routine has to get everything set up properly so that * the model can be maintained properly. The first step is to create * the *contexts[] array used later to find current context tables. * The *contexts[] array indices go from -2 up to max_order, so * the table needs to be fiddled with a little. This routine then * has to create the special order -2 and order -1 tables by hand, * since they aren't quite like other tables. Then the current * context is set to \0, \0, \0, ... and the appropriate tables * are built to support that context. The current order is set * to max_order, the scoreboard is cleared, and the system is * ready to go. */ void initialize_model() { int i; CONTEXT *null_table; CONTEXT *control_table;
Slide 149: current_order = max_order; contexts = (CONTEXT **) calloc( sizeof( CONTEXT * ), 10 ); if ( contexts == NULL ) fatal_error( "Failure #1: allocating context table!" ); context += 2; null_table = (CONTEXT *) calloc( sizeof( CONTEXT ), 1 ); if ( null_table == NULL ) fatal_error( "Failure #2: allocating null table!" ); null_table->max_index = -1; contexts[ -1 ] = null_table; for ( i = 0 ; i <= max_order ; 1++ ) contexts[ i ] = allocate_next_order_table( contexts[ i-1 ], 0, contexts[ i-1 ] ); free( (char *) null_table->stats ); null_table->stats = (STATS &) calloc( sizeof( STATS ), 256 ); if ( null_table->stats == NULL ) fatal_error( "Failure #3: allocating null table!" ); null_table->max_index = 255; for ( i=0 ; i < 256 ; i++ ) { null_table->stats[ i ].symbol = (unsigned char) i; null_table->stats[ i ].counts = 1; } control_table = (CONTEXT *) calloc( sizeof(CONTEXT), 1 ); if ( control_table == NULL ) fatal_error( "Failure #4: allocating null table!" ); control_table->stats = (STATS *) calloc( sizeof( STATS ), 2 ); if ( control_table->stats == NULL ) fatal_error( "Failure #5: allocating null table!" ); contexts[ -2 ] = control_table; control_table->max_index = 1; control_table->stats[ 0 ].symbol = -FLUSH; control_table->stats[ 0 ].counts = 1; control_table->stats[ 1 ].symbol =– DONE; control_table->stats[ 1 ].counts = 1; for ( i = 0 ; i < 256 ; i++ ) scoreboard[ i ] = 0;
}
/* * This is a utility routine used to create new tables when a new * context is created. It gets a pointer to the current context, * and gets the symbol that needs to be added to it. It also needs * a pointer to the lesser context for the table that is to be * created. For example, if the current context was "ABC", and the * symbol 'D' was read in, add_character_to_model would need to * create the new context "BCD". This routine would get called * with a pointer to "BC", the symbol 'D', and a pointer to context * "CD". This routine then creates a new table for "BCD", adds it * to the link table for "BC", and gives "BCD" a back pointer to * "CD". Note that finding the lesser context is a difficult * task, and isn't done here. This routine mainly worries about
Slide 150: * modifying the stats and links fields in the current context. */ CONTEXT *allocate_next_order_table( table, symbol, lesser_context ) CONTEXT *table; int symbol; CONTEXT *lesser_context; { CONTEXT *new_table; int i; unsigned int new_size; for ( i = 0 ; i <= table->max_index ; i++ ) if (table->stats[ i ].symbol == (unsigned char) symbol ) break; if ( i > table->max_index ) { table->max_index++; new_size = sizeof( LINKS ); new_size *= table->max_index + 1; if ( table->links == NULL ) table->links = (LINKS *) calloc( new_size, 1 ); else table->links = (LINKS *) realloc( (char *) table->links, new_size ); new_size = sizeof( STATS ); new_size *= table->max_index + 1; if ( table->stats == NULL ) table->stats = (STATS *) calloc( new_size, 1 ); else table->stats = (STATS *) realloc( (char *) table->stats, new_size ); if ( table->links == NULL ) fatal_error( "Failure #6: allocating new table" ); if ( table->stats == NULL ) fatal_error( "Failure #7: allocating new table" ); table->stats[ i ].symbol = (unsigned char) symbol; table->stats[ i ].counts = 0; } new_table = (CONTEXT *) calloc(sizeof( CONTEXT ), 1 ); if ( new_table == NULL ) fatal_error( "Failure #8: allocating new table" ); new_table->max_index = -1; table->links[ i ].next = new_table; new_table->lesser_context = lesser_context; return( new_table );
}
/* * This routine is called to increment the counts for the current * contexts. It is called after a character has been encoded or * decoded. All it does is call update_table for each of the * current contexts, which does the work of incrementing the count. * This particular version of update_model() practices update exclusion, * which means that if lower order models weren't used to encode * or decode the character, they don't get their counts updated. * This seems to improve compression performance quite a bit. * To disable update exclusion, the loop would be changed to run