灯下 · SICP · 2.3.4 Example: Huffman Encoding Trees

内容 2.3 Symbolic Data · 151

底本文段

This section provides practice in the use of list structure and data
abstraction to manipulate sets and trees. The application is to methods for
representing data as sequences of ones and zeros (bits). For example, the
ASCII standard code used to represent text in computers encodes each character
as a sequence of seven bits. Using seven bits allows us to distinguish 2
7,
or 128, possible different characters. In general, if we want to distinguish
n different symbols, we will need to use log
2
本节将通过练习来巩固表结构与数据抽象在操作集合和树方面的运用。具体应用是将数据表示为由零和一（位）组成的序列的方法。例如，计算机中用于表示文本的 ASCII 标准编码将每个字符编码为七位的序列。使用七位可以区分 2
7，即 128 种不同的字符。一般而言，若要区分 n 种不同的符号，每个符号需要使用 log
2
⁡
n bits per
symbol. If all our messages are made up of the eight symbols A, B, C, D, E, F,
G, and H, we can choose a code with three bits per character, for example
⁡
n 位。如果我们的所有消息都由 A、B、C、D、E、F、G、H 这八个符号构成，可以选择每个字符用三位编码，例如
With this code, the message
使用这种编码方案，消息
is encoded as the string of 54 bits
被编码为 54 位的位串
Codes such as ASCII and the A-through-H code above are known as
像 ASCII 以及上述 A 到 H 这样的编码方案被称为
fixed-length codes, because they represent each symbol in the message
with the same number of bits. It is sometimes advantageous to use
定长编码（fixed-length codes），因为它们用相同的位数表示消息中的每个符号。有时使用
variable-length codes, in which different symbols may be represented
by different numbers of bits. For example, Morse code does not use the same
number of dots and dashes for each letter of the alphabet. In particular, E,
the most frequent letter, is represented by a single dot. In general, if our
messages are such that some symbols appear very frequently and some very
rarely, we can encode data more efficiently (i.e., using fewer bits per
message) if we assign shorter codes to the frequent symbols. Consider the
following alternative code for the letters A through H:
变长编码（variable-length codes）更为有利，不同的符号可以用不同的位数来表示。例如，摩尔斯电码对字母表中的每个字母并不使用相同数量的点和划；特别地，出现频率最高的字母 E 仅用一个点来表示。一般来说，如果消息中某些符号出现非常频繁而另一些极为罕见，那么为频繁出现的符号分配较短的编码，就能更高效地对数据进行编码（即每条消息所用的位数更少）。下面是 A 到 H 各字母的另一种可选编码方案：

底本文段 2

With this code, the same message as above is encoded as the string
使用这种编码方案，上述同一条消息被编码为位串
This string contains 42 bits, so it saves more than 20% in space in comparison
with the fixed-length code shown above.
该位串包含 42 位，与上面的定长编码相比节省了 20% 以上的空间。
One of the difficulties of using a variable-length code is knowing when you
have reached the end of a symbol in reading a sequence of zeros and ones.
Morse code solves this problem by using a special
separator code (in
this case, a pause) after the sequence of dots and dashes for each letter.
Another solution is to design the code in such a way that no complete code for
any symbol is the beginning (or
prefix) of the code for another
symbol. Such a code is called a
prefix code. In the example above,
A is encoded by 0 and B is encoded by 100, so no other symbol can have a code
that begins with 0 or with 100.
使用变长编码的困难之一在于：在读取一串零和一的序列时，如何判断一个符号已经结束。摩尔斯电码的解决方案是在每个字母的点划序列之后使用特殊的分隔码（即停顿）。另一种解决方案是这样设计编码：任何符号的完整编码都不是其他符号编码的开头（即前缀 (prefix)）。满足此条件的编码称为前缀码 (prefix code)。在上面的例子中，A 编码为 0，B 编码为 100，因此没有其他符号的编码可以以 0 或 100 开头。
In general, we can attain significant savings if we use variable-length prefix
codes that take advantage of the relative frequencies of the symbols in the
messages to be encoded. One particular scheme for doing this is called the
Huffman encoding method, after its discoverer, David Huffman. A Huffman code
can be represented as a binary tree whose leaves are the symbols that are
encoded. At each non-leaf node of the tree there is a set containing all the
symbols in the leaves that lie below the node. In addition, each symbol at a
leaf is assigned a weight (which is its relative frequency), and each non-leaf
node contains a weight that is the sum of all the weights of the leaves lying
below it. The weights are not used in the encoding or the decoding process.
We will see below how they are used to help construct the tree.
一般来说，若能利用待编码消息中各符号的相对频率，使用变长前缀码可以获得可观的节省。其中一种具体方案以其发现者 David Huffman 的名字命名，称为哈夫曼编码法 (Huffman encoding method)。哈夫曼编码可以表示为一棵二叉树，其叶节点即为被编码的符号。树中每个非叶节点都包含一个集合，其中存放该节点以下所有叶节点中的符号。此外，每个叶节点上的符号都被赋予一个权重（即其相对频率），每个非叶节点也含有一个权重，其值为其下所有叶节点权重之和。这些权重在编码和解码过程中并不使用，我们将在下文看到它们如何辅助构造这棵树。

底本文段 3

Figure 2.18 shows the Huffman tree for the A-through-H code given above.
The weights at the leaves indicate that the tree was designed for messages in
which A appears with relative frequency 8, B with relative frequency 3, and the
other letters each with relative frequency 1.
Figure 2.18: A Huffman encoding tree.
图 2.18 展示了上述 A 到 H 编码方案对应的哈夫曼树。叶节点上的权重表明，该树是针对 A 以相对频率 8、B 以相对频率 3、其余各字母各以相对频率 1 出现的消息而构造的。
图 2.18：一棵哈夫曼编码树。
Given a Huffman tree, we can find the encoding of any symbol by starting at the
root and moving down until we reach the leaf that holds the symbol. Each time
we move down a left branch we add a 0 to the code, and each time we move down a
right branch we add a 1. (We decide which branch to follow by testing to see
which branch either is the leaf node for the symbol or contains the symbol in
its set.) For example, starting from the root of the tree in Figure 2.18,
we arrive at the leaf for D by following a right branch, then a left
branch, then a right branch, then a right branch; hence, the code for D is
1011.
给定一棵哈夫曼树，可以从根节点出发向下移动，直到到达持有该符号的叶节点，从而求得任意符号的编码。每次向左子树移动时在编码中添加 0，每次向右子树移动时添加 1。（通过检验某一分支本身是否为该符号的叶节点、或其符号集合中是否包含该符号，来决定沿哪条分支前进。）例如，从图 2.18 中的树根出发，依次沿右、左、右、右四条分支移动，即可到达 D 的叶节点；因此 D 的编码为 1011。
To decode a bit sequence using a Huffman tree, we begin at the root and use the
successive zeros and ones of the bit sequence to determine whether to move down
the left or the right branch. Each time we come to a leaf, we have generated a
new symbol in the message, at which point we start over from the root of the
tree to find the next symbol. For example, suppose we are given the tree above
and the sequence 10001010. Starting at the root, we move down the right
branch, (since the first bit of the string is 1), then down the left branch
(since the second bit is 0), then down the left branch (since the third bit is
also 0). This brings us to the leaf for B, so the first symbol of the decoded
message is B. Now we start again at the root, and we make a left move because
the next bit in the string is 0. This brings us to the leaf for A. Then we
start again at the root with the rest of the string 1010, so we move right,
left, right, left and reach C. Thus, the entire message is BAC.
Subheading: Generating Huffman trees
用哈夫曼树解码一个位序列时，从根节点开始，根据位序列中连续的零和一来决定沿左子树还是右子树向下移动。每当到达一个叶节点，便确定了消息中的一个新符号，然后再次从树根出发寻找下一个符号。例如，假设给定上述哈夫曼树和序列 10001010。从根节点出发，先沿右子树移动（因为位串第一位是 1），再沿左子树移动（第二位为 0），再沿左子树移动（第三位也是 0），到达 B 的叶节点，故解码消息的第一个符号为 B。接着从根节点重新出发，因位串下一位为 0 而向左移动，到达 A 的叶节点。然后再从根节点出发，处理剩余位串 1010，依次右、左、右、左到达 C 的叶节点。因此，整条消息为 BAC。
小节标题：生成哈夫曼树

底本文段 4

Given an “alphabet” of symbols and their relative frequencies, how do we
construct the “best” code? (In other words, which tree will encode messages
with the fewest bits?) Huffman gave an algorithm for doing this and showed
that the resulting code is indeed the best variable-length code for messages
where the relative frequency of the symbols matches the frequencies with which
the code was constructed. We will not prove this optimality of Huffman codes
here, but we will show how Huffman trees are constructed.
给定一个符号“字母表”及其相对频率，如何构造“最优”编码方案？（换言之，哪棵树能以最少的位数对消息进行编码？）哈夫曼给出了一种算法，并证明了当消息中各符号的相对频率与构造编码时所用的频率吻合时，所得编码确实是最优的变长编码。我们在此不证明哈夫曼编码的最优性，但将说明哈夫曼树是如何构造的。
The algorithm for generating a Huffman tree is very simple. The idea is to
arrange the tree so that the symbols with the lowest frequency appear farthest
away from the root. Begin with the set of leaf nodes, containing symbols and
their frequencies, as determined by the initial data from which the code is to
be constructed. Now find two leaves with the lowest weights and merge them to
produce a node that has these two nodes as its left and right branches. The
weight of the new node is the sum of the two weights. Remove the two leaves
from the original set and replace them by this new node. Now continue this
process. At each step, merge two nodes with the smallest weights, removing them
from the set and replacing them with a node that has these two as its left and
right branches. The process stops when there is only one node left, which is
the root of the entire tree. Here is how the Huffman tree of Figure 2.18
was generated:
生成哈夫曼树的算法非常简单。其基本思想是：让频率最低的符号距根节点最远。从包含各符号及其频率的叶节点集合出发，这些数据由待构造编码的初始数据确定。找出权重最小的两个叶节点，将它们合并为一个新节点，以这两个节点分别作为新节点的左、右子树。新节点的权重为两个权重之和。将这两个叶节点从原集合中移除，代之以新节点。如此反复：每一步都找出权重最小的两个节点，将它们从集合中移除，并以一个以这两个节点为左、右子树的新节点取代它们。当集合中只剩一个节点时，过程结束，该节点即为整棵树的根节点。图 2.18 中的哈夫曼树就是按以下步骤生成的：
The algorithm does not always specify a unique tree, because there may not be
unique smallest-weight nodes at each step. Also, the choice of the order in
which the two nodes are merged (i.e., which will be the right branch and which
will be the left branch) is arbitrary.
Subheading: Representing Huffman trees
该算法并不总是给出唯一的树，因为在某些步骤中权重最小的节点可能不唯一。此外，合并两个节点的顺序（即哪个作为左子树、哪个作为右子树）也是任意的。
小节标题：哈夫曼树的表示

底本文段 5

In the exercises below we will work with a system that uses Huffman trees to
encode and decode messages and generates Huffman trees according to the
algorithm outlined above. We will begin by discussing how trees are
represented.
在下面的练习中，我们将使用一个利用哈夫曼树对消息进行编码和解码、并按上述算法生成哈夫曼树的系统。我们先来讨论树的表示方式。
Leaves of the tree are represented by a list consisting of the symbol
leaf, the symbol at the leaf, and the weight:
树的叶节点用一个表来表示，该表由符号 leaf、叶节点处的符号以及权重三部分组成：
A general tree will be a list of a left branch, a right branch, a set of
symbols, and a weight. The set of symbols will be simply a list of the
symbols, rather than some more sophisticated set representation. When we make
a tree by merging two nodes, we obtain the weight of the tree as the sum of the
weights of the nodes, and the set of symbols as the union of the sets of
symbols for the nodes. Since our symbol sets are represented as lists, we can
form the union by using the append procedure we defined in
2.2.1:
一般树由左分支、右分支、一个符号集合以及一个权重组成，以表来表示。符号的集合直接用符号的表来表示，而非采用某种更复杂的集合表示形式。当我们通过合并两个节点来构造一棵树时，新树的权重等于两个节点权重之和，新树的符号集合则等于两个节点符号集合的并集。由于我们的符号集合以表来表示，可以利用 2.2.1 节中定义的 append 过程来求并集：
If we make a tree in this way, we have the following selectors:
如果我们以这种方式构造一棵树，则可使用以下选择函数：

底本文段 6

The procedures symbols and weight must do something slightly
different depending on whether they are called with a leaf or a general tree.
These are simple examples of
generic procedures (procedures that can
handle more than one kind of data), which we will have much more to say about
in 2.4 and 2.5.
Subheading: The decoding procedure
过程 symbols 和 weight 的行为需根据它们所接收的参数是叶节点还是一般树而有所不同。这是通用操作 (generic procedures) 的简单示例——通用操作能够处理多种不同类型的数据——我们将在 2.4 节和 2.5 节中对此作进一步说明。
小节标题：解码过程
The following procedure implements the decoding algorithm. It takes as
arguments a list of zeros and ones, together with a Huffman tree.
以下过程实现了解码算法，它以一个由零和一组成的表以及一棵 Huffman 树作为参数：
The procedure decode-1 takes two arguments: the list of remaining bits
and the current position in the tree. It keeps moving “down” the tree,
choosing a left or a right branch according to whether the next bit in the list
is a zero or a one. (This is done with the procedure choose-branch.)
When it reaches a leaf, it returns the symbol at that leaf as the next symbol
in the message by consing it onto the result of decoding the rest of the
message, starting at the root of the tree. Note the error check in the final
clause of choose-branch, which complains if the procedure finds
something other than a zero or a one in the input data.
Subheading: Sets of weighted elements
过程 decode-1 接收两个参数：剩余比特位的表，以及当前在树中的位置。它沿树不断向“下”移动，根据表中下一个比特位是零还是一来选择左分支或右分支（这由过程 choose-branch 完成）。当它到达一片叶子时，便将该叶节点处的符号与对消息其余部分解码的结果用 cons 拼接，作为消息中的下一个符号返回，解码从树的根节点重新开始。注意 choose-branch 最后一个子句中的错误检查——当过程在输入数据中发现除零和一以外的内容时，它会报告错误。
小节标题：加权元素的集合
In our representation of trees, each non-leaf node contains a set of symbols,
which we have represented as a simple list. However, the tree-generating
algorithm discussed above requires that we also work with sets of leaves and
trees, successively merging the two smallest items. Since we will be required
to repeatedly find the smallest item in a set, it is convenient to use an
ordered representation for this kind of set.
在我们对树的表示中，每个非叶节点都包含一个符号集合，我们将其表示为简单的表。然而，上述生成树的算法还要求我们处理叶节点和树组成的集合，并依次合并其中权重最小的两项。由于需要反复查找集合中的最小元素，对这类集合采用有序的表示方式会更为方便。

底本文段 7

We will represent a set of leaves and trees as a list of elements, arranged in
increasing order of weight. The following adjoin-set procedure for
constructing sets is similar to the one described in Exercise 2.61;
however, items are compared by their weights, and the element being added to
the set is never already in it.
我们将叶节点和树的集合表示为一个按权重递增顺序排列的元素表。下面用于构造集合的 adjoin-set 过程与练习 2.61 中描述的类似，但这里按权重比较元素，且被加入集合的元素一定尚不存在于该集合中。
The following procedure takes a list of symbol-frequency pairs such as
((A 4) (B 2) (C 1) (D 1)) and constructs an initial ordered set of
leaves, ready to be merged according to the Huffman algorithm:
以下过程接受一个由符号-频率序对组成的表（例如 `((A 4) (B 2) (C 1) (D 1))`），并据此构造一个按 Huffman 算法排好序的初始叶节点集合，以备合并之用：

SICP source code scheme

Racket #lang sicp

A 000 C 010 E 100 G 110
B 001 D 011 F 101 H 111

SICP source code scheme

Racket #lang sicp

BACADAEAFABBAAAGAH

SICP source code scheme

Racket #lang sicp

001000010000011000100000101
000001001000000000110000111

SICP source code scheme

Racket #lang sicp

A 0 C 1010 E 1100 G 1110
B 100 D 1011 F 1101 H 1111

SICP source code scheme

Racket #lang sicp

100010100101101100011
010100100000111001111

SICP source code scheme

Racket #lang sicp

Initial {(A 8) (B 3) (C 1) (D 1)
leaves (E 1) (F 1) (G 1) (H 1)}

Merge {(A 8) (B 3) ({C D} 2)
 (E 1) (F 1) (G 1) (H 1)}

Merge {(A 8) (B 3) ({C D} 2)
 ({E F} 2) (G 1) (H 1)}

Merge {(A 8) (B 3) ({C D} 2)
 ({E F} 2) ({G H} 2)}

Merge {(A 8) (B 3) ({C D} 2)
 ({E F G H} 4)}

Merge {(A 8) ({B C D} 5)
 ({E F G H} 4)}

Merge {(A 8) ({B C D E F G H} 9)}

Final {({A B C D E F G H} 17)}
merge

SICP source code scheme

Racket #lang sicp

(define (make-leaf symbol weight)
 (list 'leaf symbol weight))
(define (leaf? object)
 (eq? (car object) 'leaf))
(define (symbol-leaf x) (cadr x))
(define (weight-leaf x) (caddr x))

SICP source code scheme

Racket #lang sicp

(define (make-code-tree left right)
 (list left
 right
 (append (symbols left)
 (symbols right))
 (+ (weight left) (weight right))))

SICP source code scheme

Racket #lang sicp

(define (left-branch tree) (car tree))
(define (right-branch tree) (cadr tree))

(define (symbols tree)
 (if (leaf? tree)
 (list (symbol-leaf tree))
 (caddr tree)))

(define (weight tree)
 (if (leaf? tree)
 (weight-leaf tree)
 (cadddr tree)))

SICP source code scheme

Racket #lang sicp

(define (decode bits tree)
 (define (decode-1 bits current-branch)
 (if (null? bits)
 '()
 (let ((next-branch
 (choose-branch
 (car bits)
 current-branch)))
 (if (leaf? next-branch)
 (cons
 (symbol-leaf next-branch)
 (decode-1 (cdr bits) tree))
 (decode-1 (cdr bits)
 next-branch)))))
 (decode-1 bits tree))

(define (choose-branch bit branch)
 (cond ((= bit 0) (left-branch branch))
 ((= bit 1) (right-branch branch))
 (else (error "bad bit:
 CHOOSE-BRANCH" bit))))

SICP source code scheme

Racket #lang sicp

(define (adjoin-set x set)
 (cond ((null? set) (list x))
 (( (weight x) (weight (car set)))
 (cons x set))
 (else
 (cons (car set)
 (adjoin-set x (cdr set))))))

SICP source code scheme

Racket #lang sicp

(define (make-leaf-set pairs)
 (if (null? pairs)
 '()
 (let ((pair (car pairs)))
 (adjoin-set
 (make-leaf (car pair) ; symbol
 (cadr pair)) ; frequency
 (make-leaf-set (cdr pairs))))))

我的笔记自动保存