Initial run-length encoding@bzip2を読んでみた

Wikipediaのbzip2圧縮アルゴリズムの一部を読んだので意訳して紹介します。
参考にしたソースコードは bzip2-1.0.6 です。

Wikipedia
https://en.wikipedia.org/wiki/Bzip2#Initial_run-length_encoding

bzip2 ソースコード
https://github.com/junkawa/bzip2/tree/master/bzip2-1.0.6

概要

bzip2では、入力データを読み込んだ時(copy_input_until_stop()@bzlib.c → ADD_CHAR_TO_BLOCK() → add_pair_to_block()) にランレングス符号化を行う。

Wikipedia

Any sequence of 4 to 255 consecutive duplicate symbols is replaced by the first four symbols and a repeat length between 0 and 251.

4〜255回、同じシンボルが連続した場合、「最初の4シンボル+残りの繰り返し回数」に置き換わる。

Thus the sequence "AAAAAAABBBBCCCD" is replaced with "AAAA\3BBBB\0CCCD", where "\3" and "\0" represent byte values 3 and 0 respectively.

"AAAAAAABBBBCCCD"は、"AAAA\3BBBB\0CCCD"に置き換わる。
AAAAAAA (Aが7回)は、最初の4シンボル(AAAA) + 残りの繰り返し回数(3)となる。
BBBB (Bが4回)は、最初の4シンボル(BBBB) + 残りの繰り返し回数(0)となる。
C,Dは連続回数が4回に達しないのでそのままとなる。

ここでは分かりやすさのため、A,B,Cというシンボルを使っているが、実際のbzip2では1シンボルは1バイト(0〜255の値)。したがって、シンボル7が8回連続する場合、下記となる。 77777777 → 77774

Runs of symbols are always transformed after four consecutive symbols, even if the run-length is set to zero, to keep the transformation reversible.

4つの連続するシンボルの後は、常に変換される。
前述のBBBB0のように繰り返し回数が0になった場合でも。
これは、復号時の処理を楽にするため。
復号時は、同じシンボルが4回続いたら、その次の値(シンボル)を読み、その値分、シンボルを生成すればよい。
unRLE_obuf_to_output_FAST() @ decompress.c#608 でこの復号処理を行っている。

In the worst case, it can cause an expansion of 1.25 and best case a reduction to <0.02 . While the specification theoretically allows for runs of length 256–259 to be encoded, the reference encoder will not produce such output.

最悪の場合、1.25倍になり、最良の場合、0.02以下に短縮される。
最悪の場合とは、4回同じシンボルが連続する場合。4バイト(BBBB)が5バイト(BBBB0)になる。5/4 = 1.25。
最良の場合とは、255回同じシンボルが連続する場合。255バイトが5バイト(4バイト+長さ1バイト)になる。5/255 = 0.0196...。

仕様上、理論上は256〜259回連続するシンボルを符号化できるが、実際の実装ではそのような出力を生成しないだろう。
理論上というのは、4個のシンボル+回数、で回数は255までの数値を表せるので、最高259回の連続を表せられる。しかし、bzip2の実装ではシンボルが255回連続した場合にランレングス符号化を行っているので、4個のシンボル+回数、で回数の最大値は251となる。

The author of bzip2 has stated that the RLE step was a historical mistake and was only intended to protect the original BWT implementation from pathological cases.[6]

bzip2の作者は、「RLEのステップは歴史的な誤りだ、異常なケースからオリジナルのBWT実装を守るためだけに意図された」という立場。

The run-length encoder, which is the first of the compression transformations, is entirely irrelevant. The original purpose was to protect the sorting algorithm from the very worst case input: a string of repeated symbols. But algorithm steps Q6a and Q6b in the original Burrows-Wheeler technical report (SRC-124) show how repeats can be handled without difficulty in block sorting.

[6] http://www.bzip.org/1.0.3/html/misc.html

最初の圧縮変換であるRLEは全く無関係だ。元来の目的は最悪のケースであるシンボルの連続する入力からソートアルゴリズムを保護することだった。しかし、BWTのオリジナル論文 Q6a、Q6bはブロックソートが連続する入力を難なく扱えることを示している。

A block sorting lossless data compression algorithm p.10

As described, the algorithm performs poorly when S contains many repeated substrings.

Algorithm Q: Fast qucksorting on suffixes のQ6では、連続する入力に弱い。

We deal with this problem with two mechanisms.
The first mechanism handles strings of a single repeated character by replacing step Q6 with the following steps.

接尾辞配列のソート時に、接尾辞先頭の2シンボルが異なる接尾辞をQ6aでソートし、その後に先頭2シンボルが同じになる接尾辞をソートすることで連続するシンボルからなる接尾辞を高速にソートできる。
この処理は面白いので別途記事書きたいです。

Q6aは、https://github.com/junkawa/bzip2/blob/master/bzip2-1.0.6/blocksort.c#L878
Q6bは、https://github.com/junkawa/bzip2/blob/master/bzip2-1.0.6/blocksort.c#L911

Zephyr build (7)

この記事で関係するターゲット linker.cmd ターゲット [zephyr_base]/Makefile linker.cmd: $(zephyr-deps) $(Q)$(CC) -x assembler-with-cpp -nostdinc -undef -E -P \ $(LDFLAG_LINKERCMD) $(LD_TOOLCHAIN) -I$(srctree)/include \ -I$(objtree)/include/generated $(EXTRA_LINKER_CMD_OPT) $(KBUILD_LDS) -o $@ $(zephyr-deps) TODO $(Q) [zephyr_base]/Makefile ifeq ("$(origin V)", "command line") KBUILD_VERBOSE = $(V) endif ifndef KBUILD_VERBOSE KBUILD_VERBOSE = 0 endif ifeq ($(KBUILD_VERBOSE),1) quiet = Q = else quiet=quiet_ Q = @ endif ... export quiet Q KBUILD_VERBOSE make V=1 などとすると、環境変数 Q は"" となる(or 未定義?)。それ以外では、Q は @ となる。 $(Q)$(CC) が gcc となる場合、実行する gcc ... というコマンドが表示されるが、@gcc となる場合、コマンドは表示されない。 5.2 Recipe Echoing 参照。 $(CC) 前の記事参照。 $(LDFLAG_LINKERCMD) 未定義 $(LD_TOOLCHAIN) [zephyr_base]/Makefile LD_TOOLCHAIN ?= -D__GCC_LINKER_CMD__ LD_TOOLCHAIN が未定義の場合、-D__GCC_LINKER_CMD__ とする。 $(srctree) [zephyr_base]/Makefile ifeq ($(KBUILD_...

思考の足跡

このブログを検索