1. LZW基础概念
之前提到的算术编码、霍夫曼编码等技术集中在消除编码的冗余上,而本文要讲的LZW编码是一种针对空间冗余的无误差压缩方法。
LZW算法o又叫“串表压缩算法”,就是通过建立一个将字符串和其对应的记号构成的表(把已经出现过的字符串映射到记号上),用较短的代码来表示较长的字符串来实现压缩。
需要注意的是,LZW算法中字符串和记号的对应关系是在压缩的过程中动态生成的,并且隐含在压缩数据中,解压的时候也是一步一步还原编码并动态生成字典的过程。
2. LZW算法详解
2.1 LZW编码算法
编码器从字符串中不断地读入新的字符,并且将字符或字符串映射到新的记号,以动态地构建编码字典。在编码过程中,我们维护两个变量,即previous_char
(表示目前已有的,但未被编码的字符)和current_char
(表示刚刚读入的字符)。编码算法流程如下
- (1)初始状态,
previous_char
和current_char
都是空的。 - (2)读入新的字符current_char,并于previous合并成新的字符串
p_and_c
(previous_char+current_car) - (3)在编码字典里查找
p_and_c
,如果:p_and_c
在字典里,previous_char = p_and_cp_and_c
不在字典里,将previous_char
的标记输出;在字典中建立p_and_c
的映射;更新previous_char = current_char
。
- (4)返回步骤(2),重复直至读完字符串。
2.2 Example
ababcababac
初始状态字典里有三个默认的映射
String | Symbol |
---|---|
a | 0 |
b | 1 |
c | 2 |
开始编码
previous_char | current_char | p_and_c | p_and_c in dict | Action | output |
---|---|---|---|---|---|
"" | a | a | Y | previous_char = a | - |
a | b | ab | N | 添加映射{'ab', 3} previous_char = b | 0 |
b | a | ba | N | 添加映射{'ba', 4} | 1 |
a | b | ab | Y | previous_char = ab | - |
ab | c | abc | N | 添加映射{'abc', 5} previous_char = c | 3 |
c | a | ca | N | 添加映射{'ca', 6} previous_char = a | 2 |
a | b | ab | Y | previous_char = ab | - |
ab | a | aba | N | 添加映射{'aba', 7} previsou_char = a | 3 |
a | b | ab | Y | previous_char = ab | - |
ab | a | aba | Y | previous_char = aba | - |
aba | c | abac | N | 添加映射{'abac', 8} previous_char = c | 7 |
c | - | - | - | - | 2 |
编码结束,输出结果为0132372
,编码过程中生成的字典为:
String | Symbol |
---|---|
a | 0 |
b | 1 |
c | 2 |
ab | 3 |
ba | 4 |
abc | 5 |
ca | 6 |
aba | 7 |
abac | 8 |
2.2 LZW解码算法
解码器输入的是压缩后的数据。同时,我们将维护两个变量previous_symbol
(目前已有的但未被处理的记号),current_symbol
(当前读入的记号)。算法流程如下
- (1)初始状态,
previous_symbol
和current_symbol
都为空。 - (2)读入第一个记号,并赋给
current_symbol
,并解码直接输出。因为第一个字符为单个字符,一定可以直接解码。 - (3)赋值,
previous_symbol = current_symbol
-
(4)在字典里查找
current_symbol
,如果:current_symbol
在字典里:- a. 将
current_symbol
解码的结果输出 - b.
previous_char = dict.get(previous_symbol)
,current_char = dict.get(current_symbol)[0]
。注意,这里current_char
为current_symbol
解码的第一个字符。 -
c. 将
p_and_c
添加到新的映射 current_symbol
不在字典里:- a.
previous_char = dict.get(previous_symbol)
,current_char = dict.get(previous_symbol)[0]
。注意,这里current_char
为previous_symbol
解码的第一个字符。 - b. 将
p_and_c
添加到新的映射 - c. 输出
p_and_c
- 返回步骤(3),直至读完所有记号。
2.4 Example
0 1 3 2 3 7 2
初始状态字典里有三个默认的映射
Symbol | String |
---|---|
0 | a |
1 | b |
2 | c |
开始解码:
首先读取第一个记号,current_symbol = 0
,输出解码结果dict.get(current_symbol)
,即a
赋值previous_symbol = current_symbol
,然后循环读取之后的记号。
previous_symbol | current_symbol | current_symbol in dict | Action | output |
---|---|---|---|---|
0 | 1 | Y | previous_char = a current_char = b p_and_c = 'ab' 添加映射{3, 'ab'} | b |
1 | 3 | Y | previous_char = b current_char = a p_and_c = 'ba' 添加映射{4. 'ba'} | ab |
3 | 2 | Y | previous_char = ab current_char = c p_and_c = 'abc' 添加映射{5, 'abc'} | c |
2 | 3 | Y | previous_char = c current_char = a p_and_c = ca 添加映射{6. 'ca'} | ab |
3 | 7 | N | previous_char = ab current_char = a p_and_c = aba 添加映射{7, 'aba'} | aba |
7 | 2 | Y | previous_char = aba current_char = c P_and_c = abac 添加映射{8, 'abac'} | c |
解码结束,输出结果为ababcababac
。解码过程中生成的字典为
Symbol | String |
---|---|
0 | a |
1 | b |
2 | c |
3 | ab |
4 | ba |
5 | abc |
6 | ca |
7 | aba |
8 | abac |
2.5 代码实现(python)
def encode_init():
k = 0
char_num_dict = {}
for i in range(97,123):
char_num_dict[chr(i)] = k
k += 1
# print(char_num_dict.items())
return char_num_dict
def decode_init():
k = 0
num_char_dict = {}
for i in range(97,123):
num_char_dict[str(k)] = chr(i)
k += 1
# print(char_num_dict.items())
return num_char_dict
def LZW_encode(strList, char_num_dict):
symbol = 26 # 标记
previous_char = "" # 已有的,但未被编码的字符
current_char = "" # 当前读入的字符
p_and_c = ""
output = "" # 输出的编码序列
for s in strList:
current_char = s
p_and_c = previous_char + current_char
# print(previous_char, s)
if char_num_dict.get(p_and_c) != None:
previous_char = p_and_c
else:
output += str(char_num_dict.get(previous_char)) + '-'
char_num_dict[p_and_c] = symbol
symbol += 1
previous_char = current_char
output += str(char_num_dict[s[-1]])
print(char_num_dict.items())
return output
def LZW_decode(output, num_char_dict):
output_list = output.split('-')
print(output_list)
symbol = 26
previous_symbol = ""
current_symbol = ""
previous_char = ""
current_char = ""
decodeList = ""
decodeList += num_char_dict.get(output_list[0])
previous_symbol = output_list[0]
for o in output_list[1:]:
current_symbol = o
# print(previous_symbol, current_symbol)
if num_char_dict.get(current_symbol) != None:
decodeList += num_char_dict.get(current_symbol)
previous_char = num_char_dict.get(previous_symbol)
current_char = num_char_dict.get(current_symbol)[0]
num_char_dict[str(symbol)] = previous_char + current_char
symbol += 1
else:
previous_char = num_char_dict.get(previous_symbol)
current_char = num_char_dict.get(previous_symbol)[0]
num_char_dict[str(symbol)] = previous_char + current_char
symbol += 1
decodeList += previous_char + current_char
previous_symbol = current_symbol
# print(num_char_dict.items())
return decodeList
if __name__ == '__main__':
strList = input("请输入字符序列:").lower()
print('字符序列长度:', len(strList))
char_num_dict = encode_init()
output = LZW_encode(strList, char_num_dict)
print(output)
num_char_dict = decode_init()
# print(num_char_dict.items())
decodeList = LZW_decode(output, num_char_dict)
print(decodeList)
print("编码长度:", len(output) / 2)