123 4 5 / 5 页下一页

[文本处理] 用awk写的base64编码脚本，大家帮忙修改一下！ [复制链接]

bikkuri

家境小康

论坛徽章:: 18

11楼 [报告]

发表于 2013-12-25 17:33 |只看该作者

处理纯文本好像没什么问题：

root@unknown:/tmp/test# echo Hello |./base64encode.sh
Source: Hello
byte1=72 byte2=101 byte3=108
base1=18 base2=6 base3=21 base4=44
Result=SGVs
Source: lo
byte1=108 byte2=111 byte3=0
base1=27 base2=6 base3=60 base4=0
Result=SGVsbG8=
SGVsbG8=root@unknown:/tmp/test#

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

bikkuri

家境小康

论坛徽章:: 18

12楼 [报告]

发表于 2013-12-25 18:27 |只看该作者

我知道了，要处理二进制字节的话，asc函数里char的取值范围应该是0-255。
这是一个比较明显的错误。等下我回去修改一下试试。
还有一个问题就是为什么输出的字节数会多几倍。
从上面的debug结果看，最后输出result的时候输出了三次同样的内容。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

bikkuri

家境小康

论坛徽章:: 18

13楼 [报告]

发表于 2013-12-25 23:04 |只看该作者

本帖最后由 bikkuri 于 2013-12-25 23:15 编辑

把asc函数里char的取值范围改成0-255之后，

function asc(char,l_found)
{
for (i=0;i<=255;i++){
if (sprintf("%c",i)==char) l_found=i;
}
return l_found;
}

复制代码

大于127的高位字节可以进行处理了，但是当程序碰到0的时候还是出现了问题。

Source: ^_‹^H Length: 3
byte1=31 byte2=139 byte3=8
base1=7 base2=56 base3=44 base4=8
Result=H4sI
H4sISource: þ^M»R^B^CóHÍÉÉç^B Length: 13
byte1=254 byte2=13 byte3=187
base1=63 base2=32 base3=54 base4=59
Result=H4sI/g27
Source: R^B^CóHÍÉÉç^B Length: 10
byte1=82 byte2=2 byte3=3
base1=20 base2=32 base3=8 base4=3
Result=H4sI/g27UgID
Source: óHÍÉÉç^B Length: 7
byte1=243 byte2=72 byte3=205
base1=60 base2=52 base3=35 base4=13
Result=H4sI/g27UgID80jN
Source: ÉÉç^B Length: 4
byte1=201 byte2=201 byte3=231
base1=50 base2=28 base3=39 base4=39
Result=H4sI/g27UgID80jNycnn
Source: ^B Length: 1
byte1=2 byte2=0 byte3=0
base1=0 base2=32 base3=0 base4=0
Result=H4sI/g27UgID80jNycnnAg==
H4sI/g27UgID80jNycnnAg==Source: ^V5–1^F Length: 5
byte1=22 byte2=53 byte3=150
base1=5 base2=35 base3=22 base4=22
Result=H4sI/g27UgID80jNycnnAg==FjWW
Source: 1^F Length: 2
byte1=49 byte2=6 byte3=0
base1=12 base2=16 base3=24 base4=0
Result=H4sI/g27UgID80jNycnnAg==FjWWMQY=
H4sI/g27UgID80jNycnnAg==FjWWMQY=H4sI/g27UgID80jNycnnAg==FjWWMQY=H4sI/g27UgID80jNycnnAg==FjWWMQY=

复制代码

把应该处理的字节和实际被处理的字节做一个对比，可以发现程序在对0的处理出现异常：

[Origin] -> [Should be] | [Result]
1F 8B 08 -> 31 139 8 | 31 139 8
00 FE 0D -> 0 254 13 | 254 13 187
BB 52 02 -> 187 82 2 | 82 2 3
03 F3 48 -> 3 243 72 | 243 72 205
CD C9 C9 -> 205 201 201 | 201 201 231
E7 02 00 -> 231 2 0 | 2 0 0
16 35 96 -> 22 53 150 | 22 53 150
31 06 00 -> 49 6 0 | 49 6 0
00 00 -> 0 0 |
[26 bytes] [24 bytes]

复制代码

怀疑是asc函数中的if (sprintf("%c",i)==char) l_found=i;这一句中sprintf函数，在Busybox下无法正常输出sprintf("%c", 0)，因此asc函数返回值为空。

我试了在asc函数中对l_found先赋0值，这样即使sprintf函数在碰到0，无法输出sprintf("%c",0)，这个函数还是可以返回最初赋的0值。

function asc(char,l_found)
{
l_found=0;
for (i=0;i<=255;i++){
if (sprintf("%c",i)==char) l_found=i;
}
return l_found;
}

复制代码

但是修改后再次执行，结果没有任何变化。不知道是什么原因。

另外最后的result值显示3遍的原因也还是不清楚。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

bikkuri

家境小康

论坛徽章:: 18

14楼 [报告]

发表于 2013-12-25 23:39 |只看该作者

本帖最后由 bikkuri 于 2013-12-26 01:32 编辑

看来不是asc函数的问题。
看程序对$0长度的处理可以发现Busybox下的awk在碰到0的时候就认为字符串结束了！
也就是说0被转义了！
源字符串1F 8B 08 00 FE 0D BB 52 02 03 F3 48 CD C9 C9 E7 02 00 16 35 96 31 06 00 00 00被awk当成了3个字符串

1F 8B 08 - Length: 3
FE 0D BB 52 02 03 F3 48 CD C9 C9 E7 02 - Length: 13
16 35 96 31 06 - Length: 5

复制代码

有什么办法让awk不把0当作字符串结束呢？或者说如何让awk不转义字符串中间的0呢？有没有办法取消awk的分隔符设定呢？

程序最后输出result的时候输出了3遍同样的内容，也很有可能是因为awk把result当成是3个字符串的编码结果，在分别输出3个字符串的编码结果时却把相同的result输出了3遍。
所以现在的关键问题是要让awk知道这不是3个字符串，而是一个字符串。
各位awk高手有什么好的意见吗？

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

bikkuri

家境小康

论坛徽章:: 18

15楼 [报告]

发表于 2013-12-26 00:32 |只看该作者

本帖最后由 bikkuri 于 2013-12-26 00:38 编辑

看来确实是分隔符的问题。

root@unknown:/tmp/test# echo -e "Hello \n World!"|./base64encode.sh
Source: Hello Length: 6
byte1=72 byte2=101 byte3=108
base1=18 base2=6 base3=21 base4=44
Result=SGVs
Source: lo Length: 3
byte1=108 byte2=111 byte3=32
base1=27 base2=6 base3=60 base4=32
Result=SGVsbG8g
SGVsbG8gSource: World! Length: 7
byte1=32 byte2=87 byte3=111
base1=8 base2=5 base3=29 base4=47
Result=SGVsbG8gIFdv
Source: rld! Length: 4
byte1=114 byte2=108 byte3=100
base1=28 base2=38 base3=49 base4=36
Result=SGVsbG8gIFdvcmxk
Source: ! Length: 1
byte1=33 byte2=0 byte3=0
base1=8 base2=16 base3=0 base4=0
Result=SGVsbG8gIFdvcmxkIQ==
SGVsbG8gIFdvcmxkIQ==root@unknown:/tmp/test#

复制代码

哪位awk高手教我如何取消awk的分隔符啊！拜谢了！

root@unknown:/tmp/test# echo -e "Hello \0 World!"|./base64encode.sh
Source: Hello Length: 6
byte1=72 byte2=101 byte3=108
base1=18 base2=6 base3=21 base4=44
Result=SGVs
Source: lo Length: 3
byte1=108 byte2=111 byte3=32
base1=27 base2=6 base3=60 base4=32
Result=SGVsbG8g
SGVsbG8gSource: World! Length: 7
byte1=32 byte2=87 byte3=111
base1=8 base2=5 base3=29 base4=47
Result=SGVsbG8gIFdv
Source: rld! Length: 4
byte1=114 byte2=108 byte3=100
base1=28 base2=38 base3=49 base4=36
Result=SGVsbG8gIFdvcmxk
Source: ! Length: 1
byte1=33 byte2=0 byte3=0
base1=8 base2=16 base3=0 base4=0
Result=SGVsbG8gIFdvcmxkIQ==
SGVsbG8gIFdvcmxkIQ==root@unknown:/tmp/test#

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

jason680

富可敌国

论坛徽章:: 145

16楼 [报告]

发表于 2013-12-26 10:26 |只看该作者

本帖最后由 jason680 于 2013-12-26 10:33 编辑

回复 15# bikkuri

change the RS and FS in awk or you need the new version of awk or rebuild your busybox
$ echo -e "Hello \0 World"  | awk 'BEGIN{RS="\x00"}{print $0}'
Hello
World

$ echo -e "Hello \0 World"  | awk 'BEGIN{RS="\x55\xaa\x66\xbb"}{print $0}'
Hello  World

-------------------------------
I didn't have your system with busybox, I try it in Solaris that also have old version of awk

in Solaris, some issue in awk
$ echo -e "Hello \0 World"  | /usr/xpg4/bin/awk '{print $0}' | od -t x1
0000000 48 65 6c 6c 6f 20 0a
0000007

$ echo -e "Hello \0 World"  |  od -t x1
0000000 48 65 6c 6c 6f 20 00 20 57 6f 72 6c 64 0a
0000016

in Linux, without issue
$ echo -e "Hello \0 World"  | awk '{print $0}' | od -t x1
0000000 48 65 6c 6c 6f 20 00 20 57 6f 72 6c 64 0a
0000016

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

Herowinter

富可敌国

论坛徽章:: 771

17楼 [报告]

发表于 2013-12-26 10:44 |只看该作者

回复 15# bikkuri
好像确实有这个问题
CentOS6.4

echo -e "Hello\0World"|awk '1'
HelloWorld

复制代码

Solaris 5.10

echo -e "Hello\0World"|/usr/xpg4/bin/awk '1'
Hello

复制代码

gawk和nawk的区别？

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

Herowinter

富可敌国

论坛徽章:: 771

18楼 [报告]

发表于 2013-12-26 12:00 |只看该作者

回复 14# bikkuri

找了一上午，没找到怎么忽略输入中的\0的。
楼主看一一下下面这段，gawk外的其他awk
都是以\0为结束的。

Advanced Notes: RS = "\0" Is Not Portable

There are times when you might want to treat an entire data file as a single record. The only way to make this happen is to give RS a value that you know doesn't occur in the input file. This is hard to do in a general way, such that a program always works for arbitrary input files.

You might think that for text files, the NUL character, which consists of a character with all bits equal to zero, is a good value to use for RS in this case:

BEGIN { RS = "\0" } # whole file becomes one record?

gawk in fact accepts this, and uses the NUL character for the record separator. However, this usage is not portable to other awk implementations.

All other awk implementations(15) store strings internally as C-style strings. C strings use the NUL character as the string terminator. In effect, this means that `RS = "\0"' is the same as `RS = ""'. (d.c.)

The best way to treat a whole file as a single record is to simply read the file in, one record at a time, concatenating each record onto the end of the previous ones.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

鬼剑士之王

白手起家

论坛徽章:: 0

19楼 [报告]

发表于 2013-12-26 13:28 |只看该作者

在经历过无数次惨痛的关键字冲突导致莫名错误之后，有一点心得。
在命名变量时，为其添加适当的前缀。

譬如一个变量，表示时间，很自然想到
time=xxx
但是time是Shell下的命令，不宜用作变量。那么可以
varTime=xxx
这里用variable(变量)的缩写var作为前缀，并且使用了驼峰命名法。整个变量清晰易懂，可读性较高，同时避免任何可能的关键字冲突。
函数我一般使用function的缩写func作为前缀。

仅个人拙见，供楼主参考。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

bikkuri

家境小康

论坛徽章:: 18

20楼 [报告]

发表于 2013-12-26 18:55 |只看该作者

刚才在busybox上试了一下楼上几位的命令,好像无论怎么改FS和RS,awk都会在"\0"的地方把字符串分隔成两个字符串。

root@unknown:/tmp/test# echo -e "Hello \0 World" |hexdump -C
00000000 48 65 6c 6c 6f 20 00 20 57 6f 72 6c 64 0a |Hello . World.|
0000000e
root@unknown:/tmp/test# echo -e "Hello \0 World" |awk '{print $0}'|hexdump -C
00000000 48 65 6c 6c 6f 20 0a 20 57 6f 72 6c 64 0a |Hello . World.|
0000000e
root@unknown:/tmp/test# echo -e "Hello \0 World" |awk 'BEGIN{FS="\0"}{print $0}'|hexdump -C
00000000 48 65 6c 6c 6f 20 0a 20 57 6f 72 6c 64 0a |Hello . World.|
0000000e
root@unknown:/tmp/test# echo -e "Hello \0 World" |awk 'BEGIN{FS="\0x55\0x66"}{print $0}'|hexdump -C
00000000 48 65 6c 6c 6f 20 0a 20 57 6f 72 6c 64 0a |Hello . World.|
0000000e
root@unknown:/tmp/test# echo -e "Hello \0 World" |awk '1'|hexdump -C
00000000 48 65 6c 6c 6f 20 0a 20 57 6f 72 6c 64 0a |Hello . World.|
0000000e
root@unknown:/tmp/test# echo -e "Hello \0 World" |awk 'BEGIN{RS="\0"}{print $0}'|hexdump -C
00000000 48 65 6c 6c 6f 20 0a |Hello .|
00000007
root@unknown:/tmp/test# echo -e "Hello \0 World" |awk 'BEGIN{RS="\0x55\0x66"}{print $0}'|hexdump -C
00000000 48 65 6c 6c 6f 20 0a |Hello .|
00000007
root@unknown:/tmp/test#

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

123 4 5 / 5 页下一页

返回列表

Chinaunix › 论坛 › 程序设计 › Shell › 用awk写的base64编码脚本，大家帮忙修改一下！