12 / 2 页

论坛徽章:: 307

11楼 [报告]

发表于 2016-09-09 11:51 |只看该作者

回复 10# 104359176
待处理的文本中包含了中文, 我对于 UTF8 的概念还是未能搞懂.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

sunzhiguolu

巨富豪门

论坛徽章:: 307

12楼 [报告]

发表于 2016-09-17 11:44 |只看该作者

我在之前的设置可能有问题, 我修改了下:
我的终端环境:Options|Text->{Locale:C, Character Set:utf8}
.vimrc 配置为: :
set encoding=utf-8
:set fileencodings=utf-8,ucs-bom,gb2312,gbk
:set fileencoding=utf-8
:set termencoding=utf-8

我做了一个小测试,
cat text
------------------------------------
我非常喜欢 Perl 语言
同时也非常喜欢正则表达式

file -i text
------------------------------------
text: text/plain; charset=utf-8

我想将 text 中的文本输出成 gb2312 编码的文本, 但是将结果重定向到一个文件后居然是乱码.
没有找到原因, 请大家指点一下, 谢谢大家...

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

sunzhiguolu

巨富豪门

论坛徽章:: 307

13楼 [报告]

发表于 2016-09-17 11:57 |只看该作者

具体的代码如下:

#!/usr/bin/perl
use strict;
use warnings;
binmode (STDIN, ":encoding(utf8)");
binmode (STDOUT, ":encoding(gb2312)");
while (<>){
print;
}

复制代码

执行后提示如下错误:
perl test.pl text>rst
------------------------------------------------
"\x{00e6}" does not map to euc-cn, <> line 2.
"\x{0088}" does not map to euc-cn, <> line 2.
"\x{0091}" does not map to euc-cn, <> line 2.
"\x{009d}" does not map to euc-cn, <> line 2.
"\x{009e}" does not map to euc-cn, <> line 2.
"\x{00e5}" does not map to euc-cn, <> line 2.
"\x{00b8}" does not map to euc-cn, <> line 2.
"\x{00b8}" does not map to euc-cn, <> line 2.
"\x{00e5}" does not map to euc-cn, <> line 2.
"\x{0096}" does not map to euc-cn, <> line 2.
"\x{009c}" does not map to euc-cn, <> line 2.
"\x{00e6}" does not map to euc-cn, <> line 2.
"\x{00ac}" does not map to euc-cn, <> line 2.
"\x{00a2}" does not map to euc-cn, <> line 2.
"\x{00af}" does not map to euc-cn, <> line 2.
"\x{00ad}" does not map to euc-cn, <> line 2.
"\x{0080}" does not map to euc-cn, <> line 2.
"\x{00e5}" does not map to euc-cn, <> line 2.
"\x{0090}" does not map to euc-cn, <> line 2.
"\x{008c}" does not map to euc-cn, <> line 2.
"\x{00e6}" does not map to euc-cn, <> line 2.
"\x{0097}" does not map to euc-cn, <> line 2.
"\x{00b6}" does not map to euc-cn, <> line 2.
"\x{00e4}" does not map to euc-cn, <> line 2.
"\x{00b9}" does not map to euc-cn, <> line 2.
"\x{009f}" does not map to euc-cn, <> line 2.
"\x{009d}" does not map to euc-cn, <> line 2.
"\x{009e}" does not map to euc-cn, <> line 2.
"\x{00e5}" does not map to euc-cn, <> line 2.
"\x{00b8}" does not map to euc-cn, <> line 2.
"\x{00b8}" does not map to euc-cn, <> line 2.
"\x{00e5}" does not map to euc-cn, <> line 2.
"\x{0096}" does not map to euc-cn, <> line 2.
"\x{009c}" does not map to euc-cn, <> line 2.
"\x{00e6}" does not map to euc-cn, <> line 2.
"\x{00ac}" does not map to euc-cn, <> line 2.
"\x{00a2}" does not map to euc-cn, <> line 2.
"\x{00e6}" does not map to euc-cn, <> line 2.
"\x{00ad}" does not map to euc-cn, <> line 2.
"\x{00a3}" does not map to euc-cn, <> line 2.
"\x{00e5}" does not map to euc-cn, <> line 2.
"\x{0088}" does not map to euc-cn, <> line 2.
"\x{0099}" does not map to euc-cn, <> line 2.
"\x{00a1}" does not map to euc-cn, <> line 2.
"\x{00be}" does not map to euc-cn, <> line 2.
"\x{00be}" does not map to euc-cn, <> line 2.
"\x{00e5}" does not map to euc-cn, <> line 2.
"\x{00bc}" does not map to euc-cn, <> line 2.
"\x{008f}" does not map to euc-cn, <> line 2.

file -i rst
----------------------------------------------
rst: text/plain; charset=iso-8859-1

cat rst
----------------------------------------------
\x{00e6}\x{0088}\x{0091}▒▒\x{009d}\x{009e}\x{00e5}\x{00b8}\x{00b8}\x{00e5}\x{0096}\x{009c}\x{00e6}\x{00ac}\x{00a2} Perl ▒▒\x{00af}\x{00ad}▒▒▒▒\x{0080}
\x{00e5}\x{0090}\x{008c}\x{00e6}\x{0097}\x{00b6}\x{00e4}\x{00b9}\x{009f}▒▒\x{009d}\x{009e}\x{00e5}\x{00b8}\x{00b8}\x{00e5}\x{0096}\x{009c}\x{00e6}\x{00ac}\x{00a2}\x{00e6}\x{00ad}\x{00a3}\x{00e5}\x{0088}\x{0099}▒▒\x{00a1}▒▒▒▒\x{00be}\x{00be}\x{00e5}\x{00bc}\x{008f}

在 Windows 环境打开依然是这个样子, 解决这个问题我需要从那个地方着手, 谢谢大家...

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

sunzhiguolu

巨富豪门

论坛徽章:: 307

14楼 [报告]

发表于 2016-09-17 11:59 |只看该作者

回复 8# 104359176
有关 utf8 测试终于通过了, 谢谢.

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
if (ord ("中") == 20013){
print "UTF8 Encoding.\n";
}

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

hztj2005

家境小康

论坛徽章:: 0

15楼 [报告]

发表于 2016-09-17 14:57 |只看该作者

本帖最后由 hztj2005 于 2016-09-23 23:58 编辑

我上次说的书，你还是买一本看。我尝终日而思，不如须臾之所学。

你首先要看读入文件究竟是那种编码格式，window下用记事本打开，点击另存为，在对话框最下面，编码下拉框中就可以知道当前究竟编码格式。简体中文系统下，其中的ANSI就是gb2312。另外一种就是utf-8。还有两种你现在不用关心。
如果对少数几个文件转换格式，你选定格式之后保存，就可以了，不用写代码。

要想写代码转换，先看我说的书。因为打字很麻烦，而且我是win10下面用ActivePerl，环境与你不同，交流比较麻烦。

　以下是网上文字：
这几天工作中用到不少字符集，Unicode、utf-8、GB2312等，但是在windows命令行里敲notepad进入文本编辑页面。保存时需要选择编码方式，但是不存在GB，却有ANSI，于是纳闷，回来一查，才恍然大悟：原来在简体中文系统下，ANSI 编码代表 GB2312 编码。
　　为使计算机支持更多语言，通常使用 0x80~0xFF 范围的 2 个字节来表示 1 个字符。比如：汉字 '中' 在中文操作系统中，使用 [0xD6,0xD0] 这两个字节存储。
　　不同的国家和地区制定了不同的标准，由此产生了 GB2312, BIG5, JIS 等各自的编码标准。这些使用 2 个字节来代表一个字符的各种汉字延伸编码方式，称为 ANSI 编码。在简体中文系统下，ANSI 编码代表 GB2312 编码，在日文操作系统下，ANSI 编码代表 JIS 编码。
　　不同 ANSI 编码之间互不兼容，当信息在国际间交流时，无法将属于两种语言的文字，存储在同一段 ANSI 编码的文本中。