- 论坛徽章:
- 0
|
事情是这样的:
使用LWP请求一个gb2312编码的网页,比如 http://ip138.com/ips138.asp?ip=8.8.8.8&action=2- $response = LWP::UserAgent->new->get("http://ip138.com/ips138.asp?ip=8.8.8.8&action=2") ;
- $content = $response->decode_content() if $response->is_success;
复制代码 按理说应该是得到按照gb2312解码后的unicode字符串,实际上却并不是,似乎是按照ISO-8859-1来处理了
如果这个decode_content真的无法正常识别gb2312的话,那模块本身就有问题了,但LWP是如此常用的模块,如果有问题早就应该暴露出来了,所以不太敢确定是不是自己搞错了。贴出来希望大家也帮忙看下。
跟踪源代码分析,得到一些结论,贴一些关键性的代码出来。
首先decode_content函数- if ($self->content_is_text || (my $is_xml = $self->content_is_xml)) {
- my $charset = lc(
- $opt{charset} ||
- $self->content_type_charset ||
- $opt{default_charset} ||
- $self->content_charset ||
- "ISO-8859-1"
- );
复制代码 由于网页并没有在http头部添加charset的param 而是在html文档里设置了<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
因此HTTP::Message模块也支持从meta标签里识别这个编码。上面的if流程执行下来
$opt{charset} 未设定,$self->content_type_charset没有得到编码,$opt{default_charset}未设定。$self->content_charset应该得到gb2312才对,但实际上用
$response->content_charset还是得到了undef,跟踪content_charset()函数,一些关键代码:- elsif ($self->content_is_html) {
- # look for <META charset="..."> or <META content="...">
- # http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
- require IO::HTML;
- # Use relaxed search to match previous versions of HTTP::Message:
- my $encoding = IO::HTML::find_charset_in($cref, { encoding => 1,
- need_pragma => 0 });
- return $encoding->mime_name if $encoding;
- }
复制代码 发现my $encoding = IO::HTML::find_charset_in已经正确的返回了encode对象,但$encoding->mime_name返回确实空。$encoding->name却是正确的“euc-cn”
继续跟踪mime_name,实际上调用的是return Encode::MIME::Name::get_mime_name,关键代码:- our %MIME_NAME_OF = (
- 'AdobeStandardEncoding' => 'Adobe-Standard-Encoding',
- 'AdobeSymbol' => 'Adobe-Symbol-Encoding',
- 'ascii' => 'US-ASCII',
- 'big5-hkscs' => 'Big5-HKSCS',
- 'cp1026' => 'IBM1026',
- 'cp1047' => 'IBM1047',
- 'cp1250' => 'windows-1250',
- 'cp1251' => 'windows-1251',
- 'cp1252' => 'windows-1252',
- 'cp1253' => 'windows-1253',
- 'cp1254' => 'windows-1254',
- 'cp1255' => 'windows-1255',
- 'cp1256' => 'windows-1256',
- 'cp1257' => 'windows-1257',
- 'cp1258' => 'windows-1258',
- 'cp37' => 'IBM037',
- 'cp424' => 'IBM424',
- 'cp437' => 'IBM437',
- 'cp500' => 'IBM500',
- 'cp775' => 'IBM775',
- 'cp850' => 'IBM850',
- 'cp852' => 'IBM852',
- 'cp855' => 'IBM855',
- 'cp857' => 'IBM857',
- 'cp860' => 'IBM860',
- 'cp861' => 'IBM861',
- 'cp862' => 'IBM862',
- 'cp863' => 'IBM863',
- 'cp864' => 'IBM864',
- 'cp865' => 'IBM865',
- 'cp866' => 'IBM866',
- 'cp869' => 'IBM869',
- 'cp936' => 'GBK',
- 'euc-jp' => 'EUC-JP',
- 'euc-kr' => 'EUC-KR',
- #'gb2312-raw' => 'GB2312', # no, you're wrong, I18N::Charset
- 'hp-roman8' => 'hp-roman8',
- 'hz' => 'HZ-GB-2312',
- 'iso-2022-jp' => 'ISO-2022-JP',
- 'iso-2022-jp-1' => 'ISO-2022-JP',
- 'iso-2022-kr' => 'ISO-2022-KR',
- 'iso-8859-1' => 'ISO-8859-1',
- 'iso-8859-10' => 'ISO-8859-10',
- 'iso-8859-13' => 'ISO-8859-13',
- 'iso-8859-14' => 'ISO-8859-14',
- 'iso-8859-15' => 'ISO-8859-15',
- 'iso-8859-16' => 'ISO-8859-16',
- 'iso-8859-2' => 'ISO-8859-2',
- 'iso-8859-3' => 'ISO-8859-3',
- 'iso-8859-4' => 'ISO-8859-4',
- 'iso-8859-5' => 'ISO-8859-5',
- 'iso-8859-6' => 'ISO-8859-6',
- 'iso-8859-7' => 'ISO-8859-7',
- 'iso-8859-8' => 'ISO-8859-8',
- 'iso-8859-9' => 'ISO-8859-9',
- #'jis0201-raw' => 'JIS_X0201',
- #'jis0208-raw' => 'JIS_C6226-1983',
- #'jis0212-raw' => 'JIS_X0212-1990',
- 'koi8-r' => 'KOI8-R',
- 'koi8-u' => 'KOI8-U',
- #'ksc5601-raw' => 'KS_C_5601-1987',
- 'shiftjis' => 'Shift_JIS',
- 'UTF-16' => 'UTF-16',
- 'UTF-16BE' => 'UTF-16BE',
- 'UTF-16LE' => 'UTF-16LE',
- 'UTF-32' => 'UTF-32',
- 'UTF-32BE' => 'UTF-32BE',
- 'UTF-32LE' => 'UTF-32LE',
- 'UTF-7' => 'UTF-7',
- 'utf8' => 'UTF-8',
- 'utf-8-strict' => 'UTF-8',
- 'viscii' => 'VISCII',
- );
- sub get_mime_name($) { $MIME_NAME_OF{$_[0]} };
复制代码 可以看到 Encode::MIME::Name里根本没有euc-cn对应的mime_name,查了下IANA官方的charset文档:
http://www.iana.org/assignments/character-sets/character-sets.xml 里面也确实没有。
难道说真的是HTTP::Message里的源代码写的有问题么?。。。
|
|