- 论坛徽章:
- 0
|
原帖由 ynchnluiti 于 2009-2-16 19:44 发表
1 暂时没找到dump标签属性用单引号的设置。不过日文里,html的属性应该也可以用双引号吧
2 脚本的编码,浏览软件(浏览器,编辑器等)的编码也要一致。
这个问题英文Html的确是没有什么问题,但是在日文上面涉及到双引号和单引号的区别问题。
如下所示:
- #!/usr/bin/perl -w
- use warnings;
- use strict;
- use Encode;
- use HTML::Entities;
- use HTML::TreeBuilder;
- my $encode = 'shift_jis';
- my $html = <<'__HTML__';
- <html>
- <head>
- <title>TestForJP</title>
- </head>
- <body lang=JA>
- <p style='margin-top:0mm;margin-right:0mm;margin-bottom:0mm;margin-left:28.4pt;
- margin-bottom:.0001pt;'><span lang=EN-US
- style='font-size:10.0pt;font-family:"MS 明朝"'>AAAAA</span><span
- style='font-size:10.0pt;font-family:"MS 明朝"'>BBBBB<span
- lang=EN-US>CCCCC</span>DDDDD。<span lang=EN-US><br>
- EEEEE</span>FFFF<span lang=EN-US><br style='mso-special-character:line-break'>
- <![if !supportLineBreakNewLine]><br style='mso-special-character:line-break'>
- <![endif]><o:p></o:p></span></span></p>
- </body>
- </html>
- __HTML__
- {
- my $h = HTML::TreeBuilder->new_from_content( decode($encode, $html) );
- my $p = $h->look_down(_tag => q{p});
-
- for my $span( $h->look_down(_tag => q{span}) ) {
- $span->attr(lang=>undef) if ( defined $span->attr('lang') );
- $span->replace_with_content($span->content_refs_list) if
- ( not defined $span->attr('style') );
- }
- print encode( $encode, $h->as_HTML('<>&',' ',{}) ), "\n";
- $h->delete;
- }
- __END__
- <html>
- <head>
- <title>TestForJP</title>
- </head>
- <body lang="JA">
- <p style="margin-top:0mm;margin-right:0mm;margin-bottom:0mm;margin-left:28.4pt
- ;
- margin-bottom:.0001pt;"><span style="font-si
- ze:10.0pt;font-family:"MS 明朝"">AAAAA</span><span style="fon
- t-size:10.0pt;font-family:"MS 明朝"">BBBBBCCCCCDDDDD。<br />
- EEEEEFFFF<br style="mso-special-character:line-break" />
- <br style="mso-special-character:line-break" /></span></p>
- </body>
- </html>
复制代码
第一遍的结果是正确的。
但是如果把第一遍的这个结果再整理一次的话。
整理之后的Html代码就会像下面这个样子。
<html>
<head>
<title>TestForJP</title>
</head>
<body lang="JA">
<p style="margin-top:0mm;margin-right:0mm;margin-bottom:0mm;margin-left:28.4pt
;margin-bottom:.0001pt;"><span style="font-si
ze:10.0pt;font-family:" 明朝""="明朝""" ms="MS">AAAAA</span
><span style="fon
t-size:10.0pt;font-family:" 明朝""="明朝""" ms="MS">BBBBBCC
CCCDDDDD。<br /> EEEEEFFFF<br style="mso-special-character:line-break" />
<br style="mso-special-character:line-break" /></span></p>
</body>
</html> |
|