- 论坛徽章:
- 0
|
perl的内部格式好像不是严格的utf-8编码哦,而是utf8编码,这两者好像是有区别的
UTF-8 vs. utf8
....We now view strings not as sequences of bytes, but as sequences
of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.
That has been the perl's notion of UTF-8 but official UTF-8 is more strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al).
Now that is overruled by Larry Wall himself.
From: Larry Wall <larry@wall.org>
Date: December 04, 2004 11:51:58 JST
To: perl-unicode@perl.org
Subject: Re: Make Encode.pm support the real UTF-8
Message-Id: <20041204025158.GA28754@wall.org>
On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
: I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
: but "UTF-8" is the name of the standard and should give the
: corresponding behaviour.
For what it's worth, that's how I've always kept them straight in my
head.
Also for what it's worth, Perl 6 will mostly default to strict but
make it easy to switch back to lax.
Larry
Do you copy? As of Perl 5.8.7, UTF-8 means strict, official UTF-8 while utf8 means liberal, lax, version thereof. And Encode version 2.10 or later thus groks the difference between UTF-8 and C"utf8".
encode("utf8", "\x{FFFF_FFFF}", 1); # okay
encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
UTF-8 in Encode is actually a canonical name for utf-8-strict. Yes, the hyphen between "UTF" and "8" is important. Without it Encode goes "liberal"
find_encoding("UTF-8")->name # is 'utf-8-strict'
find_encoding("utf-8")->name # ditto. names are case insensitive
find_encoding("utf8")->name # ditto. "_" are treated as "-"
find_encoding("UTF8")->name # is 'utf8'.
这段选自Encode模块的说明文档 |
|