免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1363 | 回复: 0
打印 上一主题 下一主题

PHP中GBK和UTF8编码处理 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2009-07-23 11:48 |只看该作者 |倒序浏览
PHP中GBK和UTF8编码处理
作者:heiyeluren  来源:
[color="#0000ff"]heiyeluren的blog
2008-01-21 最后更新:2008-01-21 15:14:07
收藏到网摘:

[url=javascript:location.href='http://www.google.com/bookmarks/mark?op=add&bkmk='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title)][/url]

[url=javascript:location.href='http://del.icio.us/post?&url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title)][/url]

[url=javascript:d=document;t=d.selection?(d.selection.type!='None'?d.selection.createRange().text:''):(d.getSelection?d.getSelection():'');void(vivi=window.open('http://myweb.cn.yahoo.com/popadd.html?url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title),'yahoo','scrollbars=no,width=720,height=420,left=75,top=20,status=no,resizable=yes'));vivi.focus();][/url]

[url=javascript:u=location.href;t=document.title;c%20=%20%22%22%20+%20(window.getSelection%20?%20window.getSelection()%20:%20document.getSelection%20?%20document.getSelection()%20:%20document.selection.createRange().text);var%20url=%22http://cang.baidu.com/do/add?it=%22+encodeURIComponent(t)+%22&iu=%22+encodeURIComponent(u)+%22&dc=%22+encodeURIComponent(c)+%22&fr=ien#nw=1%22;window.open(url,'_blank','scrollbars=no,width=600,height=450,left=75,top=20,status=no,resizable=yes');%20void%200][/url]

[url=javascript:d=document;t=d.selection?(d.selection.type!='None'?d.selection.createRange().text:''):(d.getSelection?d.getSelection():'');void(vivi=window.open('http://vivi.sina.com.cn/collect/icollect.php?pid=28&title='+escape(d.title)+'&url='+escape(d.location.href)+'&desc='+escape(t),'vivi','scrollbars=no,width=480,height=480,left=75,top=20,status=no,resizable=yes'));vivi.focus();][/url]

[url=javascript:d=document;t=d.selection?(d.selection.type!='None'?d.selection.createRange().text:''):(d.getSelection?d.getSelection():'');void(keyit=window.open('http://www.365key.com/storeit.aspx?t='+escape(d.title)+'&u='+escape(d.location.href)+'&c='+escape(t),'keyit','scrollbars=no,width=475,height=575,left=75,top=20,status=no,resizable=yes'));keyit.focus();][/url]

[url=javascript:d=document;t=d.selection?(d.selection.type!='None'?d.selection.createRange().text:''):(d.getSelection?d.getSelection():'');void(yesky=window.open('http://hot.yesky.com/dp.aspx?t='+escape(d.title)+'&u='+escape(d.location.href)+'&c='+escape(t)+'&st=2','yesky','scrollbars=no,width=400,height=480,left=75,top=20,status=no,resizable=yes'));yesky.focus();][/url]

[url=javascript:d=document;t=d.selection?(d.selection.type!='None'?d.selection.createRange().text:''):(d.getSelection?d.getSelection():'');void(keyit=window.open('http://my.poco.cn/fav/storeIt.php?t='+escape(d.title)+'&u='+escape(d.location.href)+'&c='+escape(t)+'&img=http://www.h-strong.com/blog/logo.gif','keyit','scrollbars=no,width=475,height=575,status=no,resizable=yes'));keyit.focus();][/url]

[url=javascript:t=document.title;u=location.href;e=document.selection?(document.selection.type!='None'?document.selection.createRange().text:''):(document.getSelection?document.getSelection():'');void(open('http://bookmark.hexun.com/post.aspx?title='+escape(t)+'&url='+escape(u)+'&excerpt='+escape(e),'HexunBookmark','scrollbars=no,width=600,height=450,left=80,top=80,status=no,resizable=yes'));][/url]

一、编码范围
1. GBK (GB2312/GB18030)
\x00-\xff  GBK双字节编码范围
\x20-\x7f  ASCII
\xa1-\xff  中文
\x80-\xff  中文
2. UTF-8 (Unicode)
\u4e00-\u9fa5 (中文)
\x3130-\x318F (韩文
\xAC00-\xD7A3 (韩文)
\u0800-\u4e00 (日文)
ps: 韩文是大于[\u9fa5]的字符
正则例子:
preg_replace("/([\x80-\xff])/","",$str);
preg_replace("/([u4e00-u9fa5])/","",$str);

二、代码例子
//判断内容里有没有中文-GBK (PHP)
function[color="#000000"] check_is_chinese($s){
[color="#000000"]    return[color="#000000"] preg_match('/[\x80-\xff]./',[color="#000000"] $s);
}
//获取字符串长度-GBK (PHP)
function[color="#000000"] gb_strlen($str){
[color="#000000"]    $count[color="#000000"] =[color="#000000"] 0;
[color="#000000"]    for($i=[color="#000000"]0;[color="#000000"] $istrlen($str);[color="#000000"] $i++){
[color="#000000"]        $s[color="#000000"] =[color="#000000"] substr($str,[color="#000000"] $i,[color="#000000"] 1);
[color="#000000"]        if[color="#000000"] (preg_match("/[\x80-\xff]/",[color="#000000"] $s))[color="#000000"] ++$i;
[color="#000000"]        ++$count;
[color="#000000"]    }
[color="#000000"]    return[color="#000000"] $count;
}
//截取字符串字串-GBK (PHP)
function[color="#000000"] gb_substr($str,[color="#000000"] $len){
[color="#000000"]    $count[color="#000000"] =[color="#000000"] 0;
[color="#000000"]    for($i=[color="#000000"]0;[color="#000000"] $istrlen($str);[color="#000000"] $i++){
[color="#000000"]        if($count[color="#000000"] ==[color="#000000"] $len)[color="#000000"] break;
[color="#000000"]        if(preg_match("/[\x80-\xff]/",[color="#000000"] substr($str,[color="#000000"] $i,[color="#000000"] 1)))[color="#000000"] ++$i;
[color="#000000"]        ++$count;[color="#000000"]        
    }
[color="#000000"]    return[color="#000000"] substr($str,[color="#000000"] 0,[color="#000000"] $i);
}
//统计字符串长度-UTF8 (PHP)
function[color="#000000"] utf8_strlen($str)[color="#000000"] {
[color="#000000"]    $count[color="#000000"] =[color="#000000"] 0;
[color="#000000"]    for($i[color="#000000"] =[color="#000000"] 0;[color="#000000"] $i[color="#000000"] [color="#000000"] strlen($str);[color="#000000"] $i++){
[color="#000000"]        $value[color="#000000"] =[color="#000000"] ord($str[$i]);
[color="#000000"]        if($value[color="#000000"] >[color="#000000"] 127)[color="#000000"] {
[color="#000000"]            $count++;
[color="#000000"]            if($value[color="#000000"] >=[color="#000000"] 192 &&[color="#000000"] $value[color="#000000"] =[color="#000000"] 223)[color="#000000"] $i++;
[color="#000000"]            elseif($value[color="#000000"] >=[color="#000000"] 224 &&[color="#000000"] $value[color="#000000"] =[color="#000000"] 239)[color="#000000"] $i[color="#000000"] =[color="#000000"] $i[color="#000000"] +[color="#000000"] 2;
[color="#000000"]            elseif($value[color="#000000"] >=[color="#000000"] 240 &&[color="#000000"] $value[color="#000000"] =[color="#000000"] 247)[color="#000000"] $i[color="#000000"] =[color="#000000"] $i[color="#000000"] +[color="#000000"] 3;
[color="#000000"]            else[color="#000000"] die('Not a UTF-8 compatible string');
[color="#000000"]        }
[color="#000000"]        $count++;
[color="#000000"]    }
[color="#000000"]    return[color="#000000"] $count;
}
//截取字符串-UTF8(PHP)
function[color="#000000"] utf8_substr($str,$position,$length){
[color="#000000"]    $start_position[color="#000000"] =[color="#000000"] strlen($str);
[color="#000000"]    $start_byte[color="#000000"] =[color="#000000"] 0;
[color="#000000"]    $end_position[color="#000000"] =[color="#000000"] strlen($str);
[color="#000000"]    $count[color="#000000"] =[color="#000000"] 0;
[color="#000000"]    for($i[color="#000000"] =[color="#000000"] 0;[color="#000000"] $i[color="#000000"] [color="#000000"] strlen($str);[color="#000000"] $i++){
[color="#000000"]        if($count[color="#000000"] >=[color="#000000"] $position[color="#000000"] &&[color="#000000"] $start_position[color="#000000"] >[color="#000000"] $i){
[color="#000000"]            $start_position[color="#000000"] =[color="#000000"] $i;
[color="#000000"]            $start_byte[color="#000000"] =[color="#000000"] $count;
[color="#000000"]        }
[color="#000000"]        if(($count-$start_byte)>=$length)[color="#000000"] {
[color="#000000"]            $end_position[color="#000000"] =[color="#000000"] $i;
[color="#000000"]            break;
[color="#000000"]        }[color="#000000"]   
        $value[color="#000000"] =[color="#000000"] ord($str[$i]);
[color="#000000"]        if($value[color="#000000"] >[color="#000000"] 127){
[color="#000000"]            $count++;
[color="#000000"]            if($value[color="#000000"] >=[color="#000000"] 192 &&[color="#000000"] $value[color="#000000"] =[color="#000000"] 223)[color="#000000"] $i++;
[color="#000000"]            elseif($value[color="#000000"] >=[color="#000000"] 224 &&[color="#000000"] $value[color="#000000"] =[color="#000000"] 239)[color="#000000"] $i[color="#000000"] =[color="#000000"] $i[color="#000000"] +[color="#000000"] 2;
[color="#000000"]            elseif($value[color="#000000"] >=[color="#000000"] 240 &&[color="#000000"] $value[color="#000000"] =[color="#000000"] 247)[color="#000000"] $i[color="#000000"] =[color="#000000"] $i[color="#000000"] +[color="#000000"] 3;
[color="#000000"]            else[color="#000000"] die('Not a UTF-8 compatible string');
[color="#000000"]        }
[color="#000000"]        $count++;
[color="#000000"]    }
[color="#000000"]    return(substr($str,$start_position,$end_position-$start_position));
}
//字符串长度统计-UTF8 [中文3个字节,俄文、韩文占2个字节,字母占1个字节] (Ruby)
[color="#000000"]def utf8_string_length([color="#000000"]str)
[color="#000000"]    temp =[color="#000000"] CGI::[color="#000000"]unescape([color="#000000"]str)
[color="#000000"]    i =[color="#000000"] 0;
[color="#000000"]    j =[color="#000000"] 0;
[color="#000000"]    temp.[color="#000000"]length.[color="#000000"]times{|[color="#000000"]t|
[color="#000000"]        if[color="#000000"] temp[[color="#000000"]t][color="#000000"] [color="#000000"] 127
            i +=[color="#000000"] 1
        elseif[color="#000000"] temp[[color="#000000"]t][color="#000000"] >=[color="#000000"] 127 and temp[[color="#000000"]t][color="#000000"] [color="#000000"] 224
            j +=[color="#000000"] 1
            if[color="#000000"] 0 ==[color="#000000"] ([color="#000000"]j %[color="#000000"] 2)
[color="#000000"]                i +=[color="#000000"] 2
                j =[color="#000000"] 0
            end
[color="#000000"]        else
[color="#000000"]            j +=[color="#000000"] 1
            if[color="#000000"] 0 ==[color="#000000"] ([color="#000000"]j %[color="#000000"] 3)
[color="#000000"]                i +=[color="#000000"]2
                j =[color="#000000"] 0
            end
[color="#000000"]        end
[color="#000000"]    }
[color="#000000"]    return[color="#000000"] i
}
//判断是否是有韩文-UTF-8 (JavaScript)
function[color="#000000"] checkKoreaChar([color="#000000"]str)[color="#000000"] {
[color="#000000"]    for([color="#000000"]i=[color="#000000"]0;[color="#000000"] i[color="#000000"]str.[color="#000000"]length;[color="#000000"] i++)[color="#000000"] {
[color="#000000"]        if((([color="#000000"]str.[color="#000000"]charCodeAt([color="#000000"]i)[color="#000000"] >[color="#000000"] 0x3130 &&[color="#000000"] str.[color="#000000"]charCodeAt([color="#000000"]i)[color="#000000"] [color="#000000"] 0x318F)[color="#000000"] ||[color="#000000"] ([color="#000000"]str.[color="#000000"]charCodeAt([color="#000000"]i)[color="#000000"] >=[color="#000000"] 0xAC00 &&[color="#000000"] str.[color="#000000"]charCodeAt([color="#000000"]i)[color="#000000"] =[color="#000000"] 0xD7A3)))[color="#000000"] {
[color="#000000"]            return[color="#000000"] true;
[color="#000000"]        }
[color="#000000"]    }
[color="#000000"]    return[color="#000000"] false;
}
//判断是否有中文字符-GBK (JavaScript)
function[color="#000000"] check_chinese_char([color="#000000"]s){
[color="#000000"]    return[color="#000000"] ([color="#000000"]s.[color="#000000"]length !=[color="#000000"] s.[color="#000000"]replace(/[^\[color="#000000"]x00-\[color="#000000"]xff]/[color="#000000"]g,"**").[color="#000000"]length);

三、参考文档
[color="#0000ff"]http://www.unicode.org/

[color="#0000ff"]http://examples.oreilly.com/cjkvinfo/doc/cjk.inf
[color="#0000ff"]http://www.ansell-uebersetzungen.com/gbuni.html
[color="#0000ff"]http://www.haiyan.com/steelk/navigator/ref/gbk/gbindex.htm
[color="#0000ff"]http://baike.baidu.com/view/40801.htm
[color="#0000ff"]http://www.chedong.com/tech/hello_unicode.html
另:
/*
* 中文截取,支持gb2312,gbk,utf-8,big5
*
* @param string $str 要截取的字串
* @param int $start 截取起始位置
* @param int $length 截取长度
* @param string $charset utf-8|gb2312|gbk|big5 编码
* @param $suffix 是否加尾缀
*/
public function csubstr($str, $start=0, $length, $charset="utf-8", $suffix=true)
{
if(function_exists("mb_substr"))
return mb_substr($str, $start, $length, $charset);
$re['utf-8'] = "/[\x01-\x7f]|[\xc2-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3}/";
$re['gb2312'] = "/[\x01-\x7f]|[\xb0-\xf7][\xa0-\xfe]/";
$re['gbk'] = "/[\x01-\x7f]|[\x81-\xfe][\x40-\xfe]/";
$re['big5'] = "/[\x01-\x7f]|[\x81-\xfe]([\x40-\x7e]|\xa1-\xfe])/";
preg_match_all($re[$charset], $str, $match);
$slice = join("",array_slice($match[0], $start, $length));
if($suffix) return $slice."…";
return $slice;
}
               
               
               

本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u2/84280/showart_2004354.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP