免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2310 | 回复: 0
打印 上一主题 下一主题

Universal Encoding Detector [复制链接]

论坛徽章:
1
天蝎座
日期:2013-10-23 21:11:03
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2009-08-11 08:07 |只看该作者 |倒序浏览

                                                http://chardet.feedparser.org/
http://chardet.feedparser.org/docs/how-it-works.html
参考:http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Basic usage
The easiest way to use the Universal Encoding Detector library is with the detect function.

Example: Using the detect function
The detect function takes one argument, a
non-Unicode string. It returns a dictionary containing the
auto-detected character encoding and a confidence level from 0 to 1.
>>> [color="navy"]import urllib
>>> rawdata = urllib.urlopen([color="olive"]'http://yahoo.co.jp/').read()
>>> [color="navy"]import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}
[/url]
Advanced usage
If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.
Create a UniversalDetector object, then call its feed method repeatedly with each block of text.  If the detector reaches a minimum threshold of confidence, it will set detector.done to True.
Once you’ve exhausted the source text, call detector.close(), which will do some final calculations in case the detector didn’t hit its minimum confidence threshold earlier.  Then detector.result will be a dictionary containing the auto-detected character encoding and confidence level (the same as
[url=http://chardet.feedparser.org/docs/usage.html#example.basic.detect]the chardet.detect function returns

).

Example: Detecting encoding incrementally
[color="navy"]import urllib
[color="navy"]from chardet.universaldetector [color="navy"]import UniversalDetector
usock = urllib.urlopen([color="olive"]'http://yahoo.co.jp/')
detector = UniversalDetector()
[color="navy"]for line [color="navy"]in usock.readlines():
    detector.feed(line)
    [color="navy"]if detector.done: [color="navy"]break
detector.close()
usock.close()
[color="navy"]print detector.result
{'encoding': 'EUC-JP', 'confidence': 0.99}
If you want to detect the encoding of multiple texts (such as separate files), you can re-use a single UniversalDetector object.  Just call detector.reset() at the start of each file, call detector.feed as many times as you like, and then call detector.close() and check the detector.result dictionary for the file’s results.

Example: Detecting encodings of multiple files
[color="navy"]import glob
[color="navy"]from charset.universaldetector [color="navy"]import UniversalDetector
detector = UniversalDetector()
[color="navy"]for filename [color="navy"]in glob.glob([color="olive"]'*.xml'):
    [color="navy"]print filename.ljust(60),
    detector.reset()
    [color="navy"]for line [color="navy"]in file(filename, [color="olive"]'rb'):
        detector.feed(line)
        [color="navy"]if detector.done: [color="navy"]break
    detector.close()
    [color="navy"]print detector.result
               
               

本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u3/95893/showart_2023937.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP