Chinaunix
标题:
Universal Encoding Detector
[打印本页]
作者:
openspace
时间:
2009-08-11 08:07
标题:
Universal Encoding Detector
http://chardet.feedparser.org/
http://chardet.feedparser.org/docs/how-it-works.html
参考:http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Basic usage
The easiest way to use the Universal Encoding Detector library is with the detect function.
Example: Using the detect function
The detect function takes one argument, a
non-Unicode string. It returns a dictionary containing the
auto-detected character encoding and a confidence level from 0 to 1.
>>> [color="navy"]
import
urllib
>>> rawdata = urllib.urlopen([color="olive"]'http://yahoo.co.jp/').read()
>>> [color="navy"]
import
chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}
[/url]
Advanced usage
If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.
Create a UniversalDetector object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True.
Once you’ve exhausted the source text, call detector.close(), which will do some final calculations in case the detector didn’t hit its minimum confidence threshold earlier. Then detector.result will be a dictionary containing the auto-detected character encoding and confidence level (the same as
[url=http://chardet.feedparser.org/docs/usage.html#example.basic.detect]the chardet.detect function returns
).
Example: Detecting encoding incrementally
[color="navy"]
import
urllib
[color="navy"]
from
chardet.universaldetector [color="navy"]
import
UniversalDetector
usock = urllib.urlopen([color="olive"]'http://yahoo.co.jp/')
detector = UniversalDetector()
[color="navy"]
for
line [color="navy"]
in
usock.readlines():
detector.feed(line)
[color="navy"]
if
detector.done: [color="navy"]
break
detector.close()
usock.close()
[color="navy"]
print
detector.result
{'encoding': 'EUC-JP', 'confidence': 0.99}
If you want to detect the encoding of multiple texts (such as separate files), you can re-use a single UniversalDetector object. Just call detector.reset() at the start of each file, call detector.feed as many times as you like, and then call detector.close() and check the detector.result dictionary for the file’s results.
Example: Detecting encodings of multiple files
[color="navy"]
import
glob
[color="navy"]
from
charset.universaldetector [color="navy"]
import
UniversalDetector
detector = UniversalDetector()
[color="navy"]
for
filename [color="navy"]
in
glob.glob([color="olive"]'*.xml'):
[color="navy"]
print
filename.ljust(60),
detector.reset()
[color="navy"]
for
line [color="navy"]
in
file(filename, [color="olive"]'rb'):
detector.feed(line)
[color="navy"]
if
detector.done: [color="navy"]
break
detector.close()
[color="navy"]
print
detector.result
本文来自ChinaUnix博客,如果查看原文请点:
http://blog.chinaunix.net/u3/95893/showart_2023937.html
欢迎光临 Chinaunix (http://bbs.chinaunix.net/)
Powered by Discuz! X3.2