免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2975 | 回复: 0
打印 上一主题 下一主题

How to Use UTF-8 with Python [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-09-13 10:38 |只看该作者 |倒序浏览

[ 2005-October-01 20:15 ]
Tim Bray
describes
why Unicode and UTF-8 are wonderful
much better than I could, so go read that for an overview of what Unicode is, and why all your programs should support it. What I'm going to tell you is how to use Unicode, and specifically UTF-8, with one of the coolest programming languages,
Python
, but I have also written an introduction to
Using Unicode in C/C++
. Python has good support for Unicode, but there are a few tricks that you need to be aware of. I spent more than a few hours learning these tricks, and I'm hoping that by reading this you won't have to. This is a very quick and dirty introduction. If you need in depth knowledge, or need to learn about Unicode in Java or Windows, see
Unicode for Programmers
. [Updated 2005-09-01: Updated information about XML encoding declarations.]
The Basics
There are two types of strings in Python: byte strings and Unicode strings. As you may have guessed, a byte string is a sequence of bytes. When needed, Python uses your computer's default locale to convert the bytes into characters. On Mac OS X, the default locale is actually UTF-8, but everywhere else, the default is probably ASCII. This creates a byte string:
byteString = "hello world! (in my default locale)"
And this creates a Unicode string:
unicodeString = u"hello Unicode world!"
Convert a byte string into a Unicode string and back again:
s = "hello byte string"
u = unicode( s )
backToBytes = u.encode()
The previous code uses your default character set to perform the conversions. However, relying on the locale's character set is a bad idea, since your application is likely to break as soon as someone from Thailand tries to run it on their computer. In most cases it is probably better to explicitly specify the encoding of the string:
s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )
Now, the byte string s will be treated as a sequence of UTF-8 bytes to create the Unicode string u. The next line stores the UTF-8 representation of u in the byte string backToBytes.
Working With Unicode Strings
Thankfully, everything in Python is supposed to treat Unicode strings identically to byte strings. However, you need to be careful in your own code when testing to see if an object is a string. Do not do this:
if isinstance( s, str ): # BAD: Not true for Unicode strings!
Instead, use the generic string base class,
basestring
:
if isinstance( s, basestring ): # True for both Unicode and byte strings
Reading UTF-8 Files
You can manually convert strings that you read from files, however there is an easier way:
import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file
The
codecs module
will take care of all the conversions for you. You can also open a file for writing and it will convert the Unicode strings you pass in to write into whatever encoding you have chosen. However, take a look at the note below about the byte-order marker (BOM).
Working with XML and minidom
I use the
minidom module
for my XML needs mostly because I am familiar with it. Unfortunately, it only handles byte strings so you need to encode your Unicode strings before passing them to minidom functions. For example:
import xml.dom.minidom
xmlData = u"Comment ça va ? Très bien ?"
dom = xml.dom.minidom.parseString( xmlData )
The last line raises an exception: UnicodeEncodeError: 'ascii' codec can't encode character '\ue7' in position 5: ordinal not in range(128). To work around this error, encode the Unicode string into the appropriate format before passing it to minidom, like this:
import xml.dom.minidom
xmlData = u"Comment ça va ? Très bien ?"
dom = xml.dom.minidom.parseString( xmlData.encode( "utf-8" ) )
Minidom can handle any format of byte string, such as Latin-1 or UTF-16. However, it will only work reliably if the XML document has an
encoding declaration
(eg. ). If the encoding declaration is missing, minidom assumes that it is UTF-8. In is a good habit to include an encoding declaration on all your XML documents, in order to guarantee compatability on all systems.
When you get XML out of minidom by calling dom.toxml() or dom.toprettyxml(), minidom returns a Unicode string. You can also pass in an additional encoding="utf-8" parameter to get an encoded byte string, perfect for writing out to a file.
The Byte-Order Marker (BOM)
UTF-8 files sometimes start with a byte-order marker (BOM) to indicate that they are encoded in UTF-8. This is commonly used on Windows. On Mac OS X, applications (eg. TextEdit) ignore the BOM and remove it if the file is saved again. The
W3C HTML Validator
warns that older applications may not be able to handle the BOM. Unicode effectively ignores the marker, so it should not matter when reading the file. You may wish to add this to the beginning of your files to determine if they are encoded in ASCII or UTF-8. The codecs module provides the constant for you to do this:
out = file( "someFile", "w" )
out.write( codecs.BOM_UTF8 )
out.write( unicodeString.encode( "utf-8" ) )
out.close()
You need to be careful when using the BOM and UTF-8. Frankly, I think this is a bug in Python, but what do I know. Python will decode the value of the BOM into a Unicode character, instead of ignoring it. For example (tested with Python 2.3):
>>> codecs.BOM_UTF16.decode( "utf16" )
u''
>>> codecs.BOM_UTF8.decode( "utf8" )
u'\ufeff'
For UTF-16, Python decoded the BOM into an empty string, but for UTF-8, it decoded it into a character. Why is there a difference? I think the UTF-8 decoder should do the same thing as the UTF-16 decoder and strip out the BOM. However, it doesn't, so you will probably need to detect it and remove it yourself, like this:
import codecs
if s.beginswith( codecs.BOM_UTF8 ):
        # The byte string s begins with the BOM: Do something.
        # For example, decode the string as UTF-8
       
if u[0] == unicode( codecs.BOM_UTF8, "utf8" ):
        # The unicode string begins with the BOM: Do something.
        # For example, remove the character.
# Strip the BOM from the beginning of the Unicode string, if it exists
u.lstrip( unicode( codecs.BOM_UTF8, "utf8" ) )
Writing Python Scripts in Unicode
As you may have noticed from the examples on this page, you can actually write Python scripts in UTF-8. Variables must be in ASCII, but you can include Chinese comments, or Korean strings in your source files. In order for this to work correctly, Python needs to know that your script file is not ASCII. You can do this in one of two ways. First, you can place a UTF-8
byte-order marker
at the beginning of your file, if your editor supports it. Secondly, you can place the following special comment in the first or second lines of your script:
# -*- coding: utf-8 -*-
Any ASCII-compatible encoding is permitted. For details, see the
Defining Python Source Code Encodings
specification.
Other Resources


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/8780/showart_1190447.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP