论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2009-11-29 17:07 |只看该作者 |倒序浏览

2. 词法分析（Lexical analysis）
¶
A Python program is read by a parser.  Input to the parser is a stream of
tokens, generated by the lexical analyzer.  This chapter describes how the
lexical analyzer breaks a file into tokens.
一个Python程序由解析器读入, 输入解析器的是由词法分析器生成的语言符号流。本章讨论词法分析器如何把文件分隔成语言符号。
Python reads program text as Unicode code points; the encoding of a source file
can be given by an encoding declaration and defaults to UTF-8, see
PEP 3120
for details.  If the source file cannot be decoded, a
SyntaxError
is
raised.
Python使用Unicode code points作为程序文本，源程序文件的编码可以通过声明显式地修改，默认为UTF-8，详见
PEP 3120
。如果无法解码源代码，就会出现
SyntaxError
异常。
2.1. 行结构（Line structure）
¶
A Python program is divided into a number of logical lines.
一个Python程序被分割成许多逻辑行。
2.1.1. 逻辑行（Logical lines）
¶
The end of a logical line is represented by the token NEWLINE.  Statements
cannot cross logical line boundaries except where NEWLINE is allowed by the
syntax (e.g., between statements in compound statements). A logical line is
constructed from one or more physical lines by following the explicit or
implicit line joining rules.
逻辑行的结束以NEWLINE（新行）语言符号表示。语句不能跨多个逻辑行边界，除非语法上允许NEWLINE（例如，复合语句中的语句之间）。一个逻辑行由一个物理行，或者根据显式／隐式行连接规则连接的多个物理行构成。
2.1.2. 物理行（Physical lines）
¶
A physical line is a sequence of characters terminated by an end-of-line
sequence.  In source files, any of the standard platform line termination
sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
form using the ASCII sequence CR LF (return followed by linefeed), or the old
Macintosh form using the ASCII CR (return) character.  All of these forms can be
used equally, regardless of platform.
一个物理行即一个字符序列，它由一个“断行符号序列”结束。在源代码中，任何平台的标准”断行符号序列“都可以使用：Unix形式为ASCII
LF(换行)字符；Windows形式为ASCII字符序列CR LF(回车加换行)；在Macintosh形式为ASCII
CR(回车)字符。无论在什么平台上，以上这三种形式都可以使用。
When embedding Python, source code strings should be passed to Python APIs using
the standard C conventions for newline characters (the \n character,
representing ASCII LF, is the line terminator).
在嵌入Python（embedding Python）的场合里，传递给Python API的源代码字符串应该使用标准C的断行习惯，即Unix形式。
2.1.3. 注释（Comments）
¶
A comment starts with a hash character (#) that is not part of a string
literal, and ends at the end of the physical line.  A comment signifies the end
of the logical line unless the implicit line joining rules are invoked. Comments
are ignored by the syntax; they are not tokens.
一个注释以 # 字符（它不能是串字面值的一部分）开始，结束于该物理行的结尾。如果没有隐式的行连接，那么注释就意味着该逻辑行的终止。语法分析会忽略注释，它们不被看作是语言符号.
2.1.4. 编码声明（Encoding declarations）
¶
If a comment in the first or second line of the Python script matches the
regular expression coding[=:]\s*([-\w.]+), this comment is processed as an
encoding declaration; the first group of this expression names the encoding of
the source code file. The recommended forms of this expression are :
Python脚本第一行或者第二行中的注释如果与正则表达式 coding[=:]\s*([-\w.]+) 匹配，那么这个注释就被认为是编码声明。此正则表达式的第一组为该源代码文件指定了的编码名称。正则表达式的推荐形式为:
# -*- coding:  -*-
which is recognized also by GNU Emacs, and :
GNU Emacs可识别这种风格，而:
# vim:fileencoding=
which is recognized by Bram Moolenaar’s VIM.
Bram Moolenaar’s VIM.以上这种风格。
If no encoding declaration is found, the default encoding is UTF-8.  In
addition, if the first bytes of the file are the UTF-8 byte-order mark
(b'\xef\xbb\xbf'), the declared file encoding is UTF-8 (this is supported,
among others, by Microsoft’s notepad).
如果没有找到什么任何编码声明，默认编码为UTF-8。另外，如果文件和前几个字节为UTF-8字节序标记（即byte-order mark）： b'\xef\xbb\xbf' ，也意味着文件以UTF-8编码（其他程序也支持这种方式，比如微软的 notepad ）
If an encoding is declared, the encoding name must be recognized by Python. The
encoding is used for all lexical analysis, including string literals, comments
and identifiers. The encoding declaration must appear on a line of its own.
如果声明了一种编码，则这个编码必须是Python可以接受的。此编码设置会被使用于整个词法分析过程中，包括字符串字面值、注释和标识符。编码声明必须在它所在位置的的一行内。
2.1.5. 显式行连接（Explicit line joining）
¶
Two or more physical lines may be joined into logical lines using backslash
characters (\), as follows: when a physical line ends in a backslash that is
not part of a string literal or comment, it is joined with the following forming
a single logical line, deleting the backslash and the following end-of-line
character.  For example:
两个或更多物理行可以使用反斜线字符( \ )合并成一个逻辑行，具体地说：当一个物理行结束于一个反斜线处时（这个反斜线不能是字符串字面值或注释的一部分），它就同其后的物理行合并成一个逻辑行，同时将它之后的反斜线和行结束符删除，例如:
if 1900  year  2100 and 1  month  12 \
and 1  day  31 and 0  hour  24 \
and 0  minute  60 and 0  second  60: # Looks like a valid date
      return 1
A line ending in a backslash cannot carry a comment.  A backslash does not
continue a comment.  A backslash does not continue a token except for string
literals (i.e., tokens other than string literals cannot be split across
physical lines using a backslash).  A backslash is illegal elsewhere on a line
outside a string literal.
以反斜线结尾的行后不能有注释。反斜线不能接续注释行。除了字符串字面值，反斜线也不能接续任何语言符号（即，不是字符串字面值的语言符号不能通过反斜线跨越物理）。在字符串字面值之外的行内其它地方出现的反斜线都是非法的。
2.1.6. 隐式行连接（Implicit line joining）
¶
Expressions in parentheses, square brackets or curly braces can be split over
more than one physical line without using backslashes. For example:
在小括号, 中括号,大括号中的表达式，不须借助反斜线就可以跨越多个物理行，例如:
month_names = ['Januari', 'Februari', 'Maart',    # These are the
            'April', 'Mei',    'Juni',    # Dutch names
            'Juli', 'Augustus', 'September',  # for the months
            'Oktober', 'November', 'December'] # of the year
Implicitly continued lines can carry comments.  The indentation of the
continuation lines is not important.  Blank continuation lines are allowed.
There is no NEWLINE token between implicit continuation lines.  Implicitly
continued lines can also occur within triple-quoted strings (see below); in that
case they cannot carry comments.
隐式连接的行可以尾随注释，如何缩进接续行并不重要。空接续行是允许的。.在隐式接续行间中是没有NEWLINE语言符号的。隐式行连接在三重引用串（后述）中也是合法的，但那种情况下不能加注释。
2.1.7. 空行（Blank lines）
¶
A logical line that contains only spaces, tabs, formfeeds and possibly a
comment, is ignored (i.e., no NEWLINE token is generated).  During interactive
input of statements, handling of a blank line may differ depending on the
implementation of the read-eval-print loop.  In the standard interactive
interpreter, an entirely blank logical line (i.e. one containing not even
whitespace or a comment) terminates a multi-line statement.
一个仅包括空格、制表符、进纸符和一个可选注释的逻辑行，在解析过程中是被忽略的（即不会产生对应的NEWLINE语言符号)。在语句进行交互式输
入时，空行的处理依赖于“输入-计算-输出”（read-eval-print）循环的实现方式而不同。在标准交互解释器中，一个纯粹的空行（即不包括任
何东西,甚至注释和空白）才会结束多行语句。
2.1.8. 缩进（Indentation）
¶
Leading whitespace (spaces and tabs) at the beginning of a logical line is used
to compute the indentation level of the line, which in turn is used to determine
the grouping of statements.
逻辑行的前导空白（空格和制表符）用于计算行的缩进层次，缩进层次然后用于语句的分组。
Tabs are replaced (from left to right) by one to eight spaces such that the
total number of characters up to and including the replacement is a multiple of
eight (this is intended to be the same rule as used by Unix).  The total number
of spaces preceding the first non-blank character then determines the line’s
indentation.  Indentation cannot be split over multiple physical lines using
backslashes; the whitespace up to the first backslash determines the
indentation.
首先,
制表符被转换成（从左到右）一至八个空格，这样直到包括替换部分的字符总数达到八的倍数（这是为了与UNIX的规则保持一致。然后，根据首个非空白字符前
的空格总数计算行的缩进层次。“缩进”是不能用反斜线跨物理行接续的。只有反斜线之前的空白字符才用于确定缩进层次。
Indentation is rejected as inconsistent if a source file mixes tabs and spaces
in a way that makes the meaning dependent on the worth of a tab in spaces; a
TabError is raised in that case.
如果源文件混合使用了制表符和空格，并且缩进的意义依赖于制表符的空格长度的话，那么缩进可能因为不一致被拒绝。
Cross-platform compatibility note: because of the nature of text editors on
non-UNIX platforms, it is unwise to use a mixture of spaces and tabs for the
indentation in a single source file.  It should also be noted that different
platforms may explicitly limit the maximum indentation level.
跨平台兼容性注意： 由于在非UNIX平台上的文本编辑器特性，在单个源文件里使用混合空格和制表符的缩进是不明智的。另一个值得注意的地方是不同平台可能明确地限制了最大缩进层次。
A formfeed character may be present at the start of the line; it will be ignored
for the indentation calculations above.  Formfeed characters occurring elsewhere
in the leading whitespace have an undefined effect (for instance, they may reset
the space count to zero).
换页符呆以出现在行首，但以上介绍的缩进计算过程会忽略它。在行前置空白的其它位置上出现的换页符会导致未定义的行为（例如，它可能使空格数重置为零）。
The indentation levels of consecutive lines are used to generate INDENT and
DEDENT tokens, using a stack, as follows.
相临行的缩进层次用于产生语言符号INDENT和DEDENT，在这个过程中使用了堆栈数据结构，如下所述。
Before the first line of the file is read, a single zero is pushed on the stack;
this will never be popped off again.  The numbers pushed on the stack will
always be strictly increasing from bottom to top.  At the beginning of each
logical line, the line’s indentation level is compared to the top of the stack.
If it is equal, nothing happens. If it is larger, it is pushed on the stack, and
one INDENT token is generated.  If it is smaller, it must be one of the
numbers occurring on the stack; all numbers on the stack that are larger are
popped off, and for each number popped off a DEDENT token is generated.  At the
end of the file, a DEDENT token is generated for each number remaining on the
stack that is larger than zero.
在未读入文件第一行之前，压入（push）栈一个以后不会弹出的（pop）零。所有堆栈中的数字都从底部向顶部增长。在每个逻辑行的开头处，它的缩
进层次与栈顶比较，如果两者相等则什么也不会发生；如果它大于栈顶，将其压入栈中，并产生一个INDENT语言符号；如果小于栈顶,
那么它的值应该已经出现于堆栈中，堆栈中所有大于它的数都将被弹出，并且每个都产生一个DEDENT语言符号。到达文件尾时，堆栈中大于零的数字都被弹
出，每弹出一个数字也产生一个DEDENT语言符号。
Here is an example of a correctly (though confusingly) indented piece of Python
code:
这是一个有着正确缩进格式的Python代码的例子（虽然有点乱）:
def perm(l):
      # Compute the list of all permutations of l
if len(l)  1:
               return [l]
r = []
for i in range(len(l)):
         s = l[:i] + l[i+1:]
         p = perm(s)
         for x in p:
            r.append(l[i:i+1] + x)
return r
The following example shows various indentation errors:
下面的例子展示了各种缩进错误:
def perm(l):                      # error: first line indented （首行缩进）
for i in range(len(l)):          # error: not indented （未缩进）
s = l[:i] + l[i+1:]
      p = perm(l[:i] + l[i+1:]) # error: unexpected indent （意外缩进）
      for x in p:
            r.append(l[i:i+1] + x)
         return r             # error: inconsistent dedent （不一致的缩进）
(Actually, the first three errors are detected by the parser; only the last
error is found by the lexical analyzer — the indentation of return r does
not match a level popped off the stack.)
（事实上, 前三个错误是由解析器发现的。仅仅最后一个错误是由词法分析器找到的— return r 的缩进层次与弹出堆栈的数不匹配。）
2.1.9. 语言符号间的空白（Whitespace between tokens）
¶
Except at the beginning of a logical line or in string literals, the whitespace
characters space, tab and formfeed can be used interchangeably to separate
tokens.  Whitespace is needed between two tokens only if their concatenation
could otherwise be interpreted as a different token (e.g., ab is one token, but
a b is two tokens).
除了位于在逻辑行开始处或者字符串当中，空格，制表符和进纸符这些空白字符可以等效地用于分隔语言符号（token）。只在两个符号在连接后会有其它含义时才需要使用空间分割它们，例如，ab是一个符号,但a b是两个符号。
2.2. 其它语言符号（Other tokens）
¶
Besides NEWLINE, INDENT and DEDENT, the following categories of tokens exist:
identifiers, keywords, literals, operators, and delimiters. Whitespace
characters (other than line terminators, discussed earlier) are not tokens, but
serve to delimit tokens. Where ambiguity exists, a token comprises the longest
possible string that forms a legal token, when read from left to right.
除了NEWLINE、INDENT和DEDENT外，还有以下几类语言符号：标识符，关键字、字面值、运算符和分隔符。空白不是语言符号（除了断行符，如前所述），但可以用于分隔语言符号。如果在构造某语言符号可能存在歧义时，就试图用尽量长的字符串（从左至右读出的）构造一个合法语言符号。
2.3. 标识符和关键字（Identifiers and keywords）
¶
Identifiers (also referred to as names) are described by the following lexical
definitions.
标识符（也称为名字）由以下词法定义描述。
The syntax of identifiers in Python is based on the Unicode standard annex
UAX-31, with elaboration and changes as defined below; see also
PEP 3131
for
further details.
下面介绍的Python标识符定义是在Unicode standard annex UAX-31的基础上加以修改而成的，更多细节可以参考
PEP 3131
。
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
are the same as in Python 2.x: the uppercase and lowercase letters A through
Z, the underscore _ and, except for the first character, the digits
0 through 9.
在ASCII范围(U+0001..U+007F)内，标识符的有效字符与Python 2.x相同：大小写字母（A-Z）、下划线，以及不能作为标识符开始的数字（0-9）。
Python 3.0 introduces additional characters from outside the ASCII range (see
PEP 3131
).  For these characters, the classification uses the version of the
Unicode Character Database as included in the
unicodedata
module.
Python 3.0引入了在ASCII范围之外额外字符（参见
PEP 3131
）。对于这些字符进行分类（classification），可以使用在
unicodedata
模块中的Unicode Character Database版本。
Identifiers are unlimited in length.  Case is significant.
标识符不限长度，区分大小写。
identifier ::=
id_start

id_continue
*
id_start ::=
id_continue ::=  id_start[/url]
, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
The Unicode category codes mentioned above stand for:
以上Unicode category code的缩写是：

Lu - uppercase letters

Ll - lowercase letters

Lt - titlecase letters

Lm - modifier letters

Lo - other letters

Nl - letter numbers

Mn - nonspacing marks

Mc - spacing combining marks

Nd - decimal numbers

Pc - connector punctuations
All identifiers are converted into the normal form NFC while parsing; comparison
of identifiers is based on NFC.
在解析时，所有标识符都被转换为NFC形式，标识符的比较是基于NFC的。
A non-normative HTML file listing all valid identifier characters for Unicode
4.1 can be found at
有一篇非标准的HTML文件列出了所以有效的标识符Unicode 4.1字符，可以这里找到：
http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html
.
2.3.1. 关键字（Keywords）
¶
The following identifiers are used as reserved words, or keywords of the
language, and cannot be used as ordinary identifiers.  They must be spelled
exactly as written here:
以下标识符用作保留字, 或者叫做语言的关键字，它们不能作为普通标识符使用，而且它们必须按如下严格拼写：
False    class    finally is       return
None    continue for       lambda    try
True    def       from    nonlocal while
and       del       global    not       with
as       elif    if       or       yield
assert    else    import    pass
break    except    in       raise
2.3.2. 保留的标识符类型（Reserved classes of identifiers）
¶
Certain classes of identifiers (besides keywords) have special meanings.  These
classes are identified by the patterns of leading and trailing underscore
characters:
除了关键字，某些类型的标识符也具有特殊含义，这种标识符一般都以下划线开始或结束：
_*Not imported by from module import *.  The special identifier _ is used
in the interactive interpreter to store the result of the last evaluation; it is
stored in the
builtins
module.  When not in interactive mode, _
has no special meaning and is not defined. See section
The import statement
.
from moduls import * 不会导入这些符号。在交互式解释器中，特殊标识符 _ 保存上次计算（evaluation）的结果，这个符号内置模块之中。在非交互方式时， _ 没有特殊含义，而且是没有定义的。
Note
The name _ is often used in conjunction with internationalization;
refer to the documentation for the
gettext
module for more
information on this convention.
名字 _ 通常用于国际化开发，关于这个使用习惯，可以参考模块
gettext
。
__*__System-defined names.  These names are defined by the interpreter and its
implementation (including the standard library); applications should not expect
to define additional names using this convention.  The set of names of this
class defined by Python may be extended in future versions. See section
特殊方法名（Special method names）
.
系统预定义的名字。这种名字由解释器及其实现定义（包括标准库）。应用程序不应该使用这种方法定义标识符。Python的未来版本可能会引入更多的这类名字，请参考
特殊方法名（Special method names）
。
__*Class-private names.  Names in this category, when used within the context of a
class definition, are re-written to use a mangled form to help avoid name
clashes between “private” attributes of base and derived classes. See section
标识符(名字) （Identifiers (Names)）
.
类私有名字。此类名字出现在类定义的上下文中，它们会在更名为其他名字（mangled form），以避免基类与继承类的“私有”属性的名字冲突，参考
标识符(名字) （Identifiers (Names)）
。
2.4. 字面值（Literals）
¶
Literals are notations for constant values of some built-in types.
字面值是某些内置类型常量的表示法。
2.4.1. 字符串与字节的字面值（String and Bytes literals）
¶
String literals are described by the following lexical definitions:
字符串字面值由以下词法定义描述：
stringliteral   ::=  [
stringprefix
](
shortstring
|
longstring
)
stringprefix ::=  "r" | "R"
shortstring ::=  "'"
shortstringitem
* "'" | '"'
shortstringitem
* '"'
longstring     ::=  "'''"
longstringitem
* "'''" | '"""'
longstringitem
* '"""'
shortstringitem ::=
shortstringchar
|
stringescapeseq
longstringitem ::=
longstringchar
|
stringescapeseq
shortstringchar ::=
longstringchar ::=
stringescapeseq ::=  "\"
bytesliteral   ::=
bytesprefix
(
shortbytes
|
longbytes
)
bytesprefix ::=  "b" | "B"
shortbytes ::=  "'"
shortbytesitem
* "'" | '"'
shortbytesitem
* '"'
longbytes     ::=  "'''"
longbytesitem
* "'''" | '"""'
longbytesitem
* '"""'
shortbytesitem ::=
shortbyteschar
|
bytesescapeseq
longbytesitem ::=
longbyteschar
|
bytesescapeseq
shortbyteschar ::=
longbyteschar ::=
bytesescapeseq ::=  "\"
One syntactic restriction not indicated by these productions is that whitespace
is not allowed between the
stringprefix
or
bytesprefix
and the
rest of the literal. The source character set is defined by the encoding
declaration; it is UTF-8 if no encoding declaration is given in the source file;
see section
编码声明（Encoding declarations）
.
一个上面没有表示出来的语法限制是，在
stringprefix
或
bytesprefix
与其余字面值之间不允许出现空白字符。源代码的字符集由编码声明定义，如果源文件内没有指定编码声明，则默认为UTF-8，参见
编码声明（Encoding declarations）
。
In plain English: Both types of literals can be enclosed in matching single quotes
(') or double quotes (").  They can also be enclosed in matching groups
of three single or double quotes (these are generally referred to as
triple-quoted strings).  The backslash (\) character is used to escape
characters that otherwise have a special meaning, such as newline, backslash
itself, or the quote character.
通俗地讲，这两种字面值可以用单引号( ’ )或双引号( ” )括住。它们也可以用成对的三个单引号和双引号(这叫做三重引用串 )，反斜线( \ )可以用于引用其它有特殊含义的字符，例如新行符、反斜线本身或者引用字符。
String literals may optionally be prefixed with a letter 'r' or 'R';
such strings are called raw strings and treat backslashes as literal
characters.  As a result, '\U' and '\u' escapes in raw strings are not
treated specially.
字符串字面值可以用’u’和’U’开头，这样的字符串字面值叫作原始串，它不对反斜线进行转义。原始串中的 '\U' 和 '\u' 不会得到特殊处理。
Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the
bytes
type instead of the
str
type.  They
may only contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.
字节串字面一定要以 'b' 或 'B' 开始，这会产生一个:bytes 类的实例，而不是
str
的。它只能包括ASCII字符，超过数值128的字节必须用转义字符表达。
In triple-quoted strings, unescaped newlines and quotes are allowed (and are
retained), except that three unescaped quotes in a row terminate the string.  (A
“quote” is the character used to open the string, i.e. either ' or ".)
在三重引用串中，未转义新行和引用字符是允许的（并且会被保留），除非三个连续的引用字符结束了该串。（引用字符指用于开始字符串的字符, 如 ’ 和 ” ）
Unless an 'r' or 'R' prefix is present, escape sequences in strings are
interpreted according to rules similar to those used by Standard C.  The
recognized escape sequences are:
如果没有使用 ’r’ 或 ’R’ 前缀，那么其含义就按照类似C标准中的规则解释，可接受的转义的字符如下：
Escape Sequence
Meaning
Notes
\newline
Backslash and newline ignored

\\
Backslash (\)

\'
Single quote (')

\"
Double quote (")

\a
ASCII Bell (BEL)

\b
ASCII Backspace (BS)

\f
ASCII Formfeed (FF)

\n
ASCII Linefeed (LF)

\r
ASCII Carriage Return (CR)

\t
ASCII Horizontal Tab (TAB)

\v
ASCII Vertical Tab (VT)

\ooo
Character with octal value
ooo
(1,3)
\xhh
Character with hex value hh
(2,3)
Escape sequences only recognized in string literals are:
只由字符串字面值支持的转义字符有：
Escape Sequence
Meaning
Notes
\N{name}
Character named name in the
Unicode database

\uxxxx
Character with 16-bit hex value
xxxx
(4)
\Uxxxxxxxx
Character with 32-bit hex value
xxxxxxxx
(5)
Notes:

As in Standard C, up to three octal digits are accepted.
与C标准相同，最多只接受三个八进制数字。

Unlike in Standard C, at most two hex digits are accepted.
不像C标准，最多只接受两个十六进制数据。

In a bytes literal, hexadecimal and octal escapes denote the byte with the
given value. In a string literal, these escapes denote a Unicode character
with the given value.
在字节字面值中，十六进制和八进制转义字符都是指定一个字节的值。在字符串字面值中，这些转义字符指定的是一个Unicode字符的值。

Individual code units which form parts of a surrogate pair can be encoded using
this escape sequence. Unlike in Standard C, exactly two hex digits are required.
任何构成部分surrogate pair的单独code unit都可以使用转义字符序列编码。不像C标准，这里要求给全两个十六进制数字。

Any Unicode character can be encoded this way, but characters outside the Basic
Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is
compiled to use 16-bit code units (the default).  Individual code units which
form parts of a surrogate pair can be encoded using this escape sequence.
任何Unicode字符都可以用这种方式编码，但如果Python是按16位code
unit编译的话（默认），在基本多语言平面（BMP）之外的字符会使用surrogate pair编码。任何构成部分surrogate
pair的单独code unit都可以使用这种转义字符序列编码。
Unlike Standard C, all unrecognized escape sequences are left in the string
unchanged, i.e., the backslash is left in the string.  (This behavior is
useful when debugging: if an escape sequence is mistyped, the resulting output
is more easily recognized as broken.)  It is also important to note that the
escape sequences only recognized in string literals fall into the category of
unrecognized escapes for bytes literals.
不像C标准，所有不能被解释的转义序列留在串不作改变，即反斜线留在串中（这个行为在调试中特别有用：如果有转义字符输错了，可以很容易地判断出来）。但也要留意，字节字面值并不接受那些只有在字符串字面值内有效的转义字符。
Even in a raw string, string quotes can be escaped with a backslash, but the
backslash remains in the string; for example, r"\"" is a valid string
literal consisting of two characters: a backslash and a double quote; r"\"
is not a valid string literal (even a raw string cannot end in an odd number of
backslashes).  Specifically, a raw string cannot end in a single backslash
(since the backslash would escape the following quote character).  Note also
that a single backslash followed by a newline is interpreted as those two
characters as part of the string, not as a line continuation.
即使在原始串中，字符引用也可以使用反斜线转义，但反斜线会保留在字符串中，例如， r"\"" 是一个有效的字符串，它由两个字符组成，一个反斜线一个双引号；而 r"\" 则不是（甚至原始串也不能包括奇数个反斜线。另外，原始串也不能以反斜线结束（因为反斜线会把后面的引用字符转义）。同时，也要注意在新行符后出现的反斜线，解释为串部分中的两个字符，而不是续行处理。
2.4.2. 字符串字面值的连接（String literal concatenation）
¶
Multiple adjacent string literals (delimited by whitespace), possibly using
different quoting conventions, are allowed, and their meaning is the same as
their concatenation.  Thus, "hello" 'world' is equivalent to
"helloworld".  This feature can be used to reduce the number of backslashes
needed, to split long strings conveniently across long lines, or even to add
comments to parts of strings, for example:
多个空白分隔的相邻字符串字面值，可能使用了不同的引用习惯，这是允许的，并且它们在连接时含义是一样的。因此, ”hello” 'world' 等价于 ”helloworld” 。这个功能可以用来减少需要的反斜线，把跨越多行的长字符串，甚至可以在串的某个部分加注释，例如:
re.compile("[A-Za-z_]"    # letter or underscore
         "[A-Za-z0-9_]*" # letter, digit or underscore
      )
Note that this feature is defined at the syntactical level, but implemented at
compile time.  The ‘+’ operator must be used to concatenate string expressions
at run time.  Also note that literal concatenation can use different quoting
styles for each component (even mixing raw strings and triple quoted strings).
注意这个功能是在语法层次上定义的，但却是在编译时实现的。在运行时连接字符串表达式必须使用”+”运算符。再次提醒，在字面值连接时，不同的引用字符可以混用，甚至原始串与三重引用串也可以混合使用。
2.4.3. 数值型的字面值（Numeric literals）
¶
There are three types of numeric literals: integers, floating point numbers, and
imaginary numbers.  There are no complex literals (complex numbers can be formed
by adding a real number and an imaginary number).
存在有三种类型的数值型字面值：整数、浮点数和虚数。没有复数字面值（复数可以用一个实数加上一个虚数的方法构造）
Note that numeric literals do not include a sign; a phrase like -1 is
actually an expression composed of the unary operator ‘-‘ and the literal
1.
注意数值型的字面值并不包括正负号，像 -1 ，实际上是一个组合了一元运算符 ‘-‘ 和字面值``1``的表达式。
2.4.4. 整数字面值（Integer literals）
¶
Integer literals are described by the following lexical definitions:
整数字面值由以下词法定义描述：
integer    ::=
decimalinteger
|
octinteger
|
hexinteger
|
bininteger
decimalinteger ::=
nonzerodigit

digit
* | "0"+
nonzerodigit   ::=  "1"..."9"
digit       ::=  "0"..."9"
octinteger ::=  "0" ("o" | "O")
octdigit
+
hexinteger ::=  "0" ("x" | "X")
hexdigit
+
bininteger ::=  "0" ("b" | "B")
bindigit
+
octdigit    ::=  "0"..."7"
hexdigit    ::=
digit
| "a"..."f" | "A"..."F"
bindigit    ::=  "0" | "1"
There is no limit for the length of integer literals apart from what can be
stored in available memory.
没有对整数长度的软件限制，其大小只取决于有效内存的容量。
Note that leading zeros in a non-zero decimal number are not allowed. This is
for disambiguation with C-style octal literals, which Python used before version
3.0.
注意，非零十进制数字中不允许用0作为前缀，这种写法会与C语言风格的八进制字面值产生歧义（用于3.0之前版本的Python）。
Some examples of integer literals:
整数字面值的一些例子:
7    2147483647                      0o177 0b100110111
3    79228162514264337593543950336    0o377 0x100000000
   79228162514264337593543950336             0xdeadbeef
2.4.5. 浮点型字面值（Floating point literals）
¶
Floating point literals are described by the following lexical definitions:
浮点型的字面值可以用以下词法定义描述：
floatnumber   ::=
pointfloat
|
exponentfloat
pointfloat ::=  [
intpart
]
fraction
|
intpart
"."
exponentfloat ::=  (
intpart
|
pointfloat
)
exponent
intpart    ::=
digit
+
fraction     ::=  "."
digit
+
exponent     ::=  ("e" | "E") ["+" | "-"]
digit
+
Note that the integer and exponent parts are always interpreted using radix 10.
For example, 077e010 is legal, and denotes the same number as 77e10. The
allowed range of floating point literals is implementation-dependent. Some
examples of floating point literals:
注意整数部分和指数部分都看作是十进制的。例如， 077e010 是合法的，它等价于 77e10 。浮点型字面的允许范围是依赖实现，以下是一些浮点数的例子:
3.14 10. .001 1e100 3.14e-10 0e0
Note that numeric literals do not include a sign; a phrase like -1 is
actually an expression composed of the unary operator - and the literal
1.
注意数值型字面值并不包括正负号，像 -1 ，实际上是一个组合了一元运算符 ‘-‘ 和字面值``1``的表达式。
2.4.6. 虚数字面值（Imaginary literals）
¶
Imaginary literals are described by the following lexical definitions:
虚数字面值可以用下面词法定义描述：
imagnumber ::=  (
floatnumber
|
intpart
) ("j" | "J")
An imaginary literal yields a complex number with a real part of 0.0.  Complex
numbers are represented as a pair of floating point numbers and have the same
restrictions on their range.  To create a complex number with a nonzero real
part, add a floating point number to it, e.g., (3+4j).  Some examples of
imaginary literals:
虚数是实部为零的复数。复数由一对有着相同取值范围的浮点数对表示。为了创建一个非零实部的复数，可以对它增加一个浮点数，例如， (3+4j) 。下面是一些例子:
3.14j 10.j 10j    .001j 1e100j  3.14e-10j
2.5. 运算符（Operators）
¶
The following tokens are operators:
运算符包括以下语言符号:
+    -    *    **    /    //    %
   >>    &    |    ^    ~
   >          >=    ==    !=
2.6. 分隔符（Delimiters）
¶
The following tokens serve as delimiters in the grammar:
以下符号用作语法上的分隔符:
(    )    [    ]    {    }
,    :    .    ;    @    =
+=    -=    *=    /=    //=    %=
&=    |=    ^=    >>=
The period can also occur in floating-point and imaginary literals.  A sequence
of three periods has a special meaning as an ellipsis literal. The second half
of the list, the augmented assignment operators, serve lexically as delimiters,
but also perform an operation.
句号可以出现在浮点数和虚数字面值中出现，连续三个句号的一个序列是片断的省略写法。在这个表格的后半部分，即参数化赋值运算符，它们在词法上是分隔符，同时也执行运算。
The following printing ASCII characters have special meaning as part of other
tokens or are otherwise significant to the lexical analyzer:
以下ASCII可打印字符，在作为其它语言符号的一部分时有特殊含义，或者对于词法分析器具有特殊作用:
'    "    #    \
The following printing ASCII characters are not used in Python.  Their
occurrence outside string literals and comments is an unconditional error:
以下ASCII可打印字符，并不在Python中使用，当它们出现在注释和字符串字面值之外时就是非法的:
$    ?

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u1/42957/showart_2106775.html

文库|博客

返回列表

Chinaunix › 论坛 › 程序设计 › Python › Python文档中心 › Python 3.1.1 中英文对照版语言参考手册－词法分析

Python 3.1.1 中英文对照版语言参考手册－词法分析 [复制链接]

浏览过的版块

Python 3.1.1 中英文对照版语言参考手册 － 词法分析 [复制链接]

浏览过的版块

Python 3.1.1 中英文对照版语言参考手册－词法分析 [复制链接]