免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2105 | 回复: 0
打印 上一主题 下一主题

Sexy Lexing with Python [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2009-04-06 17:18 |只看该作者 |倒序浏览

Lexical analysis
, a daunting task right? Wrong! In the following document we'll walk through different methods of lexical scanning in Python. First, we'll look at a pre-built solution found in the library, and then at a custom-built solution.
Using re.Scanner
In the re module there is a class called Scanner that can do lexical sanning. It is completely undocumented other than for a
small example code block
found on the
Python-Dev mailing list
, but well worth mentioning. It works by feeding in a list of regular-expressions and callback functions linked to them. When it matches a token, it first runs its value through the appropriate callback and then appends it to the token list being returned. If the scanner reaches a spot where a token match cannot be found, it returns what matches (if any) it did have along with the rest of the document that couldn't be matched. Here is an example:
import re

def identifier(scanner, token): return "IDENT", token
def operator(scanner, token):   return "OPERATOR", token
def digit(scanner, token):      return "DIGIT", token
def end_stmnt(scanner, token):  return "END_STATEMENT"

scanner = re.Scanner([
    (r"[a-zA-Z_]\w*", identifier),
    (r"\+|\-|\\|\*|\=", operator),
    (r"[0-9]+(\.[0-9]+)?", digit),
    (r";", end_stmnt),
    (r"\s+", None),
    ])

tokens, remainder = scanner.scan("foo = 5 * 30; bar = bar - 60;")
for token in tokens:
    print token
Which provides the output:
('IDENT', 'foo')
('OPERATOR', '=')
('DIGIT', '5')
('OPERATOR', '*')
('DIGIT', '30')
END_STATEMENT
('IDENT', 'bar')
('OPERATOR', '=')
('IDENT', 'bar')
('OPERATOR', '-')
('DIGIT', '60')
END_STATEMENT
Truly easy, fast, and relatively simple to understand.
Using this is perfect for small projects, but it has some downsides such as not allowing simple error handling and not implicitly handling whitespace. Additionally, it suffers from having to tokenize the whole document before being able to provide anything, and that can get costly on larger projects.
Custom-Built Lexer
I had decided to build a custom lexer as a means to break away from the re.Scanner. Here is the code for the actual lexer. It is broken into three classes: UnknownTokenError which gets thrown when a non-recognized token is found, Lexer which holds the settings for scanning, and _InputScanner which is in charge of scanning specific input, as the name implies. A few benefits built into the Lexer include automatic whitespace handling (if desired) and the ability to easily make the scan case-insensitive. Additionally, you can optionally provide a callback with the rule to run the token through before returning it by making the rule a tuple of the rule and callback.
import re


class UnknownTokenError(Exception):
    """ This exception is for use to be thrown when an unknown token is
        encountered in the token stream. It hols the line number and the
        offending token.
    """
    def __init__(self, token, lineno):
        self.token = token
        self.lineno = lineno

    def __str__(self):
        return "Line #%s, Found token: %s" % (self.lineno, self.token)


class _InputScanner(object):
    """ This class manages the scanning of a specific input. An instance of it is
        returned when scan() is called. It is built to be great for iteration. This is
        mainly to be used by the Lexer and ideally not directly.
    """
    _position = 0

    def __init__(self, lexer, input):
        """ Put the lexer into this instance so the callbacks can reference it
            if needed.
        """
        self.lexer = lexer
        self.input = input

    def __iter__(self):
        """ All of the code for iteration is controlled by the class itself.
            This and next() (or __next__() in Python 3.0) are so syntax
            like `for token in Lexer(...):` is valid and works.
        """
        return self

    def next(self):
        """ Used for iteration. It returns token after token until there
            are no more tokens. (change this to __next__(self) if using Py3.0)
        """
        if not self.done_scanning():
            return self.scan_next()
        raise StopIteration

    def done_scanning(self):
        """ A simple boolean function that returns true if scanning is
            complete and false if it isn't.
        """
        return self._position >= len(self.input)

    def scan_next(self):
        """ Retreive the next token from the input. If the
            flag `omit_whitespace` is set to True, then it will
            skip over the whitespace characters present.
        """
        if self.done_scanning():
            return None
        if self.lexer.omit_whitespace:
            match = self.lexer.ws_regexc.match(self.input, self._position)
            if match:
                self._position = match.end()
        match = self.lexer.regexc.match(self.input, self._position)
        if match is None:
            lineno = self.input[:self._position].count("\n") + 1
            raise UnknownTokenError(self.input[self._position], lineno)
        self._position = match.end()
        value = match.group(match.lastgroup)
        if match.lastgroup in self.lexer._callbacks:
            value = self.lexer._callbacks[match.lastgroup](self, value)
        return match.lastgroup, value


class Lexer(object):
    """ A lexical scanner. It takes in an input and a set of rules based
        on reqular expressions. It then scans the input and returns the
        tokens one-by-one. It is meant to be used through iterating.
    """
    _callbacks = {}

    def __init__(self, rules, case_sensitive=True, omit_whitespace=True):
        """ Set up the lexical scanner. Build and compile the regular expression
            and prepare the whitespace searcher.
        """
        self.omit_whitespace = omit_whitespace
        self.case_sensitive = case_sensitive
        parts = []
        for name, rule in rules.items():
            if not isinstance(rule, str):
                rule, callback = rule
                self._callbacks[name] = callback
            parts.append("(?P%s)" % (name, rule))
        if self.case_sensitive:
            flags = re.M
        else:
            flags = re.M|re.I
        self.regexc = re.compile("|".join(parts), flags)
        self.ws_regexc = re.compile("\s*", re.MULTILINE)

    def scan(self, input):
        """ Return a scanner built for matching through the `input` field.
            The scanner that it returns is built well for iterating.
        """
        return _InputScanner(self, input)
This version does on-the-fly scanning through the use of building the class as an iterator. So, you can work with a token the moment it gets scanned, and before any other tokens get scanned. This can help reduce overhead in case you have a large document and may need to exit prematurely. And, of course, when you write your own lexer, it is much easier to modify it to your needs. Now let's test the above code and see what sort of token stream we arrive with.
def stmnt_callback(scanner, token):
    """ This is just an example of providing a function to run the
        token through.
    """
    return ""

rules = {
    "IDENTIFIER": r"[a-zA-Z_]\w*",
    "OPERATOR":   r"\+|\-|\\|\*|\=",
    "DIGIT":      r"[0-9]+(\.[0-9]+)?",
    "END_STMNT":  (";", stmnt_callback),
    }

lex = Lexer(rules, case_sensitive=True)
for token in lex.scan("foo = 5 * 30; bar = bar - 60;"):
    print token
Outputs:
('IDENTIFIER', 'foo')
('OPERATOR', '=')
('DIGIT', '5')
('OPERATOR', '*')
('DIGIT', '30')
('END_STMNT', '')
('IDENTIFIER', 'bar')
('OPERATOR', '=')
('IDENTIFIER', 'bar')
('OPERATOR', '-')
('DIGIT', '60')
('END_STMNT', '')
Pretty easy to understand, right? A great thing about the `Lexer` is that it is easy to subclass. For instance, in a project that I'm doing for a complex template parser, I added in the ability to only do scanning inside specific tags while treating non-tag data as their own type of token. Maybe I'll cover that in more detail in a future post.


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/78/showart_1891405.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP