- 论坛徽章:
- 0
|
本帖最后由 oldbeginner 于 2014-03-08 16:42 编辑
*****************************
5.6 Drawing the Line Between Lexer and Parser
男女搭配
*****************************
Because ANTLR lexer rules can use recursion, lexers are technically as powerful
as parsers. That means we could match even grammatical structure in
the lexer.
我想参加男子单打比赛
Where to draw the line between the lexer and the parser is partially a function
of the language but also a function of the intended application.
Fortunately,
a few rules of thumb will get us pretty far.
• Match and discard anything in the lexer that the parser does not need to
see at all. Recognize and toss out things like whitespace and comments
for programming languages. Otherwise, the parser would have to constantly
check to see whether there are comments or whitespace in between
tokens.
• Match common tokens such as identifiers, keywords, strings, and numbers
in the lexer. The parser has more overhead than the lexer, so we shouldn’t
burden the parser with, say, putting digits together to recognize integers.
• Lump together into a single token type those lexical structures that the
parser does not need to distinguish. For example, if our application treats
integer and floating-point numbers the same, then lump them together
as token type NUMBER. There’s no point in sending separate token types
to the parser.
• Lump together anything that the parser can treat as a single entity. For
example, if the parser doesn’t care about the contents of an XML tag, the
lexer can lump everything between angle brackets into a single token type
called TAG.
• On the other hand, if the parser needs to pull apart a lump of text to
process it, the lexer should pass the individual components as tokens to
the parser. For example, if the parser needs to process the elements of
an IP address, the lexer should send individual tokens for the IP components
(integers and periods).
To see how the intended application affects what we match in the lexer vs.
the parser, imagine processing a log file from a web server that has one record
per line.
192.168.209.85 "GET /download/foo.html HTTP/1.0" 200
With a complete set of tokens, we can make parser rules that match the
records in a log file.
file : row+ ; // parser rule matching rows of log file
row : IP STRING INT NL ; // match log file record
IP : INT '.' INT '.' INT '.' INT ; // 192.168.209.85
INT : [0-9]+ ; // match IP octet or HTTP result code
STRING: '"' .*? '"' ; // matches the HTTP protocol command
NL : '\n' ; // match log file record terminator
WS : ' ' -> skip ; // ignore spaces
With convenient library functions like
split('.'), we could pass IP addresses as strings to the parser and process them
there. But, it’s better to have the lexer match the IP address lexical structure
and pass the components to the parser as tokens.
file : row+ ; // parser rule matching rows of log file
row : ip STRING INT NL ; // match log file record
ip : INT '.' INT '.' INT '.' INT ; // match IPs in parser
INT : [0-9]+ ; // match IP octet or HTTP result code
STRING: '"' .*? '"' ; // matches the HTTP protocol command
NL : '\n' ; // match log file record terminator
WS : ' ' -> skip ; // ignore spaces
Switching lexer rule IP to parser rule ip shows how easily we can shift the
dividing line.
In this chapter, we learned how to work from a representative sample of the
language, or language documentation, to create grammar pseudocode and
then a formal grammar in ANTLR notation.
We also studied the common
language patterns: sequence, choice, token dependency, and nested phrase.
In the lexical realm, we looked at implementations for the most common
tokens: identifiers, numbers, strings, comments, and whitespace.
|
|