编译器学习笔记02（世界公民antlr）——2014_2_27 [复制链接]

论坛徽章:: 0

21楼 [报告]

发表于 2014-03-07 17:12 |只看该作者

*******************************
5.4 Dealing with Precedence, Left Recursion, and Associativity

抢饭碗的人真多
*******************************

Expressions have always been a hassle to specify with top-down grammars
and to recognize by hand with recursive-descent parsers,

first because the
most natural grammar is ambiguous and second because the most natural
specification uses a special kind of recursion called left recursion.

keep in mind that top-down grammars
and parsers cannot deal with left recursion in their classic form.

Literally encoding this as a grammar leads to the following
reasonable-looking rule:

The problem is that the rule as specified can interpret input such as 1+2*3 in
the two ways depicted by the middle and right parse trees. They’re different
because the middle tree says to add 1 to the result of multiplying 2 and 3,
whereas the tree on the right multiplies 3 by the result of adding 1 and 2.

ANTLR resolves ambiguities in favor of the alternative given first,
implicitly allowing us to specify operator precedence. Rule expr has the multiplication
alternative before the addition alternative, so ANTLR resolves the
operator ambiguity for 1+2*3 in favor of the multiplication.

By default, ANTLR associates operators left to right as we’d expect for * and
+. Some operators like exponentiation group right to left, though, so we have
to manually specify the associativity on the operator token using option assoc.
Here’s an expression rule that properly interprets input like 2^3^4 as 2^(3^4):

However, one of ANTLR v4’s major improvements
is that it can now handle direct left recursion. A left-recursive rule is one that
either directly or indirectly invokes itself on the left edge of an alternative.

While ANTLR v4 can handle direct left recursion, it can’t handle indirect left
recursion. That means we can’t factor expr into the grammatically equivalent
rules.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

22楼 [报告]

发表于 2014-03-07 18:18 |只看该作者

******************************
5.5 Recognizing Common Lexical Structures

一棍子打翻一船人
******************************

Computer languages look remarkably similar lexically.

Lexically, then, functional, procedural, declarative,
and object-oriented languages look pretty much the same. Amazing!

我说，照片上是同一个人，你信吗？

That’s great because we have to learn only how to describe identifiers and
integers once and, with little variation, apply them to most programming
languages.

To demonstrate what lexical rules look like, let’s build simple versions of the
common tokens, starting with our friend the humble identifier.

Matching Identifiers

In grammar pseudocode, a basic identifier is a nonempty sequence of uppercase
and lowercase letters.

ID : ('a'..'z'|'A'..'Z')+ ; // match 1-or-more upper or lowercase letters

As a shorthand for character sets, ANTLR supports the more familiar regular
expression set notation.

ID : [a-zA-Z]+ ; // match 1-or-more upper or lowercase letters

Matching Numbers

Describing integer numbers such as 10 is easy because it’s just a sequence
of digits.

INT : '0'..'9'+ ; // match 1 or more digits
or
INT : [0-9]+ ; // match 1 or more digits

Floating-point numbers are much more complicated, unfortunately, but we
can make a simplified version easily if we ignore exponents.

FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment
DIGIT : [0-9] ; // match single digit

By prefixing the rule with fragment, we let ANTLR know that the
rule will be used only by other lexical rules. It is not a token in and of itself.

Matching String Literals

The next token that computer languages tend to have in common is the string
literal like "Hello". Most use double quotes, but some use single quotes or even
both (Python). Regardless of the choice of delimiters, we match them using a
rule that consumes everything between the delimiters. In grammar pseudocode,
a string is a sequence of any characters between double quotes.

STRING : '"' .*? '"' ; // match anything in "..."

The dot wildcard operator matches any single character. Therefore, .* would
be a loop that matches any sequence of zero or more characters.

If .*? is confusing,
don’t worry about it. Just remember it as a pattern for matching stuff inside
quotes or other delimiters.

To support the common escape characters, we need something like
the following:

STRING: '"' (ESC|.)*? '"' ;
fragment
ESC : '\\"' | '\\\\' ; // 2-char sequences \" and \\

ANTLR itself needs to escape the escape character, so that’s why we need \\
to specify the backslash character.

Matching Comments and Whitespace

For example, here is how to match both single-line and multiline
comments for C-derived languages:

LINE_COMMENT : '//' .*? '\r'? '\n' -> skip ; // Match "//" stuff '\n'
COMMENT : '/*' .*? '*/' -> skip ; // Match "/*" stuff "*/"

In LINE_COMMENT, .*? consumes everything after // until it sees a newline
In COMMENT, .*? consumes everything after /* and before the terminating */.

Here is how to tell ANTLR to throw out
whitespace:

WS : (' '|'\t'|'\r'|'\n')+ -> skip ; // match 1-or-more whitespace but discard
or
WS : [ \t\r\n]+ -> skip ; // match 1-or-more whitespace but discard

Believe
it or not, that’s a great start on a lexer for even a big programming language.
感觉作者好像了解中国文化一样，

Here’s a lexer starter kit we can use as a reference later:

Before
we move on, though, there are two important issues to consider. First, it’s
not always obvious where to draw the line between what we match in the
parser and what we match in the lexer. Second, ANTLR places a few constraints
on our grammar rules that we should know about.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

23楼 [报告]

发表于 2014-03-08 16:34 |只看该作者

本帖最后由 oldbeginner 于 2014-03-08 16:42 编辑

*****************************
5.6 Drawing the Line Between Lexer and Parser

男女搭配
*****************************

Because ANTLR lexer rules can use recursion, lexers are technically as powerful
as parsers. That means we could match even grammatical structure in
the lexer.

我想参加男子单打比赛

Where to draw the line between the lexer and the parser is partially a function
of the language but also a function of the intended application.

Fortunately,
a few rules of thumb will get us pretty far.

• Match and discard anything in the lexer that the parser does not need to
see at all. Recognize and toss out things like whitespace and comments
for programming languages. Otherwise, the parser would have to constantly
check to see whether there are comments or whitespace in between
tokens.

• Match common tokens such as identifiers, keywords, strings, and numbers
in the lexer. The parser has more overhead than the lexer, so we shouldn’t
burden the parser with, say, putting digits together to recognize integers.

• Lump together into a single token type those lexical structures that the
parser does not need to distinguish. For example, if our application treats
integer and floating-point numbers the same, then lump them together
as token type NUMBER. There’s no point in sending separate token types
to the parser.

• Lump together anything that the parser can treat as a single entity. For
example, if the parser doesn’t care about the contents of an XML tag, the
lexer can lump everything between angle brackets into a single token type
called TAG.

• On the other hand, if the parser needs to pull apart a lump of text to
process it, the lexer should pass the individual components as tokens to
the parser. For example, if the parser needs to process the elements of
an IP address, the lexer should send individual tokens for the IP components
(integers and periods).

To see how the intended application affects what we match in the lexer vs.
the parser, imagine processing a log file from a web server that has one record
per line.

192.168.209.85 "GET /download/foo.html HTTP/1.0" 200

With a complete set of tokens, we can make parser rules that match the
records in a log file.

file : row+ ; // parser rule matching rows of log file

row : IP STRING INT NL ; // match log file record

IP : INT '.' INT '.' INT '.' INT ; // 192.168.209.85

INT : [0-9]+ ; // match IP octet or HTTP result code

STRING: '"' .*? '"' ; // matches the HTTP protocol command

NL : '\n' ; // match log file record terminator

WS : ' ' -> skip ; // ignore spaces

With convenient library functions like
split('.'), we could pass IP addresses as strings to the parser and process them
there. But, it’s better to have the lexer match the IP address lexical structure
and pass the components to the parser as tokens.

file : row+ ; // parser rule matching rows of log file

row : ip STRING INT NL ; // match log file record

ip : INT '.' INT '.' INT '.' INT ; // match IPs in parser

INT : [0-9]+ ; // match IP octet or HTTP result code

STRING: '"' .*? '"' ; // matches the HTTP protocol command

NL : '\n' ; // match log file record terminator

WS : ' ' -> skip ; // ignore spaces

Switching lexer rule IP to parser rule ip shows how easily we can shift the
dividing line.

In this chapter, we learned how to work from a representative sample of the
language, or language documentation, to create grammar pseudocode and
then a formal grammar in ANTLR notation.

We also studied the common
language patterns: sequence, choice, token dependency, and nested phrase.

In the lexical realm, we looked at implementations for the most common
tokens: identifiers, numbers, strings, comments, and whitespace.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

24楼 [报告]

发表于 2014-03-08 17:15 |只看该作者

oldbeginner 发表于 2014-03-07 18:18
******************************
5.5 Recognizing Common Lexical Structures

流水线产品能和自然天成相比吗？

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

25楼 [报告]

发表于 2014-03-08 18:21 |只看该作者

*******************************
CHAPTER 6
Exploring Some Real Grammars

终于，要面对现实了
*******************************

呵呵，开个玩笑，武松当然是杀嫂的，应该是下面

欢迎来到现实的世界。

***************************
6.1 Parsing Comma-Separated Values

***************************

要翻译的CSV文件

相应的语法文件，书上的重点就是一步一步介绍这个语法文件是怎么来的（千万不要学范伟）

Note that we’ve introduced an extra rule called hdr for clarity. Grammatically
it’s just a row, but we’ve made its role clearer by separating it out.

Rule row is the same as before: a list of fields separated by commas and terminated
by a newline.

TEXT tokens are a sequence of characters
until we hit the next comma field separator or the end of the line.

Strings are
any characters in between double quotes.

To get a double quote inside a double-quoted string, the CSV format generally
uses two double quotes in a row. That’s what the ('""'|~'"')* subrule does in
rule STRING.

然后，执行

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

26楼 [报告]

发表于 2014-03-09 16:26 |只看该作者

***************************
6.2 Parsing JSON

其实我不懂JSON
***************************

JSON is a text data format that captures a collection of name-value pairs,
and since values can themselves be collections,

Our goal is to build an ANTLR grammar by reading the JSON reference
manual and looking at its syntax diagram and existing grammar.

We’ll pull
out key phrases from the manual and figure out how to encode them as ANTLR
rules, starting with the grammatical structures.

对语法文件的解释，
The language reference says that a JSON file can be either an object, as shown
earlier, or an array of values.

The next step is to drill down into the rule references in json.

The syntax diagram at the JSON website also indicates that names have to
be strings.

To convert this English description into grammar constructs, we break it
apart looking for key phrases that represent one of our patterns: sequence,
choice, token dependency, and nested phrase.

Here’s how our grammar parses the sample input from earlier:

作者主要描述如何将 JSON reference 转成相应 ANTLR 语法，类似翻译工作

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

27楼 [报告]

发表于 2014-03-09 16:52 |只看该作者

******************************
6.3 Parsing DOT

******************************

DOT5 is a declarative language for describing graphs such as network diagrams,
trees, or state machines.

To get a feel for the language, imagine that we want to visualize a call tree
graph for a program with four functions.

Here’s the resulting diagram created using the DOT visualizer, graphviz:

语法，相比之前的，感觉比较复杂
首先，
DOT Grammatical Rules

DOT Lexical Rules

ID : LETTER (LETTER|DIGIT)*;

fragment
LETTER : [a-zA-Z\u0080-\u00FF_] ;

NUMBER : '-'? ('.' DIGIT+ | DIGIT+ ('.' DIGIT*)? ) ;

fragment
DIGIT : [0-9] ;

STRING : '"' ('\\"'|.)*? '"' ;

HTML_STRING : '<' (TAG|~[<>])* '>' ;

fragment
TAG : '<' .*? '>' ;

PREPROC : '#' .*? '\n' -> skip ;

That’s it for the DOT grammar (except for rules we’re very familiar with). We’ve
made it through our first moderately complex language!

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

28楼 [报告]

发表于 2014-03-09 17:27 |只看该作者

***********************
6.4 Parsing Cymbol

山寨 C 语言
***********************

we’re going to build a grammar for a language I conjured up called
Cymbol.

Cymbol is a simple non-object-oriented programming language that
looks like C without structs.

A grammar for this language serves as a good
prototype for other new programming languages, if you’d like to build one.

Here’s a program with a global variable and recursive
function declaration that shows what Cymbol code looks like:

Here’s the parse tree that illustrates how our grammar
should interpret the input：

grammar Cymbol;

file: (functionDecl | varDecl)+ ;

varDecl
: type ID ('=' expr)? ';'
;
type: 'float' | 'int' | 'void' ; // user-defined types

functionDecl
: type ID '(' formalParameters? ')' block // "void f(int x) {...}"
;
formalParameters
: formalParameter (',' formalParameter)*
;
formalParameter
: type ID
;

block:  '{' stat* '}' ; // possibly empty statement block
stat: block
| varDecl
| 'if' expr 'then' stat ('else' stat)?
| 'return' expr? ';'
| expr '=' expr ';' // assignment
| expr ';'       // func call
;

/* expr below becomes the following non-left recursive rule:
expr[int _p]
: ( '-' expr[6]
      | '!' expr[5]
      | ID
      | INT
      | '(' expr ')'
      )
      ( {8 >= $_p}? '*' expr[9]
      | {7 >= $_p}? ('+'|'-') expr[8]
      | {4 >= $_p}? '==' expr[5]
      | {10 >= $_p}? '[' expr ']'
      | {9 >= $_p}? '(' exprList? ')'
      )*
;
*/

expr: ID '(' exprList? ')' // func call like f(), f(x), f(1,2)
| expr '[' expr ']'    // array index like a, a[j]
| '-' expr             // unary minus
| '!' expr             // boolean not
| expr '*' expr
| expr ('+'|'-') expr
| expr '==' expr       // equality comparison (lowest priority op)
| ID                   // variable reference
| INT
| '(' expr ')'
;
exprList : expr (',' expr)* ; // arg list

ID  : LETTER (LETTER | [0-9])* ;
fragment
LETTER : [a-zA-Z] ;

INT : [0-9]+ ;

WS  : [ \t\n\r]+ -> skip ;

SL_COMMENT
: '//' .*? '\n' -> skip
;

Using our intuitive sense of what a struct-less or class-less Java language might
look like made building the Cymbol grammar pretty easy.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

29楼 [报告]

发表于 2014-03-09 18:35 |只看该作者

***********************
6.5 Parsing R

在乡长面前，R 屁都不是
**********************

对统计最有偏见的人，是一个美国人。

马克. 吐温的一句名言:世界上有三种谎言:谎言、该死的谎言和统计数字。

其实马克时代局限性，现在应该是，统计数字和中国统计数字。

R is an expressive domain-specific programming language for describing
statistical problems.

We must deduce R’s structure by
wading through reference manuals, examples, and a formal yacc grammar
from the existing implementation.

/**
derived from http://svn.r-project.org/R/trunk/src/main/gram.y
http://cran.r-project.org/doc/manuals/R-lang.html#Parser
*/
grammar R;

prog: ( expr_or_assign (';'|NL)
      | NL
      )*
      EOF
;

expr_or_assign
: expr ('<-'|'='|'<<-') expr_or_assign
| expr
;

expr: expr '[[' sublist ']' ']'  // '[[' follows R's yacc grammar
| expr '[' sublist ']'
| expr ('::'|':::') expr
| expr ('$'|'@') expr
| expr '^'<assoc=right> expr
| ('-'|'+') expr
| expr ':' expr
| expr USER_OP expr // anything wrappedin %: '%' .* '%'
| expr ('*'|'/') expr
| expr ('+'|'-') expr
| expr ('>'|'>='|'<'|'<='|'=='|'!=') expr
| '!' expr
| expr ('&'|'&&') expr
| expr ('|'|'||') expr
| '~' expr
| expr '~' expr
| expr ('->'|'->>'|':=') expr
| 'function' '(' formlist? ')' expr // define function
| expr '(' sublist ')'             // call function
| '{' exprlist '}' // compound statement
| 'if' '(' expr ')' expr
| 'if' '(' expr ')' expr 'else' expr
| 'for' '(' ID 'in' expr ')' expr
| 'while' '(' expr ')' expr
| 'repeat' expr
| '?' expr // get help on expr, usually string or ID
| 'next'
| 'break'
| '(' expr ')'
| ID
| STRING
| HEX
| INT
| FLOAT
| COMPLEX
| 'NULL'
| 'NA'
| 'Inf'
| 'NaN'
| 'TRUE'
| 'FALSE'
;

exprlist
: expr_or_assign ((';'|NL) expr_or_assign?)*
|
;

formlist : form (',' form)* ;

form: ID
| ID '=' expr
| '...'
;

sublist : sub (',' sub)* ;
sub : expr
| ID '='
| ID '=' expr
| STRING '='
| STRING '=' expr
| 'NULL' '='
| 'NULL' '=' expr
| '...'
|
;

HEX : '0' ('x'|'X') HEXDIGIT+ [Ll]? ;

INT : DIGIT+ [Ll]? ;

fragment
HEXDIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

FLOAT:  DIGIT+ '.' DIGIT* EXP? [Ll]?
| DIGIT+ EXP? [Ll]?
| '.' DIGIT+ EXP? [Ll]?
;
fragment
DIGIT:  '0'..'9' ;
fragment
EXP : ('E' | 'e') ('+' | '-')? INT ;

COMPLEX
: INT 'i'
| FLOAT 'i'
;

STRING
: '"' ( ESC | ~[\\"] )*? '"'
| '\'' ( ESC | ~[\\'] )*? '\''
;

fragment
ESC : '\\' ([abtnfrv]|'"'|'\'')
| UNICODE_ESCAPE
| HEX_ESCAPE
| OCTAL_ESCAPE
;

fragment
UNICODE_ESCAPE
: '\\' 'u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT
| '\\' 'u' '{' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT '}'
;

fragment
OCTAL_ESCAPE
: '\\' [0-3] [0-7] [0-7]
| '\\' [0-7] [0-7]
| '\\' [0-7]
;

fragment
HEX_ESCAPE
: '\\' HEXDIGIT HEXDIGIT?
;

ID  : '.' (LETTER|'_'|'.') (LETTER|DIGIT|'_'|'.')*
| LETTER (LETTER|DIGIT|'_'|'.')*
;
fragment LETTER  : [a-zA-Z] ;

USER_OP : '%' .*? '%' ;

COMMENT : '#' .*? '\r'? '\n' -> type(NL) ;

// Match both UNIX and Windows newlines
NL    : '\r'? '\n' ;

WS    : [ \t]+ -> skip ;

上面的语法是怎样推导的，我是跳过了，书上有步骤，只是感觉作者写得不太好。

有了语法，就测试一下吧，

************************************

Our goal in this chapter was to solidify our knowledge of ANTLR syntax and
to learn how to derive grammars from language reference manuals, examples,
and existing non-ANTLR grammars.

To that end, we looked at two data languages
(CSV, JSON), a declarative language (DOT), an imperative language
(Cymbol), and a functional language (R). These examples cover all of the skills
you’ll need to build grammars for most moderately complex languages.

Before
you move on, though, it’s a good idea to lock in your new expertise by downloading
the grammars and trying some simple modifications to alter the
languages.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

oldbeginner

白手起家

论坛徽章:: 0

30楼 [报告]

发表于 2014-03-10 20:07 |只看该作者

*****************************
CHAPTER 7
Decoupling Grammars from
Application-Specific Code

******************************

某大国多年改革的梦想被作者实现了。

To build language applications,
we need the parser to trigger specific actions when it sees specific input sentences,
phrases, or tokens.

Our goal in this chapter is to understand exactly what tree-walking facilities
ANTLR builds for us and why. We’ll start by looking at the origins of the
listener mechanism and how we can keep application-specific code out of our
grammars using listeners and visitors.

Next, we’ll learn how to get ANTLR to
generate more precise events, one for each alternative in a rule. Once we know
a little more about ANTLR’s tree walking, we’ll look at three calculator
implementations that illustrate different ways to pass around subexpression
results.

Finally, we’ll discuss the advantages and disadvantages of the three
approaches.

***************************************
7.1 Evolving from Embedded Actions to Listeners

**************************************

韦大人，。。。。。

The listener and visitor mechanisms
decouple grammars from application code, providing some compelling
benefits.

Without embedded
actions, we can reuse the same grammar in different applications without
even recompiling the generated parser.

In this section, we’re going to investigate the evolution from grammar with
embedded actions to completely decoupled grammar and application.

没有分离
grammar PropertyFile;

file : {«start file»} prop+ {«finish file»}
;
prop : ID '=' STRING '\n' {«process property»} ;

ID : [a-z]+ ;

STRING : '"' .*? '"' ;

分离语法和应用后，

grammar PropertyFile;

@members {

void startFile() { } // blank implementations

void finishFile() { }

void defineProperty(Token name, Token value) { }

}
file : {startFile();} prop+ {finishFile();} ;

prop : ID '=' STRING '\n' {defineProperty($ID, $STRING)} ;

ID : [a-z]+ ;

STRING : '"' .*? '"' ;

This decoupling makes the grammar reusable for different applications, but
the grammar is still tied to Java because of the method calls.

*****************************************************

***********************************
7.2 Implementing Applications with Parse-Tree Listeners

***********************************

To build language applications without entangling the application and the
grammar, the key is to have the parser create a parse tree and then walk it
to trigger application-specific code.

需要写一个java文件，继承相应接口

然后，执行

A listener-based approach is great because all of the tree walking and method
triggering is done automatically.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

1 234 / 4 页下一页

返回列表

Chinaunix › 论坛 › 程序设计 › CPU与编译器 › 编译器学习笔记02（世界公民antlr）——2014_2_27