2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
Guide to the parser
|
|
|
|
|
===================
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
--------
|
|
|
|
|
|
|
|
|
|
Python's Parser is currently a
|
|
|
|
|
[`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar)
|
|
|
|
|
parser. It was introduced in
|
|
|
|
|
[PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace
|
2024-10-22 00:37:31 +02:00
|
|
|
|
the original [`LL(1)`](https://en.wikipedia.org/wiki/LL_parser) parser.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
The code implementing the parser is generated from a grammar definition by a
|
|
|
|
|
[parser generator](https://en.wikipedia.org/wiki/Compiler-compiler).
|
|
|
|
|
Therefore, changes to the Python language are made by modifying the
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[grammar file](../Grammar/python.gram).
|
2024-10-09 19:21:35 +02:00
|
|
|
|
Developers rarely need to modify the generator itself.
|
|
|
|
|
|
2024-11-07 16:35:29 +01:00
|
|
|
|
See [Changing CPython's grammar](changing_grammar.md)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
for a detailed description of the grammar and the process for changing it.
|
|
|
|
|
|
|
|
|
|
How PEG parsers work
|
|
|
|
|
====================
|
|
|
|
|
|
|
|
|
|
A PEG (Parsing Expression Grammar) grammar differs from a
|
|
|
|
|
[context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar)
|
|
|
|
|
in that the way it is written more closely reflects how the parser will operate
|
|
|
|
|
when parsing. The fundamental technical difference is that the choice operator
|
|
|
|
|
is ordered. This means that when writing:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rule: A | B | C
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
a parser that implements a context-free-grammar (such as an `LL(1)` parser) will
|
2024-10-09 19:21:35 +02:00
|
|
|
|
generate constructions that, given an input string, *deduce* which alternative
|
2024-10-22 00:37:31 +02:00
|
|
|
|
(`A`, `B` or `C`) must be expanded. On the other hand, a PEG parser will
|
2024-10-09 19:21:35 +02:00
|
|
|
|
check each alternative, in the order in which they are specified, and select
|
|
|
|
|
that first one that succeeds.
|
|
|
|
|
|
|
|
|
|
This means that in a PEG grammar, the choice operator is not commutative.
|
|
|
|
|
Furthermore, unlike context-free grammars, the derivation according to a
|
|
|
|
|
PEG grammar cannot be ambiguous: if a string parses, it has exactly one
|
|
|
|
|
valid parse tree.
|
|
|
|
|
|
|
|
|
|
PEG parsers are usually constructed as a recursive descent parser in which every
|
|
|
|
|
rule in the grammar corresponds to a function in the program implementing the
|
|
|
|
|
parser, and the parsing expression (the "expansion" or "definition" of the rule)
|
|
|
|
|
represents the "code" in said function. Each parsing function conceptually takes
|
|
|
|
|
an input string as its argument, and yields one of the following results:
|
|
|
|
|
|
|
|
|
|
* A "success" result. This result indicates that the expression can be parsed by
|
|
|
|
|
that rule and the function may optionally move forward or consume one or more
|
|
|
|
|
characters of the input string supplied to it.
|
|
|
|
|
* A "failure" result, in which case no input is consumed.
|
|
|
|
|
|
|
|
|
|
Note that "failure" results do not imply that the program is incorrect, nor do
|
|
|
|
|
they necessarily mean that the parsing has failed. Since the choice operator is
|
|
|
|
|
ordered, a failure very often merely indicates "try the following option". A
|
|
|
|
|
direct implementation of a PEG parser as a recursive descent parser will present
|
|
|
|
|
exponential time performance in the worst case, because PEG parsers have
|
|
|
|
|
infinite lookahead (this means that they can consider an arbitrary number of
|
|
|
|
|
tokens before deciding for a rule). Usually, PEG parsers avoid this exponential
|
|
|
|
|
time complexity with a technique called
|
|
|
|
|
["packrat parsing"](https://pdos.csail.mit.edu/~baford/packrat/thesis/)
|
|
|
|
|
which not only loads the entire program in memory before parsing it but also
|
|
|
|
|
allows the parser to backtrack arbitrarily. This is made efficient by memoizing
|
|
|
|
|
the rules already matched for each position. The cost of the memoization cache
|
2024-10-22 00:37:31 +02:00
|
|
|
|
is that the parser will naturally use more memory than a simple `LL(1)` parser,
|
2024-10-09 19:21:35 +02:00
|
|
|
|
which normally are table-based.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Key ideas
|
|
|
|
|
---------
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
- Alternatives are ordered ( `A | B` is not the same as `B | A` ).
|
2024-10-09 19:21:35 +02:00
|
|
|
|
- If a rule returns a failure, it doesn't mean that the parsing has failed,
|
|
|
|
|
it just means "try something else".
|
|
|
|
|
- By default PEG parsers run in exponential time, which can be optimized to linear by
|
|
|
|
|
using memoization.
|
|
|
|
|
- If parsing fails completely (no rule succeeds in parsing all the input text), the
|
|
|
|
|
PEG parser doesn't have a concept of "where the
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) is".
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
> [!IMPORTANT]
|
|
|
|
|
> Don't try to reason about a PEG grammar in the same way you would to with an
|
|
|
|
|
> [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form)
|
|
|
|
|
> or context free grammar. PEG is optimized to describe **how** input strings will
|
|
|
|
|
> be parsed, while context-free grammars are optimized to generate strings of the
|
|
|
|
|
> language they describe (in EBNF, to know whether a given string is in the
|
|
|
|
|
> language, you need to do work to find out as it is not immediately obvious from
|
|
|
|
|
> the grammar).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Consequences of the ordered choice operator
|
|
|
|
|
-------------------------------------------
|
|
|
|
|
|
|
|
|
|
Although PEG may look like EBNF, its meaning is quite different. The fact
|
|
|
|
|
that the alternatives are ordered in a PEG grammer (which is at the core of
|
|
|
|
|
how PEG parsers work) has deep consequences, other than removing ambiguity.
|
|
|
|
|
|
|
|
|
|
If a rule has two alternatives and the first of them succeeds, the second one is
|
|
|
|
|
**not** attempted even if the caller rule fails to parse the rest of the input.
|
|
|
|
|
Thus the parser is said to be "eager". To illustrate this, consider
|
|
|
|
|
the following two rules (in these examples, a token is an individual character):
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
first_rule: ( 'a' | 'aa' ) 'a'
|
|
|
|
|
second_rule: ('aa' | 'a' ) 'a'
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
In a regular EBNF grammar, both rules specify the language `{aa, aaa}` but
|
|
|
|
|
in PEG, one of these two rules accepts the string `aaa` but not the string
|
|
|
|
|
`aa`. The other does the opposite -- it accepts the string `aa`
|
|
|
|
|
but not the string `aaa`. The rule `('a'|'aa')'a'` does
|
|
|
|
|
not accept `aaa` because `'a'|'aa'` consumes the first `a`, letting the
|
|
|
|
|
final `a` in the rule consume the second, and leaving out the third `a`.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
As the rule has succeeded, no attempt is ever made to go back and let
|
2024-10-22 00:37:31 +02:00
|
|
|
|
`'a'|'aa'` try the second alternative. The expression `('aa'|'a')'a'` does
|
|
|
|
|
not accept `aa` because `'aa'|'a'` accepts all of `aa`, leaving nothing
|
|
|
|
|
for the final `a`. Again, the second alternative of `'aa'|'a'` is not
|
2024-10-09 19:21:35 +02:00
|
|
|
|
tried.
|
|
|
|
|
|
|
|
|
|
> [!CAUTION]
|
|
|
|
|
> The effects of ordered choice, such as the ones illustrated above, may be
|
|
|
|
|
> hidden by many levels of rules.
|
|
|
|
|
|
|
|
|
|
For this reason, writing rules where an alternative is contained in the next
|
|
|
|
|
one is in almost all cases a mistake, for example:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
my_rule:
|
|
|
|
|
| 'if' expression 'then' block
|
|
|
|
|
| 'if' expression 'then' block 'else' block
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
In this example, the second alternative will never be tried because the first one will
|
2024-10-22 00:37:31 +02:00
|
|
|
|
succeed first (even if the input string has an `'else' block` that follows). To correctly
|
2024-10-09 19:21:35 +02:00
|
|
|
|
write this rule you can simply alter the order:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
my_rule:
|
|
|
|
|
| 'if' expression 'then' block 'else' block
|
|
|
|
|
| 'if' expression 'then' block
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
In this case, if the input string doesn't have an `'else' block`, the first alternative
|
2024-10-09 19:21:35 +02:00
|
|
|
|
will fail and the second will be attempted.
|
|
|
|
|
|
|
|
|
|
Grammar Syntax
|
|
|
|
|
==============
|
|
|
|
|
|
|
|
|
|
The grammar consists of a sequence of rules of the form:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rule_name: expression
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Optionally, a type can be included right after the rule name, which
|
|
|
|
|
specifies the return type of the C or Python function corresponding to
|
|
|
|
|
the rule:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rule_name[return_type]: expression
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
If the return type is omitted, then a `void *` is returned in C and an
|
|
|
|
|
`Any` in Python.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
Grammar expressions
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
| Expression | Description and Example |
|
|
|
|
|
|-----------------|-----------------------------------------------------------------------------------------------------------------------|
|
|
|
|
|
| `# comment` | Python-style comments. |
|
|
|
|
|
| `e1 e2` | Match `e1`, then match `e2`. <br> `rule_name: first_rule second_rule` |
|
|
|
|
|
| `e1 \| e2` | Match `e1` or `e2`. <br> `rule_name[return_type]:`<br>` \| first_alt`<br>` \| second_alt` |
|
|
|
|
|
| `( e )` | Grouping operator: Match `e`. <br> `rule_name: (e)`<br>`rule_name: (e1 e2)*` |
|
|
|
|
|
| `[ e ]` or `e?` | Optionally match `e`. <br> `rule_name: [e]`<br>`rule_name: e (',' e)* [',']` |
|
|
|
|
|
| `e*` | Match zero or more occurrences of `e`. <br> `rule_name: (e1 e2)*` |
|
|
|
|
|
| `e+` | Match one or more occurrences of `e`. <br> `rule_name: (e1 e2)+` |
|
|
|
|
|
| `s.e+` | Match one or more occurrences of `e`, separated by `s`. <br> `rule_name: ','.e+` |
|
|
|
|
|
| `&e` | Positive lookahead: Succeed if `e` can be parsed, without consuming input. |
|
|
|
|
|
| `!e` | Negative lookahead: Fail if `e` can be parsed, without consuming input. <br> `primary: atom !'.' !'(' !'['` |
|
|
|
|
|
| `~` | Commit to the current alternative, even if it fails to parse (cut). <br> `rule_name: '(' ~ some_rule ')' \| some_alt` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Left recursion
|
|
|
|
|
--------------
|
|
|
|
|
|
|
|
|
|
PEG parsers normally do not support left recursion, but CPython's parser
|
|
|
|
|
generator implements a technique similar to the one described in
|
|
|
|
|
[Medeiros et al.](https://arxiv.org/pdf/1207.0443) but using the memoization
|
|
|
|
|
cache instead of static variables. This approach is closer to the one described
|
|
|
|
|
in [Warth et al.](http://web.cs.ucla.edu/~todd/research/pepm08.pdf). This
|
|
|
|
|
allows us to write not only simple left-recursive rules but also more
|
|
|
|
|
complicated rules that involve indirect left-recursion like:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rule1: rule2 | 'a'
|
|
|
|
|
rule2: rule3 | 'b'
|
|
|
|
|
rule3: rule1 | 'c'
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
and "hidden left-recursion" like:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rule: 'optional'? rule '@' some_other_rule
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Variables in the grammar
|
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
|
|
A sub-expression can be named by preceding it with an identifier and an
|
2024-10-22 00:37:31 +02:00
|
|
|
|
`=` sign. The name can then be used in the action (see below), like this:
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rule_name[return_type]: '(' a=some_other_rule ')' { a }
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Grammar actions
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
To avoid the intermediate steps that obscure the relationship between the
|
|
|
|
|
grammar and the AST generation, the PEG parser allows directly generating AST
|
|
|
|
|
nodes for a rule via grammar actions. Grammar actions are language-specific
|
|
|
|
|
expressions that are evaluated when a grammar rule is successfully parsed. These
|
|
|
|
|
expressions can be written in Python or C depending on the desired output of the
|
|
|
|
|
parser generator. This means that if one would want to generate a parser in
|
|
|
|
|
Python and another in C, two grammar files should be written, each one with a
|
|
|
|
|
different set of actions, keeping everything else apart from said actions
|
|
|
|
|
identical in both files. As an example of a grammar with Python actions, the
|
|
|
|
|
piece of the parser generator that parses grammar files is bootstrapped from a
|
|
|
|
|
meta-grammar file with Python actions that generate the grammar tree as a result
|
|
|
|
|
of the parsing.
|
|
|
|
|
|
|
|
|
|
In the specific case of the PEG grammar for Python, having actions allows
|
|
|
|
|
directly describing how the AST is composed in the grammar itself, making it
|
|
|
|
|
more clear and maintainable. This AST generation process is supported by the use
|
|
|
|
|
of some helper functions that factor out common AST object manipulations and
|
|
|
|
|
some other required operations that are not directly related to the grammar.
|
|
|
|
|
|
|
|
|
|
To indicate these actions each alternative can be followed by the action code
|
|
|
|
|
inside curly-braces, which specifies the return value of the alternative:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rule_name[return_type]:
|
|
|
|
|
| first_alt1 first_alt2 { first_alt1 }
|
|
|
|
|
| second_alt1 second_alt2 { second_alt1 }
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If the action is omitted, a default action is generated:
|
|
|
|
|
|
|
|
|
|
- If there is a single name in the rule, it gets returned.
|
|
|
|
|
- If there multiple names in the rule, a collection with all parsed
|
|
|
|
|
expressions gets returned (the type of the collection will be different
|
|
|
|
|
in C and Python).
|
|
|
|
|
|
|
|
|
|
This default behaviour is primarily made for very simple situations and for
|
|
|
|
|
debugging purposes.
|
|
|
|
|
|
|
|
|
|
> [!WARNING]
|
|
|
|
|
> It's important that the actions don't mutate any AST nodes that are passed
|
|
|
|
|
> into them via variables referring to other rules. The reason for mutation
|
|
|
|
|
> being not allowed is that the AST nodes are cached by memoization and could
|
|
|
|
|
> potentially be reused in a different context, where the mutation would be
|
|
|
|
|
> invalid. If an action needs to change an AST node, it should instead make a
|
|
|
|
|
> new copy of the node and change that.
|
|
|
|
|
|
|
|
|
|
The full meta-grammar for the grammars supported by the PEG generator is:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
start[Grammar]: grammar ENDMARKER { grammar }
|
|
|
|
|
|
|
|
|
|
grammar[Grammar]:
|
|
|
|
|
| metas rules { Grammar(rules, metas) }
|
|
|
|
|
| rules { Grammar(rules, []) }
|
|
|
|
|
|
|
|
|
|
metas[MetaList]:
|
|
|
|
|
| meta metas { [meta] + metas }
|
|
|
|
|
| meta { [meta] }
|
|
|
|
|
|
|
|
|
|
meta[MetaTuple]:
|
|
|
|
|
| "@" NAME NEWLINE { (name.string, None) }
|
|
|
|
|
| "@" a=NAME b=NAME NEWLINE { (a.string, b.string) }
|
|
|
|
|
| "@" NAME STRING NEWLINE { (name.string, literal_eval(string.string)) }
|
|
|
|
|
|
|
|
|
|
rules[RuleList]:
|
|
|
|
|
| rule rules { [rule] + rules }
|
|
|
|
|
| rule { [rule] }
|
|
|
|
|
|
|
|
|
|
rule[Rule]:
|
|
|
|
|
| rulename ":" alts NEWLINE INDENT more_alts DEDENT {
|
|
|
|
|
Rule(rulename[0], rulename[1], Rhs(alts.alts + more_alts.alts)) }
|
|
|
|
|
| rulename ":" NEWLINE INDENT more_alts DEDENT { Rule(rulename[0], rulename[1], more_alts) }
|
|
|
|
|
| rulename ":" alts NEWLINE { Rule(rulename[0], rulename[1], alts) }
|
|
|
|
|
|
|
|
|
|
rulename[RuleName]:
|
|
|
|
|
| NAME '[' type=NAME '*' ']' {(name.string, type.string+"*")}
|
|
|
|
|
| NAME '[' type=NAME ']' {(name.string, type.string)}
|
|
|
|
|
| NAME {(name.string, None)}
|
|
|
|
|
|
|
|
|
|
alts[Rhs]:
|
|
|
|
|
| alt "|" alts { Rhs([alt] + alts.alts)}
|
|
|
|
|
| alt { Rhs([alt]) }
|
|
|
|
|
|
|
|
|
|
more_alts[Rhs]:
|
|
|
|
|
| "|" alts NEWLINE more_alts { Rhs(alts.alts + more_alts.alts) }
|
|
|
|
|
| "|" alts NEWLINE { Rhs(alts.alts) }
|
|
|
|
|
|
|
|
|
|
alt[Alt]:
|
|
|
|
|
| items '$' action { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=action) }
|
|
|
|
|
| items '$' { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=None) }
|
|
|
|
|
| items action { Alt(items, action=action) }
|
|
|
|
|
| items { Alt(items, action=None) }
|
|
|
|
|
|
|
|
|
|
items[NamedItemList]:
|
|
|
|
|
| named_item items { [named_item] + items }
|
|
|
|
|
| named_item { [named_item] }
|
|
|
|
|
|
|
|
|
|
named_item[NamedItem]:
|
|
|
|
|
| NAME '=' ~ item {NamedItem(name.string, item)}
|
|
|
|
|
| item {NamedItem(None, item)}
|
|
|
|
|
| it=lookahead {NamedItem(None, it)}
|
|
|
|
|
|
|
|
|
|
lookahead[LookaheadOrCut]:
|
|
|
|
|
| '&' ~ atom {PositiveLookahead(atom)}
|
|
|
|
|
| '!' ~ atom {NegativeLookahead(atom)}
|
|
|
|
|
| '~' {Cut()}
|
|
|
|
|
|
|
|
|
|
item[Item]:
|
|
|
|
|
| '[' ~ alts ']' {Opt(alts)}
|
|
|
|
|
| atom '?' {Opt(atom)}
|
|
|
|
|
| atom '*' {Repeat0(atom)}
|
|
|
|
|
| atom '+' {Repeat1(atom)}
|
|
|
|
|
| sep=atom '.' node=atom '+' {Gather(sep, node)}
|
|
|
|
|
| atom {atom}
|
|
|
|
|
|
|
|
|
|
atom[Plain]:
|
|
|
|
|
| '(' ~ alts ')' {Group(alts)}
|
|
|
|
|
| NAME {NameLeaf(name.string) }
|
|
|
|
|
| STRING {StringLeaf(string.string)}
|
|
|
|
|
|
|
|
|
|
# Mini-grammar for the actions
|
|
|
|
|
|
|
|
|
|
action[str]: "{" ~ target_atoms "}" { target_atoms }
|
|
|
|
|
|
|
|
|
|
target_atoms[str]:
|
|
|
|
|
| target_atom target_atoms { target_atom + " " + target_atoms }
|
|
|
|
|
| target_atom { target_atom }
|
|
|
|
|
|
|
|
|
|
target_atom[str]:
|
|
|
|
|
| "{" ~ target_atoms "}" { "{" + target_atoms + "}" }
|
|
|
|
|
| NAME { name.string }
|
|
|
|
|
| NUMBER { number.string }
|
|
|
|
|
| STRING { string.string }
|
|
|
|
|
| "?" { "?" }
|
|
|
|
|
| ":" { ":" }
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
As an illustrative example this simple grammar file allows directly
|
|
|
|
|
generating a full parser that can parse simple arithmetic expressions and that
|
|
|
|
|
returns a valid C-based Python AST:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
start[mod_ty]: a=expr_stmt* ENDMARKER { _PyAST_Module(a, NULL, p->arena) }
|
|
|
|
|
expr_stmt[stmt_ty]: a=expr NEWLINE { _PyAST_Expr(a, EXTRA) }
|
|
|
|
|
|
|
|
|
|
expr[expr_ty]:
|
|
|
|
|
| l=expr '+' r=term { _PyAST_BinOp(l, Add, r, EXTRA) }
|
|
|
|
|
| l=expr '-' r=term { _PyAST_BinOp(l, Sub, r, EXTRA) }
|
|
|
|
|
| term
|
|
|
|
|
|
|
|
|
|
term[expr_ty]:
|
|
|
|
|
| l=term '*' r=factor { _PyAST_BinOp(l, Mult, r, EXTRA) }
|
|
|
|
|
| l=term '/' r=factor { _PyAST_BinOp(l, Div, r, EXTRA) }
|
|
|
|
|
| factor
|
|
|
|
|
|
|
|
|
|
factor[expr_ty]:
|
|
|
|
|
| '(' e=expr ')' { e }
|
|
|
|
|
| atom
|
|
|
|
|
|
|
|
|
|
atom[expr_ty]:
|
|
|
|
|
| NAME
|
|
|
|
|
| NUMBER
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
Here `EXTRA` is a macro that expands to `start_lineno, start_col_offset,
|
|
|
|
|
end_lineno, end_col_offset, p->arena`, those being variables automatically
|
|
|
|
|
injected by the parser; `p` points to an object that holds on to all state
|
2024-10-09 19:21:35 +02:00
|
|
|
|
for the parser.
|
|
|
|
|
|
|
|
|
|
A similar grammar written to target Python AST objects:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or [] }
|
|
|
|
|
expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) }
|
|
|
|
|
|
|
|
|
|
expr:
|
|
|
|
|
| l=expr '+' r=term { ast.BinOp(left=l, op=ast.Add(), right=r, EXTRA) }
|
|
|
|
|
| l=expr '-' r=term { ast.BinOp(left=l, op=ast.Sub(), right=r, EXTRA) }
|
|
|
|
|
| term
|
|
|
|
|
|
|
|
|
|
term:
|
|
|
|
|
| l=term '*' r=factor { ast.BinOp(left=l, op=ast.Mult(), right=r, EXTRA) }
|
|
|
|
|
| l=term '/' r=factor { ast.BinOp(left=l, op=ast.Div(), right=r, EXTRA) }
|
|
|
|
|
| factor
|
|
|
|
|
|
|
|
|
|
factor:
|
|
|
|
|
| '(' e=expr ')' { e }
|
|
|
|
|
| atom
|
|
|
|
|
|
|
|
|
|
atom:
|
|
|
|
|
| NAME
|
|
|
|
|
| NUMBER
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Pegen
|
|
|
|
|
=====
|
|
|
|
|
|
|
|
|
|
Pegen is the parser generator used in CPython to produce the final PEG parser
|
|
|
|
|
used by the interpreter. It is the program that can be used to read the python
|
2024-10-22 00:37:31 +02:00
|
|
|
|
grammar located in [`Grammar/python.gram`](../Grammar/python.gram) and produce
|
|
|
|
|
the final C parser. It contains the following pieces:
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
- A parser generator that can read a grammar file and produce a PEG parser
|
|
|
|
|
written in Python or C that can parse said grammar. The generator is located at
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen).
|
2024-10-09 19:21:35 +02:00
|
|
|
|
- A PEG meta-grammar that automatically generates a Python parser which is used
|
|
|
|
|
for the parser generator itself (this means that there are no manually-written
|
|
|
|
|
parsers). The meta-grammar is located at
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
|
2024-10-09 19:21:35 +02:00
|
|
|
|
- A generated parser (using the parser generator) that can directly produce C and Python AST objects.
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
The source code for Pegen lives at [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
but normally all typical commands to interact with the parser generator are executed from
|
|
|
|
|
the main makefile.
|
|
|
|
|
|
|
|
|
|
How to regenerate the parser
|
|
|
|
|
----------------------------
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
Once you have made the changes to the grammar files, to regenerate the `C`
|
2024-10-09 19:21:35 +02:00
|
|
|
|
parser (the one used by the interpreter) just execute:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
make regen-pegen
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
using the `Makefile` in the main directory. If you are on Windows you can
|
2024-10-09 19:21:35 +02:00
|
|
|
|
use the Visual Studio project files to regenerate the parser or to execute:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
./PCbuild/build.bat --regen
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c).
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
How to regenerate the meta-parser
|
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
|
|
The meta-grammar (the grammar that describes the grammar for the grammar files
|
|
|
|
|
themselves) is located at
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
|
2024-10-09 19:21:35 +02:00
|
|
|
|
Although it is very unlikely that you will ever need to modify it, if you make
|
|
|
|
|
any modifications to this file (in order to implement new Pegen features) you will
|
|
|
|
|
need to regenerate the meta-parser (the parser that parses the grammar files).
|
|
|
|
|
To do so just execute:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
make regen-pegen-metaparser
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you are on Windows you can use the Visual Studio project files
|
|
|
|
|
to regenerate the parser or to execute:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
./PCbuild/build.bat --regen
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Grammatical elements and rules
|
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
|
|
Pegen has some special grammatical elements and rules:
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
- Strings with single quotes (') (for example, `'class'`) denote KEYWORDS.
|
|
|
|
|
- Strings with double quotes (") (for example, `"match"`) denote SOFT KEYWORDS.
|
|
|
|
|
- Uppercase names (for example, `NAME`) denote tokens in the
|
|
|
|
|
[`Grammar/Tokens`](../Grammar/Tokens) file.
|
|
|
|
|
- Rule names starting with `invalid_` are used for specialized syntax errors.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
- These rules are NOT used in the first pass of the parser.
|
|
|
|
|
- Only if the first pass fails to parse, a second pass including the invalid
|
|
|
|
|
rules will be executed.
|
|
|
|
|
- If the parser fails in the second phase with a generic syntax error, the
|
|
|
|
|
location of the generic failure of the first pass will be used (this avoids
|
|
|
|
|
reporting incorrect locations due to the invalid rules).
|
|
|
|
|
- The order of the alternatives involving invalid rules matter
|
|
|
|
|
(like any rule in PEG).
|
|
|
|
|
|
|
|
|
|
Tokenization
|
|
|
|
|
------------
|
|
|
|
|
|
|
|
|
|
It is common among PEG parser frameworks that the parser does both the parsing
|
|
|
|
|
and the tokenization, but this does not happen in Pegen. The reason is that the
|
|
|
|
|
Python language needs a custom tokenizer to handle things like indentation
|
2024-10-22 00:37:31 +02:00
|
|
|
|
boundaries, some special keywords like `ASYNC` and `AWAIT` (for
|
2024-10-09 19:21:35 +02:00
|
|
|
|
compatibility purposes), backtracking errors (such as unclosed parenthesis),
|
|
|
|
|
dealing with encoding, interactive mode and much more. Some of these reasons
|
|
|
|
|
are also there for historical purposes, and some others are useful even today.
|
|
|
|
|
|
|
|
|
|
The list of tokens (all uppercase names in the grammar) that you can use can
|
2024-10-22 00:37:31 +02:00
|
|
|
|
be found in the [`Grammar/Tokens`](../Grammar/Tokens)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
file. If you change this file to add new tokens, make sure to regenerate the
|
|
|
|
|
files by executing:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
make regen-token
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you are on Windows you can use the Visual Studio project files to regenerate
|
|
|
|
|
the tokens or to execute:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
./PCbuild/build.bat --regen
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
How tokens are generated and the rules governing this are completely up to the tokenizer
|
2024-10-22 00:37:31 +02:00
|
|
|
|
([`Parser/lexer`](../Parser/lexer) and [`Parser/tokenizer`](../Parser/tokenizer));
|
2024-10-09 19:21:35 +02:00
|
|
|
|
the parser just receives tokens from it.
|
|
|
|
|
|
|
|
|
|
Memoization
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
As described previously, to avoid exponential time complexity in the parser,
|
|
|
|
|
memoization is used.
|
|
|
|
|
|
|
|
|
|
The C parser used by Python is highly optimized and memoization can be expensive
|
|
|
|
|
both in memory and time. Although the memory cost is obvious (the parser needs
|
|
|
|
|
memory for storing previous results in the cache) the execution time cost comes
|
|
|
|
|
for continuously checking if the given rule has a cache hit or not. In many
|
|
|
|
|
situations, just parsing it again can be faster. Pegen **disables memoization
|
2024-10-22 00:37:31 +02:00
|
|
|
|
by default** except for rules with the special marker `memo` after the rule
|
2024-10-09 19:21:35 +02:00
|
|
|
|
name (and type, if present):
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
rule_name[typr] (memo):
|
|
|
|
|
...
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
By selectively turning on memoization for a handful of rules, the parser becomes
|
|
|
|
|
faster and uses less memory.
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
> Left-recursive rules always use memoization, since the implementation of
|
|
|
|
|
> left-recursion depends on it.
|
|
|
|
|
|
|
|
|
|
To determine whether a new rule needs memoization or not, benchmarking is required
|
|
|
|
|
(comparing execution times and memory usage of some considerably large files with
|
|
|
|
|
and without memoization). There is a very simple instrumentation API available
|
|
|
|
|
in the generated C parse code that allows to measure how much each rule uses
|
2024-10-22 00:37:31 +02:00
|
|
|
|
memoization (check the [`Parser/pegen.c`](../Parser/pegen.c)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
file for more information) but it needs to be manually activated.
|
|
|
|
|
|
|
|
|
|
Automatic variables
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
To make writing actions easier, Pegen injects some automatic variables in the
|
|
|
|
|
namespace available when writing actions. In the C parser, some of these
|
|
|
|
|
automatic variable names are:
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
- `p`: The parser structure.
|
|
|
|
|
- `EXTRA`: This is a macro that expands to
|
|
|
|
|
`(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)`,
|
2024-10-09 19:21:35 +02:00
|
|
|
|
which is normally used to create AST nodes as almost all constructors need these
|
|
|
|
|
attributes to be provided. All of the location variables are taken from the
|
|
|
|
|
location information of the current token.
|
|
|
|
|
|
|
|
|
|
Hard and soft keywords
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
> In the grammar files, keywords are defined using **single quotes** (for example,
|
2024-10-22 00:37:31 +02:00
|
|
|
|
> `'class'`) while soft keywords are defined using **double quotes** (for example,
|
|
|
|
|
> `"match"`).
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
There are two kinds of keywords allowed in pegen grammars: *hard* and *soft*
|
|
|
|
|
keywords. The difference between hard and soft keywords is that hard keywords
|
|
|
|
|
are always reserved words, even in positions where they make no sense
|
2024-10-22 00:37:31 +02:00
|
|
|
|
(for example, `x = class + 1`), while soft keywords only get a special
|
2024-10-09 19:21:35 +02:00
|
|
|
|
meaning in context. Trying to use a hard keyword as a variable will always
|
|
|
|
|
fail:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
>>> class = 3
|
|
|
|
|
File "<stdin>", line 1
|
|
|
|
|
class = 3
|
|
|
|
|
^
|
|
|
|
|
SyntaxError: invalid syntax
|
|
|
|
|
>>> foo(class=3)
|
|
|
|
|
File "<stdin>", line 1
|
|
|
|
|
foo(class=3)
|
|
|
|
|
^^^^^
|
|
|
|
|
SyntaxError: invalid syntax
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
While soft keywords don't have this limitation if used in a context other the
|
|
|
|
|
one where they are defined as keywords:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
>>> match = 45
|
|
|
|
|
>>> foo(match="Yeah!")
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
The `match` and `case` keywords are soft keywords, so that they are
|
2024-10-09 19:21:35 +02:00
|
|
|
|
recognized as keywords at the beginning of a match statement or case block
|
|
|
|
|
respectively, but are allowed to be used in other places as variable or
|
|
|
|
|
argument names.
|
|
|
|
|
|
|
|
|
|
You can get a list of all keywords defined in the grammar from Python:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
>>> import keyword
|
|
|
|
|
>>> keyword.kwlist
|
|
|
|
|
['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break',
|
|
|
|
|
'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for',
|
|
|
|
|
'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or',
|
|
|
|
|
'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
as well as soft keywords:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
>>> import keyword
|
|
|
|
|
>>> keyword.softkwlist
|
|
|
|
|
['_', 'case', 'match']
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
> [!CAUTION]
|
|
|
|
|
> Soft keywords can be a bit challenging to manage as they can be accepted in
|
|
|
|
|
> places you don't intend, given how the order alternatives behave in PEG
|
|
|
|
|
> parsers (see the
|
|
|
|
|
> [consequences of ordered choice](#consequences-of-the-ordered-choice-operator)
|
|
|
|
|
> section for some background on this). In general, try to define them in places
|
|
|
|
|
> where there are not many alternatives.
|
|
|
|
|
|
|
|
|
|
Error handling
|
|
|
|
|
--------------
|
|
|
|
|
|
|
|
|
|
When a pegen-generated parser detects that an exception is raised, it will
|
|
|
|
|
**automatically stop parsing**, no matter what the current state of the parser
|
|
|
|
|
is, and it will unwind the stack and report the exception. This means that if a
|
|
|
|
|
[rule action](#grammar-actions) raises an exception, all parsing will
|
|
|
|
|
stop at that exact point. This is done to allow to correctly propagate any
|
|
|
|
|
exception set by calling Python's C API functions. This also includes
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
exceptions and it is the main mechanism the parser uses to report custom syntax
|
|
|
|
|
error messages.
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
> Tokenizer errors are normally reported by raising exceptions but some special
|
|
|
|
|
> tokenizer errors such as unclosed parenthesis will be reported only after the
|
|
|
|
|
> parser finishes without returning anything.
|
|
|
|
|
|
|
|
|
|
How syntax errors are reported
|
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
|
|
As described previously in the [how PEG parsers work](#how-peg-parsers-work)
|
|
|
|
|
section, PEG parsers don't have a defined concept of where errors happened
|
|
|
|
|
in the grammar, because a rule failure doesn't imply a parsing failure like
|
|
|
|
|
in context free grammars. This means that a heuristic has to be used to report
|
|
|
|
|
generic errors unless something is explicitly declared as an error in the
|
|
|
|
|
grammar.
|
|
|
|
|
|
|
|
|
|
To report generic syntax errors, pegen uses a common heuristic in PEG parsers:
|
|
|
|
|
the location of *generic* syntax errors is reported to be the furthest token that
|
|
|
|
|
was attempted to be matched but failed. This is only done if parsing has failed
|
2024-10-22 00:37:31 +02:00
|
|
|
|
(the parser returns `NULL` in C or `None` in Python) but no exception has
|
2024-10-09 19:21:35 +02:00
|
|
|
|
been raised.
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
As the Python grammar was primordially written as an `LL(1)` grammar, this heuristic
|
2024-10-09 19:21:35 +02:00
|
|
|
|
has an extremely high success rate, but some PEG features, such as lookaheads,
|
|
|
|
|
can impact this.
|
|
|
|
|
|
|
|
|
|
> [!CAUTION]
|
|
|
|
|
> Positive and negative lookaheads will try to match a token so they will affect
|
|
|
|
|
> the location of generic syntax errors. Use them carefully at boundaries
|
|
|
|
|
> between rules.
|
|
|
|
|
|
|
|
|
|
To generate more precise syntax errors, custom rules are used. This is a common
|
|
|
|
|
practice also in context free grammars: the parser will try to accept some
|
|
|
|
|
construct that is known to be incorrect just to report a specific syntax error
|
2024-10-22 00:37:31 +02:00
|
|
|
|
for that construct. In pegen grammars, these rules start with the `invalid_`
|
2024-10-09 19:21:35 +02:00
|
|
|
|
prefix. This is because trying to match these rules normally has a performance
|
|
|
|
|
impact on parsing (and can also affect the 'correct' grammar itself in some
|
|
|
|
|
tricky cases, depending on the ordering of the rules) so the generated parser
|
|
|
|
|
acts in two phases:
|
|
|
|
|
|
|
|
|
|
1. The first phase will try to parse the input stream without taking into
|
2024-10-22 00:37:31 +02:00
|
|
|
|
account rules that start with the `invalid_` prefix. If the parsing
|
2024-10-09 19:21:35 +02:00
|
|
|
|
succeeds it will return the generated AST and the second phase will be
|
|
|
|
|
skipped.
|
|
|
|
|
|
|
|
|
|
2. If the first phase failed, a second parsing attempt is done including the
|
2024-10-22 00:37:31 +02:00
|
|
|
|
rules that start with an `invalid_` prefix. By design this attempt
|
2024-10-09 19:21:35 +02:00
|
|
|
|
**cannot succeed** and is only executed to give to the invalid rules a
|
|
|
|
|
chance to detect specific situations where custom, more precise, syntax
|
|
|
|
|
errors can be raised. This also allows to trade a bit of performance for
|
|
|
|
|
precision reporting errors: given that we know that the input text is
|
|
|
|
|
invalid, there is typically no need to be fast because execution is going
|
|
|
|
|
to stop anyway.
|
|
|
|
|
|
|
|
|
|
> [!IMPORTANT]
|
|
|
|
|
> When defining invalid rules:
|
|
|
|
|
>
|
|
|
|
|
> - Make sure all custom invalid rules raise
|
2024-10-22 00:37:31 +02:00
|
|
|
|
> [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
> exceptions (or a subclass of it).
|
2024-10-22 00:37:31 +02:00
|
|
|
|
> - Make sure **all** invalid rules start with the `invalid_` prefix to not
|
2024-10-09 19:21:35 +02:00
|
|
|
|
> impact performance of parsing correct Python code.
|
|
|
|
|
> - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules
|
|
|
|
|
> (see the [how PEG parsers work](#how-peg-parsers-work) section for more information).
|
|
|
|
|
|
|
|
|
|
You can find a collection of macros to raise specialized syntax errors in the
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[`Parser/pegen.h`](../Parser/pegen.h)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
header file. These macros allow also to report ranges for
|
|
|
|
|
the custom errors, which will be highlighted in the tracebacks that will be
|
|
|
|
|
displayed when the error is reported.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
> [!TIP]
|
|
|
|
|
> A good way to test whether an invalid rule will be triggered when you expect
|
|
|
|
|
> is to test if introducing a syntax error **after** valid code triggers the
|
|
|
|
|
> rule or not. For example:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
<valid python code> $ 42
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
should trigger the syntax error in the `$` character. If your rule is not correctly defined this
|
2024-10-09 19:21:35 +02:00
|
|
|
|
won't happen. As another example, suppose that you try to define a rule to match Python 2 style
|
2024-10-22 00:37:31 +02:00
|
|
|
|
`print` statements in order to create a better error message and you define it as:
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
invalid_print: "print" expression
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
This will **seem** to work because the parser will correctly parse `print(something)` because it is valid
|
|
|
|
|
code and the second phase will never execute but if you try to parse `print(something) $ 3` the first pass
|
|
|
|
|
of the parser will fail (because of the `$`) and in the second phase, the rule will match the
|
|
|
|
|
`print(something)` as `print` followed by the variable `something` between parentheses and the error
|
|
|
|
|
will be reported there instead of the `$` character.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
Generating AST objects
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
The output of the C parser used by CPython, which is generated from the
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[grammar file](../Grammar/python.gram), is a Python AST object (using C
|
|
|
|
|
structures). This means that the actions in the grammar file generate AST
|
|
|
|
|
objects when they succeed. Constructing these objects can be quite cumbersome
|
|
|
|
|
(see the [AST compiler section](compiler.md#abstract-syntax-trees-ast)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
for more information on how these objects are constructed and how they are used
|
|
|
|
|
by the compiler), so special helper functions are used. These functions are
|
2024-10-22 00:37:31 +02:00
|
|
|
|
declared in the [`Parser/pegen.h`](../Parser/pegen.h) header file and defined
|
|
|
|
|
in the [`Parser/action_helpers.c`](../Parser/action_helpers.c) file. The
|
|
|
|
|
helpers include functions that join AST sequences, get specific elements
|
2024-10-09 19:21:35 +02:00
|
|
|
|
from them or to perform extra processing on the generated tree.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
> [!CAUTION]
|
|
|
|
|
> Actions must **never** be used to accept or reject rules. It may be tempting
|
|
|
|
|
> in some situations to write a very generic rule and then check the generated
|
|
|
|
|
> AST to decide whether it is valid or not, but this will render the
|
|
|
|
|
> (official grammar)[https://docs.python.org/3/reference/grammar.html] partially
|
|
|
|
|
> incorrect (because it does not include actions) and will make it more difficult
|
|
|
|
|
> for other Python implementations to adapt the grammar to their own needs.
|
|
|
|
|
|
|
|
|
|
As a general rule, if an action spawns multiple lines or requires something more
|
|
|
|
|
complicated than a single expression of C code, is normally better to create a
|
2024-10-22 00:37:31 +02:00
|
|
|
|
custom helper in [`Parser/action_helpers.c`](../Parser/action_helpers.c)
|
|
|
|
|
and expose it in the [`Parser/pegen.h`](../Parser/pegen.h) header file so that
|
|
|
|
|
it can be used from the grammar.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
When parsing succeeds, the parser **must** return a **valid** AST object.
|
|
|
|
|
|
|
|
|
|
Testing
|
|
|
|
|
=======
|
|
|
|
|
|
|
|
|
|
There are three files that contain tests for the grammar and the parser:
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
- [test_grammar.py](../Lib/test/test_grammar.py)
|
|
|
|
|
- [test_syntax.py](../Lib/test/test_syntax.py)
|
|
|
|
|
- [test_exceptions.py](../Lib/test/test_exceptions.py)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
Check the contents of these files to know which is the best place for new
|
|
|
|
|
tests, depending on the nature of the new feature you are adding.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
Tests for the parser generator itself can be found in the
|
2024-10-22 00:37:31 +02:00
|
|
|
|
[test_peg_generator](../Lib/test_peg_generator) directory.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Debugging generated parsers
|
|
|
|
|
===========================
|
|
|
|
|
|
|
|
|
|
Making experiments
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
As the generated C parser is the one used by Python, this means that if
|
|
|
|
|
something goes wrong when adding some new rules to the grammar, you cannot
|
|
|
|
|
correctly compile and execute Python anymore. This makes it a bit challenging
|
|
|
|
|
to debug when something goes wrong, especially when experimenting.
|
|
|
|
|
|
|
|
|
|
For this reason it is a good idea to experiment first by generating a Python
|
2024-10-22 00:37:31 +02:00
|
|
|
|
parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator)
|
2024-10-09 19:21:35 +02:00
|
|
|
|
directory on the CPython repository and manually call the parser generator by executing:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ python -m pegen python <PATH TO YOUR GRAMMAR FILE>
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
This will generate a file called `parse.py` in the same directory that you
|
2024-10-09 19:21:35 +02:00
|
|
|
|
can use to parse some input:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ python parse.py file_with_source_code_to_test.py
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
As the generated `parse.py` file is just Python code, you can modify it
|
2024-10-09 19:21:35 +02:00
|
|
|
|
and add breakpoints to debug or better understand some complex situations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Verbose mode
|
|
|
|
|
------------
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
When Python is compiled in debug mode (by adding `--with-pydebug` when
|
|
|
|
|
running the configure step in Linux or by adding `-d` when calling the
|
|
|
|
|
[PCbuild/build.bat](../PCbuild/build.bat)), it is possible to activate a
|
|
|
|
|
**very** verbose mode in the generated parser. This is very useful to
|
|
|
|
|
debug the generated parser and to understand how it works, but it
|
2024-10-09 19:21:35 +02:00
|
|
|
|
can be a bit hard to understand at first.
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
> When activating verbose mode in the Python parser, it is better to not use
|
|
|
|
|
> interactive mode as it can be much harder to understand, because interactive
|
|
|
|
|
> mode involves some special steps compared to regular parsing.
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
To activate verbose mode you can add the `-d` flag when executing Python:
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
$ python -d file_to_test.py
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
This will print **a lot** of output to `stderr` so it is probably better to dump
|
2024-10-09 19:21:35 +02:00
|
|
|
|
it to a file for further analysis. The output consists of trace lines with the
|
|
|
|
|
following structure::
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
<indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ...
|
|
|
|
|
```
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
Every line is indented by a different amount (`<indentation>`) depending on how
|
2024-10-09 19:21:35 +02:00
|
|
|
|
deep the call stack is. The next character marks the type of the trace:
|
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
- `>` indicates that a rule is going to be attempted to be parsed.
|
|
|
|
|
- `-` indicates that a rule has failed to be parsed.
|
|
|
|
|
- `+` indicates that a rule has been parsed correctly.
|
|
|
|
|
- `!` indicates that an exception or an error has been detected and the parser is unwinding.
|
2024-10-09 19:21:35 +02:00
|
|
|
|
|
2024-10-22 00:37:31 +02:00
|
|
|
|
The `<token_location>` part indicates the current index in the token array,
|
|
|
|
|
the `<rule_name>` part indicates what rule is being parsed and
|
|
|
|
|
the `<alternative>` part indicates what alternative within that rule
|
2024-10-09 19:21:35 +02:00
|
|
|
|
is being attempted.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
> [!NOTE]
|
|
|
|
|
> **Document history**
|
|
|
|
|
>
|
|
|
|
|
> Pablo Galindo Salgado - Original author
|
2024-10-22 00:37:31 +02:00
|
|
|
|
>
|
2024-10-09 19:21:35 +02:00
|
|
|
|
> Irit Katriel and Jacob Coffee - Convert to Markdown
|