diff --git a/InternalDocs/adaptive.md b/InternalDocs/adaptive.md index 09245730b27..4ae9e85b387 100644 --- a/InternalDocs/adaptive.md +++ b/InternalDocs/adaptive.md @@ -31,8 +31,7 @@ although these are not fundamental and may change: ## Example family -The `LOAD_GLOBAL` instruction (in -[Python/bytecodes.c](https://github.com/python/cpython/blob/main/Python/bytecodes.c)) +The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c)) already has an adaptive family that serves as a relatively simple example. The `LOAD_GLOBAL` instruction performs adaptive specialization, diff --git a/InternalDocs/compiler.md b/InternalDocs/compiler.md index e9608977b0c..0da4670c792 100644 --- a/InternalDocs/compiler.md +++ b/InternalDocs/compiler.md @@ -7,17 +7,16 @@ Abstract In CPython, the compilation from source code to bytecode involves several steps: -1. Tokenize the source code - [Parser/lexer/](https://github.com/python/cpython/blob/main/Parser/lexer/) - and [Parser/tokenizer/](https://github.com/python/cpython/blob/main/Parser/tokenizer/). +1. Tokenize the source code [Parser/lexer/](../Parser/lexer/) + and [Parser/tokenizer/](../Parser/tokenizer/). 2. Parse the stream of tokens into an Abstract Syntax Tree - [Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c). + [Parser/parser.c](../Parser/parser.c). 3. Transform AST into an instruction sequence - [Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c). + [Python/compile.c](../Python/compile.c). 4. Construct a Control Flow Graph and apply optimizations to it - [Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c). + [Python/flowgraph.c](../Python/flowgraph.c). 5. Emit bytecode based on the Control Flow Graph - [Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c). + [Python/assemble.c](../Python/assemble.c). This document outlines how these steps of the process work. @@ -36,12 +35,10 @@ of tokens rather than a stream of characters which is more common with PEG parsers. The grammar file for Python can be found in -[Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram). -The definitions for literal tokens (such as ``:``, numbers, etc.) can be found in -[Grammar/Tokens](https://github.com/python/cpython/blob/main/Grammar/Tokens). -Various C files, including -[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c) -are generated from these. +[Grammar/python.gram](../Grammar/python.gram). +The definitions for literal tokens (such as `:`, numbers, etc.) can be found in +[Grammar/Tokens](../Grammar/Tokens). Various C files, including +[Parser/parser.c](../Parser/parser.c) are generated from these. See Also: @@ -63,7 +60,7 @@ specification of the AST nodes is specified using the Zephyr Abstract Syntax Definition Language (ASDL) [^1], [^2]. The definition of the AST nodes for Python is found in the file -[Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl). +[Parser/Python.asdl](../Parser/Python.asdl). Each AST node (representing statements, expressions, and several specialized types, like list comprehensions and exception handlers) is @@ -87,14 +84,14 @@ approach and syntax: The preceding example describes two different kinds of statements and an expression: function definitions, return statements, and yield expressions. -All three kinds are considered of type ``stmt`` as shown by ``|`` separating +All three kinds are considered of type `stmt` as shown by `|` separating the various kinds. They all take arguments of various kinds and amounts. -Modifiers on the argument type specify the number of values needed; ``?`` -means it is optional, ``*`` means 0 or more, while no modifier means only one -value for the argument and it is required. ``FunctionDef``, for instance, -takes an ``identifier`` for the *name*, ``arguments`` for *args*, zero or more -``stmt`` arguments for *body*, and zero or more ``expr`` arguments for +Modifiers on the argument type specify the number of values needed; `?` +means it is optional, `*` means 0 or more, while no modifier means only one +value for the argument and it is required. `FunctionDef`, for instance, +takes an `identifier` for the *name*, `arguments` for *args*, zero or more +`stmt` arguments for *body*, and zero or more `expr` arguments for *decorators*. Do notice that something like 'arguments', which is a node type, is @@ -132,9 +129,9 @@ The statement definitions above generate the following C structure type: ``` Also generated are a series of constructor functions that allocate (in -this case) a ``stmt_ty`` struct with the appropriate initialization. The -``kind`` field specifies which component of the union is initialized. The -``FunctionDef()`` constructor function sets 'kind' to ``FunctionDef_kind`` and +this case) a `stmt_ty` struct with the appropriate initialization. The +`kind` field specifies which component of the union is initialized. The +`FunctionDef()` constructor function sets 'kind' to `FunctionDef_kind` and initializes the *name*, *args*, *body*, and *attributes* fields. See also @@ -156,13 +153,13 @@ In general, unless you are working on the critical core of the compiler, memory management can be completely ignored. But if you are working at either the very beginning of the compiler or the end, you need to care about how the arena works. All code relating to the arena is in either -[Include/internal/pycore_pyarena.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_pyarena.h) -or [Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c). +[Include/internal/pycore_pyarena.h](../Include/internal/pycore_pyarena.h) +or [Python/pyarena.c](../Python/pyarena.c). -``PyArena_New()`` will create a new arena. The returned ``PyArena`` structure +`PyArena_New()` will create a new arena. The returned `PyArena` structure will store pointers to all memory given to it. This does the bookkeeping of what memory needs to be freed when the compiler is finished with the memory it -used. That freeing is done with ``PyArena_Free()``. This only needs to be +used. That freeing is done with `PyArena_Free()`. This only needs to be called in strategic areas where the compiler exits. As stated above, in general you should not have to worry about memory @@ -173,25 +170,25 @@ The only exception comes about when managing a PyObject. Since the rest of Python uses reference counting, there is extra support added to the arena to cleanup each PyObject that was allocated. These cases are very rare. However, if you've allocated a PyObject, you must tell -the arena about it by calling ``PyArena_AddPyObject()``. +the arena about it by calling `PyArena_AddPyObject()`. Source code to AST ================== The AST is generated from source code using the function -``_PyParser_ASTFromString()`` or ``_PyParser_ASTFromFile()`` -[Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c). +`_PyParser_ASTFromString()` or `_PyParser_ASTFromFile()` +[Parser/peg_api.c](../Parser/peg_api.c). After some checks, a helper function in -[Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c) +[Parser/parser.c](../Parser/parser.c) begins applying production rules on the source code it receives; converting source code to tokens and matching these tokens recursively to their corresponding rule. The production rule's corresponding rule function is called on every match. These rule functions follow the format `xx_rule`. Where *xx* is the grammar rule that the function handles and is automatically derived from -[Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram) by -[Tools/peg_generator/pegen/c_generator.py](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/c_generator.py). +[Grammar/python.gram](../Grammar/python.gram) by +[Tools/peg_generator/pegen/c_generator.py](../Tools/peg_generator/pegen/c_generator.py). Each rule function in turn creates an AST node as it goes along. It does this by allocating all the new nodes it needs, calling the proper AST node creation @@ -202,18 +199,15 @@ there are no more rules, an error is set and the parsing ends. The AST node creation helper functions have the name `_PyAST_{xx}` where *xx* is the AST node that the function creates. These are defined by the -ASDL grammar and contained in -[Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c) -(which is generated by -[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py) -from -[Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl)). -This all leads to a sequence of AST nodes stored in ``asdl_seq`` structs. +ASDL grammar and contained in [Python/Python-ast.c](../Python/Python-ast.c) +(which is generated by [Parser/asdl_c.py](../Parser/asdl_c.py) +from [Parser/Python.asdl](../Parser/Python.asdl)). +This all leads to a sequence of AST nodes stored in `asdl_seq` structs. To demonstrate everything explained so far, here's the rule function responsible for a simple named import statement such as -``import sys``. Note that error-checking and debugging code has been -omitted. Removed parts are represented by ``...``. +`import sys`. Note that error-checking and debugging code has been +omitted. Removed parts are represented by `...`. Furthermore, some comments have been added for explanation. These comments may not be present in the actual code. @@ -255,55 +249,52 @@ may not be present in the actual code. To improve backtracking performance, some rules (chosen by applying a -``(memo)`` flag in the grammar file) are memoized. Each rule function checks if +`(memo)` flag in the grammar file) are memoized. Each rule function checks if a memoized version exists and returns that if so, else it continues in the manner stated in the previous paragraphs. -There are macros for creating and using ``asdl_xx_seq *`` types, where *xx* is +There are macros for creating and using `asdl_xx_seq *` types, where *xx* is a type of the ASDL sequence. Three main types are defined -manually -- ``generic``, ``identifier`` and ``int``. These types are found in -[Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c) -and its corresponding header file -[Include/internal/pycore_asdl.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_asdl.h). -Functions and macros for creating ``asdl_xx_seq *`` types are as follows: +manually -- `generic`, `identifier` and `int`. These types are found in +[Python/asdl.c](../Python/asdl.c) and its corresponding header file +[Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h). +Functions and macros for creating `asdl_xx_seq *` types are as follows: -``_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)`` - Allocate memory for an ``asdl_generic_seq`` of the specified length -``_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)`` - Allocate memory for an ``asdl_identifier_seq`` of the specified length -``_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)`` - Allocate memory for an ``asdl_int_seq`` of the specified length +`_Py_asdl_generic_seq_new(Py_ssize_t, PyArena *)` + Allocate memory for an `asdl_generic_seq` of the specified length +`_Py_asdl_identifier_seq_new(Py_ssize_t, PyArena *)` + Allocate memory for an `asdl_identifier_seq` of the specified length +`_Py_asdl_int_seq_new(Py_ssize_t, PyArena *)` + Allocate memory for an `asdl_int_seq` of the specified length In addition to the three types mentioned above, some ASDL sequence types are -automatically generated by -[Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py) -and found in -[Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h). +automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py) and found in +[Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h). Macros for using both manually defined and automatically generated ASDL sequence types are as follows: -``asdl_seq_GET(asdl_xx_seq *, int)`` - Get item held at a specific position in an ``asdl_xx_seq`` -``asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)`` - Set a specific index in an ``asdl_xx_seq`` to the specified value +`asdl_seq_GET(asdl_xx_seq *, int)` + Get item held at a specific position in an `asdl_xx_seq` +`asdl_seq_SET(asdl_xx_seq *, int, stmt_ty)` + Set a specific index in an `asdl_xx_seq` to the specified value Untyped counterparts exist for some of the typed macros. These are useful when a function needs to manipulate a generic ASDL sequence: -``asdl_seq_GET_UNTYPED(asdl_seq *, int)`` - Get item held at a specific position in an ``asdl_seq`` -``asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)`` - Set a specific index in an ``asdl_seq`` to the specified value -``asdl_seq_LEN(asdl_seq *)`` - Return the length of an ``asdl_seq`` or ``asdl_xx_seq`` +`asdl_seq_GET_UNTYPED(asdl_seq *, int)` + Get item held at a specific position in an `asdl_seq` +`asdl_seq_SET_UNTYPED(asdl_seq *, int, stmt_ty)` + Set a specific index in an `asdl_seq` to the specified value +`asdl_seq_LEN(asdl_seq *)` + Return the length of an `asdl_seq` or `asdl_xx_seq` Note that typed macros and functions are recommended over their untyped counterparts. Typed macros carry out checks in debug mode and aid -debugging errors caused by incorrectly casting from ``void *``. +debugging errors caused by incorrectly casting from `void *`. If you are working with statements, you must also worry about keeping track of what line number generated the statement. Currently the line -number is passed as the last parameter to each ``stmt_ty`` function. +number is passed as the last parameter to each `stmt_ty` function. See also [PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/). @@ -333,19 +324,19 @@ else: end() ``` -The ``x < 10`` guard is represented by its own basic block that -compares ``x`` with ``10`` and then ends in a conditional jump based on +The `x < 10` guard is represented by its own basic block that +compares `x` with `10` and then ends in a conditional jump based on the result of the comparison. This conditional jump allows the block -to point to both the body of the ``if`` and the body of the ``else``. The -``if`` basic block contains the ``f1()`` and ``f2()`` calls and points to -the ``end()`` basic block. The ``else`` basic block contains the ``g()`` -call and similarly points to the ``end()`` block. +to point to both the body of the `if` and the body of the `else`. The +`if` basic block contains the `f1()` and `f2()` calls and points to +the `end()` basic block. The `else` basic block contains the `g()` +call and similarly points to the `end()` block. -Note that more complex code in the guard, the ``if`` body, or the ``else`` +Note that more complex code in the guard, the `if` body, or the `else` body may be represented by multiple basic blocks. For instance, -short-circuiting boolean logic in a guard like ``if x or y:`` -will produce one basic block that tests the truth value of ``x`` -and then points both (1) to the start of the ``if`` body and (2) to +short-circuiting boolean logic in a guard like `if x or y:` +will produce one basic block that tests the truth value of `x` +and then points both (1) to the start of the `if` body and (2) to a different basic block that tests the truth value of y. CFGs are useful as an intermediate representation of the code because @@ -354,27 +345,24 @@ they are a convenient data structure for optimizations. AST to CFG to bytecode ====================== -The conversion of an ``AST`` to bytecode is initiated by a call to the function -``_PyAST_Compile()`` in -[Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c). +The conversion of an `AST` to bytecode is initiated by a call to the function +`_PyAST_Compile()` in [Python/compile.c](../Python/compile.c). The first step is to construct the symbol table. This is implemented by -``_PySymtable_Build()`` in -[Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c). +`_PySymtable_Build()` in [Python/symtable.c](../Python/symtable.c). This function begins by entering the starting code block for the AST (passed-in) and then calling the proper `symtable_visit_{xx}` function (with *xx* being the AST node type). Next, the AST tree is walked with the various code blocks that delineate the reach of a local variable as blocks are entered and exited using -``symtable_enter_block()`` and ``symtable_exit_block()``, respectively. +`symtable_enter_block()` and `symtable_exit_block()`, respectively. -Once the symbol table is created, the ``AST`` is transformed by ``compiler_codegen()`` -in [Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c) -into a sequence of pseudo instructions. These are similar to bytecode, but -in some cases they are more abstract, and are resolved later into actual -bytecode. The construction of this instruction sequence is handled by several -functions that break the task down by various AST node types. The functions are -all named `compiler_visit_{xx}` where *xx* is the name of the node type (such -as ``stmt``, ``expr``, etc.). Each function receives a ``struct compiler *`` +Once the symbol table is created, the `AST` is transformed by `compiler_codegen()` +in [Python/compile.c](../Python/compile.c) into a sequence of pseudo instructions. +These are similar to bytecode, but in some cases they are more abstract, and are +resolved later into actual bytecode. The construction of this instruction sequence +is handled by several functions that break the task down by various AST node types. +The functions are all named `compiler_visit_{xx}` where *xx* is the name of the node +type (such as `stmt`, `expr`, etc.). Each function receives a `struct compiler *` and `{xx}_ty` where *xx* is the AST node type. Typically these functions consist of a large 'switch' statement, branching based on the kind of node type passed to it. Simple things are handled inline in the @@ -382,242 +370,224 @@ node type passed to it. Simple things are handled inline in the functions named `compiler_{xx}` with *xx* being a descriptive name of what is being handled. -When transforming an arbitrary AST node, use the ``VISIT()`` macro. +When transforming an arbitrary AST node, use the `VISIT()` macro. The appropriate `compiler_visit_{xx}` function is called, based on the value passed in for (so `VISIT({c}, expr, {node})` calls -`compiler_visit_expr({c}, {node})`). The ``VISIT_SEQ()`` macro is very similar, +`compiler_visit_expr({c}, {node})`). The `VISIT_SEQ()` macro is very similar, but is called on AST node sequences (those values that were created as arguments to a node that used the '*' modifier). Emission of bytecode is handled by the following macros: -* ``ADDOP(struct compiler *, location, int)`` +* `ADDOP(struct compiler *, location, int)` add a specified opcode -* ``ADDOP_IN_SCOPE(struct compiler *, location, int)`` - like ``ADDOP``, but also exits current scope; used for adding return value +* `ADDOP_IN_SCOPE(struct compiler *, location, int)` + like `ADDOP`, but also exits current scope; used for adding return value opcodes in lambdas and closures -* ``ADDOP_I(struct compiler *, location, int, Py_ssize_t)`` +* `ADDOP_I(struct compiler *, location, int, Py_ssize_t)` add an opcode that takes an integer argument -* ``ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)`` +* `ADDOP_O(struct compiler *, location, int, PyObject *, TYPE)` add an opcode with the proper argument based on the position of the specified PyObject in PyObject sequence object, but with no handling of mangled names; used for when you need to do named lookups of objects such as globals, consts, or parameters where name mangling is not possible and the scope of the name is known; *TYPE* is the name of PyObject sequence - (``names`` or ``varnames``) -* ``ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)`` - just like ``ADDOP_O``, but steals a reference to PyObject -* ``ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)`` - just like ``ADDOP_O``, but name mangling is also handled; used for + (`names` or `varnames`) +* `ADDOP_N(struct compiler *, location, int, PyObject *, TYPE)` + just like `ADDOP_O`, but steals a reference to PyObject +* `ADDOP_NAME(struct compiler *, location, int, PyObject *, TYPE)` + just like `ADDOP_O`, but name mangling is also handled; used for attribute loading or importing based on name -* ``ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)`` - add the ``LOAD_CONST`` opcode with the proper argument based on the +* `ADDOP_LOAD_CONST(struct compiler *, location, PyObject *)` + add the `LOAD_CONST` opcode with the proper argument based on the position of the specified PyObject in the consts table. -* ``ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)`` - just like ``ADDOP_LOAD_CONST_NEW``, but steals a reference to PyObject -* ``ADDOP_JUMP(struct compiler *, location, int, basicblock *)`` +* `ADDOP_LOAD_CONST_NEW(struct compiler *, location, PyObject *)` + just like `ADDOP_LOAD_CONST_NEW`, but steals a reference to PyObject +* `ADDOP_JUMP(struct compiler *, location, int, basicblock *)` create a jump to a basic block -The ``location`` argument is a struct with the source location to be +The `location` argument is a struct with the source location to be associated with this instruction. It is typically extracted from an -``AST`` node with the ``LOC`` macro. The ``NO_LOCATION`` can be used +`AST` node with the `LOC` macro. The `NO_LOCATION` can be used for *synthetic* instructions, which we do not associate with a line -number at this stage. For example, the implicit ``return None`` +number at this stage. For example, the implicit `return None` which is added at the end of a function is not associated with any line in the source code. There are several helper functions that will emit pseudo-instructions and are named `compiler_{xx}()` where *xx* is what the function helps -with (``list``, ``boolop``, etc.). A rather useful one is ``compiler_nameop()``. +with (`list`, `boolop`, etc.). A rather useful one is `compiler_nameop()`. This function looks up the scope of a variable and, based on the expression context, emits the proper opcode to load, store, or delete the variable. Once the instruction sequence is created, it is transformed into a CFG -by ``_PyCfg_FromInstructionSequence()``. Then ``_PyCfg_OptimizeCodeUnit()`` +by `_PyCfg_FromInstructionSequence()`. Then `_PyCfg_OptimizeCodeUnit()` applies various peephole optimizations, and -``_PyCfg_OptimizedCfgToInstructionSequence()`` converts the optimized ``CFG`` +`_PyCfg_OptimizedCfgToInstructionSequence()` converts the optimized `CFG` back into an instruction sequence. These conversions and optimizations are -implemented in -[Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c). +implemented in [Python/flowgraph.c](../Python/flowgraph.c). Finally, the sequence of pseudo-instructions is converted into actual bytecode. This includes transforming pseudo instructions into actual instructions, converting jump targets from logical labels to relative offsets, and -construction of the -[exception table](exception_handling.md) and -[locations table](https://github.com/python/cpython/blob/main/InternalDocs/locations.md). -The bytecode and tables are then wrapped into a ``PyCodeObject`` along with additional -metadata, including the ``consts`` and ``names`` arrays, information about function +construction of the [exception table](exception_handling.md) and +[locations table](locations.md). +The bytecode and tables are then wrapped into a `PyCodeObject` along with additional +metadata, including the `consts` and `names` arrays, information about function reference to the source code (filename, etc). All of this is implemented by -``_PyAssemble_MakeCodeObject()`` in -[Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c). +`_PyAssemble_MakeCodeObject()` in [Python/assemble.c](../Python/assemble.c). Code objects ============ -The result of ``PyAST_CompileObject()`` is a ``PyCodeObject`` which is defined in -[Include/cpython/code.h](https://github.com/python/cpython/blob/main/Include/cpython/code.h). +The result of `PyAST_CompileObject()` is a `PyCodeObject` which is defined in +[Include/cpython/code.h](../Include/cpython/code.h). And with that you now have executable Python bytecode! -The code objects (byte code) are executed in -[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c). +The code objects (byte code) are executed in [Python/ceval.c](../Python/ceval.c). This file will also need a new case statement for the new opcode in the big switch -statement in ``_PyEval_EvalFrameDefault()``. +statement in `_PyEval_EvalFrameDefault()`. Important files =============== -* [Parser/](https://github.com/python/cpython/blob/main/Parser/) +* [Parser/](../Parser/) - * [Parser/Python.asdl](https://github.com/python/cpython/blob/main/Parser/Python.asdl): + * [Parser/Python.asdl](../Parser/Python.asdl): ASDL syntax file. - * [Parser/asdl.py](https://github.com/python/cpython/blob/main/Parser/asdl.py): + * [Parser/asdl.py](../Parser/asdl.py): Parser for ASDL definition files. Reads in an ASDL description and parses it into an AST that describes it. - * [Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py): + * [Parser/asdl_c.py](../Parser/asdl_c.py): Generate C code from an ASDL description. Generates - [Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c) - and - [Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h). + [Python/Python-ast.c](../Python/Python-ast.c) and + [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h). - * [Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c): - The new PEG parser introduced in Python 3.9. - Generated by - [Tools/peg_generator/pegen/c_generator.py](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/c_generator.py) - from the grammar [Grammar/python.gram](https://github.com/python/cpython/blob/main/Grammar/python.gram). + * [Parser/parser.c](../Parser/parser.c): + The new PEG parser introduced in Python 3.9. Generated by + [Tools/peg_generator/pegen/c_generator.py](../Tools/peg_generator/pegen/c_generator.py) + from the grammar [Grammar/python.gram](../Grammar/python.gram). Creates the AST from source code. Rule functions for their corresponding production rules are found here. - * [Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c): - Contains high-level functions which are - used by the interpreter to create an AST from source code. + * [Parser/peg_api.c](../Parser/peg_api.c): + Contains high-level functions which are used by the interpreter to create + an AST from source code. - * [Parser/pegen.c](https://github.com/python/cpython/blob/main/Parser/pegen.c): + * [Parser/pegen.c](../Parser/pegen.c): Contains helper functions which are used by functions in - [Parser/parser.c](https://github.com/python/cpython/blob/main/Parser/parser.c) - to construct the AST. Also contains helper functions which help raise better error messages - when parsing source code. + [Parser/parser.c](../Parser/parser.c) to construct the AST. Also contains + helper functions which help raise better error messages when parsing source code. - * [Parser/pegen.h](https://github.com/python/cpython/blob/main/Parser/pegen.h): - Header file for the corresponding - [Parser/pegen.c](https://github.com/python/cpython/blob/main/Parser/pegen.c). - Also contains definitions of the ``Parser`` and ``Token`` structs. + * [Parser/pegen.h](../Parser/pegen.h): + Header file for the corresponding [Parser/pegen.c](../Parser/pegen.c). + Also contains definitions of the `Parser` and `Token` structs. -* [Python/](https://github.com/python/cpython/blob/main/Python) +* [Python/](../Python) - * [Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c): + * [Python/Python-ast.c](../Python/Python-ast.c): Creates C structs corresponding to the ASDL types. Also contains code for marshalling AST nodes (core ASDL types have marshalling code in - [Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c)). - File automatically generated by - [Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py). + [Python/asdl.c](../Python/asdl.c)). + File automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py). This file must be committed separately after every grammar change - is committed since the ``__version__`` value is set to the latest + is committed since the `__version__` value is set to the latest grammar change revision number. - * [Python/asdl.c](https://github.com/python/cpython/blob/main/Python/asdl.c): + * [Python/asdl.c](../Python/asdl.c): Contains code to handle the ASDL sequence type. Also has code to handle marshalling the core ASDL types, such as number - and identifier. Used by - [Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c) + and identifier. Used by [Python/Python-ast.c](../Python/Python-ast.c) for marshalling AST nodes. - * [Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c): + * [Python/ast.c](../Python/ast.c): Used for validating the AST. - * [Python/ast_opt.c](https://github.com/python/cpython/blob/main/Python/ast_opt.c): + * [Python/ast_opt.c](../Python/ast_opt.c): Optimizes the AST. - * [Python/ast_unparse.c](https://github.com/python/cpython/blob/main/Python/ast_unparse.c): + * [Python/ast_unparse.c](../Python/ast_unparse.c): Converts the AST expression node back into a string (for string annotations). - * [Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c): + * [Python/ceval.c](../Python/ceval.c): Executes byte code (aka, eval loop). - * [Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c): + * [Python/symtable.c](../Python/symtable.c): Generates a symbol table from AST. - * [Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c): + * [Python/pyarena.c](../Python/pyarena.c): Implementation of the arena memory manager. - * [Python/compile.c](https://github.com/python/cpython/blob/main/Python/compile.c): + * [Python/compile.c](../Python/compile.c): Emits pseudo bytecode based on the AST. - * [Python/flowgraph.c](https://github.com/python/cpython/blob/main/Python/flowgraph.c): + * [Python/flowgraph.c](../Python/flowgraph.c): Implements peephole optimizations. - * [Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c): + * [Python/assemble.c](../Python/assemble.c): Constructs a code object from a sequence of pseudo instructions. - * [Python/instruction_sequence.c](https://github.com/python/cpython/blob/main/Python/instruction_sequence.c): + * [Python/instruction_sequence.c](../Python/instruction_sequence.c): A data structure representing a sequence of bytecode-like pseudo-instructions. -* [Include/](https://github.com/python/cpython/blob/main/Include/) +* [Include/](../Include/) - * [Include/cpython/code.h](https://github.com/python/cpython/blob/main/Include/cpython/code.h) - : Header file for - [Objects/codeobject.c](https://github.com/python/cpython/blob/main/Objects/codeobject.c); - contains definition of ``PyCodeObject``. + * [Include/cpython/code.h](../Include/cpython/code.h) + : Header file for [Objects/codeobject.c](../Objects/codeobject.c); + contains definition of `PyCodeObject`. - * [Include/opcode.h](https://github.com/python/cpython/blob/main/Include/opcode.h) - : One of the files that must be modified if - [Lib/opcode.py](https://github.com/python/cpython/blob/main/Lib/opcode.py) is. + * [Include/opcode.h](../Include/opcode.h) + : One of the files that must be modified whenever + [Lib/opcode.py](../Lib/opcode.py) is. - * [Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h) + * [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h) : Contains the actual definitions of the C structs as generated by - [Python/Python-ast.c](https://github.com/python/cpython/blob/main/Python/Python-ast.c) - Automatically generated by - [Parser/asdl_c.py](https://github.com/python/cpython/blob/main/Parser/asdl_c.py). + [Python/Python-ast.c](../Python/Python-ast.c). + Automatically generated by [Parser/asdl_c.py](../Parser/asdl_c.py). - * [Include/internal/pycore_asdl.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_asdl.h) - : Header for the corresponding - [Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c). + * [Include/internal/pycore_asdl.h](../Include/internal/pycore_asdl.h) + : Header for the corresponding [Python/ast.c](../Python/ast.c). - * [Include/internal/pycore_ast.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_ast.h) - : Declares ``_PyAST_Validate()`` external (from - [Python/ast.c](https://github.com/python/cpython/blob/main/Python/ast.c)). + * [Include/internal/pycore_ast.h](../Include/internal/pycore_ast.h) + : Declares `_PyAST_Validate()` external (from [Python/ast.c](../Python/ast.c)). - * [Include/internal/pycore_symtable.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_symtable.h) - : Header for - [Python/symtable.c](https://github.com/python/cpython/blob/main/Python/symtable.c). - ``struct symtable`` and ``PySTEntryObject`` are defined here. + * [Include/internal/pycore_symtable.h](../Include/internal/pycore_symtable.h) + : Header for [Python/symtable.c](../Python/symtable.c). + `struct symtable` and `PySTEntryObject` are defined here. - * [Include/internal/pycore_parser.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_parser.h) - : Header for the corresponding - [Parser/peg_api.c](https://github.com/python/cpython/blob/main/Parser/peg_api.c). + * [Include/internal/pycore_parser.h](../Include/internal/pycore_parser.h) + : Header for the corresponding [Parser/peg_api.c](../Parser/peg_api.c). - * [Include/internal/pycore_pyarena.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_pyarena.h) - : Header file for the corresponding - [Python/pyarena.c](https://github.com/python/cpython/blob/main/Python/pyarena.c). + * [Include/internal/pycore_pyarena.h](../Include/internal/pycore_pyarena.h) + : Header file for the corresponding [Python/pyarena.c](../Python/pyarena.c). - * [Include/opcode_ids.h](https://github.com/python/cpython/blob/main/Include/opcode_ids.h) - : List of opcodes. Generated from - [Python/bytecodes.c](https://github.com/python/cpython/blob/main/Python/bytecodes.c) + * [Include/opcode_ids.h](../Include/opcode_ids.h) + : List of opcodes. Generated from [Python/bytecodes.c](../Python/bytecodes.c) by - [Tools/cases_generator/opcode_id_generator.py](https://github.com/python/cpython/blob/main/Tools/cases_generator/opcode_id_generator.py). + [Tools/cases_generator/opcode_id_generator.py](../Tools/cases_generator/opcode_id_generator.py). -* [Objects/](https://github.com/python/cpython/blob/main/Objects/) +* [Objects/](../Objects/) - * [Objects/codeobject.c](https://github.com/python/cpython/blob/main/Objects/codeobject.c) + * [Objects/codeobject.c](../Objects/codeobject.c) : Contains PyCodeObject-related code. - * [Objects/frameobject.c](https://github.com/python/cpython/blob/main/Objects/frameobject.c) - : Contains the ``frame_setlineno()`` function which should determine whether it is allowed + * [Objects/frameobject.c](../Objects/frameobject.c) + : Contains the `frame_setlineno()` function which should determine whether it is allowed to make a jump between two points in a bytecode. -* [Lib/](https://github.com/python/cpython/blob/main/Lib/) +* [Lib/](../Lib/) - * [Lib/opcode.py](https://github.com/python/cpython/blob/main/Lib/opcode.py) + * [Lib/opcode.py](../Lib/opcode.py) : opcode utilities exposed to Python. - * [Include/core/pycore_magic_number.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_magic_number.h) - : Home of the magic number (named ``MAGIC_NUMBER``) for bytecode versioning. + * [Include/core/pycore_magic_number.h](../Include/internal/pycore_magic_number.h) + : Home of the magic number (named `MAGIC_NUMBER`) for bytecode versioning. Objects @@ -625,7 +595,7 @@ Objects * [Locations](locations.md): Describes the location table * [Frames](frames.md): Describes frames and the frame stack -* [Objects/object_layout.md](https://github.com/python/cpython/blob/main/Objects/object_layout.md): Describes object layout for 3.11 and later +* [Objects/object_layout.md](../Objects/object_layout.md): Describes object layout for 3.11 and later * [Exception Handling](exception_handling.md): Describes the exception table diff --git a/InternalDocs/exception_handling.md b/InternalDocs/exception_handling.md index 64a346b55b8..14066a5864b 100644 --- a/InternalDocs/exception_handling.md +++ b/InternalDocs/exception_handling.md @@ -68,18 +68,16 @@ Handling Exceptions ------------------- At runtime, when an exception occurs, the interpreter calls -``get_exception_handler()`` in -[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c) +`get_exception_handler()` in [Python/ceval.c](../Python/ceval.c) to look up the offset of the current instruction in the exception table. If it finds a handler, control flow transfers to it. Otherwise, the exception bubbles up to the caller, and the caller's frame is checked for a handler covering the `CALL` instruction. This repeats until a handler is found or the topmost frame is reached. If no handler is found, then the interpreter function -(``_PyEval_EvalFrameDefault()``) returns NULL. During unwinding, +(`_PyEval_EvalFrameDefault()`) returns NULL. During unwinding, the traceback is constructed as each frame is added to it by -``PyTraceBack_Here()``, which is in -[Python/traceback.c](https://github.com/python/cpython/blob/main/Python/traceback.c). +`PyTraceBack_Here()`, which is in [Python/traceback.c](../Python/traceback.c). Along with the location of an exception handler, each entry of the exception table also contains the stack depth of the `try` instruction @@ -174,22 +172,20 @@ which is then encoded as: for a total of five bytes. -The code to construct the exception table is in ``assemble_exception_table()`` -in [Python/assemble.c](https://github.com/python/cpython/blob/main/Python/assemble.c). +The code to construct the exception table is in `assemble_exception_table()` +in [Python/assemble.c](../Python/assemble.c). The interpreter's function to lookup the table by instruction offset is -``get_exception_handler()`` in -[Python/ceval.c](https://github.com/python/cpython/blob/main/Python/ceval.c). -The Python function ``_parse_exception_table()`` in -[Lib/dis.py](https://github.com/python/cpython/blob/main/Lib/dis.py) +`get_exception_handler()` in [Python/ceval.c](../Python/ceval.c). +The Python function `_parse_exception_table()` in [Lib/dis.py](../Lib/dis.py) returns the exception table content as a list of namedtuple instances. Exception Chaining Implementation --------------------------------- [Exception chaining](https://docs.python.org/dev/tutorial/errors.html#exception-chaining) -refers to setting the ``__context__`` and ``__cause__`` fields of an exception as it is -being raised. The ``__context__`` field is set by ``_PyErr_SetObject()`` in -[Python/errors.c](https://github.com/python/cpython/blob/main/Python/errors.c) -(which is ultimately called by all ``PyErr_Set*()`` functions). -The ``__cause__`` field (explicit chaining) is set by the ``RAISE_VARARGS`` bytecode. +refers to setting the `__context__` and `__cause__` fields of an exception as it is +being raised. The `__context__` field is set by `_PyErr_SetObject()` in +[Python/errors.c](../Python/errors.c) (which is ultimately called by all +`PyErr_Set*()` functions). The `__cause__` field (explicit chaining) is set by +the `RAISE_VARARGS` bytecode. diff --git a/InternalDocs/frames.md b/InternalDocs/frames.md index 34682adb1b4..06dc8f0702c 100644 --- a/InternalDocs/frames.md +++ b/InternalDocs/frames.md @@ -10,20 +10,19 @@ of three conceptual sections: globals dict, code object, instruction pointer, stack depth, the previous frame, etc. -The definition of the ``_PyInterpreterFrame`` struct is in -[Include/internal/pycore_frame.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_frame.h). +The definition of the `_PyInterpreterFrame` struct is in +[Include/internal/pycore_frame.h](../Include/internal/pycore_frame.h). # Allocation Python semantics allows frames to outlive the activation, so they need to be allocated outside the C call stack. To reduce overhead and improve locality of reference, most frames are allocated contiguously in a per-thread stack -(see ``_PyThreadState_PushFrame`` in -[Python/pystate.c](https://github.com/python/cpython/blob/main/Python/pystate.c)). +(see `_PyThreadState_PushFrame` in [Python/pystate.c](../Python/pystate.c)). Frames of generators and coroutines are embedded in the generator and coroutine -objects, so are not allocated in the per-thread stack. See ``PyGenObject`` in -[Include/internal/pycore_genobject.h](https://github.com/python/cpython/blob/main/Include/internal/pycore_genobject.h). +objects, so are not allocated in the per-thread stack. See `PyGenObject` in +[Include/internal/pycore_genobject.h](../Include/internal/pycore_genobject.h). ## Layout @@ -82,16 +81,15 @@ frames for each activation, but with low runtime overhead. ### Generators and Coroutines -Generators (objects of type ``PyGen_Type``, ``PyCoro_Type`` or -``PyAsyncGen_Type``) have a `_PyInterpreterFrame` embedded in them, so +Generators (objects of type `PyGen_Type`, `PyCoro_Type` or +`PyAsyncGen_Type`) have a `_PyInterpreterFrame` embedded in them, so that they can be created with a single memory allocation. When such an embedded frame is iterated or awaited, it can be linked with frames on the per-thread stack via the linkage fields. If a frame object associated with a generator outlives the generator, then the embedded `_PyInterpreterFrame` is copied into the frame object (see -``take_ownership()`` in -[Python/frame.c](https://github.com/python/cpython/blob/main/Python/frame.c)). +`take_ownership()` in [Python/frame.c](../Python/frame.c)). ### Field names diff --git a/InternalDocs/garbage_collector.md b/InternalDocs/garbage_collector.md index fd0246fa1a6..a6ee5c09e19 100644 --- a/InternalDocs/garbage_collector.md +++ b/InternalDocs/garbage_collector.md @@ -12,7 +12,7 @@ a local variable in some C function. When an object’s reference count becomes the object is deallocated. If it contains references to other objects, their reference counts are decremented. Those other objects may be deallocated in turn, if this decrement makes their reference count become zero, and so on. The reference -count field can be examined using the ``sys.getrefcount()`` function (notice that the +count field can be examined using the `sys.getrefcount()` function (notice that the value returned by this function is always 1 more as the function also has a reference to the object when called): @@ -39,7 +39,7 @@ cycles. For instance, consider this code: >>> del container ``` -In this example, ``container`` holds a reference to itself, so even when we remove +In this example, `container` holds a reference to itself, so even when we remove our reference to it (the variable "container") the reference count never falls to 0 because it still has its own internal reference. Therefore it would never be cleaned just by simple reference counting. For this reason some additional machinery @@ -127,7 +127,7 @@ GC for the free-threaded build ------------------------------ In the free-threaded build, Python objects contain a 1-byte field -``ob_gc_bits`` that is used to track garbage collection related state. The +`ob_gc_bits` that is used to track garbage collection related state. The field exists in all objects, including ones that do not support cyclic garbage collection. The field is used to identify objects that are tracked by the collector, ensure that finalizers are called only once per object, @@ -146,14 +146,14 @@ and, during garbage collection, differentiate reachable vs. unreachable objects. | ... | ``` -Note that not all fields are to scale. ``pad`` is two bytes, ``ob_mutex`` and -``ob_gc_bits`` are each one byte, and ``ob_ref_local`` is four bytes. The -other fields, ``ob_tid``, ``ob_ref_shared``, and ``ob_type``, are all +Note that not all fields are to scale. `pad` is two bytes, `ob_mutex` and +`ob_gc_bits` are each one byte, and `ob_ref_local` is four bytes. The +other fields, `ob_tid`, `ob_ref_shared`, and `ob_type`, are all pointer-sized (that is, eight bytes on a 64-bit platform). -The garbage collector also temporarily repurposes the ``ob_tid`` (thread ID) -and ``ob_ref_local`` (local reference count) fields for other purposes during +The garbage collector also temporarily repurposes the `ob_tid` (thread ID) +and `ob_ref_local` (local reference count) fields for other purposes during collections. @@ -165,17 +165,17 @@ objects with GC support. These APIs can be found in the [Garbage Collector C API documentation](https://docs.python.org/3/c-api/gcsupport.html). Apart from this object structure, the type object for objects supporting garbage -collection must include the ``Py_TPFLAGS_HAVE_GC`` in its ``tp_flags`` slot and -provide an implementation of the ``tp_traverse`` handler. Unless it can be proven +collection must include the `Py_TPFLAGS_HAVE_GC` in its `tp_flags` slot and +provide an implementation of the `tp_traverse` handler. Unless it can be proven that the objects cannot form reference cycles with only objects of its type or unless -the type is immutable, a ``tp_clear`` implementation must also be provided. +the type is immutable, a `tp_clear` implementation must also be provided. Identifying reference cycles ============================ The algorithm that CPython uses to detect those reference cycles is -implemented in the ``gc`` module. The garbage collector **only focuses** +implemented in the `gc` module. The garbage collector **only focuses** on cleaning container objects (that is, objects that can contain a reference to one or more objects). These can be arrays, dictionaries, lists, custom class instances, classes in extension modules, etc. One could think that @@ -195,7 +195,7 @@ the interpreter create cycles everywhere. Some notable examples: To correctly dispose of these objects once they become unreachable, they need to be identified first. To understand how the algorithm works, let’s take the case of a circular linked list which has one link referenced by a -variable ``A``, and one self-referencing object which is completely +variable `A`, and one self-referencing object which is completely unreachable: ```pycon @@ -234,7 +234,7 @@ objects have a refcount larger than the number of incoming references from within the candidate set. Every object that supports garbage collection will have an extra reference -count field initialized to the reference count (``gc_ref`` in the figures) +count field initialized to the reference count (`gc_ref` in the figures) of that object when the algorithm starts. This is because the algorithm needs to modify the reference count to do the computations and in this way the interpreter will not modify the real reference count field. @@ -243,43 +243,43 @@ interpreter will not modify the real reference count field. The GC then iterates over all containers in the first list and decrements by one the `gc_ref` field of any other object that container is referencing. Doing -this makes use of the ``tp_traverse`` slot in the container class (implemented +this makes use of the `tp_traverse` slot in the container class (implemented using the C API or inherited by a superclass) to know what objects are referenced by each container. After all the objects have been scanned, only the objects that have -references from outside the “objects to scan” list will have ``gc_ref > 0``. +references from outside the “objects to scan” list will have `gc_ref > 0`. ![gc-image2](images/python-cyclic-gc-2-new-page.png) -Notice that having ``gc_ref == 0`` does not imply that the object is unreachable. -This is because another object that is reachable from the outside (``gc_ref > 0``) -can still have references to it. For instance, the ``link_2`` object in our example -ended having ``gc_ref == 0`` but is referenced still by the ``link_1`` object that +Notice that having `gc_ref == 0` does not imply that the object is unreachable. +This is because another object that is reachable from the outside (`gc_ref > 0`) +can still have references to it. For instance, the `link_2` object in our example +ended having `gc_ref == 0` but is referenced still by the `link_1` object that is reachable from the outside. To obtain the set of objects that are really unreachable, the garbage collector re-scans the container objects using the -``tp_traverse`` slot; this time with a different traverse function that marks objects with -``gc_ref == 0`` as "tentatively unreachable" and then moves them to the +`tp_traverse` slot; this time with a different traverse function that marks objects with +`gc_ref == 0` as "tentatively unreachable" and then moves them to the tentatively unreachable list. The following image depicts the state of the lists in a -moment when the GC processed the ``link_3`` and ``link_4`` objects but has not -processed ``link_1`` and ``link_2`` yet. +moment when the GC processed the `link_3` and `link_4` objects but has not +processed `link_1` and `link_2` yet. ![gc-image3](images/python-cyclic-gc-3-new-page.png) -Then the GC scans the next ``link_1`` object. Because it has ``gc_ref == 1``, +Then the GC scans the next `link_1` object. Because it has `gc_ref == 1`, the gc does not do anything special because it knows it has to be reachable (and is already in what will become the reachable list): ![gc-image4](images/python-cyclic-gc-4-new-page.png) -When the GC encounters an object which is reachable (``gc_ref > 0``), it traverses -its references using the ``tp_traverse`` slot to find all the objects that are +When the GC encounters an object which is reachable (`gc_ref > 0`), it traverses +its references using the `tp_traverse` slot to find all the objects that are reachable from it, moving them to the end of the list of reachable objects (where -they started originally) and setting its ``gc_ref`` field to 1. This is what happens -to ``link_2`` and ``link_3`` below as they are reachable from ``link_1``. From the -state in the previous image and after examining the objects referred to by ``link_1`` -the GC knows that ``link_3`` is reachable after all, so it is moved back to the -original list and its ``gc_ref`` field is set to 1 so that if the GC visits it again, +they started originally) and setting its `gc_ref` field to 1. This is what happens +to `link_2` and `link_3` below as they are reachable from `link_1`. From the +state in the previous image and after examining the objects referred to by `link_1` +the GC knows that `link_3` is reachable after all, so it is moved back to the +original list and its `gc_ref` field is set to 1 so that if the GC visits it again, it will know that it's reachable. To avoid visiting an object twice, the GC marks all -objects that have already been visited once (by unsetting the ``PREV_MASK_COLLECTING`` +objects that have already been visited once (by unsetting the `PREV_MASK_COLLECTING` flag) so that if an object that has already been processed is referenced by some other object, the GC does not process it twice. @@ -295,7 +295,7 @@ list are really unreachable and can thus be garbage collected. Pragmatically, it's important to note that no recursion is required by any of this, and neither does it in any other way require additional memory proportional to the number of objects, number of pointers, or the lengths of pointer chains. Apart from -``O(1)`` storage for internal C needs, the objects themselves contain all the storage +`O(1)` storage for internal C needs, the objects themselves contain all the storage the GC algorithms require. Why moving unreachable objects is better @@ -331,7 +331,7 @@ with the objective of completely destroying these objects. Roughly, the process follows these steps in order: 1. Handle and clear weak references (if any). Weak references to unreachable objects - are set to ``None``. If the weak reference has an associated callback, the callback + are set to `None`. If the weak reference has an associated callback, the callback is enqueued to be called once the clearing of weak references is finished. We only invoke callbacks for weak references that are themselves reachable. If both the weak reference and the pointed-to object are unreachable we do not execute the callback. @@ -339,15 +339,15 @@ follows these steps in order: object and support for weak references predates support for object resurrection. Ignoring the weak reference's callback is fine because both the object and the weakref are going away, so it's legitimate to say the weak reference is going away first. -2. If an object has legacy finalizers (``tp_del`` slot) move it to the - ``gc.garbage`` list. -3. Call the finalizers (``tp_finalize`` slot) and mark the objects as already +2. If an object has legacy finalizers (`tp_del` slot) move it to the + `gc.garbage` list. +3. Call the finalizers (`tp_finalize` slot) and mark the objects as already finalized to avoid calling finalizers twice if the objects are resurrected or if other finalizers have removed the object first. 4. Deal with resurrected objects. If some objects have been resurrected, the GC finds the new subset of objects that are still unreachable by running the cycle detection algorithm again and continues with them. -5. Call the ``tp_clear`` slot of every object so all internal links are broken and +5. Call the `tp_clear` slot of every object so all internal links are broken and the reference counts fall to 0, triggering the destruction of all unreachable objects. @@ -376,9 +376,9 @@ generations. Every collection operates on the entire heap. In order to decide when to run, the collector keeps track of the number of object allocations and deallocations since the last collection. When the number of -allocations minus the number of deallocations exceeds ``threshold_0``, +allocations minus the number of deallocations exceeds `threshold_0`, collection starts. Initially only generation 0 is examined. If generation 0 has -been examined more than ``threshold_1`` times since generation 1 has been +been examined more than `threshold_1` times since generation 1 has been examined, then generation 1 is examined as well. With generation 2, things are a bit more complicated; see [Collecting the oldest generation](#Collecting-the-oldest-generation) for @@ -393,8 +393,8 @@ function: ``` The content of these generations can be examined using the -``gc.get_objects(generation=NUM)`` function and collections can be triggered -specifically in a generation by calling ``gc.collect(generation=NUM)``. +`gc.get_objects(generation=NUM)` function and collections can be triggered +specifically in a generation by calling `gc.collect(generation=NUM)`. ```pycon >>> import gc @@ -433,7 +433,7 @@ Collecting the oldest generation -------------------------------- In addition to the various configurable thresholds, the GC only triggers a full -collection of the oldest generation if the ratio ``long_lived_pending / long_lived_total`` +collection of the oldest generation if the ratio `long_lived_pending / long_lived_total` is above a given value (hardwired to 25%). The reason is that, while "non-full" collections (that is, collections of the young and middle generations) will always examine roughly the same number of objects (determined by the aforementioned @@ -463,12 +463,12 @@ used for tags or to keep other information – most often as a bit field (each bit a separate tag) – as long as code that uses the pointer masks out these bits before accessing memory. For example, on a 32-bit architecture (for both addresses and word size), a word is 32 bits = 4 bytes, so word-aligned -addresses are always a multiple of 4, hence end in ``00``, leaving the last 2 bits +addresses are always a multiple of 4, hence end in `00`, leaving the last 2 bits available; while on a 64-bit architecture, a word is 64 bits = 8 bytes, so -word-aligned addresses end in ``000``, leaving the last 3 bits available. +word-aligned addresses end in `000`, leaving the last 3 bits available. The CPython GC makes use of two fat pointers that correspond to the extra fields -of ``PyGC_Head`` discussed in the `Memory layout and object structure`_ section: +of `PyGC_Head` discussed in the `Memory layout and object structure`_ section: > [!WARNING] > Because the presence of extra information, "tagged" or "fat" pointers cannot be @@ -478,23 +478,23 @@ of ``PyGC_Head`` discussed in the `Memory layout and object structure`_ section: > normally assume the pointers inside the lists are in a consistent state. -- The ``_gc_prev`` field is normally used as the "previous" pointer to maintain the +- The `_gc_prev` field is normally used as the "previous" pointer to maintain the doubly linked list but its lowest two bits are used to keep the flags - ``PREV_MASK_COLLECTING`` and ``_PyGC_PREV_MASK_FINALIZED``. Between collections, - the only flag that can be present is ``_PyGC_PREV_MASK_FINALIZED`` that indicates - if an object has been already finalized. During collections ``_gc_prev`` is - temporarily used for storing a copy of the reference count (``gc_ref``), in + `PREV_MASK_COLLECTING` and `_PyGC_PREV_MASK_FINALIZED`. Between collections, + the only flag that can be present is `_PyGC_PREV_MASK_FINALIZED` that indicates + if an object has been already finalized. During collections `_gc_prev` is + temporarily used for storing a copy of the reference count (`gc_ref`), in addition to two flags, and the GC linked list becomes a singly linked list until - ``_gc_prev`` is restored. + `_gc_prev` is restored. -- The ``_gc_next`` field is used as the "next" pointer to maintain the doubly linked +- The `_gc_next` field is used as the "next" pointer to maintain the doubly linked list but during collection its lowest bit is used to keep the - ``NEXT_MASK_UNREACHABLE`` flag that indicates if an object is tentatively + `NEXT_MASK_UNREACHABLE` flag that indicates if an object is tentatively unreachable during the cycle detection algorithm. This is a drawback to using only doubly linked lists to implement partitions: while most needed operations are constant-time, there is no efficient way to determine which partition an object is currently in. Instead, when that's needed, ad hoc tricks (like the - ``NEXT_MASK_UNREACHABLE`` flag) are employed. + `NEXT_MASK_UNREACHABLE` flag) are employed. Optimization: delay tracking containers ======================================= @@ -531,7 +531,7 @@ benefit from delayed tracking: full garbage collection (all generations), the collector will untrack any dictionaries whose contents are not tracked. -The garbage collector module provides the Python function ``is_tracked(obj)``, which returns +The garbage collector module provides the Python function `is_tracked(obj)`, which returns the current tracking status of the object. Subsequent garbage collections may change the tracking status of the object. @@ -556,20 +556,20 @@ Differences between GC implementations This section summarizes the differences between the GC implementation in the default build and the implementation in the free-threaded build. -The default build implementation makes extensive use of the ``PyGC_Head`` data +The default build implementation makes extensive use of the `PyGC_Head` data structure, while the free-threaded build implementation does not use that data structure. - The default build implementation stores all tracked objects in a doubly - linked list using ``PyGC_Head``. The free-threaded build implementation + linked list using `PyGC_Head`. The free-threaded build implementation instead relies on the embedded mimalloc memory allocator to scan the heap for tracked objects. -- The default build implementation uses ``PyGC_Head`` for the unreachable +- The default build implementation uses `PyGC_Head` for the unreachable object list. The free-threaded build implementation repurposes the - ``ob_tid`` field to store a unreachable objects linked list. -- The default build implementation stores flags in the ``_gc_prev`` field of - ``PyGC_Head``. The free-threaded build implementation stores these flags - in ``ob_gc_bits``. + `ob_tid` field to store a unreachable objects linked list. +- The default build implementation stores flags in the `_gc_prev` field of + `PyGC_Head`. The free-threaded build implementation stores these flags + in `ob_gc_bits`. The default build implementation relies on the diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index 11aaf112536..6398ba6cd28 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -9,12 +9,12 @@ Python's Parser is currently a [`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar) parser. It was introduced in [PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace -the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser. +the original [`LL(1)`](https://en.wikipedia.org/wiki/LL_parser) parser. The code implementing the parser is generated from a grammar definition by a [parser generator](https://en.wikipedia.org/wiki/Compiler-compiler). Therefore, changes to the Python language are made by modifying the -[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram). +[grammar file](../Grammar/python.gram). Developers rarely need to modify the generator itself. See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) @@ -33,9 +33,9 @@ is ordered. This means that when writing: rule: A | B | C ``` -a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will +a parser that implements a context-free-grammar (such as an `LL(1)` parser) will generate constructions that, given an input string, *deduce* which alternative -(``A``, ``B`` or ``C``) must be expanded. On the other hand, a PEG parser will +(`A`, `B` or `C`) must be expanded. On the other hand, a PEG parser will check each alternative, in the order in which they are specified, and select that first one that succeeds. @@ -67,21 +67,21 @@ time complexity with a technique called which not only loads the entire program in memory before parsing it but also allows the parser to backtrack arbitrarily. This is made efficient by memoizing the rules already matched for each position. The cost of the memoization cache -is that the parser will naturally use more memory than a simple ``LL(1)`` parser, +is that the parser will naturally use more memory than a simple `LL(1)` parser, which normally are table-based. Key ideas --------- -- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ). +- Alternatives are ordered ( `A | B` is not the same as `B | A` ). - If a rule returns a failure, it doesn't mean that the parsing has failed, it just means "try something else". - By default PEG parsers run in exponential time, which can be optimized to linear by using memoization. - If parsing fails completely (no rule succeeds in parsing all the input text), the PEG parser doesn't have a concept of "where the - [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". + [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". > [!IMPORTANT] @@ -111,16 +111,16 @@ the following two rules (in these examples, a token is an individual character): second_rule: ('aa' | 'a' ) 'a' ``` -In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but -in PEG, one of these two rules accepts the string ``aaa`` but not the string -``aa``. The other does the opposite -- it accepts the string ``aa`` -but not the string ``aaa``. The rule ``('a'|'aa')'a'`` does -not accept ``aaa`` because ``'a'|'aa'`` consumes the first ``a``, letting the -final ``a`` in the rule consume the second, and leaving out the third ``a``. +In a regular EBNF grammar, both rules specify the language `{aa, aaa}` but +in PEG, one of these two rules accepts the string `aaa` but not the string +`aa`. The other does the opposite -- it accepts the string `aa` +but not the string `aaa`. The rule `('a'|'aa')'a'` does +not accept `aaa` because `'a'|'aa'` consumes the first `a`, letting the +final `a` in the rule consume the second, and leaving out the third `a`. As the rule has succeeded, no attempt is ever made to go back and let -``'a'|'aa'`` try the second alternative. The expression ``('aa'|'a')'a'`` does -not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing -for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not +`'a'|'aa'` try the second alternative. The expression `('aa'|'a')'a'` does +not accept `aa` because `'aa'|'a'` accepts all of `aa`, leaving nothing +for the final `a`. Again, the second alternative of `'aa'|'a'` is not tried. > [!CAUTION] @@ -137,7 +137,7 @@ one is in almost all cases a mistake, for example: ``` In this example, the second alternative will never be tried because the first one will -succeed first (even if the input string has an ``'else' block`` that follows). To correctly +succeed first (even if the input string has an `'else' block` that follows). To correctly write this rule you can simply alter the order: ``` @@ -146,7 +146,7 @@ write this rule you can simply alter the order: | 'if' expression 'then' block ``` -In this case, if the input string doesn't have an ``'else' block``, the first alternative +In this case, if the input string doesn't have an `'else' block`, the first alternative will fail and the second will be attempted. Grammar Syntax @@ -166,8 +166,8 @@ the rule: rule_name[return_type]: expression ``` -If the return type is omitted, then a ``void *`` is returned in C and an -``Any`` in Python. +If the return type is omitted, then a `void *` is returned in C and an +`Any` in Python. Grammar expressions ------------------- @@ -214,7 +214,7 @@ Variables in the grammar ------------------------ A sub-expression can be named by preceding it with an identifier and an -``=`` sign. The name can then be used in the action (see below), like this: +`=` sign. The name can then be used in the action (see below), like this: ``` rule_name[return_type]: '(' a=some_other_rule ')' { a } @@ -387,9 +387,9 @@ returns a valid C-based Python AST: | NUMBER ``` -Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset, -end_lineno, end_col_offset, p->arena``, those being variables automatically -injected by the parser; ``p`` points to an object that holds on to all state +Here `EXTRA` is a macro that expands to `start_lineno, start_col_offset, +end_lineno, end_col_offset, p->arena`, those being variables automatically +injected by the parser; `p` points to an object that holds on to all state for the parser. A similar grammar written to target Python AST objects: @@ -422,50 +422,47 @@ Pegen Pegen is the parser generator used in CPython to produce the final PEG parser used by the interpreter. It is the program that can be used to read the python -grammar located in -[`Grammar/python.gram`](https://github.com/python/cpython/blob/main/Grammar/python.gram) -and produce the final C parser. It contains the following pieces: +grammar located in [`Grammar/python.gram`](../Grammar/python.gram) and produce +the final C parser. It contains the following pieces: - A parser generator that can read a grammar file and produce a PEG parser written in Python or C that can parse said grammar. The generator is located at - [`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen). + [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen). - A PEG meta-grammar that automatically generates a Python parser which is used for the parser generator itself (this means that there are no manually-written parsers). The meta-grammar is located at - [`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). + [`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram). - A generated parser (using the parser generator) that can directly produce C and Python AST objects. -The source code for Pegen lives at -[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen) +The source code for Pegen lives at [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen) but normally all typical commands to interact with the parser generator are executed from the main makefile. How to regenerate the parser ---------------------------- -Once you have made the changes to the grammar files, to regenerate the ``C`` +Once you have made the changes to the grammar files, to regenerate the `C` parser (the one used by the interpreter) just execute: ``` make regen-pegen ``` -using the ``Makefile`` in the main directory. If you are on Windows you can +using the `Makefile` in the main directory. If you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute: ``` ./PCbuild/build.bat --regen ``` -The generated parser file is located at -[`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c). +The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c). How to regenerate the meta-parser --------------------------------- The meta-grammar (the grammar that describes the grammar for the grammar files themselves) is located at -[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). +[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram). Although it is very unlikely that you will ever need to modify it, if you make any modifications to this file (in order to implement new Pegen features) you will need to regenerate the meta-parser (the parser that parses the grammar files). @@ -488,11 +485,11 @@ Grammatical elements and rules Pegen has some special grammatical elements and rules: -- Strings with single quotes (') (for example, ``'class'``) denote KEYWORDS. -- Strings with double quotes (") (for example, ``"match"``) denote SOFT KEYWORDS. -- Uppercase names (for example, ``NAME``) denote tokens in the - [`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) file. -- Rule names starting with ``invalid_`` are used for specialized syntax errors. +- Strings with single quotes (') (for example, `'class'`) denote KEYWORDS. +- Strings with double quotes (") (for example, `"match"`) denote SOFT KEYWORDS. +- Uppercase names (for example, `NAME`) denote tokens in the + [`Grammar/Tokens`](../Grammar/Tokens) file. +- Rule names starting with `invalid_` are used for specialized syntax errors. - These rules are NOT used in the first pass of the parser. - Only if the first pass fails to parse, a second pass including the invalid @@ -509,14 +506,13 @@ Tokenization It is common among PEG parser frameworks that the parser does both the parsing and the tokenization, but this does not happen in Pegen. The reason is that the Python language needs a custom tokenizer to handle things like indentation -boundaries, some special keywords like ``ASYNC`` and ``AWAIT`` (for +boundaries, some special keywords like `ASYNC` and `AWAIT` (for compatibility purposes), backtracking errors (such as unclosed parenthesis), dealing with encoding, interactive mode and much more. Some of these reasons are also there for historical purposes, and some others are useful even today. The list of tokens (all uppercase names in the grammar) that you can use can -be found in thei -[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) +be found in the [`Grammar/Tokens`](../Grammar/Tokens) file. If you change this file to add new tokens, make sure to regenerate the files by executing: @@ -532,9 +528,7 @@ the tokens or to execute: ``` How tokens are generated and the rules governing this are completely up to the tokenizer -([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer) -and -[`Parser/tokenizer`](https://github.com/python/cpython/blob/main/Parser/tokenizer)); +([`Parser/lexer`](../Parser/lexer) and [`Parser/tokenizer`](../Parser/tokenizer)); the parser just receives tokens from it. Memoization @@ -548,7 +542,7 @@ both in memory and time. Although the memory cost is obvious (the parser needs memory for storing previous results in the cache) the execution time cost comes for continuously checking if the given rule has a cache hit or not. In many situations, just parsing it again can be faster. Pegen **disables memoization -by default** except for rules with the special marker ``memo`` after the rule +by default** except for rules with the special marker `memo` after the rule name (and type, if present): ``` @@ -567,8 +561,7 @@ To determine whether a new rule needs memoization or not, benchmarking is requir (comparing execution times and memory usage of some considerably large files with and without memoization). There is a very simple instrumentation API available in the generated C parse code that allows to measure how much each rule uses -memoization (check the -[`Parser/pegen.c`](https://github.com/python/cpython/blob/main/Parser/pegen.c) +memoization (check the [`Parser/pegen.c`](../Parser/pegen.c) file for more information) but it needs to be manually activated. Automatic variables @@ -578,9 +571,9 @@ To make writing actions easier, Pegen injects some automatic variables in the namespace available when writing actions. In the C parser, some of these automatic variable names are: -- ``p``: The parser structure. -- ``EXTRA``: This is a macro that expands to - ``(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)``, +- `p`: The parser structure. +- `EXTRA`: This is a macro that expands to + `(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)`, which is normally used to create AST nodes as almost all constructors need these attributes to be provided. All of the location variables are taken from the location information of the current token. @@ -590,13 +583,13 @@ Hard and soft keywords > [!NOTE] > In the grammar files, keywords are defined using **single quotes** (for example, -> ``'class'``) while soft keywords are defined using **double quotes** (for example, -> ``"match"``). +> `'class'`) while soft keywords are defined using **double quotes** (for example, +> `"match"`). There are two kinds of keywords allowed in pegen grammars: *hard* and *soft* keywords. The difference between hard and soft keywords is that hard keywords are always reserved words, even in positions where they make no sense -(for example, ``x = class + 1``), while soft keywords only get a special +(for example, `x = class + 1`), while soft keywords only get a special meaning in context. Trying to use a hard keyword as a variable will always fail: @@ -621,7 +614,7 @@ one where they are defined as keywords: >>> foo(match="Yeah!") ``` -The ``match`` and ``case`` keywords are soft keywords, so that they are +The `match` and `case` keywords are soft keywords, so that they are recognized as keywords at the beginning of a match statement or case block respectively, but are allowed to be used in other places as variable or argument names. @@ -662,7 +655,7 @@ is, and it will unwind the stack and report the exception. This means that if a [rule action](#grammar-actions) raises an exception, all parsing will stop at that exact point. This is done to allow to correctly propagate any exception set by calling Python's C API functions. This also includes -[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) +[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) exceptions and it is the main mechanism the parser uses to report custom syntax error messages. @@ -684,10 +677,10 @@ grammar. To report generic syntax errors, pegen uses a common heuristic in PEG parsers: the location of *generic* syntax errors is reported to be the furthest token that was attempted to be matched but failed. This is only done if parsing has failed -(the parser returns ``NULL`` in C or ``None`` in Python) but no exception has +(the parser returns `NULL` in C or `None` in Python) but no exception has been raised. -As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic +As the Python grammar was primordially written as an `LL(1)` grammar, this heuristic has an extremely high success rate, but some PEG features, such as lookaheads, can impact this. @@ -699,19 +692,19 @@ can impact this. To generate more precise syntax errors, custom rules are used. This is a common practice also in context free grammars: the parser will try to accept some construct that is known to be incorrect just to report a specific syntax error -for that construct. In pegen grammars, these rules start with the ``invalid_`` +for that construct. In pegen grammars, these rules start with the `invalid_` prefix. This is because trying to match these rules normally has a performance impact on parsing (and can also affect the 'correct' grammar itself in some tricky cases, depending on the ordering of the rules) so the generated parser acts in two phases: 1. The first phase will try to parse the input stream without taking into - account rules that start with the ``invalid_`` prefix. If the parsing + account rules that start with the `invalid_` prefix. If the parsing succeeds it will return the generated AST and the second phase will be skipped. 2. If the first phase failed, a second parsing attempt is done including the - rules that start with an ``invalid_`` prefix. By design this attempt + rules that start with an `invalid_` prefix. By design this attempt **cannot succeed** and is only executed to give to the invalid rules a chance to detect specific situations where custom, more precise, syntax errors can be raised. This also allows to trade a bit of performance for @@ -723,15 +716,15 @@ acts in two phases: > When defining invalid rules: > > - Make sure all custom invalid rules raise -> [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) +> [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) > exceptions (or a subclass of it). -> - Make sure **all** invalid rules start with the ``invalid_`` prefix to not +> - Make sure **all** invalid rules start with the `invalid_` prefix to not > impact performance of parsing correct Python code. > - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules > (see the [how PEG parsers work](#how-peg-parsers-work) section for more information). You can find a collection of macros to raise specialized syntax errors in the -[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) +[`Parser/pegen.h`](../Parser/pegen.h) header file. These macros allow also to report ranges for the custom errors, which will be highlighted in the tracebacks that will be displayed when the error is reported. @@ -746,35 +739,33 @@ displayed when the error is reported. $ 42 ``` -should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this +should trigger the syntax error in the `$` character. If your rule is not correctly defined this won't happen. As another example, suppose that you try to define a rule to match Python 2 style -``print`` statements in order to create a better error message and you define it as: +`print` statements in order to create a better error message and you define it as: ``` invalid_print: "print" expression ``` -This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid -code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass -of the parser will fail (because of the ``$``) and in the second phase, the rule will match the -``print(something)`` as ``print`` followed by the variable ``something`` between parentheses and the error -will be reported there instead of the ``$`` character. +This will **seem** to work because the parser will correctly parse `print(something)` because it is valid +code and the second phase will never execute but if you try to parse `print(something) $ 3` the first pass +of the parser will fail (because of the `$`) and in the second phase, the rule will match the +`print(something)` as `print` followed by the variable `something` between parentheses and the error +will be reported there instead of the `$` character. Generating AST objects ---------------------- The output of the C parser used by CPython, which is generated from the -[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram), -is a Python AST object (using C structures). This means that the actions in the -grammar file generate AST objects when they succeed. Constructing these objects -can be quite cumbersome (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast) +[grammar file](../Grammar/python.gram), is a Python AST object (using C +structures). This means that the actions in the grammar file generate AST +objects when they succeed. Constructing these objects can be quite cumbersome +(see the [AST compiler section](compiler.md#abstract-syntax-trees-ast) for more information on how these objects are constructed and how they are used by the compiler), so special helper functions are used. These functions are -declared in the -[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) -header file and defined in the -[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) -file. The helpers include functions that join AST sequences, get specific elements +declared in the [`Parser/pegen.h`](../Parser/pegen.h) header file and defined +in the [`Parser/action_helpers.c`](../Parser/action_helpers.c) file. The +helpers include functions that join AST sequences, get specific elements from them or to perform extra processing on the generated tree. @@ -788,11 +779,9 @@ from them or to perform extra processing on the generated tree. As a general rule, if an action spawns multiple lines or requires something more complicated than a single expression of C code, is normally better to create a -custom helper in -[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) -and expose it in the -[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) -header file so that it can be used from the grammar. +custom helper in [`Parser/action_helpers.c`](../Parser/action_helpers.c) +and expose it in the [`Parser/pegen.h`](../Parser/pegen.h) header file so that +it can be used from the grammar. When parsing succeeds, the parser **must** return a **valid** AST object. @@ -801,16 +790,15 @@ Testing There are three files that contain tests for the grammar and the parser: -- [test_grammar.py](https://github.com/python/cpython/blob/main/Lib/test/test_grammar.py) -- [test_syntax.py](https://github.com/python/cpython/blob/main/Lib/test/test_syntax.py) -- [test_exceptions.py](https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py) +- [test_grammar.py](../Lib/test/test_grammar.py) +- [test_syntax.py](../Lib/test/test_syntax.py) +- [test_exceptions.py](../Lib/test/test_exceptions.py) -Check the contents of these files to know which is the best place for new tests, depending -on the nature of the new feature you are adding. +Check the contents of these files to know which is the best place for new +tests, depending on the nature of the new feature you are adding. Tests for the parser generator itself can be found in the -[test_peg_generator](https://github.com/python/cpython/blob/main/Lib/test_peg_generator) -directory. +[test_peg_generator](../Lib/test_peg_generator) directory. Debugging generated parsers @@ -825,33 +813,32 @@ correctly compile and execute Python anymore. This makes it a bit challenging to debug when something goes wrong, especially when experimenting. For this reason it is a good idea to experiment first by generating a Python -parser. To do this, you can go to the -[Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator) +parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator) directory on the CPython repository and manually call the parser generator by executing: ``` $ python -m pegen python ``` -This will generate a file called ``parse.py`` in the same directory that you +This will generate a file called `parse.py` in the same directory that you can use to parse some input: ``` $ python parse.py file_with_source_code_to_test.py ``` -As the generated ``parse.py`` file is just Python code, you can modify it +As the generated `parse.py` file is just Python code, you can modify it and add breakpoints to debug or better understand some complex situations. Verbose mode ------------ -When Python is compiled in debug mode (by adding ``--with-pydebug`` when -running the configure step in Linux or by adding ``-d`` when calling the -[PCbuild/build.bat](https://github.com/python/cpython/blob/main/PCbuild/build.bat)), -it is possible to activate a **very** verbose mode in the generated parser. This -is very useful to debug the generated parser and to understand how it works, but it +When Python is compiled in debug mode (by adding `--with-pydebug` when +running the configure step in Linux or by adding `-d` when calling the +[PCbuild/build.bat](../PCbuild/build.bat)), it is possible to activate a +**very** verbose mode in the generated parser. This is very useful to +debug the generated parser and to understand how it works, but it can be a bit hard to understand at first. > [!NOTE] @@ -859,13 +846,13 @@ can be a bit hard to understand at first. > interactive mode as it can be much harder to understand, because interactive > mode involves some special steps compared to regular parsing. -To activate verbose mode you can add the ``-d`` flag when executing Python: +To activate verbose mode you can add the `-d` flag when executing Python: ``` $ python -d file_to_test.py ``` -This will print **a lot** of output to ``stderr`` so it is probably better to dump +This will print **a lot** of output to `stderr` so it is probably better to dump it to a file for further analysis. The output consists of trace lines with the following structure:: @@ -873,17 +860,17 @@ following structure:: ('>'|'-'|'+'|'!') []: ... ``` -Every line is indented by a different amount (````) depending on how +Every line is indented by a different amount (``) depending on how deep the call stack is. The next character marks the type of the trace: -- ``>`` indicates that a rule is going to be attempted to be parsed. -- ``-`` indicates that a rule has failed to be parsed. -- ``+`` indicates that a rule has been parsed correctly. -- ``!`` indicates that an exception or an error has been detected and the parser is unwinding. +- `>` indicates that a rule is going to be attempted to be parsed. +- `-` indicates that a rule has failed to be parsed. +- `+` indicates that a rule has been parsed correctly. +- `!` indicates that an exception or an error has been detected and the parser is unwinding. -The ```` part indicates the current index in the token array, -the ```` part indicates what rule is being parsed and -the ```` part indicates what alternative within that rule +The `` part indicates the current index in the token array, +the `` part indicates what rule is being parsed and +the `` part indicates what alternative within that rule is being attempted. @@ -891,4 +878,5 @@ is being attempted. > **Document history** > > Pablo Galindo Salgado - Original author +> > Irit Katriel and Jacob Coffee - Convert to Markdown