17 KiB
The bytecode interpreter
Overview
This document describes the workings and implementation of the bytecode interpreter, the part of python that executes compiled Python code. Its entry point is in Python/ceval.c.
At a high level, the interpreter consists of a loop that iterates over the bytecode instructions, executing each of them via a switch statement that has a case implementing each opcode. This switch statement is generated from the instruction definitions in Python/bytecodes.c which are written in a DSL developed for this purpose.
Recall that the Python Compiler produces a CodeObject
,
which contains the bytecode instructions along with static data that is required to execute them,
such as the consts list, variable names,
exception table, and so on.
When the interpreter's
PyEval_EvalCode()
function is called to execute a CodeObject
, it constructs a Frame
and calls
_PyEval_EvalFrame()
to execute the code object in this frame. The frame hold the dynamic state of the
CodeObject
's execution, including the instruction pointer, the globals and builtins.
It also has a reference to the CodeObject
itself.
In addition to the frame, _PyEval_EvalFrame()
also receives a
Thread State
object, tstate
, which includes things like the exception state and the
recursion depth. The thread state also provides access to the per-interpreter
state (tstate->interp
), which has a pointer to the per-runtime (that is,
truly global) state (tstate->interp->runtime
).
Finally, _PyEval_EvalFrame()
receives an integer argument throwflag
which, when nonzero, indicates that the interpreter should just raise the current exception
(this is used in the implementation of
gen.throw
.
By default, _PyEval_EvalFrame()
simply calls [_PyEval_EvalFrameDefault()
] to execute the frame. However, as per
PEP 523
this is configurable by setting
interp->eval_frame
. In the following, we describe the default function,
_PyEval_EvalFrameDefault()
.
Instruction decoding
The first task of the interpreter is to decode the bytecode instructions.
Bytecode is stored as an array of 16-bit code units (_Py_CODEUNIT
).
Each code unit contains an 8-bit opcode
and an 8-bit argument (oparg
), both unsigned.
In order to make the bytecode format independent of the machine byte order when stored on disk,
opcode
is always the first byte and oparg
is always the second byte.
Macros are used to extract the opcode
and oparg
from a code unit
(_Py_OPCODE(word)
and _Py_OPARG(word)
).
Some instructions (for example, NOP
or POP_TOP
) have no argument -- in this case
we ignore oparg
.
A simplified version of the interpreter's main loop looks like this:
_Py_CODEUNIT *first_instr = code->co_code_adaptive;
_Py_CODEUNIT *next_instr = first_instr;
while (1) {
_Py_CODEUNIT word = *next_instr++;
unsigned char opcode = _Py_OPCODE(word);
unsigned int oparg = _Py_OPARG(word);
switch (opcode) {
// ... A case for each opcode ...
}
}
This loop iterates over the instructions, decoding each into its opcode
and oparg
, and then executes the switch case that implements this opcode
.
The instruction format supports 256 different opcodes, which is sufficient.
However, it also limits oparg
to 8-bit values, which is too restrictive.
To overcome this, the EXTENDED_ARG
opcode allows us to prefix any instruction
with one or more additional data bytes, which combine into a larger oparg.
For example, this sequence of code units:
EXTENDED_ARG 1
EXTENDED_ARG 0
LOAD_CONST 2
would set opcode
to LOAD_CONST
and oparg
to 65538
(that is, 0x1_00_02
).
The compiler should limit itself to at most three EXTENDED_ARG
prefixes, to allow the
resulting oparg
to fit in 32 bits, but the interpreter does not check this.
In the following, a code unit
is always two bytes, while an instruction
is a
sequence of code units consisting of zero to three EXTENDED_ARG
opcodes followed by
a primary opcode.
The following loop, to be inserted just above the switch
statement, will make the above
snippet decode a complete instruction:
while (opcode == EXTENDED_ARG) {
word = *next_instr++;
opcode = _Py_OPCODE(word);
oparg = (oparg << 8) | _Py_OPARG(word);
}
For various reasons we'll get to later (mostly efficiency, given that EXTENDED_ARG
is rare) the actual code is different.
Jumps
Note that when the switch
statement is reached, next_instr
(the "instruction offset")
already points to the next instruction.
Thus, jump instructions can be implemented by manipulating next_instr
:
- A jump forward (
JUMP_FORWARD
) setsnext_instr += oparg
. - A jump backward sets
next_instr -= oparg
.
Inline cache entries
Some (specialized or specializable) instructions have an associated "inline cache".
The inline cache consists of one or more two-byte entries included in the bytecode
array as additional words following the opcode
/oparg
pair.
The size of the inline cache for a particular instruction is fixed by its opcode
.
Moreover, the inline cache size for all instructions in a
family of specialized/specializable instructions
(for example, LOAD_ATTR
, LOAD_ATTR_SLOT
, LOAD_ATTR_MODULE
) must all be
the same. Cache entries are reserved by the compiler and initialized with zeros.
Although they are represented by code units, cache entries do not conform to the
opcode
/ oparg
format.
If an instruction has an inline cache, the layout of its cache is described by
a struct
definition in (pycore_code.h
)[../Include/internal/pycore_code.h].
This allows us to access the cache by casting next_instr
to a pointer to this struct
.
The size of such a struct
must be independent of the machine architecture, word size
and alignment requirements. For a 32-bit field, the struct
should use _Py_CODEUNIT field[2]
.
The instruction implementation is responsible for advancing next_instr
past the inline cache.
For example, if an instruction's inline cache is four bytes (that is, two code units) in size,
the code for the instruction must contain next_instr += 2;
.
This is equivalent to a relative forward jump by that many code units.
(In the interpreter definition DSL, this is coded as JUMPBY(n)
, where n
is the number
of code units to jump, typically given as a named constant.)
Serializing non-zero cache entries would present a problem because the serialization
(:mod:marshal
) format must be independent of the machine byte order.
More information about the use of inline caches can be found in PEP 659.
The evaluation stack
Most instructions read or write some data in the form of object references (PyObject *
).
The CPython bytecode interpreter is a stack machine, meaning that its instructions operate
by pushing data onto and popping it off the stack.
The stack is forms part of the frame for the code object. Its maximum depth is calculated
by the compiler and stored in the co_stacksize
field of the code object, so that the
stack can be pre-allocated is a contiguous array of PyObject*
pointers, when the frame
is created.
The stack effects of each instruction are also exposed through the
opcode metadata through two
functions that report how many stack elements the instructions consumes,
and how many it produces (_PyOpcode_num_popped
and _PyOpcode_num_pushed
).
For example, the BINARY_OP
instruction pops two objects from the stack and pushes the
result back onto the stack.
The stack grows up in memory; the operation PUSH(x)
is equivalent to *stack_pointer++ = x
,
whereas x = POP()
means x = *--stack_pointer
.
Overflow and underflow checks are active in debug mode, but are otherwise optimized away.
At any point during execution, the stack level is knowable based on the instruction pointer
alone, and some properties of each item on the stack are also known.
In particular, only a few instructions may push a NULL
onto the stack, and the positions
that may be NULL
are known.
A few other instructions (GET_ITER
, FOR_ITER
) push or pop an object that is known to
be an iterator.
Instruction sequences that do not allow statically knowing the stack depth are deemed illegal; the bytecode compiler never generates such sequences. For example, the following sequence is illegal, because it keeps pushing items on the stack:
LOAD_FAST 0
JUMP_BACKWARD 2
Note
Do not confuse the evaluation stack with the call stack, which is used to implement calling and returning from functions.
Error handling
When the implementation of an opcode raises an exception, it jumps to the
exception_unwind
label in Python/ceval.c.
The exception is then handled as described in the
exception handling documentation
.
Python-to-Python calls
The _PyEval_EvalFrameDefault()
function is recursive, because sometimes
the interpreter calls some C function that calls back into the interpreter.
In 3.10 and before, this was the case even when a Python function called
another Python function:
The CALL
opcode would call the tp_call
dispatch function of the
callee, which would extract the code object, create a new frame for the call
stack, and then call back into the interpreter. This approach is very general
but consumes several C stack frames for each nested Python call, thereby
increasing the risk of an (unrecoverable) C stack overflow.
Since 3.11, the CALL
instruction special-cases function objects to "inline"
the call. When a call gets inlined, a new frame gets pushed onto the call
stack and the interpreter "jumps" to the start of the callee's bytecode.
When an inlined callee executes a RETURN_VALUE
instruction, the frame is
popped off the call stack and the interpreter returns to its caller,
by popping a frame off the call stack and "jumping" to the return address.
There is a flag in the frame (frame->is_entry
) that indicates whether
the frame was inlined (set if it wasn't).
If RETURN_VALUE
finds this flag set, it performs the usual cleanup and
returns from _PyEval_EvalFrameDefault()
altogether, to a C caller.
A similar check is performed when an unhandled exception occurs.
The call stack
Up through 3.10, the call stack was implemented as a singly-linked list of frame objects. This was expensive because each call would require a heap allocation for the stack frame.
Since 3.11, frames are no longer fully-fledged objects. Instead, a leaner internal
_PyInterpreterFrame
structure is used, which is allocated using a custom allocator
function (_PyThreadState_BumpFramePointer()
), which allocates and initializes a
frame structure. Usually a frame allocation is just a pointer bump, which improves
memory locality.
Sometimes an actual PyFrameObject
is needed, such as when Python code calls
sys._getframe()
or an extension module calls
PyEval_GetFrame()
.
In this case we allocate a proper PyFrameObject
and initialize it from the
_PyInterpreterFrame
.
Things get more complicated when generators are involved, since those do not
follow the push/pop model. This includes async functions, which are based on
the same mechanism. A generator object has space for a _PyInterpreterFrame
structure, including the variable-size part (used for locals and the eval stack).
When a generator (or async) function is first called, a special opcode
RETURN_GENERATOR
is executed, which is responsible for creating the
generator object. The generator object's _PyInterpreterFrame
is initialized
with a copy of the current stack frame. The current stack frame is then popped
off the frame stack and the generator object is returned.
(Details differ depending on the is_entry
flag.)
When the generator is resumed, the interpreter pushes its _PyInterpreterFrame
onto the frame stack and resumes execution.
See also the generators section.
Introducing a new bytecode instruction
It is occasionally necessary to add a new opcode in order to implement a new feature or change the way that existing features are compiled. This section describes the changes required to do this.
First, you must choose a name for the bytecode, implement it in
Python/bytecodes.c
and add a documentation
entry in Doc/library/dis.rst
.
Then run make regen-cases
to assign a number for it (see
Include/opcode_ids.h
) and regenerate a
number of files with the actual implementation of the bytecode in
Python/generated_cases.c.h
and
metadata about it in additional files.
With a new bytecode you must also change what is called the "magic number" for
.pyc files: bump the value of the variable MAGIC_NUMBER
in
Lib/importlib/_bootstrap_external.py
.
Changing this number will lead to all .pyc files with the old MAGIC_NUMBER
to be recompiled by the interpreter on import. Whenever MAGIC_NUMBER
is
changed, the ranges in the magic_values
array in
PC/launcher.c
may also need to be updated. Changes to
Lib/importlib/_bootstrap_external.py
will take effect only after running make regen-importlib
.
Note
Running
make regen-importlib
before adding the new bytecode target toPython/bytecodes.c
(followed bymake regen-cases
) will result in an error. You should only runmake regen-importlib
after the new bytecode target has been added.
Note
On Windows, running the
./build.bat
script will automatically regenerate the required files without requiring additional arguments.
Finally, you need to introduce the use of the new bytecode. Update
Python/codegen.c
to emit code with this bytecode.
Optimizations in Python/flowgraph.c
may also
need to be updated. If the new opcode affects a control flow or the block
stack, you may have to update the frame_setlineno()
function in
Objects/frameobject.c
. It may also be necessary
to update Lib/dis.py
if the new opcode interprets its
argument in a special way (like FORMAT_VALUE
or MAKE_FUNCTION
).
If you make a change here that can affect the output of bytecode that
is already in existence and you do not change the magic number, make
sure to delete your old .py(c|o) files! Even though you will end up changing
the magic number if you change the bytecode, while you are debugging your work
you may be changing the bytecode output without constantly bumping up the
magic number. This can leave you with stale .pyc files that will not be
recreated.
Running find . -name '*.py[co]' -exec rm -f '{}' +
should delete all .pyc
files you have, forcing new ones to be created and thus allow you test out your
new bytecode properly. Run make regen-importlib
for updating the
bytecode of frozen importlib files. You have to run make
again after this
to recompile the generated C files.