mirror of
https://github.com/python/cpython.git
synced 2024-11-24 17:47:13 +01:00
6f1d448bc1
* Add an InternalDocs file describing how interning should work and how to use it. * Add internal functions to *explicitly* request what kind of interning is done: - `_PyUnicode_InternMortal` - `_PyUnicode_InternImmortal` - `_PyUnicode_InternStatic` * Switch uses of `PyUnicode_InternInPlace` to those. * Disallow using `_Py_SetImmortal` on strings directly. You should use `_PyUnicode_InternImmortal` instead: - Strings should be interned before immortalization, otherwise you're possibly interning a immortalizing copy. - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in backports, as they are now part of public API and version-specific ABI. * Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery. * Make sure the statically allocated string singletons are unique. This means these sets are now disjoint: - `_Py_ID` - `_Py_STR` (including the empty string) - one-character latin-1 singletons Now, when you intern a singleton, that exact singleton will be interned. * Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic). * Intern `_Py_STR` singletons at startup. * For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup. * Beef up the tests. Cover internal details (marked with `@cpython_only`). * Add lots of assertions Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
123 lines
4.7 KiB
Markdown
123 lines
4.7 KiB
Markdown
# String interning
|
|
|
|
*Interned* strings are conceptually part of an interpreter-global
|
|
*set* of interned strings, meaning that:
|
|
- no two interned strings have the same content (across an interpreter);
|
|
- two interned strings can be safely compared using pointer equality
|
|
(Python `is`).
|
|
|
|
This is used to optimize dict and attribute lookups, among other things.
|
|
|
|
Python uses three different mechanisms to intern strings:
|
|
|
|
- Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros.
|
|
These are statically allocated, and collected using `make regen-global-objects`
|
|
(`Tools/build/generate_global_objects.py`), which generates code
|
|
for declaration, initialization and finalization.
|
|
|
|
The difference between the two kinds is not important. (A `_Py_ID` string is
|
|
a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain
|
|
non-identifier characters, so it needs a separate C-compatible name.)
|
|
|
|
The empty string is in this category (as `_Py_STR(empty)`).
|
|
|
|
These singletons are interned in a runtime-global lookup table,
|
|
`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`),
|
|
at runtime initialization.
|
|
|
|
- The 256 possible one-character latin-1 strings are singletons,
|
|
which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global
|
|
arrays, `_PyRuntime.static_objects.strings.ascii` and
|
|
`_PyRuntime.static_objects.strings.latin1`.
|
|
|
|
These are NOT interned at startup in the normal build.
|
|
In the free-threaded build, they are; this avoids modifying the
|
|
global lookup table after threads are started.
|
|
|
|
Interning a one-char latin-1 string will always intern the corresponding
|
|
singleton.
|
|
|
|
- All other strings are allocated dynamically, and have their
|
|
`_PyUnicode_STATE(s).statically_allocated` flag set to zero.
|
|
When interned, such strings are added to an interpreter-wide dict,
|
|
`PyInterpreterState.cached_objects.interned_strings`.
|
|
|
|
The key and value of each entry in this dict reference the same object.
|
|
|
|
The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`)
|
|
are disjoint.
|
|
If you have such a singleton, it (and no other copy) will be interned.
|
|
|
|
|
|
## Immortality and reference counting
|
|
|
|
Invariant: Every immortal string is interned, *except* the one-char latin-1
|
|
singletons (which might but might not be interned).
|
|
|
|
In practice, this means that you must not use `_Py_SetImmortal` on
|
|
a string. (If you know it's already immortal, don't immortalize it;
|
|
if you know it's not interned you might be immortalizing a redundant copy;
|
|
if it's interned and mortal it needs extra processing in
|
|
`_PyUnicode_InternImmortal`.)
|
|
|
|
The converse is not true: interned strings can be mortal.
|
|
For mortal interned strings:
|
|
- the 2 references from the interned dict (key & value) are excluded from
|
|
their refcount
|
|
- the deallocator (`unicode_dealloc`) removes the string from the interned dict
|
|
- at shutdown, when the interned dict is cleared, the references are added back
|
|
|
|
As with any type, you should only immortalize strings that will live until
|
|
interpreter shutdown.
|
|
We currently also immortalize strings contained in code objects and similar,
|
|
specifically in the compiler and in `marshal`.
|
|
These are “close enough” to immortal: even in use cases like hot reloading
|
|
or `eval`-ing user input, the number of distinct identifiers and string
|
|
constants expected to stay low.
|
|
|
|
|
|
## Internal API
|
|
|
|
We have the following *internal* API for interning:
|
|
|
|
- `_PyUnicode_InternMortal`: just intern the string
|
|
- `_PyUnicode_InternImmortal`: intern, and immortalize the result
|
|
- `_PyUnicode_InternStatic`: intern a static singleton (`_Py_STR`, `_Py_ID`
|
|
or one-byte). Not for general use.
|
|
|
|
All take an interpreter state, and a pointer to a `PyObject*` which they
|
|
modify in place.
|
|
|
|
The functions take ownership of (“steal”) the reference to their argument,
|
|
and update the argument with a *new* reference.
|
|
This means:
|
|
- They're “reference neutral”.
|
|
- They must not be called with a borrowed reference.
|
|
|
|
|
|
## State
|
|
|
|
The intern state (retrieved by `PyUnicode_CHECK_INTERNED(s)`;
|
|
stored in `_PyUnicode_STATE(s).interned`) can be:
|
|
|
|
- `SSTATE_NOT_INTERNED` (defined as 0, which is useful in a boolean context)
|
|
- `SSTATE_INTERNED_MORTAL` (1)
|
|
- `SSTATE_INTERNED_IMMORTAL` (2)
|
|
- `SSTATE_INTERNED_IMMORTAL_STATIC` (3)
|
|
|
|
The valid transitions between these states are:
|
|
|
|
- For dynamically allocated strings:
|
|
|
|
- 0 -> 1 (`_PyUnicode_InternMortal`)
|
|
- 1 -> 2 or 0 -> 2 (`_PyUnicode_InternImmortal`)
|
|
|
|
Using `_PyUnicode_InternStatic` on these is an error; the other cases
|
|
don't change the state.
|
|
|
|
- One-char latin-1 singletons can be interned (0 -> 3) using any interning
|
|
function; after that the functions don't change the state.
|
|
|
|
- Other statically allocated strings are interned (0 -> 3) at runtime init;
|
|
after that all interning functions don't change the state.
|