0
0
mirror of https://github.com/python/cpython.git synced 2024-11-22 05:26:10 +01:00
cpython/InternalDocs/string_interning.md
Petr Viktorin 49f6beb56a
[3.12] gh-113993: Make interned strings mortal (GH-120520, GH-121364, GH-121903, GH-122303) (#123065)
This backports several PRs for gh-113993, making interned strings mortal so they can be garbage-collected when no longer needed.

* Allow interned strings to be mortal, and fix related issues (GH-120520)

  * Add an InternalDocs file describing how interning should work and how to use it.

  * Add internal functions to *explicitly* request what kind of interning is done:
    - `_PyUnicode_InternMortal`
    - `_PyUnicode_InternImmortal`
    - `_PyUnicode_InternStatic`

  * Switch uses of `PyUnicode_InternInPlace` to those.

  * Disallow using `_Py_SetImmortal` on strings directly.
    You should use `_PyUnicode_InternImmortal` instead:
    - Strings should be interned before immortalization, otherwise you're possibly
      interning a immortalizing copy.
    - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
      `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
      backports, as they are now part of public API and version-specific ABI.

  * Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

   Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
    - `_Py_ID`
    - `_Py_STR` (including the empty string)
    - one-character latin-1 singletons

    Now, when you intern a singleton, that exact singleton will be interned.

  * Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

  * Intern `_Py_STR` singletons at startup.

  * Beef up the tests. Cover internal details (marked with `@cpython_only`).

  * Add lots of assertions

* Don't immortalize in PyUnicode_InternInPlace; keep immortalizing in other API (GH-121364)

  * Switch PyUnicode_InternInPlace to _PyUnicode_InternMortal, clarify docs

  * Document immortality in some functions that take `const char *`

  This is PyUnicode_InternFromString;
  PyDict_SetItemString, PyObject_SetAttrString;
  PyObject_DelAttrString; PyUnicode_InternFromString;
  and the PyModule_Add convenience functions.

  Always point out a non-immortalizing alternative.

  * Don't immortalize user-provided attr names in _ctypes

* Immortalize names in code objects to avoid crash (GH-121903)

* Intern latin-1 one-byte strings at startup (GH-122303)

There are some 3.12-specific changes, mainly to allow statically allocated strings in deepfreeze. (In 3.13, deepfreeze switched to the general `_Py_ID`/`_Py_STR`.)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
2024-09-27 13:28:48 -07:00

4.5 KiB

String interning

Interned strings are conceptually part of an interpreter-global set of interned strings, meaning that:

  • no two interned strings have the same content (across an interpreter);
  • two interned strings can be safely compared using pointer equality (Python is).

This is used to optimize dict and attribute lookups, among other things.

Python uses two different mechanisms to intern strings: singletons and dynamic interning.

Singletons

The 256 possible one-character latin-1 strings, which can be retrieved with _Py_LATIN1_CHR(c), are stored in statically allocated arrays, _PyRuntime.static_objects.strings.ascii and _PyRuntime.static_objects.strings.latin1.

Longer singleton strings are marked in C source with _Py_ID (if the string is a valid C identifier fragment) or _Py_STR (if it needs a separate C-compatible name.) These are also stored in statically allocated arrays. They are collected from CPython sources using make regen-global-objects (Tools/build/generate_global_objects.py), which generates code for declaration, initialization and finalization.

The empty string is one of the singletons: _Py_STR(empty).

Deep-frozen modules (see Tools/build/deepfreeze.py) use either singletons, or statically allocated strings. These are added to INTERNED_STRINGS at runtime initialization, when deepfreeze modules are loaded.

These sets of singletons (_Py_LATIN1_CHR, _Py_ID, _Py_STR, deepfreeze) are disjoint. If you have such a singleton, it (and no other copy) will be interned.

These singletons are interned in a runtime-global lookup table, _PyRuntime.cached_objects.interned_strings (INTERNED_STRINGS), at runtime initialization, and immutable until it's torn down at runtime finalization. It is shared across threads and interpreters without any synchronization.

Dynamically allocated strings

All other strings are allocated dynamically, and have their _PyUnicode_STATE(s).statically_allocated flag set to zero. When interned, such strings are added to an interpreter-wide dict, PyInterpreterState.cached_objects.interned_strings.

The key and value of each entry in this dict reference the same object.

Immortality and reference counting

Invariant: Every immortal string is interned.

In practice, this means that you must not use _Py_SetImmortal on a string. (If you know it's already immortal, don't immortalize it; if you know it's not interned you might be immortalizing a redundant copy; if it's interned and mortal it needs extra processing in _PyUnicode_InternImmortal.)

The converse is not true: interned strings can be mortal. For mortal interned strings:

  • the 2 references from the interned dict (key & value) are excluded from their refcount
  • the deallocator (unicode_dealloc) removes the string from the interned dict
  • at shutdown, when the interned dict is cleared, the references are added back

As with any type, you should only immortalize strings that will live until interpreter shutdown. We currently also immortalize strings contained in code objects and similar, specifically in the compiler and in marshal. These are “close enough” to immortal: even in use cases like hot reloading or eval-ing user input, the number of distinct identifiers and string constants expected to stay low.

Internal API

We have the following internal API for interning:

  • _PyUnicode_InternMortal: just intern the string
  • _PyUnicode_InternImmortal: intern, and immortalize the result
  • _PyUnicode_InternStatic: intern a static singleton (_Py_STR, _Py_ID or one-byte). Not for general use.

All take an interpreter state, and a pointer to a PyObject* which they modify in place.

The functions take ownership of (“steal”) the reference to their argument, and update the argument with a new reference. This means:

  • They're “reference neutral”.
  • They must not be called with a borrowed reference.

State

The intern state (retrieved by PyUnicode_CHECK_INTERNED(s); stored in _PyUnicode_STATE(s).interned) can be:

  • SSTATE_NOT_INTERNED (defined as 0, which is useful in a boolean context)
  • SSTATE_INTERNED_MORTAL (1)
  • SSTATE_INTERNED_IMMORTAL (2)
  • SSTATE_INTERNED_IMMORTAL_STATIC (3)

The valid transitions between these states are:

  • For dynamically allocated strings:

    • 0 -> 1 (_PyUnicode_InternMortal)
    • 1 -> 2 or 0 -> 2 (_PyUnicode_InternImmortal)

    Using _PyUnicode_InternStatic on these is an error; the other cases don't change the state.

  • Singletons are interned (0 -> 3) at runtime init; after that all interning functions don't change the state.