0
0
mirror of https://github.com/django/django.git synced 2024-11-29 22:56:46 +01:00
django/docs/topics/db/optimization.txt
2012-11-02 17:58:24 -04:00

341 lines
12 KiB
Plaintext

============================
Database access optimization
============================
Django's database layer provides various ways to help developers get the most
out of their databases. This document gathers together links to the relevant
documentation, and adds various tips, organized under a number of headings that
outline the steps to take when attempting to optimize your database usage.
Profile first
=============
As general programming practice, this goes without saying. Find out :ref:`what
queries you are doing and what they are costing you
<faq-see-raw-sql-queries>`. You may also want to use an external project like
django-debug-toolbar_, or a tool that monitors your database directly.
Remember that you may be optimizing for speed or memory or both, depending on
your requirements. Sometimes optimizing for one will be detrimental to the
other, but sometimes they will help each other. Also, work that is done by the
database process might not have the same cost (to you) as the same amount of
work done in your Python process. It is up to you to decide what your
priorities are, where the balance must lie, and profile all of these as required
since this will depend on your application and server.
With everything that follows, remember to profile after every change to ensure
that the change is a benefit, and a big enough benefit given the decrease in
readability of your code. **All** of the suggestions below come with the caveat
that in your circumstances the general principle might not apply, or might even
be reversed.
.. _django-debug-toolbar: https://github.com/django-debug-toolbar/django-debug-toolbar/
Use standard DB optimization techniques
=======================================
...including:
* Indexes. This is a number one priority, *after* you have determined from
profiling what indexes should be added. Use
:attr:`django.db.models.Field.db_index` to add these from Django.
* Appropriate use of field types.
We will assume you have done the obvious things above. The rest of this document
focuses on how to use Django in such a way that you are not doing unnecessary
work. This document also does not address other optimization techniques that
apply to all expensive operations, such as :doc:`general purpose caching
</topics/cache>`.
Understand QuerySets
====================
Understanding :doc:`QuerySets </ref/models/querysets>` is vital to getting good
performance with simple code. In particular:
Understand QuerySet evaluation
------------------------------
To avoid performance problems, it is important to understand:
* that :ref:`QuerySets are lazy <querysets-are-lazy>`.
* when :ref:`they are evaluated <when-querysets-are-evaluated>`.
* how :ref:`the data is held in memory <caching-and-querysets>`.
Understand cached attributes
----------------------------
As well as caching of the whole ``QuerySet``, there is caching of the result of
attributes on ORM objects. In general, attributes that are not callable will be
cached. For example, assuming the :ref:`example Weblog models
<queryset-model-example>`::
>>> entry = Entry.objects.get(id=1)
>>> entry.blog # Blog object is retrieved at this point
>>> entry.blog # cached version, no DB access
But in general, callable attributes cause DB lookups every time::
>>> entry = Entry.objects.get(id=1)
>>> entry.authors.all() # query performed
>>> entry.authors.all() # query performed again
Be careful when reading template code - the template system does not allow use
of parentheses, but will call callables automatically, hiding the above
distinction.
Be careful with your own custom properties - it is up to you to implement
caching.
Use the ``with`` template tag
-----------------------------
To make use of the caching behavior of ``QuerySet``, you may need to use the
:ttag:`with` template tag.
Use ``iterator()``
------------------
When you have a lot of objects, the caching behavior of the ``QuerySet`` can
cause a large amount of memory to be used. In this case,
:meth:`~django.db.models.query.QuerySet.iterator()` may help.
Do database work in the database rather than in Python
======================================================
For instance:
* At the most basic level, use :ref:`filter and exclude <queryset-api>` to do
filtering in the database.
* Use :ref:`F() object query expressions <query-expressions>` to do filtering
against other fields within the same model.
* Use :doc:`annotate to do aggregation in the database </topics/db/aggregation>`.
If these aren't enough to generate the SQL you need:
Use ``QuerySet.extra()``
------------------------
A less portable but more powerful method is
:meth:`~django.db.models.query.QuerySet.extra()`, which allows some SQL to be
explicitly added to the query. If that still isn't powerful enough:
Use raw SQL
-----------
Write your own :doc:`custom SQL to retrieve data or populate models
</topics/db/sql>`. Use ``django.db.connection.queries`` to find out what Django
is writing for you and start from there.
Retrieve individual objects using a unique, indexed column
==========================================================
There are two reasons to use a column with
:attr:`~django.db.models.Field.unique` or
:attr:`~django.db.models.Field.db_index` when using
:meth:`~django.db.models.query.QuerySet.get` to retrieve individual objects.
First, the query will be quicker because of the underlying database index.
Also, the query could run much slower if multiple objects match the lookup;
having a unique constraint on the column guarantees this will never happen.
So using the :ref:`example Weblog models <queryset-model-example>`::
>>> entry = Entry.objects.get(id=10)
will be quicker than:
>>> entry = Entry.object.get(headline="News Item Title")
because ``id`` is indexed by the database and is guaranteed to be unique.
Doing the following is potentially quite slow:
>>> entry = Entry.objects.get(headline__startswith="News")
First of all, `headline` is not indexed, which will make the underlying
database fetch slower.
Second, the lookup doesn't guarantee that only one object will be returned.
If the query matches more than one object, it will retrieve and transfer all of
them from the database. This penalty could be substantial if hundreds or
thousands of records are returned. The penalty will be compounded if the
database lives on a separate server, where network overhead and latency also
play a factor.
Retrieve everything at once if you know you will need it
========================================================
Hitting the database multiple times for different parts of a single 'set' of
data that you will need all parts of is, in general, less efficient than
retrieving it all in one query. This is particularly important if you have a
query that is executed in a loop, and could therefore end up doing many database
queries, when only one was needed. So:
Use ``QuerySet.select_related()`` and ``prefetch_related()``
------------------------------------------------------------
Understand :meth:`~django.db.models.query.QuerySet.select_related` and
:meth:`~django.db.models.query.QuerySet.prefetch_related` thoroughly, and use
them:
* in view code,
* and in :doc:`managers and default managers </topics/db/managers>` where
appropriate. Be aware when your manager is and is not used; sometimes this is
tricky so don't make assumptions.
Don't retrieve things you don't need
====================================
Use ``QuerySet.values()`` and ``values_list()``
-----------------------------------------------
When you just want a ``dict`` or ``list`` of values, and don't need ORM model
objects, make appropriate usage of
:meth:`~django.db.models.query.QuerySet.values()`.
These can be useful for replacing model objects in template code - as long as
the dicts you supply have the same attributes as those used in the template,
you are fine.
Use ``QuerySet.defer()`` and ``only()``
---------------------------------------
Use :meth:`~django.db.models.query.QuerySet.defer()` and
:meth:`~django.db.models.query.QuerySet.only()` if there are database columns
you know that you won't need (or won't need in most cases) to avoid loading
them. Note that if you *do* use them, the ORM will have to go and get them in
a separate query, making this a pessimization if you use it inappropriately.
Also, be aware that there is some (small extra) overhead incurred inside
Django when constructing a model with deferred fields. Don't be too aggressive
in deferring fields without profiling as the database has to read most of the
non-text, non-VARCHAR data from the disk for a single row in the results, even
if it ends up only using a few columns. The ``defer()`` and ``only()`` methods
are most useful when you can avoid loading a lot of text data or for fields
that might take a lot of processing to convert back to Python. As always,
profile first, then optimize.
Use QuerySet.count()
--------------------
...if you only want the count, rather than doing ``len(queryset)``.
Use QuerySet.exists()
---------------------
...if you only want to find out if at least one result exists, rather than ``if
queryset``.
But:
Don't overuse ``count()`` and ``exists()``
------------------------------------------
If you are going to need other data from the QuerySet, just evaluate it.
For example, assuming an Email model that has a ``body`` attribute and a
many-to-many relation to User, the following template code is optimal:
.. code-block:: html+django
{% if display_inbox %}
{% with emails=user.emails.all %}
{% if emails %}
<p>You have {{ emails|length }} email(s)</p>
{% for email in emails %}
<p>{{ email.body }}</p>
{% endfor %}
{% else %}
<p>No messages today.</p>
{% endif %}
{% endwith %}
{% endif %}
It is optimal because:
1. Since QuerySets are lazy, this does no database queries if 'display_inbox'
is False.
#. Use of :ttag:`with` means that we store ``user.emails.all`` in a variable
for later use, allowing its cache to be re-used.
#. The line ``{% if emails %}`` causes ``QuerySet.__bool__()`` to be called,
which causes the ``user.emails.all()`` query to be run on the database, and
at the least the first line to be turned into an ORM object. If there aren't
any results, it will return False, otherwise True.
#. The use of ``{{ emails|length }}`` calls ``QuerySet.__len__()``, filling
out the rest of the cache without doing another query.
#. The :ttag:`for` loop iterates over the already filled cache.
In total, this code does either one or zero database queries. The only
deliberate optimization performed is the use of the :ttag:`with` tag. Using
``QuerySet.exists()`` or ``QuerySet.count()`` at any point would cause
additional queries.
Use ``QuerySet.update()`` and ``delete()``
------------------------------------------
Rather than retrieve a load of objects, set some values, and save them
individual, use a bulk SQL UPDATE statement, via :ref:`QuerySet.update()
<topics-db-queries-update>`. Similarly, do :ref:`bulk deletes
<topics-db-queries-delete>` where possible.
Note, however, that these bulk update methods cannot call the ``save()`` or
``delete()`` methods of individual instances, which means that any custom
behavior you have added for these methods will not be executed, including
anything driven from the normal database object :doc:`signals </ref/signals>`.
Use foreign key values directly
-------------------------------
If you only need a foreign key value, use the foreign key value that is already on
the object you've got, rather than getting the whole related object and taking
its primary key. i.e. do::
entry.blog_id
instead of::
entry.blog.id
Insert in bulk
==============
When creating objects, where possible, use the
:meth:`~django.db.models.query.QuerySet.bulk_create()` method to reduce the
number of SQL queries. For example::
Entry.objects.bulk_create([
Entry(headline="Python 3.0 Released"),
Entry(headline="Python 3.1 Planned")
])
...is preferable to::
Entry.objects.create(headline="Python 3.0 Released")
Entry.objects.create(headline="Python 3.1 Planned")
Note that there are a number of :meth:`caveats to this method
<django.db.models.query.QuerySet.bulk_create>`, so make sure it's appropriate
for your use case.
This also applies to :class:`ManyToManyFields
<django.db.models.ManyToManyField>`, so doing::
my_band.members.add(me, my_friend)
...is preferable to::
my_band.members.add(me)
my_band.members.add(my_friend)
...where ``Bands`` and ``Artists`` have a many-to-many relationship.