On the Complexity of Python's list.append

01 Mar 2022

Let’s start with a simple example. There is a list x = [] to which we add new elements using x.append. After each append, we call sys.getsizeof(x) to check how many bytes it takes to store x in memory.

from sys import getsizeof

x = []
for _ in range(3):
    x.append(1)
    print(f"length={len(x)}, {getsizeof(x)} bytes")

The output of the snippet looks like this:

length=1, 88 bytes
length=2, 88 bytes
length=3, 88 bytes

We are adding more stuff to x but its size stays the same. What’s going on here? If we actually tried to add more than 3 elements, the size would eventually increase. We can visualize it by appending 50 times and plotting the list size in bytes.

List size

We can see that every so often the size of the list increases, then stays the same for a while before growing again. The list behaves this way because it is internally implemented as a dynamic array.

Dynamic Array Is a Space-Time Tradeoff

Imagine an array of $n$ values. To append to the array, we would need to allocate a block of memory that can hold $n+1$ values, copy all the data to this new memory block, add the new value at the end and then free up the old block of size $n$ so that it can be reused. The complexity of such operation is $O(n)$ because every time we want to append, we need to copy $n$ elements. For large $n$, this can take a lot of time.

Dynamic array tries to work around this by allocating more memory than it actually needs. If a dynamic array of size $n$ allocates a block for $n+k$ values during the first append, the next $k-1$ appends will be very fast. The complexity is

\[O\left(\frac{1}{k}n\right) < O(n), \forall k > 1\]

because the copying is done only in 1 out of $k$ appends. This is an example of a space-time tradeoff, the dynamic array is faster but less memory efficient than the naive approach from the first example.

Let’s try again appending 50 times to a list and measuring how long each append operation takes. We can use timeit package to do the measurements.

import timeit
import pandas as pd

def time_appends(n=50):
    x = []
    return [
        timeit.timeit(
            stmt="x.append(1)",
            globals={"x": x},
            number=1
        )
        for _ in range(n)
    ]

append_times = [time_appends() for _ in range(10_000)]
means = pd.DataFrame(append_times).mean(axis=0)

We repeat the whole thing 10000 times to filter out noise. At the end we get an average time it takes to finish each of the 50 appends. If we plot the data, we can see peaks at the exact spots where the list runs out of space and needs to copy its contents to a new memory block.

List size with time

Growth Factor

What is the value of $k$? In Python’s case, it is not a constant. The maximum length of a list $C(i)$ after $i$-th pre-allocation is

\[C(i) = \sum_{j=0}^{i-1}(j + (j \gg 3) + 6) \land \lnot 3\] \[k_i = C(i) - C(i - 1)\]

where $\gg$, $\land$, $\lnot$ are bitwise right shift, AND, complement operators respectively. You can find the formula in Python’s source code.

List Elements Are Pointers

So far all elements in our list x have been of the same type. However, we can add an object of any type to x. How does the list know how much memory it needs to pre-allocate for the next $k-1$ elements if it doesn’t know their type in advance? A string "A" * 1000 of thousand characters definitely requires more memory than an integer.

List pointers

The answer is that the list does not store the actual objects. Every element is just a pointer containing a memory address where the object lives. The pointers have always the same size, regardless of the object they point to.