Finding test isolation issues with PyTest

Matthew Wilkes on 2020-07-06

One of the most frustrating problems you can encounter with your test suite is it not behaving deterministically. Sometimes your tests will pass when you run them, so you submit a pull request and the CI runner rejects your changes because they break the tests. Other times, you see a failing test during a full test-run, then re-run with pytest --lf and suddenly the test is no longer failing. Both of these are common sights when you have test isolation issues.

When writing test fixtures, we usually focus on making sure that the right data is available at the end of the setup phase. We don't think so much about the teardown phase. This can cause problems when the test suite grows, as when we add a test we may see leakage from previous tests.

The two main ways this can occur are fixtures that don't run at test scope having incomplete teardown methods and mocking that changes the global state of the Python interpreter, such as mocking of a function that isn't undone. Listing 1 shows an example of a misunderstanding of class and instance variables that results in incorrect mocking.

import sys

import pytest


class PythonVersion:
    def value(self):
        return sys.version_info

    @classmethod
    def format(cls, value):
        if value[2] == 0 and value[3] == "alpha":
            return "{0}.{1}.{2}a{4}".format(*value)
        return "{0}.{1}".format(*value)


@pytest.fixture
def sensor():
    return PythonVersion()


@pytest.fixture
def alpha_sensor():
    def returns_alpha(self):
        return (3, 9, 0, "alpha", 1)
    PythonVersion.value = returns_alpha
    return PythonVersion()


def test_value_is_right(sensor):
    assert sensor.value() == sys.version_info


def test_formatter(sensor):
    assert PythonVersion.format(sensor.value()) == "3.7"


def test_alpha_specialcase(alpha_sensor):
    assert PythonVersion.format(alpha_sensor.value()) == "3.9.0a1"

Listing 1. Some basic tests for the PythonVersion class that do not mock functions correctly.

These tests pass, but they have an isolation issue. Specifically, the alpha_sensor fixture patches out the value() method of PythonVersion but does not ensure that the original is restored. So long as the test runner executes the set of tests in the order that they're defined then there will be no problem.

When we write tests it's quite natural to check that they do what they're supposed to do, then write tests for edge cases afterwards. The tests that don't do any patching are run first, before the state is messed up by later fixtures.

Randomising test order

One way of finding out when you've done something like this is to randomise your test order. This lets you run tests that are passing and find the assumptions about state that you've inadvertantly integrated into your test code. There are a few PyTest plugins that help with test randomisation, but I use pytest-random-order. This is controlled with the --random-order flag. With this in place, the tests will randomly fail about two thirds of the time. Specifically, they will fail any time that the test_alpha_specialcase test is not picked last.

When the test suite runs with randomisation enabled a couple of lines of context are added to the output:

========================================== test session starts ==========================================
platform linux -- Python 3.7.5, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
Using --random-order-bucket=module
Using --random-order-seed=447300

rootdir: /secure/coding/advancedpython.dev/play/pytest-randomisation
plugins: random-order-1.0.4
collected 3 items                                                                                       

listing1.py .FF                                                                                   [100%]

=============================================== FAILURES ================================================
____________________________________________ test_formatter _____________________________________________

sensor = <listing1.PythonVersion object at 0x7fa62a8b88d0>

    def test_formatter(sensor):
>       assert PythonVersion.format(sensor.value()) == "3.7"
E       AssertionError: assert '3.9.0a1' == '3.7'
E         - 3.7
E         + 3.9.0a1

listing1.py:35: AssertionError
__________________________________________ test_value_is_right __________________________________________

sensor = <listing1.PythonVersion object at 0x7fa62a8a8910>

    def test_value_is_right(sensor):
>       assert sensor.value() == sys.version_info
E       AssertionError: assert (3, 9, 0, 'alpha', 1) == sys.version_i...al', serial=0)
E         At index 1 diff: 9 != 7
E         Use -v to get the full diff

listing1.py:31: AssertionError
======================================== short test summary info ========================================
FAILED listing1.py::test_formatter - AssertionError: assert '3.9.0a1' == '3.7'
FAILED listing1.py::test_value_is_right - AssertionError: assert (3, 9, 0, 'alpha', 1) == sys.version_...
====================================== 2 failed, 1 passed in 0.03s ======================================

The two lines immediately after the plugin list are the randomisation options that were picked by for this test run. This is really helpful, as the hardest test isolation problems to debug are the ones that are least likely to be triggered. It could be a very specific set of circumstances that trigger the bug, which makes reproducing it reliably difficult. Once you've managed to trigger the bug you're looking for, you can re-run the tests with these two parameters to ensure that the tests will be picked in the same order, so it will reliably fail every time.

The seed parameter is the randomisation seed that was used to pick the ordering, but the bucket parameter is just as important for repeatable results. If you've not specified a bucket then module level will be used. This means that the various test files will have their order randomised, and tests within a module will also be randomised. All tests in a specific module will complete before the next module is started. Similarly, --random-order-bucket=class will randomise the test classes and their contents, but will complete a class before moving on to the next one. These both help keep execution times down, as any fixtures that are defined at module or class scope can behave as normal.

If you use the --random-order-bucket=global option then all tests will be randomised completely, so there is no guarantee that tests within a class will be run as a block. Any class-scoped fixtures therefore need to be set up and torn down multiple times depending on the ordering of the tests.

This shouldn't be a problem for any fixture that perfectly reverts its changes during the teardown phase, although it could cause the test run to take many times longer than it would otherwise.

Solving isolation issues

Once you've found a randomisation that illustrates the problem, you can begin to find what the cause is. The optimal demonstration is having a test fail very close to the start of the test run. A test failing tells you that one of the tests before it was the cause of the isolation problem, so a test failing right at the end isn't very helpful.

We can see the ordering of tests by running with the pytest -v flag to print each tess being run and its status, rather than just . or F. We can then add filters to the test run to limit the tests being run. By trying different filters with the same randomisation settings we can find a minimal set of tests that cause our test to fail. At this step it can be helpful to run the tests with the -x flag, which aborts the test run at the first failure.

It's not always easy to see what fixtures are involved, especially when there are lots of tests being run. One way of finding out is to write a fixture that introspects all your tests and logs the fixtures in use. One such example is shown as Listing 2.

@pytest.fixture(autouse=True)
def testinfo(request):
    for fixture in request.fixturenames:
        print(f"Using fixture {fixture}")

Listing 2. An autouse fixture to show fixtures in use.

N.B. It's important to use the -s flag when using this fixture, as that causes printed statements to be shown when the test is run, rather than captured and shown at the end.

Once you've found the fixtures involved in your minimal reproduction of the problem you can find which of them is responsible. I generally approach this by (temporarily!) deleting the bodies of the test functions that were run before my failing test and ensuring the test still fails. If it doesn't, that means that some of the test code is responsible for the isolation issue. Assuming it does fail, you can begin to remove fixtures from the set until you find a minimal set that trigger the issue.

A minimal reproducer for a bug is about as good as you can get for debugging, so hopefully from this point you will be able to find the exact problem, potentially using a debugger.

Writing robust tests

While it's best to write tests and fixtures that are perfectly safe to use out of order, it's best to be pragmatic. Tests can often be made robust to isolation issues by limiting the assertions to test only what you intend to test. As such, there are some patterns to avoid in tests, as some isolation issues are very common.

The most common is ordering issues. Dictionaries either have no defined ordering or an implicit ordering. If you're looping over the items in a dictionary in your code it's quite possible that the order will change. This is quite common when serialising data, for example generating URLs with parameters or creating a JSON document. For this reason, you should always try to write tests that check what you care about, rather than checking a large string is exactly equal. It may seem counter-intuitive to parse a URL you've just generated to check a single parameter, but that's better than unintentially asserting that the order of the parameters must always be the same.

It's also common to have test isolation issues when working with a database. We often want to keep a database between tests because it's slow and error-prone to tear down an entire database and regenerate it for every test. Therefore, we sometimes find problems where data is committed to the database unintentionally, or some non-transactional resource updates. Asserting that an object has a specific id is an example of this, as databases will often allow gaps in their ids for better transaction support. Never assert that a database id is exactly something unless it absolutely must be, instead assert that it has an id, or some other invariant. Similarly, don't assert that there are a certain number of results when you make a query when you want to assert that the number has changed, for example that deleting an object succeeded.

Some large projects end up declaring a form of isolation bankrupcy, it's too difficult to fix the isolation issues that have crept in over the years, especially when there are many thousands of tests that take hours to run. In this case they use markers and other filters to split the test suite into logical components that don't suffer from isolation issues and provide a script to run these different segments in order. This is far from perfect, but especially for very slow test suites it can make sense as it allows the tests to be run in parallel on different CI hosts. I wouldn't recommend this as a go-to solution, but being able to run the tests in an inconvenient way is better than not being able to run them at all, so it's worth being aware of. Especially as the first step in a programme of improving the test reliability.