flaky test

A flaky test is an automated test that passes sometimes and fails other times against the same code, the same test suite, and the same environment. The inconsistency is also called test non-determinism, and it erodes trust in continuous integration faster than almost any other quality problem because the team can no longer tell a red build from a real bug.

Martin Fowler popularized the framing in his 2011 article “Eradicating Non-Determinism in Tests.” Google’s testing team has since documented that almost 16% of their tests show some level of flakiness, with roughly 1.5% of test results failing flakily on any given run.

How It Shows Up in Practice

A Python developer usually meets a flaky test in one of three places. A pull request check turns red, then green after a manual rerun, with no code change in between. A CI dashboard surfaces the same handful of tests at the top of a “flakiest tests of the week” report. Or a test is tagged with @pytest.mark.flaky(reruns=3) from the pytest-rerunfailures plugin, which retries that test up to three times before reporting a final failure.

A short example shows the most common shape, an assertion that depends on the iteration order of a set:

def get_tags():
    return {"python", "testing", "ci"}

# Order-dependent. The set holds the same three strings every run,
# but a set does not guarantee iteration order across Python processes.
first = next(iter(get_tags()))
print(f"first tag this run: {first!r}")

# Order-independent. Membership does not care about iteration order.
print(f"'python' in tags: {'python' in get_tags()}")

first tag this run: 'ci'
'python' in tags: True

Asserting that first == "python" would pass on some Python invocations and fail on others, with no code change in between. Asserting membership is stable.

Common Causes

Most flakiness traces back to a small set of repeating patterns, and concentrates near the top of the test pyramid where tests touch real systems:

Test ordering and isolation: one test relies on data, files, or database rows left behind by the test that ran before it. Running the suite in a different order, or in parallel, exposes the hidden dependency.
Asynchronous behavior: a test uses time.sleep() to wait for a background task and the chosen interval is sometimes too short on a busy CI runner.
External services: a test hits a real network endpoint, a CI runner without internet access, or a rate-limited third-party API.
Time and randomness: a test calls datetime.now(), random.random(), or generates UUIDs without seeding, so the inputs differ on each run.
Resource exhaustion: a long-running suite leaks file handles, sockets, or database connections, and the last few tests fail under pressure.

How Teams Respond

Healthy responses fall into three steps, applied in order:

Identify and quarantine. Tag the test as flaky in the tracker, move it out of the blocking pipeline, and treat the quarantine list as a debt that has to shrink over time.
Fix the root cause. Replace sleep with a polling helper, mock the external service, freeze the clock, seed the randomness, or rebuild test data between cases. Common Python helpers include unittest.mock.patch, freezegun, and the responses library.
Last resort, automatic reruns. Plugins like pytest-rerunfailures retry a failing test a fixed number of times and report success if any attempt passes. Reruns mask the symptom rather than fix it, so a healthy team uses them only on tests already marked for repair.

The cost of ignoring flaky tests compounds. Once developers learn to manually rerun a red build, they stop investigating real failures too, and the regression value of the whole suite drains away.

Tutorial

Continuous Integration With Python: An Introduction

In this Python tutorial, you'll learn the core concepts behind Continuous Integration (CI) and why they are essential for modern software engineering teams. Find out how to how set up Continuous Integration for your Python project to automatically create environments, install dependencies, and run tests.

intermediate best-practices devops testing

For additional information on related topics, take a look at the following resources:

Understanding the Python Mock Object Library (Tutorial)
How to Provide Test Fixtures for Django Models in Pytest (Tutorial)
pytest Tutorial: Effective Python Testing (Tutorial)
Build Robust Continuous Integration With Docker and Friends (Tutorial)
Continuous Integration and Deployment for Python With GitHub Actions (Tutorial)
Continuous Integration With Python (Course)
Improving Your Tests With the Python Mock Object Library (Course)
Exploring unittest.mock in Python (Course)
Understanding the Python Mock Object Library (Quiz)
Testing Your Code With pytest (Course)
Effective Testing with Pytest (Quiz)
Python Continuous Integration and Deployment Using GitHub Actions (Course)
GitHub Actions for Python (Quiz)

By Martin Breuss • Updated May 28, 2026

Software Engineering Glossary Share Feedback

flaky test

How It Shows Up in Practice

Common Causes

How Teams Respond

Related Resources

Continuous Integration With Python: An Introduction