Flaky Tests: The Complete Guide to Detection & Prevention

Find top Playwright tutorials, documentation, tools, and guides in one place to learn automation, testing patterns, and CI/CD workflows.

Thumbnail 1

Same code. Same environment. Different result.

In CI/CD, this is the definition of a flaky test. It is the single biggest obstacle to a fast, reliable release cycle.

Consider the scale of the problem:

  • Google found that 84% of pass-to-fail transitions in their CI involved a flaky test, not a real bug.

  • Atlassian loses thousands of developer hours a year to "retry" culture.

To solve flakiness, you have to stop treating it as a bug and start treating it as a systemic health issue. This guide provides a roadmap based on research from 51 major open-source projects and a 4-pillar framework we use to organize what teams at Google, Slack, and Atlassian have built internally.

TL;DR

Flaky tests come from 6 root causes:

  • Timing - bare sleeps, missing waits, race conditions between test and app
  • Shared state - tests that depend on each other's data or run order
  • Environment - CI vs local differences, resource starvation, container limits
  • External dependencies - remote services, 3rd-party APIs, unreliable backends
  • Resource leaks - unclosed connections, memory leaks, file handles left open
  • Non-determinism - time zones, random values, date-sensitive logic

We organize this guide around 4 pillars, based on patterns from teams at Google, Slack, and Atlassian:

  1. Detect: track per-test pass rates from CI history and reruns; a practical starting point is to investigate anything above ~2% (how to detect)
  2. Notify: assign each test an owner; alert when flakiness crosses a threshold (ownership + alerts)
  3. Triage: capture artifacts on every failure; reproduce with retries off; quarantine while fixing.
  4. Prevent: enforce a flaky-resistant checklist in code review; tighten CI config.

Where to start: pick your single flakiest test. Run it through the debugging workflow. One test, one week, full cycle.

What is a flaky test?

A flaky test is a software test that produces both passing and failing results on the same code, commit, and environment — without any changes. Also called non-deterministic or intermittent test failures, flaky tests are the leading cause of unreliable CI/CD pipelines.

Flaky tests break trust in both directions. A flake to red wastes time investigating a non-bug. A flake to green lets a real bug slip through. Most people only worry about the first kind. The second is worse.

The cost of flaky tests

  • Google (2016): 84% of pass-to-fail transitions are flaky, not real bugs. 16% of all tests have some flakiness. (Google Testing Blog)

  • Atlassian (2025): 150,000 developer hours per year wasted on flaky tests. (Atlassian Engineering)

  • Slack (2022): 57% of build failures were caused by test job failures (flaky + genuinely failing tests). After automated detection and suppression, test job failure rate dropped to under 4%. (Slack Engineering)

Quick math for your team: An industrial case study (Leinen et al., ICST 2024) found that a team of ~30 developers spent 2.5% of their productive time dealing with flaky tests, including 1.3% on repairs alone. Your number will vary. Track it: flaky_tests x avg_investigation_minutes x developer_hourly_rate x frequency.

Diagram showing the 4-pillar flaky test lifecycle: detect, notify, triage, prevent Diagram showing the 4-pillar flaky test lifecycle: detect, notify, triage, prevent

Fowler's 2011 essay identified most of these patterns early. A University of Illinois study (Luo et al., FSE 2014) analyzed 201 flaky-test fixes across 51 open-source projects and found recurring causes such as async waits, concurrency, and test order dependency. Those findings map well to these 6 buckets:

Category Summary What helps
Timing Bare sleeps, missing waits, race conditions Event-driven waits (Playwright auto-waiting; see framework fixes for Cypress and Selenium)
Shared state Tests depend on each other's data or run order Each test owns its state: fresh browser contexts, isolated setup/teardown
Environment CI vs local differences, non-deterministic environments, resource starvation Deterministic environments, proper CI resources, capped parallel workers
External deps Remote services, 3rd-party APIs, rate limiting Mock at the network layer (Playwright, Cypress intercept, WireMock)
Resource leaks Unclosed connections, memory, file handles Set pools to size 1 (Fowler, 2011); track duration drift
Non-determinism Time zones, random values, date-sensitive logic Freeze clock (Playwright, Cypress clock, Sinon); seed random values

The 6 root causes of flaky tests: timing, shared state, environment, external dependencies, resource leaks, and non-determinism

Timing and async coordination issues show up repeatedly in both research and engineering write-ups. An ICSE 2021 study of UI flaky tests found that async-wait subcategories accounted for roughly 45% of UI-specific flaky tests (106 of 235 studied). The pattern is consistent across frameworks. Here is the fix:

checkout.spec.ts
// Brittle - fixed wait guesses when the app will be ready
await page.getByRole('button', { name'Submit' }).click();
await page.waitForTimeout(2000);
await expect(page.getByText('Ticket submitted')).toBeVisible();

// Better - wait for the real user-visible outcome
await page.getByRole('button', { name'Submit' }).click();
await expect(page.getByText('Ticket submitted')).toBeVisible();

Tip: Tests pass with --workers=1 but fail with --workers=4? Shared state is a strong suspect. Google (2017) found flakiness rose sharply with test size: about 0.5% for small tests vs. 14% for large tests.

How to detect flaky tests

You can't fix what you can't measure. This pillar turns "we have some flaky tests" into "here are the tests causing most of our CI disruption."

1. Rerun-based detection

Run the same test multiple times on the same commit. If a test fails then passes without code changes, it's flaky.

For detection, use repeat/rerun modes: Playwright --repeat-each, pytest pytest-repeat, or loop scripts. This is different from automatic retries (Playwright --retries, Jest jest.retryTimes(), Cypress retries), which re-run failures to get a green build. Retries keep CI moving but mask flakiness. Use repeat-runs to detect, retries to unblock.

Limitation: rerun-based detection only catches high-frequency flakes. A test that flakes 1-in-300 runs won't show up in 10 reruns. That's where heuristic-based detection and historical cross-run analysis pay off. For a more advanced approach, DeFlaker (Bell et al., ICSE 2018) detects flaky tests without rerunning them by tracking which code a test actually covers and comparing against recent changes.

2. Historical analysis and metrics

Per-test flakiness scores over a rolling window give you a much clearer picture than any single run. Here's a minimal data model (from Reddit Engineering's FTQS):

Column Example
test_name checkout.spec.ts > guest pays by card
status pass, fail, flaky
commit_sha a1b2c3d
branch main
duration_ms 4230
runner_id runner-3
timestamp 2026-03-15T08:12:00Z

With this data, find your worst offenders: any test where (fail_count / total_runs) > 0.02 over the last 14 days. A data-driven QA approach turns this raw data into prioritized action items.

Thresholds: a practical starting point is to investigate anything above 2% over a rolling window. Google reported roughly 1.5% overall flakiness and still considered it a drag. Many teams escalate hard once a test reaches the mid-single digits.

Flaky test metrics to track: flaky rate (% needing retries), MTTR (detection to fix), duration trends (tests getting slower = likely leak), environment correlation (failures tied to specific runners/shards).

Build your own (Atlassian built Flakinator, Reddit piped SQL into JIRA), or use tools like TestDino that track this out of the box. TestDino's flaky detection calculates a stability percentage per test and categorizes root causes automatically. It also sends scheduled PDF reports with pass rates, failure trends, and flaky test lists.

If you only do 1 thing from this section: track per-test pass rates over a rolling 14-day window. Even a spreadsheet works.

How to assign ownership and notify the right people

Detection without notification is a dashboard nobody checks.

1. Define test ownership in code

Tests without owners get ignored. The fix: put ownership where developers already look - in the test file itself.

The simplest approach works for any framework: a CODEOWNERS file or a test-owners.yaml mapping tests/payments/ to team-payments. When a test starts flaking, ownership is already clear.

In Playwright, you can go further with per-test annotations that tag each test with an owner, priority, and notification target:

checkout.spec.ts
test('guest checkout completes with valid card', {
  annotation: [
    { type'owner'description'@sarah' },
    { type'ticket'description'TA-242' },
    { type'docs'description'docs.abx.com/guest' },
  ],
}, async ({ page }) => {
  // test body
});

TestDino reads these annotations and routes alerts automatically - when this test fails, @sarah gets a Slack message in #checkout-alerts with the failure context.

SLA example (adapt to your team):

  • Flaky test detected: owner notified within 10 minutes

  • Owner triages within 2 working days

  • Critical-path flakes (p0/p1): fixed or quarantined within 5 working days

Note: Assign a person, not "the team." Without a named owner with a deadline, quarantine becomes permanent.

2. Set up automated alerts

A good alert includes: test name, flakiness rate, link to the failing run, and a suggested cause. The anti-pattern is silent retry - a test quietly passes on attempt 2 and nobody knows it flaked.

One routing rule that saves time: scattered failures across different tests = test flakiness (route to the test owner). A wall of failures across all tests = infrastructure (route to DevOps).

TestDino's AI failure analysis adds root cause context by classifying each failure as Actual Bug, UI Change, Unstable Test, or Miscellaneous and grouping by cause bucket (Timing Related, Environment Dependent, Network Dependent, Assertion Intermittent). The evidence panel attaches screenshots, video, traces, console logs, and error details per attempt.

3. Surface flakiness in pull requests

The best place to catch flakiness is before merge. Block or warn on PRs that touch historically flaky tests - even a simple CI step that checks recent flake rates and posts a comment gives reviewers context.

TestDino's GitHub integration does this automatically: it attaches test summaries to PRs with links to the dashboard for traces and root cause details. Status Checks enforce merge gates with a configurable pass rate threshold and mandatory tags (e.g., @critical, @payment). 2 modes:

  • Strict: flaky tests count as failures. Use on production branches

  • Neutral (default): flaky tests excluded from pass rate, so known flakes don't block merges while you fix them

The 3 most common mistakes in flaky test management

Before the debugging workflow, here are the 3 patterns that keep teams stuck:

  1. Treating retries as a fix. Retries keep CI green but hide the problem. Teams that rely on retries alone see flakiness grow because nobody investigates the root cause. Use retries to unblock, not to close the ticket.
  2. No owner, no deadline. A flaky test without a named owner and a fix-by date will sit in quarantine forever. Atlassian, Reddit, and Slack all built ownership routing and ticketing into their flaky-test workflows, and each credited it as a key factor in driving resolution rates.
  3. Ignoring CI resource limits. A test that passes on your 16GB laptop but fails on a 4GB CI runner isn't flaky, it's resource-starved. Before blaming the test, check whether the environment can actually support it.

How to fix flaky tests step by step

Step-by-step debugging workflow for flaky tests: capture evidence, reproduce, narrow cause, fix, verify

This is the section most developers want to jump straight to. But triage only works when you know which tests to fix (detection) and someone is accountable (ownership).

1. Capture evidence and reproduce

When a flaky failure happens, you need 2 things: artifacts from the failure, and a way to reproduce it.

Artifacts to capture on every CI failure (configure this once):

  • Traces: Playwright trace viewer, Cypress time-travel, Selenium logging

  • Screenshots and video from the failing run

  • Console/browser logs, network logs / HAR files

  • Infrastructure context: worker, shard, runner, browser, OS, headless/headed

  • Git context: commit SHA, branch, test seed/order

Reproduce with retries off:

  • --retries=0 (or equivalent)

  • Run the test alone 20-50 times. Always passes? The problem is likely interaction with other tests

  • Run with 1 worker, then increase. Fails only in parallel? Shared state or resource contention

  • Run the exact failing commit, not latest main

Tip: If the test flakes rarely, 20 runs won't catch it. Kinaxis reported tests that failed once in every 300,000 runs. For low-frequency flakes, run 100+ times or rely on historical analysis instead.

2. Narrow down the root cause

Compare local vs CI conditions. Your machine has more CPU, RAM, and faster disk. Our guide to debugging Playwright tests covers this comparison in detail.

  • Only fails in CI? Likely environment. Reproduce with Docker: docker run --cpus=2 --memory=4g

  • Only on 1 shard or runner? Under-provisioned

  • Only under parallel load? Shared state or resource contention

  • Only on 1 browser or headless mode? Browser-specific behavior

  • Only at certain times of day? Timezone logic or shared infrastructure load

Once you have a theory, stress it:

  • Timing? Add network latency with tc (Linux) or Clumsy (Windows)

  • Shared state? Randomize test order, run with maximum parallelism

  • Resources? Constrain: docker run --cpus=1 --memory=2g

  • Non-determinism? TZ=America/Los_Angeles, different locales, different dates

  • External dependency? Inject mock failures: slow responses, 500 errors, timeouts

3. Apply the fix and verify

Match evidence to a root cause:

If the evidence points to... The fix is...
Bare sleeps or timeouts Replace with event-driven waits (auto-waiting)
Shared state between tests Isolate each test: own state, own setup, own teardown
CI-only failures Match CI resources locally with Docker
External API flakiness Mock at the network layer; contract tests to keep mocks honest
Gradually increasing failures Resource leak. Set pools to size 1, track duration drift
Timezone or random-value sensitivity Freeze clock, seed random values

Do not stop at a retry or a longer timeout. Use them only as temporary containment while you fix the actual cause. Then verify: run the test 50+ times locally and in CI before closing.

Illustrative example: timing flake

support.spec.ts
// BEFORE: fixed wait guesses at backend timing
test('submits support ticket'async ({ page }) => {
  await page.goto('/support');
  await page.fill('#message''Help!');
  await page.getByRole('button', { name'Submit' }).click();
  await page.waitForTimeout(2000);
  await expect(page.locator('.success')).toBeVisible(); // fails when response > 2s
});
// AFTER: wait for the actual response, not a guess
test('submits support ticket and sees confirmation'async ({ page }) => {
  await page.goto('/support');
  await page.getByLabel('Message').fill('Help!');
  await page.getByRole('button', { name'Submit' }).click();
  await expect(page.getByText('Ticket submitted')).toBeVisible({ timeout10000 });
});

A representative investigation looks like this: repeated CI runs fail at the .success assertion on slower runners. The waitForTimeout(2000) guesses at backend timing, so it breaks under resource contention. Replace the guess with an assertion on the real success state.

Google's research team built on this approach at scale: their root cause localization tool (IEEE ICSME 2020) automatically pinpoints the code responsible for flaky behavior. Apple took a different angle with test repetition modes in Xcode (WWDC 2021), letting you run tests in retry-until-failure or retry-until-success loops to surface flakes before shipping.

Debug Playwright tests faster
Get AI failure analysis and visual traces in one dashboard.
Try TestDino CTA Graphic

4. Quarantine while you fix

CI pipeline diagram showing main test suite blocking merges while quarantined flaky tests run separately without blocking

The 6 root causes of flaky tests: timing, shared state, environment, external dependencies, resource leaks, and non-determinism

The 6 root causes of flaky tests: timing, shared state, environment, external dependencies, resource leaks, and non-determinism

Tag the flaky test, move it to a separate non-blocking CI job, fix it within a deadline.

Teams usually pick one of three strategies. Here's how they compare:

Strategy What happens Keeps coverage? Fixes root cause? Best for
Retry Re-run on failure, pass on 2nd attempt Yes No (masks the problem) Unblocking CI while you investigate
Quarantine Move to non-blocking job, track with deadline Yes Eventually (if deadline enforced) Tests you plan to fix within 1-2 sprints
Delete Remove the test entirely No N/A Tests that are unfixable or no longer relevant

inventory.spec.ts
test('inventory sync after bulk upload', {
  tag: ['@flaky'],
  annotation: [
    { type'testdino:flaky-reason'description'Redis cache race condition on parallel writes' },
    { type'testdino:owner'description'@mike' },
  ],
}, async ({ page }) => {
  // test body
});

.github/workflows/playwright.yml
# Main suite - blocks merge
name: Run tests
  run: npx playwright test --grep-invert @flaky

# Quarantine suite - runs but doesn't block
name: Run quarantined tests
  run: npx playwright test --grep @flaky
  continue-on-errortrue

Quarantine policy (adapt to your team):

  • Cap: no more than 5% of your suite in quarantine

  • Time limit: 1-2 sprints

  • Not fixed by the deadline? Fix it or delete it. Quarantine without a deadline is just deletion with extra steps

If you only do 1 thing from this section: configure CI to save artifacts (traces, screenshots, logs) on every failure. That alone usually shortens debugging time substantially.

How to prevent flaky tests from reaching CI

New flaky tests appear at roughly the same rate you fix old ones. Prevention is the only way to bend that curve.

1. Code review checklist

Add this to your PR template. Reviewers check each item on any PR that adds or modifies tests:

  • Shared state? Isolate it

  • Bare sleep() or fixed waits? Use event-driven waits

  • Teardown guaranteed on failure? Use try/finally or framework hooks

  • Test order dependency? Randomize and verify

  • Unmocked external services? Mock non-critical deps

  • Non-deterministic assertions? (Random data, timestamps, UUIDs)

  • Works with reduced CPU/RAM? Think CI constraints

  • Could this test be smaller? Push it down the testing pyramid: unit over integration over E2E

For a more complete version, see our Playwright automation checklist.

If your team uses AI to generate tests, the same checklist applies. AI-written Playwright tests can introduce flaky patterns (bare waits, shared state) just as easily as hand-written ones, so review them with the same rigor.

Team agreement (paste into your wiki): "We don't merge tests that use bare sleeps, global mutable state, or unmocked external services. Flaky tests are either fixed within 2 sprints or quarantined with a named owner and removal date."

2. Write tests for diagnosability

Write every test as if someone will debug it from a single failed CI run, without rerunning it. Using structured Playwright test scripts with the Page Object Model helps keep tests consistent and easier to triage.

  • Precise name: not checkout works, but guest user can pay by card and sees order confirmation

  • Labeled steps: Playwright test.step(), Cypress custom commands. When a test with 8 steps fails at step 5, you skip 4 steps of investigation

  • Logged context: user ID, order ID, seed value. If data is random, log the seed

  • Precise assertions: assert the specific element or status code, not just "page loaded." Playwright assertions, Cypress assertions, Selenium locators

Example: a diagnosable test vs a black-box test

checkout.spec.ts
// Hard to debug - no steps, vague name, no context on failure
test('checkout works'async ({ page }) => {
  await page.goto('/products');
  await page.click('.add-to-cart');
  await page.click('.checkout');
  await page.fill('#card''4242424242424242');
  await page.click('#pay');
  await expect(page.locator('.confirmation')).toBeVisible();
});

// Easy to debug - labeled steps, precise name, logged context
test('guest user can pay by card and sees order confirmation'async ({ page }) => {
  await test.step('add product to cart'async () => {
    await page.goto('/products');
    await page.getByRole('button', { name'Add to cart' }).click();
  });

  await test.step('complete checkout with card payment'async () => {
    await page.getByRole('link', { name'Checkout' }).click();
    await page.getByLabel('Card number').fill('4242424242424242');
    console.log(`Checkout at ${new Date().toISOString()}, URL: ${page.url()}`);
    await page.getByRole('button', { name'Pay' }).click();
  });

  await test.step('verify order confirmation'async () => {
    await expect(page.getByText('Order confirmed')).toBeVisible();
  });
});

When the second test fails at "complete checkout with card payment," you know exactly where to look. The first just says "locator '.confirmation' not found" with no indication which step broke.

2 rules that get broken constantly:

  • Don't let cleanup destroy evidence. If a test fails, keep artifacts and state. Clean up on the next run's setup, not the failed run's teardown

  • Don't use silent retries as your only safety net. A retry that passes hides the first failure, which has the useful information

3. CI configuration that reduces flakiness

  • Allocate enough CPU and RAM per runner. Under-provisioned runners are a common source of intermittent failures

  • Shard tests across multiple CI jobs instead of cramming workers into 1 runner

  • Use retry-on-fail with quarantine tracking, not silent retry

  • Add a smoke test that checks database, permissions, and connectivity before the full suite. Fail fast with 5 tests, not 2,000 timeouts

If you only do 1 thing from this section: add the code review checklist to your PR template. It costs nothing and catches the most common patterns.

Framework-specific fixes for flaky tests

Flaky tests in Playwright

  1. Enable traces on first retry. Set trace: 'on-first-retry' in your Playwright config. When a test flakes, you get a full trace without the storage cost of tracing every run. See our trace viewer guide for setup

  2. Prefer semantic locators. page.getByRole() and page.getByLabel() are usually more resilient than brittle CSS chains because they target user-facing structure. Playwright still runs actionability checks before clicking either way. More in our locators guide

  3. Lean on built-in isolation. Playwright creates a fresh BrowserContext per test, so cookies and storage do not leak between tests unless you explicitly opt into reuse

For more, see Playwright flaky tests: how to find and fix them.

Flaky tests in Cypress

  1. Use cy.intercept() for network control. Mock unstable APIs at the network layer instead of hitting real services. This removes a common source of external-dependency flakes

  2. Enable test isolation. Cypress 12+ has test isolation on by default, clearing cookies and storage between tests. Make sure it's not disabled in your config

  3. Avoid cy.wait(ms). Use cy.intercept() with cy.wait('@alias') to wait for specific network responses instead of guessing with fixed timeouts

Flaky tests in Selenium

  1. Use explicit waits, not implicit. WebDriverWait with ExpectedConditions targets the specific element you need. Implicit waits apply globally and mask timing problems

  2. Isolate browser state. Create a fresh WebDriver session per test or clear cookies/storage in setup. Selenium doesn't do this automatically like Playwright does

  3. Make browser mode explicit in CI. Use a consistent headless setup or a consistent virtual display setup so you are not debugging environment drift between local and CI

Flaky tests in pytest

  1. Use pytest-randomly to catch order dependencies. Shuffles test order on every run. If tests pass in sequence but fail when randomized, you have shared state. Pin a seed to reproduce: pytest --randomly-seed=12345

  2. Reset database state between tests. Use pytest-django fixtures/marks or your own rollback/reset fixtures so each test starts clean. Shared database state is a common source of backend flakiness

  3. Freeze time with freezegun. freezegun or time-machine eliminates date/timezone flakes. Faster than mocking datetime manually and works across imports

Flaky tests in JUnit

  1. Use @RepeatedTest to surface flakes early. JUnit 5's @RepeatedTest(50) runs a test N times in a single invocation. Add this to suspect tests before promoting to CI

  2. Isolate with @TestInstance(Lifecycle.PER_METHOD). This is the JUnit 5 default, but some codebases override it to PER_CLASS for performance. PER_CLASS shares state across tests in the same class and can create ordering flakes

  3. Use Testcontainers for disposable infrastructure. Testcontainers gives you fresh Dockerized dependencies for the test scope you choose, which reduces shared-infrastructure flakes in integration suites

Flaky test management tools

You can manage flaky tests with custom scripts, open-source tooling, or a dedicated platform. Here's how the options compare:

Approach Examples Strengths Limitations
Custom scripts + SQL Bash loops, BigQuery, Grafana dashboards Full control, no vendor lock-in You build and maintain everything: ingestion, UI, alerts, reports
CI-native features GitHub Actions retry, GitLab CI retry:, Jenkins Flaky Test Handler Zero setup, works inside your existing pipeline Mostly retry-oriented. Historical tracking, ownership routing, and root cause analysis usually require extra tooling
Open-source frameworks Allure, ReportPortal Rich reporting, community plugins Self-hosted infrastructure. Flaky detection is manual or limited
Dedicated platforms

TestDino, Datadog CI Visibility, BrowserStack Test Observability, Trunk Flaky Tests

Automated detection, AI classification, ownership routing, merge gates SaaS cost. Varying levels of framework support

The right choice depends on your team size and how much infrastructure you want to own. Small teams can start with CI-native retries and a spreadsheet. Once you're tracking 500+ tests across multiple pipelines, a dedicated platform saves more engineering time than it costs.

Enterprise case studies

Team What they did Result

Slack

Auto-detection + suppression at scale 57% to under 4% test-job failure rate

GitHub

3-scenario retry + impact scoring 18x reduction in flaky failures

Atlassian

Flakinator: detection + auto-ownership Recovered 22,000+ builds

Reddit

FTQS: quarantine + ownership workflow Separated flaky-test noise from blocking CI

Spotify

Simple visibility dashboard 33% reduction in 2 months

More: Meta, Airtable, Uber, Kinaxis have all published similar playbooks. See also how OpenObserve reduced flaky tests by 90% after centralizing test reporting.

Debug failures in one click
TestDino's dashboard flags flaky tests and surfaces root causes
Try TestDino CTA Graphic

How flaky tests affect developer experience

The cost of flaky tests goes beyond CI minutes. Test execution flakiness erodes the feedback loop that makes CI/CD valuable, and becomes a CI/CD pipeline bottleneck that slows down the entire engineering organization.

  • Trust decay: When tests cry wolf, developers stop trusting CI. They merge with failing checks, skip test runs locally, and treat red builds as noise. Once that habit sets in, real bugs slip through

  • Context switching: A flaky failure pulls a developer out of their current task to investigate something that turns out to be nothing. That context switch costs 23 minutes on average to recover from

  • Slow merges: Teams with high flakiness add manual "re-run CI" steps to their merge process. This adds 15-30 minutes of idle waiting per PR, multiplied across every developer, every day

  • Onboarding friction: New team members can't tell which failures are real and which are "just that flaky test." They either waste time investigating known flakes or learn to ignore CI too early

Spotify found that simply making flakiness visible on a dashboard reduced it by 33% in 2 months (from 6% to 4%). They did not attribute the improvement to any single technical change, which suggests that visibility itself was a major factor. Setting up proper Playwright reporting makes that visibility automatic.

Conclusion

Flaky tests are expensive and pervasive. Google reported that 84% of pass-to-fail transitions were flaky, not real regressions.

  • The 4 pillars (Detect, Notify, Triage, Prevent) only work as an ongoing system, not a one-time cleanup.

  • Visibility alone drives reduction. Spotify got 33% from a dashboard.

  • Assign owners with deadlines. Without a named person, nothing gets fixed.

  • Prevention compounds. Fixing is linear, preventing is exponential.

Your next step: open your CI dashboard, find the test with the highest failure count, and run it through the debugging workflow. Document the process as you go - that document becomes your team's flaky test runbook.

FAQ

What is a flaky test?

A flaky test is a test that passes and fails on the same code without any changes. Run it twice on the same commit and get different results. A common cause is timing: the test assumes something will be ready in X milliseconds, and sometimes it is not. Google found 84% of pass-to-fail transitions were flaky, not real bugs.

Can I just use retries to fix flaky tests?
No. Retries keep CI moving, but they mask the problem if you stop there. Use retries to unblock while you fix. Track every retried test so the underlying cause still gets addressed.
What's the difference between a flaky test and an intermittent bug?

A flaky test produces different results on the same code. An intermittent bug is a real defect that only shows under certain conditions. Flaky tests need test-level fixes (waits, isolation, mocking); intermittent bugs need product-level fixes. TestDino's AI classification sorts failures into Actual Bug vs Unstable Test to help tell them apart.

How many flaky tests is too many?
A practical rule is to investigate anything above about 2% over a rolling window. Once a test reaches the mid-single digits, it usually deserves active triage or quarantine. Flakiness is not evenly distributed, so focus on the worst offenders first.
What causes most flaky tests?

The top 3 causes are: 1) async wait/timing issues, 2) concurrency and shared state, and 3) test order dependency. This comes from a University of Illinois analysis of 201 flaky test fixes across 51 open-source projects. An ICSE 2021 study found that async-wait issues accounted for roughly 45% of UI-specific flaky tests. The fix for timing is almost always the same: replace fixed waits with event-driven waits. For a full overview of the research, see Parry et al.'s survey of flaky tests (ACM TOSEM). See common causes for framework-specific links.

Should I delete or fix flaky tests?
Quarantine first, then fix. Deleting loses coverage. But quarantine with no deadline is just deletion with extra steps. Set a cap (no more than 5% of your suite) and a time limit. Not fixed by the deadline? Then yes, delete it - a test you never fix and never trust is worse than no test.
How do you detect flaky tests in CI/CD?
2 approaches. First, rerun-based: run the same test 20-50 times on the same commit. If it both passes and fails, it is flaky. Playwright has --repeat-each for this. Second, historical: track per-test pass rates over a 14-day rolling window and start investigating anything above roughly 2% failure rate. Tools like TestDino automate the historical approach.
Pratik Patel

Founder & CEO

Pratik Patel is the founder of TestDino, a Playwright-focused observability and CI optimization platform that helps engineering and QA teams gain clear visibility into automated test results, flaky failures, and CI pipeline health. With 12+ years of QA automation experience, he has worked closely with startups and enterprise organizations to build and scale high-performing QA teams, including companies such as Scotts Miracle-Gro, Avenue One, and Huma.

Pratik is an active contributor to the open-source community and a member of the Test Tribe community. He previously authored Make the Move to Automation with Appium and supported lot of QA engineers with practical tools, consulting, and educational resources, and he regularly writes about modern testing practices, Playwright, and developer productivity.

Get started fast

Step-by-step guides, real-world examples, and proven strategies to maximize your test reporting success