Playwright flaky tests: detection, causes, and fixes

Fix flaky tests and build a stable automation suite using Playwright best practices, root cause analysis, and AI insights to restore CI trust.

Pratik Patel

Dec 18, 2025

Playwright flaky tests: detection, causes, and fixes

Flaky tests break trust and destroy test stability. One minute your CI fails, the next it passes, and your team loses hours chasing issues that aren't real bugs.

Retries only hide the instability. Real test stability comes from identifying root causes, timing issues, race conditions, and environment noise, not masking them.

This guide shows how to fix instability at the source with Playwright best practices and AI-powered insights, so your team moves from firefighting to consistent, reliable automation.

What makes tests flaky, and why team ignores CI

A flaky test is one that passes, then fails, without any code change. It doesn't just cause annoyance, it quietly destroys trust in your entire automation suite.

The real cost of test instability

Flakiness creates real business drag:

Wasted engineering hours: Teams lose significant weekly time debugging false failures instead of building features.
Eroded developer trust: Red builds get ignored "probably just flaky" causing real bugs to slip through. A GitLab survey found 36% of developers face delayed releases due to test failures at least once a month.
Slower feedback loops: Retries inflate CI time and turn quick checks into slow bottlenecks.

The critical insight

Research shows a large share of failures come from environmental and infrastructure noise, not product bugs. Meaning: most debugging is spent chasing issues that never existed in the first place.

The 4 root causes of flakiness

Let's break down what's actually happening when tests flake:

1. Race conditions

Your test runs faster than your application code.

Common scenario: The test tries to click a button before the page finishes loading. In Playwright, this looks like:

code

// ❌ This will flake
await page.goto('https://app.example.com');
await page.click('button'); // Clicks before button is interactive

Why it flakes: Asynchronous rendering means the DOM element exists but isn't ready for interaction yet.

2. Uncontrolled state

Your tests depend on data or sessions from previous tests.

The trap: Test A creates a user account. Test B uses that account. When CI runs tests in parallel or shuffles the order, Test B fails because the account doesn't exist.

Warning sign: Tests pass locally but fail in CI, or fail when run in isolation.

3. Fixed timeouts

You're waiting for arbitrary amounts of time instead of actual conditions.

code

// ❌ Unstable: Fast machines pass, slow ones fail
await page.waitForTimeout(2000);

The problem: 2 seconds might work on your M3 Mac but fail on a containerized CI runner.

4. External dependencies

Your tests rely on third-party services that respond unpredictably.

Examples: Payment gateways, email verification APIs, or rate-limited external APIs. When these services slow down or temporarily fail, your tests fail even though your code is fine.

How to identify flaky tests

You can't fix what you don't measure. Here's how to separate real bugs from flaky noise.

The 4 metrics that actually matter

Metric	What it measures	Why it matters
Flaky Rate	% of tests that fail initially but pass on retry	This is your #1 stability indicator. Keep this near 0%.
Pass Rate Trend	Daily success rate of your entire suite	Spot stability degradation immediately after deploys.
Error Variants	Number of unique error messages per failure	High variance = deep instability. The test fails in different ways.
EWMA	Exponentially weighted moving average for duration	Catches tests that are slowly getting slower, a leading indicator of future flakiness.

The Old Way vs. The Smart Way

In the traditional debugging workflow, a test fails in CI, and the developer checks the logs, only to find a generic or unclear error.

When the issue cannot be reproduced locally, debugging becomes guesswork. The developer reruns the test, adds a random sleep() to stabilize timing, and eventually ships the change while hoping it fixes the problem.

With an AI-powered workflow, the same failure is handled more intelligently. When a test fails in CI, an analytics tool groups the failure by its root cause and provides a confidence score.

The developer can apply a targeted fix using an explicit wait, see insights like "Timing Issue (94% confidence); element not interactive," and verify the solution with historical trend data to make sure the problem is actually fixed.

Using AI to categorize failures

Modern test reporting tools use pattern recognition to group failures into actionable categories:

Actual Bug: Consistent failure across environments → fix product code.
UI Change: Selector changed after a DOM update → update locators.
Unstable Test: Intermittent failure → apply timing fixes or quarantine.
Miscellaneous: Setup issues, data problems, or CI configuration errors.

The key advantage: You stop chasing symptoms and start fixing root causes!

Fixing flaky tests: Playwright best practices

Now let's get tactical. These are the exact patterns that eliminate 90% of common flakiness.

Fix #1: Replace every fixed timeout

The Problem:

code

// ❌ AVOID: Unstable fixed wait
await page.waitForTimeout(2000);
await page.click('button#submit');

The Solution:

code

// ✅ USE: Explicitly wait for the element to be ready
await expect(page.getByRole('button', { name: 'Submit' })).toBeVisible();
await page.getByRole('button', { name: 'Submit' }).click();

Why this works: Playwright's auto-waiting ensures the element is visible and interactive before clicking. No guessing, no arbitrary delays.

Fix #2: Isolate test state completely

Every test should be atomic; it creates its own data and cleans up after itself.

Best practice setup:

example.spec.ts

import { test, expect } from '@playwright/test';
// Enable traces and screenshots for debugging
test.use({ 
  trace: 'on-first-retry',
  screenshot: 'only-on-failure' 
});
test.beforeEach(async ({ page }) => {
  // Each test gets a clean slate
  await page.goto(process.env.BASE_URL);
});
test('login shows dashboard', async ({ page }) => {
  // Use unique, dedicated test data
  const testUser = `test-${Date.now()}@example.com`;
  const testPass = process.env.USER_PASS;
  await page.getByLabel('Email').fill(testUser);
  await page.getByLabel('Password').fill(testPass);
  await page.getByRole('button', { name: 'Sign in' }).click();
  // Wait for the final state before asserting
  await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();
});

Key principle: No test should ever depend on another test's side effects.

Fix #3: Handle async operations properly

For complex interactions that involve network requests or animations:

code

// ✅ Wait for network to settle
await page.goto('https://app.example.com');
await page.waitForLoadState('networkidle');
// ✅ Wait for specific API response
await page.waitForResponse(
  response => response.url().includes('/api/data') && response.status() === 200
);
// ✅ Wait for element to be stable (no more animations)
await expect(page.locator('.modal')).toBeVisible();
await page.locator('.modal').waitFor({ state: 'visible' });

Retry configuration: When and How?

Retries are a tool, not a cure. Use them strategically:

playwright.config.ts

// playwright.config.ts
export default {
  // Only retry in CI, not locally
  retries: process.env.CI ? 2 : 0,
  // But give each test attempt enough time
  timeout: 30000,
};

Golden rule: If a test needs retries to pass consistently, it's still broken. Retries just hide the instability.

Building test stability into your CI/CD pipeline

As your test suite grows, stability becomes an organizational challenge. Here's how to scale without losing control.

Integrating stability checks in CI

Your CI pipeline should automatically surface flakiness, without manual investigation.

Minimal GitHub Actions setup:

.github/workflows/e2e.yml

name: e2e
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npx playwright install --with-deps
      - run: npx playwright test --reporter=list,html,junit
      # Upload artifacts for analysis
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: test-results
          path: |
            playwright-report/
            test-results/
            test-results/*.xml

What happens next: Analytics platforms ingest these artifacts and link failures to specific commits, PRs, and branches. No more "which commit broke this?"

Role-based dashboards

Different roles need different views:

QA engineers need:

Overall pass/fail breakdown
Failure category distribution (Bug vs. Flaky)
New failures since last deploy

Developers need:

Active blockers on their branches
Flaky tests in their scope
Direct links to failed runs

Engineering managers need:

High-level metrics (pass rate, flaky rate)
Trend lines proving stability are improving
ROI data on automation investment

Quarantine strategy for unstable tests

When you identify tests with >5% flaky rate over 30 days:

Don't delete them (they might catch real bugs)
Don't leave them in the main suite (they erode trust)
Quarantine and monitor:

code

// Mark as quarantine
test.skip('payment flow completes', async ({ page }) => {
  // TODO: Fix timing issue with external payment gateway
});

Run quarantined tests in a separate, non-blocking CI job. Fix them when you have dedicated time.

Debugging flaky tests: Context is everything

The difference between a 2-hour debugging session and a 10-minute fix comes down to having the right evidence at hand.

What you need to debug effectively

When a test fails, you should have instant access to:

1. Step-by-step execution history

Which step failed (line 47: await page.click('#submit'))
How long each step took (helps identify timeouts)

2. Visual proof

Screenshots at the moment of failure
Full video recording of the entire test run
DOM snapshot showing element state

3. Console logs

Browser console errors
Network request failures
JavaScript exceptions

4. Environment context

Which branch, commit, and PR
Which browser/viewport configuration
What test data was used

Modern test reporting tools bundle all of this into a single view. You click a failed test, see the video, spot the issue, and fix it with no local reproduction needed.

Preventing flaky tests

Prevention beats debugging every time. Build these practices into your development workflow:

1. Write tests with stability in mind

DO:

✅ Use explicit waits (toBeVisible(), waitForResponse())
✅ Create fresh test data for each test
✅ Use Playwright's auto-retry assertions
✅ Test against stable states, not intermediate animations

DON'T:

❌ Use waitForTimeout() for anything
❌ Chain tests together with shared state
❌ Hard-code selectors without fallback strategies
❌ Skip writing teardown logic

2. Code review checklist

Before approving any new test, verify:

[ ] No fixed timeouts

[ ] Test can run in isolation

[ ] Uses role-based selectors (more stable than CSS)

[ ] Includes retry configuration only for network-dependent tests

[ ] Has proper cleanup in afterEach

3. Monitor suite health over time

Track your stability metrics weekly:

Week	Pass Rate	Flaky Rate	Avg Test Duration
Week 1	87%	15%	4m 23s
Week 2	91%	11%	3m 58s
Week 3	94%	7%	3m 45s
Week 4	96%	3%	3m 30s

Goal: Pass rate ↑, Flaky rate ↓, Duration ↓

Case study: Saving hours per week

Microsoft tackled flaky tests with a company-wide strategy:

Problem: Developers spent hours on false failures, and CI/CD pipelines were unreliable.

Solution: They implemented DeFlaker with automated detection, a “fix-or-remove” policy, continuous monitoring, and developer training.

Results: Flakiness dropped 18% in 6 months, productivity improved, and engineering time was redirected to feature development.

Takeaway: Combining automation, clear policies, and team commitment drives real test stability.

Tools and framework comparison

Framework-specific stability features

Playwright:

✅ Built-in auto-waiting
✅ Trace viewer with timeline
✅ Network interception
⚠️ Requires explicit retry configuration

Cypress:

✅ Automatic retries for assertions
✅ Time-travel debugging
⚠️ Limited multi-tab support

Selenium:

⚠️ Manual wait management
⚠️ No built-in video recording
✅ Mature ecosystem

What makes a test reporting tool actually useful

Feature	Basic JUnit Viewer	Modern AI Platform
Failure categorization	Manual	Automatic with confidence scores
Root cause detection	None	AI-powered (race condition, timing, etc.)
Git awareness	Run ID only	Links to specific PRs, branches, commits
Evidence collection	Logs only	Videos, screenshots, traces, console logs
Time to fix	Hours	Minutes

Why this matters: A basic JUnit viewer tells you what failed. An AI platform tells you why it failed and how to fix it.

Real-World success stories

Spotify's Approach: Spotify reduced test flakiness from 6% to 4% in just 2 months by making visibility tools available to all developers. Their key insight? Giving developers immediate feedback about flaky tests dramatically accelerated fixes.

Meta's Innovation: Meta developed a probabilistic flakiness scoring system that predicts which tests are likely to become flaky before they cause widespread problems. This proactive approach prevents flakiness from spreading through the codebase.

Conclusion

Flaky tests aren't just technical annoyances; they erode trust in CI, slow releases, and waste developer time on false failures. This affects productivity and team morale.

Test stability is achievable by tracking flaky rates, using AI-powered root cause analysis, replacing arbitrary waits with explicit actions, and isolating test state with proper async handling.

Focusing on real issues rather than retries restores confidence in automation. Teams ship faster, catch genuine bugs sooner, and save significant debugging time. Start improving test stability with TestDino today and see immediate results.

FAQs

Does setting Playwright retries fix flaky tests?

No retries hide the symptom, not the cause. Your test is still unstable; it just passes on the second or third try. This adds 2-3x to your CI time. Use retries sparingly for legitimate network failures, but always fix the underlying issue.

How do I know if a failure is a real bug or a flaky test?

AI-powered analytics look at failure patterns: Does it fail consistently at the same step across environments? That's likely a real bug. Does it pass/fail randomly without a pattern? That's flakiness. A good tool provides confidence scores for each classification.

What's the most critical metric for test stability?

Flaky Rate the percentage of tests requiring retries. If this is high (>5%), your developers will stop trusting automation. A healthy suite should maintain a flaky rate near 0%.

Should I delete flaky tests or quarantine them?

Quarantine first. Flaky tests often catch real bugs they're just unreliable. Move them to a separate, non-blocking CI job while you fix them. Track them in a backlog and tackle the highest-impact ones first.

Can I achieve 100% test stability?

Realistically, aim for 95-98% pass rate with <2% flaky rate. External dependencies (payment gateways, email services) will always introduce some variability. The goal is to minimize flakiness to the point where developers trust the suite.

Pratik Patel

Founder & CEO

Pratik Patel is the founder of TestDino, a Playwright-focused observability and CI optimization platform that helps engineering and QA teams gain clear visibility into automated test results, flaky failures, and CI pipeline health. With 12+ years of QA automation experience, he has worked closely with startups and enterprise organizations to build and scale high-performing QA teams, including companies such as Scotts Miracle-Gro, Avenue One, and Huma.

Pratik is an active contributor to the open-source community and a member of the Test Tribe community. He previously authored Make the Move to Automation with Appium and supported lot of QA engineers with practical tools, consulting, and educational resources, and he regularly writes about modern testing practices, Playwright, and developer productivity.

View all posts →

Table of content

Flaky tests killing your velocity?

TestDino auto-detects flakiness, categorizes root causes, tracks patterns over time.

See Your Flakiest Tests

Playwright flaky tests: detection, causes, and fixes

What makes tests flaky, and why team ignores CI

The real cost of test instability

The critical insight

The 4 root causes of flakiness

1. Race conditions

2. Uncontrolled state

3. Fixed timeouts

4. External dependencies

How to identify flaky tests

The 4 metrics that actually matter

The Old Way vs. The Smart Way

Using AI to categorize failures

Fixing flaky tests: Playwright best practices

Fix #1: Replace every fixed timeout

Fix #2: Isolate test state completely

Fix #3: Handle async operations properly

Retry configuration: When and How?

Building test stability into your CI/CD pipeline

Integrating stability checks in CI

Role-based dashboards

Quarantine strategy for unstable tests

Debugging flaky tests: Context is everything

What you need to debug effectively

Preventing flaky tests

1. Write tests with stability in mind

2. Code review checklist

3. Monitor suite health over time

Tools and framework comparison

Framework-specific stability features

What makes a test reporting tool actually useful

Conclusion

FAQs

Get started fast

Playwright BDD: Setup, Gherkin & E2E Testing Guide

Playwright Locators Guide: Enhancing Test Automation with Effective Locator Strategies

Playwright Observability Platform: What your CI Setup is Missing

Playwright flaky tests: detection, causes, and fixes

What makes tests flaky, and why team ignores CI

The real cost of test instability

The critical insight

The 4 root causes of flakiness

1. Race conditions

2. Uncontrolled state

3. Fixed timeouts

4. External dependencies

How to identify flaky tests

The 4 metrics that actually matter

The Old Way vs. The Smart Way

Using AI to categorize failures

Fixing flaky tests: Playwright best practices

Fix #1: Replace every fixed timeout

Fix #2: Isolate test state completely

Fix #3: Handle async operations properly

Retry configuration: When and How?

Building test stability into your CI/CD pipeline

Integrating stability checks in CI

Role-based dashboards

Quarantine strategy for unstable tests

Debugging flaky tests: Context is everything

What you need to debug effectively

Preventing flaky tests

1. Write tests with stability in mind

2. Code review checklist

3. Monitor suite health over time

Tools and framework comparison

Framework-specific stability features

What makes a test reporting tool actually useful

Conclusion

FAQs

Get started fast

Playwright BDD: Setup, Gherkin & E2E Testing Guide

Playwright Locators Guide: Enhancing Test Automation with Effective Locator Strategies

Playwright Observability Platform: What your CI Setup is Missing

Join our waitlist