Playwright flaky tests: detection, causes, and fixes

Fix flaky tests and build a stable automation suite using Playwright best practices, root cause analysis, and AI insights to restore CI trust.

User

TestDino

Dec 18, 2025

Playwright flaky tests: detection, causes, and fixes

Flaky tests break trust and destroy test stability. One minute your CI fails, the next it passes, and your team loses hours chasing issues that aren’t real bugs.

Retries only hide the instability. Real test stability comes from identifying root causes, timing issues, race conditions, and environment noise, not masking them.

This guide shows how to fix instability at the source with Playwright best practices and AI-powered insights, so your team moves from firefighting to consistent, reliable automation.

What makes tests flaky,, and why team ignores CI

A flaky test is one that passes, then fails, without any code change. It doesn’t just cause annoyance, it quietly destroys trust in your entire automation suite.

The real cost of test instability

Flakiness creates real business drag:

  • Wasted engineering hours:Teams lose significant weekly time debugging false failures instead of building features.
  • Eroded developer trust: Red builds get ignored “probably just flaky” causing real bugs to slip through. A GitLab survey found 36% of developers face delayed releases due to test failures at least once a month.
  • Slower feedback loops: Retries inflate CI time and turn quick checks into slow bottlenecks.

The critical insight

Research shows a large share of failures come from environmental and infrastructure noise, not product bugs. Meaning: most debugging is spent chasing issues that never existed in the first place.

The 4 root causes of flakiness

Let's break down what's actually happening when tests flake:

1. Race conditions

Your test runs faster than your application code.

Common scenario: The test tries to click a button before the page finishes loading. In Playwright, this looks like:

js
// ❌ This will flake await page.goto('https://app.example.com'); await page.click('button'); // Clicks before button is interactive

Why it flakes: Asynchronous rendering means the DOM element exists but isn't ready for interaction yet.

2. Uncontrolled state

Your tests depend on data or sessions from previous tests.

The trap :Test a creates a user account. Test B uses that account. When CI runs tests in parallel or shuffles the order, Test B fails because the account doesn't exist.

Warning sign: Tests pass locally but fail in CI, or fail when run in isolation.

3. Fixed timeouts

You're waiting for arbitrary amounts of time instead of actual conditions.

// ❌ Unstable: Fast machines pass, slow ones fail
await page.waitForTimeout(2000);

The problem: 2 seconds might work on your M3 Mac but fail on a containerized CI runner.

4. External dependencies

Your tests rely on third-party services that respond unpredictably.

Examples: Payment gateways, email verification APIs, or rate-limited external APIs. When these services slow down or temporarily fail, your tests fail even though your code is fine.

How to identify flaky tests

You can't fix what you don't measure. Here's how to separate real bugs from flaky noise.

The 4 metrics that actually matter

Metric What it measures Why it matters
Flaky Rate % of tests that fail initially but pass on retry This is your #1 stability indicator. Keep this near 0%.
Pass Rate Trend Daily success rate of your entire suite Spot stability degradation immediately after deploys.
Error Variants Number of unique error messages per failure High variance = deep instability. The test fails in different ways
EWMA Exponentially weighted moving average for duration Catches tests that are slowly getting slower, a leading indicator of future flakiness

The Old Way vs. The Smart Way

In the traditional debugging workflow, a test fails in CI, and the developer checks the logs, only to find a generic or unclear error.

When the issue cannot be reproduced locally, debugging becomes guesswork. The developer reruns the test, adds a random sleep() to stabilize timing, and eventually ships the change while hoping it fixes the problem.

With an AI-powered workflow, the same failure is handled more intelligently. When a test fails in CI, an analytics tool groups the failure by its root cause and provides a confidence score.

The developer can apply a targeted fix using an explicit wait, see insights like "Timing Issue (94% confidence); element not interactive," and verify the solution with historical trend data to make sure the problem is actually fixed.

Spot flaky tests fast

Pinpoint root causes and fix them before CI breaks.

Get Started

Using AI to categorize failures

Modern test reporting tools use pattern recognition to group failures into actionable categories:

  • Actual Bug: Consistent failure across environments → fix product code.
  • UI Change: Selector changed after a DOM update → update locators.
  • Unstable Test: Intermittent failure → apply timing fixes or quarantine.
  • Miscellaneous: Setup issues, data problems, or CI configuration errors.

The key advantage: You stop chasing symptoms and start fixing root causes!

Fixing flaky tests: Playwright best practices

Now let's get tactical. These are the exact patterns that eliminate 90% of common flakiness.

Fix #1: Replace every fixed timeout

The Problem:

// ❌ AVOID: Unstable fixed wait
await page.waitForTimeout(2000);
await page.click('button#submit');

The Solution:

// ✅ USE: Explicitly wait for the element to be ready
await expect(page.getByRole('button', { name: 'Submit' })).toBeVisible();
await page.getByRole('button', { name: 'Submit' }).click();

Why this works: Playwright's auto-waiting ensures the element is visible and interactive before clicking. No guessing, no arbitrary delays.

Fix #2: Isolate test state completely

Every test should be atomic; it creates its own data and cleans up after itself.

Best practice setup:
import { test, expect } from '@playwright/test'; // Enable traces and screenshots for debugging test.use({ trace: 'on-first-retry', screenshot: 'only-on-failure' }); test.beforeEach(async ({ page }) => { // Each test gets a clean slate await page.goto(process.env.BASE_URL); }); test('login shows dashboard', async ({ page }) => { // Use unique, dedicated test data const testUser = `test-${Date.now()}@example.com`; const testPass = process.env.USER_PASS; await page.getByLabel('Email').fill(testUser); await page.getByLabel('Password').fill(testPass); await page.getByRole('button', { name: 'Sign in' }).click(); // Wait for the final state before asserting await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible(); });

Key principle: No test should ever depend on another test's side effects.

Fix #3: Handle async operations properly

For complex interactions that involve network requests or animations:

example.spec.js
// ✅ Wait for network to settle await page.goto('https://app.example.com'); await page.waitForLoadState('networkidle'); // ✅ Wait for specific API response await page.waitForResponse( response => response.url().includes('/api/data') && response.status() === 200 ); // ✅ Wait for element to be stable (no more animations) await expect(page.locator('.modal')).toBeVisible(); await page.locator('.modal').waitFor({ state: 'visible' });

Retry configuration: When and How?

Retries are a tool, not a cure. Use them strategically:

playwright.config.ts
export default { // Only retry in CI, not locally retries: process.env.CI ? 2 : 0, // But give each test attempt enough time timeout: 30000, };

Golden rule: If a test needs retries to pass consistently, it's still broken. Retries just hide the instability.

Building test stability into your CI/CD pipeline

As your test suite grows, stability becomes an organizational challenge. Here's how to scale without losing control.

Integrating stability checks in CI

Your CI pipeline should automatically surface flakiness, without manual investigation.

Minimal GitHub Actions setup:
name: e2e on: [push, pull_request] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: 20 } - run: npm ci - run: npx playwright install --with-deps - run: npx playwright test --reporter=list,html,junit # Upload artifacts for analysis - uses: actions/upload-artifact@v4 if: always() with: name: test-results path: | playwright-report/ test-results/ test-results/*.xml

What happens next: Analytics platforms ingest these artifacts and link failures to specific commits, PRs, and branches. No more "which commit broke this?"

Role based dashboards

Different roles need different views:

QA engineers need:

  • Overall pass/fail breakdown
  • Failure category distribution (Bug vs. Flaky)
  • New failures since last deploy

Developers need:

  • Active blockers on their branches
  • Flaky tests in their scope
  • Direct links to failed runs

Engineering managers need:

  • High-level metrics (pass rate, flaky rate)
  • Trend lines proving stability are improving
  • ROI data on automation investment

Role-based dashboards

Instant insights into test stability for your role

Try TestDino

Quarantine strategy for unstable tests

When you identify tests with >5% flaky rate over 30 days:

1. Don't delete them (they might catch real bugs)

2. Don't leave them in the main suite (they erode trust)

3. Quarantine and monitor:

ts
// Mark as quarantine test.skip('payment flow completes', async ({ page }) => { // TODO: Fix timing issue with external payment gateway });

Run quarantined tests in a separate, non-blocking CI job. Fix them when you have dedicated time.

Debugging flaky tests: Context is everything

The difference between a 2-hour debugging session and a 10-minute fix comes down to having the right evidence at hand.

What you need to debug effectively

When a test fails, you should have instant access to:

1. Step-by-step execution history

  • Which step failed (line 47: await page.click('#submit'))
  • How long each step took (helps identify timeouts)

2. Visual proof

  • Screenshots at the moment of failure
  • Full video recording of the entire test run
  • DOM snapshot showing element state

3. Console logs

  • Browser console errors
  • Network request failures
  • JavaScript exceptions

4. Environment context

  • Which branch, commit, and PR
  • Which browser/viewport configuration
  • What test data was used

Modern test reporting tools bundle all of this into a single view. You click a failed test, see the video, spot the issue, and fix it with no local reproduction needed.

Preventing flaky tests

Prevention beats debugging every time. Build these practices into your development workflow:

1. Write tests with stability in mind

DO:

  • ✅ Use explicit waits (toBeVisible(), waitForResponse())
  • ✅ Create fresh test data for each test
  • ✅ Use Playwright's auto-retry assertions
  • ✅ Test against stable states, not intermediate animations

DON'T:

  • ❌ Use waitForTimeout() for anything
  • ❌ Chain tests together with shared state
  • ❌ Hard-code selectors without fallback strategies
  • ❌ Skip writing teardown logic

2. Code review checklist

Before approving any new test, verify:

[ ] No fixed timeouts

[ ] Test can run in isolation

[ ] Uses role-based selectors (more stable than CSS)

[ ] Includes retry configuration only for network-dependent tests

[ ] Has proper cleanup in afterEach

3. Monitor suite health over time

Track your stability metrics weekly:

Week Pass Rate Flaky Rate Avg Test Duration
Week 1 87% 15% 4m 23s
Week 2 91% 11% 3m 58s
Week 3 94% 7% 3m 45s
Week 4 96% 3% 3m 30s

Goal: Pass rate ↑, Flaky rate ↓, Duration ↓

Case study: Saving hours per week

Microsoft tackled flaky tests with a company-wide strategy:

Problem: Developers spent hours on false failures, and CI/CD pipelines were unreliable.

Solution: They implemented DeFlaker with automated detection, a “fix-or-remove” policy, continuous monitoring, and developer training.

Results: Flakiness dropped 18% in 6 months, productivity improved, and engineering time was redirected to feature development.

Takeaway: Combining automation, clear policies, and team commitment drives real test stability.

Tools and framework comparison

Framework-specific stability features

Playwright:

  • ✅ Built-in auto-waiting
  • ✅ Trace viewer with timeline
  • ✅ Network interception
  • ⚠️ Requires explicit retry configuration

Cypress:

  • ✅ Automatic retries for assertions
  • ✅ Time-travel debugging
  • ⚠️ Limited multi-tab support

Selenium:

  • ⚠️ Manual wait management
  • ⚠️ No built-in video recording
  • ✅ Mature ecosystem

What makes a test reporting tool actually useful

Feature Basic JUnit Viewer Modern AI Platform
Failure categorization Manual Automatic with confidence scores
Root cause detection None AI-powered (race condition, timing, etc.)
Git awareness Run ID only Links to specific PRs, branches, commits
Evidence collection Logs only Videos, screenshots, traces, console logs
Time to fix Hours Minutes

Why this matters: A basic JUnit viewer tells you what failed. An AI platform tells you why it failed and how to fix it.

Real-World success stories

Spotify's Approach: Spotify reduced test flakiness from 6% to 4% in just 2 months by making visibility tools available to all developers. Their key insight? Giving developers immediate feedback about flaky tests dramatically accelerated fixes.

Meta's Innovation: Meta developed a probabilistic flakiness scoring system that predicts which tests are likely to become flaky before they cause widespread problems. This proactive approach prevents flakiness from spreading through the codebase.

Conclusion

Flaky tests aren’t just technical annoyances; they erode trust in CI, slow releases, and waste developer time on false failures. This affects productivity and team morale.

Test stability is achievable by tracking flaky rates, using AI-powered root cause analysis, replacing arbitrary waits with explicit actions, and isolating test state with proper async handling.

Focusing on real issues rather than retries restores confidence in automation. Teams ship faster, catch genuine bugs sooner, and save significant debugging time. Start improving test stability with TestDino today and see immediate results.

Boost test stability instantly

Track, analyze, and fix flaky tests with AI-powered insights

Check TestDino

FAQs

No retries hide the symptom, not the cause. Your test is still unstable; it just passes on the second or third try. This adds 2-3x to your CI time. Use retries sparingly for legitimate network failures, but always fix the underlying issue.

Get started fast

Step-by-step guides, real-world examples, and proven strategies to maximize your test reporting success