Playwright flaky tests: detection, causes, and fixes
Fix flaky tests and build a stable automation suite using Playwright best practices, root cause analysis, and AI insights to restore CI trust.
Flaky tests break trust and destroy test stability. One minute your CI fails, the next it passes, and your team loses hours chasing issues that aren't real bugs.
Retries only hide the instability. Real test stability comes from identifying root causes, timing issues, race conditions, and environment noise, not masking them.
This guide shows how to fix instability at the source with Playwright best practices and AI-powered insights, so your team moves from firefighting to consistent, reliable automation.
What makes tests flaky, and why team ignores CI
A flaky test is one that passes, then fails, without any code change. It doesn't just cause annoyance, it quietly destroys trust in your entire automation suite.
The real cost of test instability
Flakiness creates real business drag:
- Wasted engineering hours: Teams lose significant weekly time debugging false failures instead of building features.
- Eroded developer trust: Red builds get ignored "probably just flaky" causing real bugs to slip through. A GitLab survey found 36% of developers face delayed releases due to test failures at least once a month.
- Slower feedback loops: Retries inflate CI time and turn quick checks into slow bottlenecks.
The critical insight
Research shows a large share of failures come from environmental and infrastructure noise, not product bugs. Meaning: most debugging is spent chasing issues that never existed in the first place.
The 4 root causes of flakiness
Let's break down what's actually happening when tests flake:
1. Race conditions
Your test runs faster than your application code.
Common scenario: The test tries to click a button before the page finishes loading. In Playwright, this looks like:
// ❌ This will flake
await page.goto('https://app.example.com');
await page.click('button'); // Clicks before button is interactive
Why it flakes: Asynchronous rendering means the DOM element exists but isn't ready for interaction yet.
2. Uncontrolled state
Your tests depend on data or sessions from previous tests.
The trap: Test A creates a user account. Test B uses that account. When CI runs tests in parallel or shuffles the order, Test B fails because the account doesn't exist.
Warning sign: Tests pass locally but fail in CI, or fail when run in isolation.
3. Fixed timeouts
You're waiting for arbitrary amounts of time instead of actual conditions.
// ❌ Unstable: Fast machines pass, slow ones fail
await page.waitForTimeout(2000);
The problem: 2 seconds might work on your M3 Mac but fail on a containerized CI runner.
4. External dependencies
Your tests rely on third-party services that respond unpredictably.
Examples: Payment gateways, email verification APIs, or rate-limited external APIs. When these services slow down or temporarily fail, your tests fail even though your code is fine.
How to identify flaky tests
You can't fix what you don't measure. Here's how to separate real bugs from flaky noise.
The 4 metrics that actually matter
| Metric | What it measures | Why it matters |
|---|---|---|
| Flaky Rate | % of tests that fail initially but pass on retry | This is your #1 stability indicator. Keep this near 0%. |
| Pass Rate Trend | Daily success rate of your entire suite | Spot stability degradation immediately after deploys. |
| Error Variants | Number of unique error messages per failure | High variance = deep instability. The test fails in different ways. |
| EWMA | Exponentially weighted moving average for duration | Catches tests that are slowly getting slower, a leading indicator of future flakiness. |
The Old Way vs. The Smart Way
In the traditional debugging workflow, a test fails in CI, and the developer checks the logs, only to find a generic or unclear error.
When the issue cannot be reproduced locally, debugging becomes guesswork. The developer reruns the test, adds a random sleep() to stabilize timing, and eventually ships the change while hoping it fixes the problem.
With an AI-powered workflow, the same failure is handled more intelligently. When a test fails in CI, an analytics tool groups the failure by its root cause and provides a confidence score.
The developer can apply a targeted fix using an explicit wait, see insights like "Timing Issue (94% confidence); element not interactive," and verify the solution with historical trend data to make sure the problem is actually fixed.
Using AI to categorize failures
Modern test reporting tools use pattern recognition to group failures into actionable categories:
- Actual Bug: Consistent failure across environments → fix product code.
- UI Change: Selector changed after a DOM update → update locators.
- Unstable Test: Intermittent failure → apply timing fixes or quarantine.
- Miscellaneous: Setup issues, data problems, or CI configuration errors.
The key advantage: You stop chasing symptoms and start fixing root causes!
Fixing flaky tests: Playwright best practices
Now let's get tactical. These are the exact patterns that eliminate 90% of common flakiness.
Fix #1: Replace every fixed timeout
The Problem:
// ❌ AVOID: Unstable fixed wait
await page.waitForTimeout(2000);
await page.click('button#submit');
The Solution:
// ✅ USE: Explicitly wait for the element to be ready
await expect(page.getByRole('button', { name: 'Submit' })).toBeVisible();
await page.getByRole('button', { name: 'Submit' }).click();
Why this works: Playwright's auto-waiting ensures the element is visible and interactive before clicking. No guessing, no arbitrary delays.
Fix #2: Isolate test state completely
Every test should be atomic; it creates its own data and cleans up after itself.
Best practice setup:
import { test, expect } from '@playwright/test';
// Enable traces and screenshots for debugging
test.use({
trace: 'on-first-retry',
screenshot: 'only-on-failure'
});
test.beforeEach(async ({ page }) => {
// Each test gets a clean slate
await page.goto(process.env.BASE_URL);
});
test('login shows dashboard', async ({ page }) => {
// Use unique, dedicated test data
const testUser = `test-${Date.now()}@example.com`;
const testPass = process.env.USER_PASS;
await page.getByLabel('Email').fill(testUser);
await page.getByLabel('Password').fill(testPass);
await page.getByRole('button', { name: 'Sign in' }).click();
// Wait for the final state before asserting
await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();
});
Key principle: No test should ever depend on another test's side effects.
Fix #3: Handle async operations properly
For complex interactions that involve network requests or animations:
// ✅ Wait for network to settle
await page.goto('https://app.example.com');
await page.waitForLoadState('networkidle');
// ✅ Wait for specific API response
await page.waitForResponse(
response => response.url().includes('/api/data') && response.status() === 200
);
// ✅ Wait for element to be stable (no more animations)
await expect(page.locator('.modal')).toBeVisible();
await page.locator('.modal').waitFor({ state: 'visible' });
Retry configuration: When and How?
Retries are a tool, not a cure. Use them strategically:
// playwright.config.ts
export default {
// Only retry in CI, not locally
retries: process.env.CI ? 2 : 0,
// But give each test attempt enough time
timeout: 30000,
};
Golden rule: If a test needs retries to pass consistently, it's still broken. Retries just hide the instability.
Building test stability into your CI/CD pipeline
As your test suite grows, stability becomes an organizational challenge. Here's how to scale without losing control.
Integrating stability checks in CI
Your CI pipeline should automatically surface flakiness, without manual investigation.
Minimal GitHub Actions setup:
name: e2e
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm ci
- run: npx playwright install --with-deps
- run: npx playwright test --reporter=list,html,junit
# Upload artifacts for analysis
- uses: actions/upload-artifact@v4
if: always()
with:
name: test-results
path: |
playwright-report/
test-results/
test-results/*.xml
What happens next: Analytics platforms ingest these artifacts and link failures to specific commits, PRs, and branches. No more "which commit broke this?"
Role-based dashboards
Different roles need different views:
QA engineers need:
-
Overall pass/fail breakdown
-
Failure category distribution (Bug vs. Flaky)
-
New failures since last deploy
Developers need:
-
Active blockers on their branches
-
Flaky tests in their scope
-
Direct links to failed runs
Engineering managers need:
-
High-level metrics (pass rate, flaky rate)
-
Trend lines proving stability are improving
-
ROI data on automation investment
Quarantine strategy for unstable tests
When you identify tests with >5% flaky rate over 30 days:
- Don't delete them (they might catch real bugs)
- Don't leave them in the main suite (they erode trust)
- Quarantine and monitor:
// Mark as quarantine
test.skip('payment flow completes', async ({ page }) => {
// TODO: Fix timing issue with external payment gateway
});
Run quarantined tests in a separate, non-blocking CI job. Fix them when you have dedicated time.
Debugging flaky tests: Context is everything
The difference between a 2-hour debugging session and a 10-minute fix comes down to having the right evidence at hand.
What you need to debug effectively
When a test fails, you should have instant access to:
1. Step-by-step execution history
- Which step failed (line 47: await page.click('#submit'))
- How long each step took (helps identify timeouts)
2. Visual proof
- Screenshots at the moment of failure
- Full video recording of the entire test run
- DOM snapshot showing element state
3. Console logs
- Browser console errors
- Network request failures
- JavaScript exceptions
4. Environment context
- Which branch, commit, and PR
- Which browser/viewport configuration
- What test data was used
Modern test reporting tools bundle all of this into a single view. You click a failed test, see the video, spot the issue, and fix it with no local reproduction needed.
Preventing flaky tests
Prevention beats debugging every time. Build these practices into your development workflow:
1. Write tests with stability in mind
DO:
- ✅ Use explicit waits (toBeVisible(), waitForResponse())
- ✅ Create fresh test data for each test
- ✅ Use Playwright's auto-retry assertions
- ✅ Test against stable states, not intermediate animations
DON'T:
- ❌ Use waitForTimeout() for anything
- ❌ Chain tests together with shared state
- ❌ Hard-code selectors without fallback strategies
- ❌ Skip writing teardown logic
2. Code review checklist
Before approving any new test, verify:
[ ] No fixed timeouts
[ ] Test can run in isolation
[ ] Uses role-based selectors (more stable than CSS)
[ ] Includes retry configuration only for network-dependent tests
[ ] Has proper cleanup in afterEach
3. Monitor suite health over time
Track your stability metrics weekly:
| Week | Pass Rate | Flaky Rate | Avg Test Duration |
|---|---|---|---|
| Week 1 | 87% | 15% | 4m 23s |
| Week 2 | 91% | 11% | 3m 58s |
| Week 3 | 94% | 7% | 3m 45s |
| Week 4 | 96% | 3% | 3m 30s |
Goal: Pass rate ↑, Flaky rate ↓, Duration ↓
Tools and framework comparison
Framework-specific stability features
Playwright:
- ✅ Built-in auto-waiting
- ✅ Trace viewer with timeline
- ✅ Network interception
- ⚠️ Requires explicit retry configuration
Cypress:
- ✅ Automatic retries for assertions
- ✅ Time-travel debugging
- ⚠️ Limited multi-tab support
Selenium:
- ⚠️ Manual wait management
- ⚠️ No built-in video recording
- ✅ Mature ecosystem
What makes a test reporting tool actually useful
| Feature | Basic JUnit Viewer | Modern AI Platform |
|---|---|---|
| Failure categorization | Manual | Automatic with confidence scores |
| Root cause detection | None | AI-powered (race condition, timing, etc.) |
| Git awareness | Run ID only | Links to specific PRs, branches, commits |
| Evidence collection | Logs only | Videos, screenshots, traces, console logs |
| Time to fix | Hours | Minutes |
Why this matters: A basic JUnit viewer tells you what failed. An AI platform tells you why it failed and how to fix it.
Conclusion
Flaky tests aren't just technical annoyances; they erode trust in CI, slow releases, and waste developer time on false failures. This affects productivity and team morale.
Test stability is achievable by tracking flaky rates, using AI-powered root cause analysis, replacing arbitrary waits with explicit actions, and isolating test state with proper async handling.
Focusing on real issues rather than retries restores confidence in automation. Teams ship faster, catch genuine bugs sooner, and save significant debugging time. Start improving test stability with TestDino today and see immediate results.
FAQs
Table of content
Flaky tests killing your velocity?
TestDino auto-detects flakiness, categorizes root causes, tracks patterns over time.