We Analyzed Playwright Test Failures: Here’s the Breakdown

Async wait issues cause nearly half of all UI test failures, making them the biggest source of Playwright flakiness. Here’s what research reveals about why tests fail and how to fix them faster.

Vishwas Tiwari

Feb 18, 2026

We Analyzed Playwright Test Failures_ Here’s the Breakdown

The problem today is that most teams debug failures one at a time, without a system to understand why tests break.

We combined data from peer-reviewed studies, open-source project analyses, and real-world failure classification patterns from Playwright teams to build the most complete breakdown of test failures available.

The goal: show you where your debugging time goes, and how to cut it in half.

Finding 1: Almost Half of All Failures Trace Back to Async Wait Issues

The single biggest category of E2E test failures is async wait, and it’s not close.

Romano et al. found that 45% of flaky UI tests fail because tests don't properly wait for asynchronous operations to complete. Luo et al.'s foundational study confirmed the same pattern: nearly half of flaky-test-fixing commits addressed async wait as the root cause.

In Playwright terms, this shows up as:

Clicking before an element is interactive: The element exists in the DOM but isn't clickable yet
Asserting before data loads: API responses haven't returned, so the UI shows a stale or empty state
Racing against animations/transitions: The page is mid-render when the test acts

single-bar-chart

Root Cause Category	% of UI Test Failures	Playwright Equivalent
Async Wait	45%	Missing waitFor, premature assertions
Concurrency / Race Conditions	~24%	Parallel tab/context conflicts, shared state
Platform / Environment	~12%	Browser-specific rendering, OS differences
Network	~9%	API timeouts, failed backend requests
Test Order Dependency	~5%	Shared database state between tests
Other (IO, Randomness, Time)	~5%	Timezone-dependent logic, random test data

Sources: Romano et al., 2021; Luo et al., 2014; Parry et al., 2022

Tip: Playwright's built-in auto-waiting covers many async cases automatically. But auto-wait only applies to actions like click() and fill(). Assertions like expect(locator).toHaveText() need explicit await, and custom logic that checks DOM state manually bypasses auto-wait entirely. That's where most async failures hide..

If you're spending time on random test failures, start by auditing your assertions. Replace manual DOM checks with Playwright's web-first assertions (expect(locator).toBeVisible() instead of checking element properties directly). This alone can address nearly half your failure backlog.

Finding 2: Timeouts Are the Silent Killer

Timeouts sit in a gray zone between real failures and infrastructure noise. The SAP HANA study found that system tests suffer from timeout issues in 18% of cases, compared to just 7% for unit tests. The more complex the test, the more likely a timeout becomes the failure mode.

In Playwright, timeouts show up in three flavors:

Action timeouts (default 30s): page.click() waiting for an element that never appears
Navigation timeouts (default 30s): page.goto() on a slow or unresponsive server
Test timeouts (default 30s per test): The entire test exceeds its time budget

Eck et al.'s study of 200 flaky tests at Mozilla found that test case timeouts and test suite timeouts were frequent enough to warrant their own categories, separate from async wait.

Tip: Don't just increase timeout values when tests fail. That masks the problem. Instead, check whether the timeout is caused by a slow backend (network issue), a missing element (locator issue), or a legitimately slow operation. TestDino's Analytics dashboard tracks duration trends per spec, so you can spot tests gradually slowing down before they start timing out.

Finding 3: Resource and Environment Flakiness Accounts for Almost Half of All Flaky Tests

Here's a finding that surprised us. A study on resource-affected flaky tests covering 52 projects across Java, JavaScript, and Python found that 46.5% of flaky tests are RAFTs. Their pass/fail behavior depends entirely on available CPU, memory, and I/O at execution time.

This means almost half of your flaky tests aren't flaky because of bad test code — they're flaky because CI runners have variable resources. In Playwright, this shows up as:

Tests passing locally but failing in CI : Your machine has more CPU/RAM than the CI runner
Inconsistent behavior across shards : Some shards get resource-starved containers
Tests failing only under load : Parallel execution increases resource contention

Factor	Impact on Flakiness	Practical Solution
CPU availability	Higher CPU reduces the flaky rate substantially	Use consistent CI runner specs
Memory pressure	Causes browser crashes and slow renders	Monitor container memory limits
Parallel execution	Increases contention between tests	Use Playwright's browser context isolation
Network variability	API call timing becomes unpredictable	Mock external APIs in tests

Before blaming the test code, check your CI environment. TestDino's Environment Mapping lets you compare test results across environments, so you can determine whether failures are coding issues or infrastructure problems.

Finding 4: 13% of CI Build Failures Are Caused by Flaky Tests

A study of open-source projects by Labuschagne et al. found that 13% of failed CI builds were caused by flaky tests, rather than actual code defects. GitHub's internal data from 2020 showed that 1 in 11 commits (9%) had at least one red build due to a flaky test.

At Google's scale, the numbers are even more striking. Almost 16% of their tests show some level of flakiness, and 84% of test transitions from pass to fail involved a flaky test. For a project with 1,000 tests and a 1.5% flaky rate, roughly 15 tests will likely fail on every run, each requiring investigation.

Metric	Value	Source
Flaky tests at Google	~16% of all tests	Micco, 2016
CI builds failed due to flaky tests	13%	Labuschagne et al., 2017
Commits with flaky red builds (GitHub)	9% (1 in 11)	GitHub internal data, 2020
Pass-to-fail transitions that are flaky (Google)	84%	Micco, 2016
Cost of poor software quality (US)	$2.41 trillion	CISQ

Tip: The real cost isn't the CI minutes. It's the developer time. An industrial case study found developers spend 1.28% of their working time specifically on repairing flaky tests, at a monthly cost of $2,250 per developer. For a team of 10, that's $270,000/year burned on flaky test repairs alone.

Teams that invest in flaky test detection early recoup that cost within weeks.

Finding 5: JavaScript Tests Fail Differently Than Java Tests

Most flaky test research focuses on Java projects. But Verdecchia et al.'s JavaScript-specific study found a meaningfully different root cause distribution for JS/TS codebases

Root Cause	In Java Studies	In JavaScript Studies
Async wait / Concurrency	Common	Dominant (primary cause)
Test order dependency	Top 3 cause	Rare
OS-specific behavior	Occasionally noted	Ranks higher
Network stability	Moderate factor	Bigger factor (event-driven arch.)
Infrastructure flakiness	Not reported	New category (Gruber et al.)

This matters for Playwright teams because Playwright tests run in JavaScript/TypeScript by default, in an inherently asynchronous environment where timing is the primary failure mode.

Don't apply Java-focused flaky test advice directly to your Playwright suite. Focus on async patterns, network mocking, and environment consistency rather than test isolation and execution ordering. For a practical checklist, see the Playwright automation checklist.

Finding 6: Flaky Tests Cluster Together

One of the most practical findings comes from systemic flakiness research. Across a dataset of 10,000 test suite runs from 24 projects, researchers found:

75% of flaky tests belong to a cluster of tests that fail together
Mean cluster size: 13.5 flaky tests sharing the same root cause
Predominant causes: intermittent networking issues, unstable external dependencies, shared infrastructure problems

The practical implication: fixing one root cause can stabilize over a dozen tests at once.

Tip: When you find a flaky test, don't just fix that one test. Look at what else failed in the same run. TestDino's error grouping clusters similar failures automatically, so you can see when 15 "different" failures all trace to one broken API endpoint.

How TestDino Automates Failure Analysis

The patterns in this article are exactly what TestDino was built to detect and classify automatically. Instead of manually investigating each failure, TestDino's AI analyzes error messages, stack traces, and historical patterns to categorize every test failure:

Actual Bug — Consistent failure across environments, pointing to a real product defect
UI Change — Selector or layout changed after a DOM update, needs locator refresh
Flaky Test — Intermittent failure from timing, network, or environment issues
Miscellaneous — Setup issues, data problems, or CI configuration errors

Each classification comes with a confidence score and suggested next steps.

Research Finding	TestDino Feature	What It Does
Async wait causes 45% of failures	AI Insights	Classifies timing-related failures with confidence scores
Timeouts are the silent killer	Analytics	Tracks duration trends per test case to catch slowdowns
46.5% of flakiness is resource-related	Environment Mapping	Compares results across CI environments
Flaky tests cluster together	Debug Failures	Groups similar errors across tests and runs
13% of CI builds fail from flakiness	Flaky Tests tracking	Flags flaky tests with flakiness percentage
JS has unique failure patterns	Annotations	Add owner, flaky reason, Slack alerts, and custom tags
Faster triage cuts developer cost	Test Explorer	Browse all test cases with full execution history

What's New in TestDino for Failure Analysis

TestDino's recent releases added features directly relevant to the failure patterns we have discussed:

Real-Time Streaming: Watch tests run live in the dashboard with shard-aware tracking. No more waiting for the full suite to complete before seeing failures. Powered by WebSockets with automatic polling fallback.
Test Explorer: Browse and search all test cases across a project. Filter by status, tags, or spec files. See how each test behaved over time, including failure and flaky patterns.
Enhanced GitHub integration: PR comments and status checks with AI-generated summaries posted directly to pull requests and commits.
Code Coverage: Per-run coverage tab with statement, branch, function, and line metrics. Includes sharded-coverage merging and a coverage trend chart in Analytics.
Custom Annotations: Owner, Note, Flaky Reason, Link, Notify Slack, and Metric tags for richer failure context and analytics.
Slack App integration: Test run notifications with status tables and annotation-based Slack alerts, including private channel support.

For teams running sharded Playwright suites, TestDino provides shard-aware tracking to catch failures as they happen. Combined with CI Optimization (rerun only failed tests via npx tdpw last-failed), you can cut investigation time from hours to minutes.

What This Means for Your Team

Here are the highest-impact actions ranked by expected failure reduction:

Audit your assertions first. Async wait causes 45% of UI failures. Replace manual DOM checks with Playwright's web-first assertions. This is the single highest-ROI fix.
Track duration trends, not just pass/fail. Timeouts are failures waiting to happen. Monitor which tests are getting slower over time using test analytics.
Standardize CI runner resources. Nearly half of flaky tests are resource-affected. Pin your CI container specs and compare results across environments.
Fix clusters, not individual tests. 75% of flaky tests share a root cause. When one test flakes, investigate what else failed in the same run using error grouping.
Classify before investigating. Don't spend 20 minutes debugging a flaky test that would have passed on retry. Use AI classification to separate actual bugs from noise.

What This Means for Your Team

Here are the highest-impact actions ranked by expected failure reduction:

Audit your assertions first. Async wait causes 45% of UI failures. Replace manual DOM checks with Playwright's web-first assertions. This is the single highest-ROI fix.
Track duration trends, not just pass/fail. Timeouts are failures waiting to happen. Monitor which tests are getting slower over time using test analytics.
Standardize CI runner resources. Nearly half of flaky tests are resource-affected. Pin your CI container specs and compare results across environments.
Fix clusters, not individual tests. 75% of flaky tests share a root cause. When one test flakes, investigate what else failed in the same run using error grouping.
Classify before investigating. Don't spend 20 minutes debugging a flaky test that would have passed on retry. Use AI classification to separate actual bugs from noise.

That's the workflow TestDino was built around. AI classifies every failure the moment it happens. Error grouping surfaces the clusters. Analytics tracks whether your fixes actually moved the numbers. Instead of spending 6-8 hours a week on manual triage, your team spends that time shipping features.

The data is in the studies. The patterns are in your test suite. The only question is whether you're tracking them.

FAQs

What are the most common Playwright test failures?

Async wait issues (tests not waiting for elements or data to load) account for approximately 45% of UI test failures according to academic research. After that, concurrency/race conditions (~24%), platform/environment differences (~12%), and network issues (~9%) round out the top four categories.

What percentage of tests are typically flaky?

Google reports roughly 16% of their tests show some level of flakiness. For a healthy Playwright suite, aim for a flaky rate below 2%. Above 5%, developers tend to stop trusting test results entirely.

Why do Playwright tests fail in CI but pass locally?

Research on resource-affected flaky tests found that 46.5% of flaky tests behave differently depending on available CPU, memory, and I/O. CI runners typically have fewer resources than developer machines, making timing-sensitive tests more likely to fail.

How do I categorize Playwright test failures?

The standard approach classifies failures into: Actual Bug (consistent failure due to a real defect), UI Change (selector or layout change), Flaky Test (intermittent due to timing/environment), and Miscellaneous (infrastructure/config). TestDino's AI Insights automate this classification with confidence scores.

What's the cost of flaky tests?

An industrial case study found developers spend 1.28% of their working time repairing flaky tests, costing roughly $2,250/month per developer. Poor software quality costs US organizations $2.41 trillion annually, with testing identified as a key weak link.

Vishwas Tiwari

AI/ML Developer

Vishwas Tiwari is an AI/ML Developer at TestDino working on test reporting and analytics automation.

He builds tools that automate test data analysis using Python, Pandas, NumPy, and Scikit-learn. His work includes developing machine learning models for error categorization, failure pattern detection, and test case debugging.

He created an MCP server for test automation workflows and currently focuses on Playwright automation and test analytics to reduce manual QA work.

View all posts →

Table of content

Flaky tests killing your velocity?

TestDino auto-detects flakiness, categorizes root causes, tracks patterns over time.

See Your Flakiest Tests

We Analyzed Playwright Test Failures: Here’s the Breakdown

Finding 1: Almost Half of All Failures Trace Back to Async Wait Issues

Finding 2: Timeouts Are the Silent Killer

Finding 3: Resource and Environment Flakiness Accounts for Almost Half of All Flaky Tests

Finding 4: 13% of CI Build Failures Are Caused by Flaky Tests

Finding 5: JavaScript Tests Fail Differently Than Java Tests

Finding 6: Flaky Tests Cluster Together

How TestDino Automates Failure Analysis

What's New in TestDino for Failure Analysis

What This Means for Your Team

What This Means for Your Team

FAQs

Get started fast

Best Playwright CI/CD Integrations: GitHub Actions, Jenkins, and GitLab CI

Playwright Accessibility Testing: Fast CI Automation Guide

Playwright Skill Claude Code: 82 E2E Tests for E-commerce

We Analyzed Playwright Test Failures: Here’s the Breakdown

Finding 1: Almost Half of All Failures Trace Back to Async Wait Issues

Finding 2: Timeouts Are the Silent Killer

Finding 3: Resource and Environment Flakiness Accounts for Almost Half of All Flaky Tests

Finding 4: 13% of CI Build Failures Are Caused by Flaky Tests

Finding 5: JavaScript Tests Fail Differently Than Java Tests

Finding 6: Flaky Tests Cluster Together

How TestDino Automates Failure Analysis

What's New in TestDino for Failure Analysis

What This Means for Your Team

What This Means for Your Team

FAQs

Get started fast

Best Playwright CI/CD Integrations: GitHub Actions, Jenkins, and GitLab CI

Playwright Accessibility Testing: Fast CI Automation Guide

Playwright Skill Claude Code: 82 E2E Tests for E-commerce

Join our waitlist