How Long Should E2E Tests Take? Performance Data from Real Teams

If your E2E suite takes 47 minutes, that’s not “normal”, it’s a release bottleneck. Here’s how to score your CI speed, see what healthy teams look like, and cut time without guessing.

Jashn Jain

Feb 24, 2026

How-Long-Should-E2E-Tests-Take_-Performance-Data-from-Real-Teams

Your E2E test suite takes 47 minutes to run. Is that normal?

Nobody on your team knows. The tests have been getting slower for months, but nobody can point to the moment it happened or the tests that caused it. Meanwhile, PRs queue up, engineers context-switch while they wait, and releases slip by half a day because someone has to rerun a flaky build.

This article gives you three things: a way to score your current CI/CD pipeline speed, a breakdown of what "normal" looks like based on data from millions of real workflows, and a decision framework for cutting total CI time without guessing where to start.

The goal isn't faster tests. It's a measurable time budget, a way to find what's burning it, and a system to keep it from creeping back.

Score Your Suite: A 60-Second Self-Check

Before reading any benchmarks, answer four questions about your E2E suite:

Question	Your Number	Green	Yellow	Red
Total CI time for E2E tests (minutes)	___	Under 10	10-30	Over 30
Suite size (test count)	___	Under 100	100-500	Over 500
Parallelism (number of CI workers)	___	4+	2-3	1
Flake rate (% of runs needing reruns)	___	Under 3%	3-10%	Over 10%

What your zone means:

Mostly green: Your suite is in good shape. Focus on keeping it there. Set alerts for when the duration creeps above your budget.

Mostly yellow: You have room to improve, and it's worth the investment. A 10-minute reduction in CI time pays back 30+ minutes of developer wait time per PR (including context-switch recovery). Prioritize based on the largest bucket, using the time budget model below.

Mostly red: Your test suite is actively slowing down releases. Engineers are likely ignoring failures or merging without waiting for CI to complete. This is costing you real shipping velocity. Read the rest of this article with urgency.

Note: These thresholds come from CircleCI's benchmark data across 28 million workflows and DORA research on elite-performing teams. They're not arbitrary. Teams that hit 5-10 minute workflow durations with 90%+ success rates consistently ship faster and recover from failures in under an hour.

The Time Budget Model: Where Your Minutes Actually Go

Most teams only measure one thing: how long the tests take to run. That's like budgeting for a road trip by only counting miles, ignoring gas stops, traffic, and the wrong turn you'll take in New Jersey.

Here's the real model:

Total CI time = Setup + Execution + Reruns + Debug

Each component eats time differently, and each has a different fix.

Setup time includes dependency installation, Docker container startup, browser downloads, and environment provisioning. For teams using containerized CI, this alone can add 5-7 minutes before the first test even runs.

Execution time is what most people think of as "test time." The time the actual tests take from the first click to the last assertion.

Rerun time is the hidden tax. When a flaky test fails, the pipeline either reruns automatically or a developer manually triggers a retry. Each rerun repeats the full execution cycle, sometimes the setup too.

Debug time happens after CI finishes. An engineer sees a red build, opens the logs, tries to figure out what failed and why. This is the most expensive component because it pulls a developer out of feature work and into investigation mode.

Tip: Track all four components separately, not just execution time. If you only optimize execution but your flake rate is 15%, you'll save 2 minutes on test speed and lose 20 minutes to reruns and debugging. The net effect is negative.

What "Normal" Looks Like: Benchmarks from Real Teams

Here's what the data says about E2E test execution time across real engineering organizations. Each benchmark below pairs a number with its meaning and what to do about it.

Benchmark 1: Total Workflow Duration

The CircleCI 2026 State of Software Delivery analyzed 28 million workflows across 22,000 organizations. Their recommended benchmark for workflow duration is 5-10 minutes.

Percentile	Workflow Duration	What It Means
Top 5%	Under 2 minutes	Elite. Highly parallelized, minimal setup, tight test suite
Top 25%	3-5 minutes	Strong. Tests are fast and focused on critical paths
Median	~11-12 minutes	Average. Room to optimize but not blocking releases
Bottom 25%	20-30+ minutes	Slow. Engineers context-switch during builds, PRs queue
Bottom 10%	45+ minutes	Critical. Suite is a release bottleneck

What to do if you're above 20 minutes: Don't start by rewriting tests. First, figure out which of the four budget components (setup, execution, reruns, debug) is the largest. That tells you where to cut.

What is context switch threshold?

Context-switch threshold is the CI duration above which engineers stop waiting and start working on something else. Research and practitioner experience put this around 20-25 minutes. Below that, a developer can do a shallow task (code review, Slack message) and come back. Above it, they start a new task, and recovering from the interruption when CI finishes costs 15-20 additional minutes of refocusing.

Benchmark 2: Individual Test Duration

A single E2E test should run between 5 seconds and 2 minutes. Anything longer usually means the test is doing too much.

Test Type	Expected Duration	Common Cause if Slow
Simple flow (login, page load)	5-15 seconds	Hard-coded waits instead of auto-wait
Medium flow (search, filter, form)	15-45 seconds	Unnecessary full-page reloads
Complex flow (checkout, multi-step)	45-120 seconds	No API shortcuts for setup/teardown
Over 2 minutes	Fix it	The test is doing too many things. Split it

Fastest lever: Replace sleep() and hard waits with framework-native waiting. Playwright's auto-wait and Cypress's retry-ability handle this automatically. For Selenium, use explicit WebDriverWait with conditions, not Thread.sleep().

Benchmark 3: Parallelization Impact

Running tests in parallel is the single biggest lever for reducing total E2E execution time.

Workers	Suite of 200 tests (serial baseline: 40 min)	Reduction
1	40 minutes	-
2	~22 minutes	45%
4	~12 minutes	70%
8	~7 minutes	82%

These are rough estimates assuming an even test distribution. In practice, results vary based on how well tests are split. If your longest test takes 3 minutes and all others take 30 seconds, 4 workers won't help much because one worker is bottlenecked.

What to do: Use duration-based sharding, not random splitting. Assign the longest tests first, then fill the remaining capacity. Playwright supports this natively with --shard and fullyParallel mode.

Benchmark 4: Recovery Time

When a build fails, how fast can your team get back to green?

CircleCI's 2026 data shows the median team takes 72 minutes to recover from a failed workflow, up 13% year over year. The top 5% recover in 1 minute 36 seconds. That's a 45x difference.

The math is brutal. A team pushing 5 changes per day with a 70% success rate (the current industry median for the main branch) experiences 1.5 failures per day. At 72 minutes recovery each, that's nearly 2 hours lost daily to getting back to green. Scale that across a year, and it equals roughly 250 hours of blocked deployments.

Fastest lever: Better failure diagnostics. If your engineer can see why a test failed in 30 seconds instead of 10 minutes of log-reading, your recovery time drops immediately.

Benchmark 5: Success Rate

Your main branch success rate should be above 90%. The current industry median has dropped to 70.8%, the lowest in five years, largely due to AI-generated code volume outpacing validation.

That means 3 out of 10 merges to main are failing. If even a third of those failures are flaky (not real bugs), your team is spending significant time chasing ghosts.

The Hidden Tax: Reruns and Flakiness

This is the section where the real cost lives.

Setup time you can cache. Execution time you can parallelize. But reruns are pure waste, and they're growing. The Bitrise Mobile Insights report found that the probability of hitting a flaky test rose from 10% in 2022 to 26% in 2025.

Here's what flakiness costs at scale:

Google's research found that flaky tests account for 4.56% of all test failures, costing over 2% of total coding time. For a team of 50 engineers, that's a full person-year lost annually. Atlassian reported that 15% of their Jira backend build failures came from flaky tests, costing over 150,000 hours of developer time per year. A separate five-year industrial study found that dealing with flaky tests consumed at least 2.5% of productive developer time.

The pattern is consistent: speed gains don't stick if flake rate stays high. You can cut your execution time in half, but if 10% of runs still need a rerun, you've just made the wasted cycle happen twice as fast.

Tip: Track flakiness by individual test, not just as a suite-wide percentage. A 5% flake rate might mean 5 tests that flake constantly, or 250 tests that flake rarely. The fix is completely different.

How TestDino Shows You Exactly Where Flakiness Lives

This is where most teams get stuck. You know you have flaky tests. You don't know which ones, how often they flake, or whether it's getting better or worse.

TestDino tracks flakiness at the individual test level across every CI run. Instead of a single pass/fail signal, you get a stability score per test that factors in how often it flips between passing and failing over the last 30, 60, or 90 days.

Here's what that looks like in practice:

Most Flaky Tests

What you need to see	Where TestDino shows it
Which tests flake the most	QA Dashboard ranks tests by flaky rate percentage with direct links to the latest run
Who owns the flaky tests	Developer Dashboard filters flaky tests by author so ownership is clear
Is flakiness trending up or down	Analytics Summary charts flaky rate over time. A rising line means your suite is getting less stable
What kind of flakiness is it	Each test is classified by root cause: Timing Related, Environment Dependent, Network Dependent, or Assertion Intermittent
Which environments cause flakes	Cross-Environment Comparison shows flaky rate per environment side by side
Should flaky tests block CI	CI Check modes: Strict (flaky = failure) for production branches, Neutral (flaky excluded) for development

The difference between "we have flaky tests" and "test X flakes 23% of the time on Chrome Linux runners in staging" is the difference between a vague concern and a fix you can ship this sprint.

The Decision Checklist: If This, Then That

You've scored your suite. You've seen the benchmarks. Now pick the right fix based on where your time is going.

If setup dominates (over 30% of total CI time):

Cache dependency installs between runs (npm, pip, browsers)
Use pre-built Docker images with browsers already installed
Move to faster CI runners (GitHub Actions large runners, CircleCI resource classes)

If execution dominates (tests themselves are slow):

Find the 10 slowest tests and optimize them first (80/20 rule)
Replace hard waits with framework-native auto-waiting
Use API calls instead of UI clicks for test setup
Parallelize with duration-based sharding
Split into smoke (every PR) and full regression (nightly)

If reruns dominate (high flake rate eating CI cycles):

Identify the top 10 flakiest tests and fix or quarantine them
- Track flake rate per test using TestDino's flaky test detection
Add retry logic carefully: retries mask the problem, visibility fixes it
Mock external API dependencies that cause timing failures

If debug dominates (engineers spend too long figuring out failures):

Add Playwright Trace Viewer or equivalent for full execution recordings
Use AI-powered failure classification to triage failures before a human looks
Post failure summaries to PRs automatically using GitHub integration

Tip: Pick one bucket. The biggest one. Fix that first. Don't try to optimize all four simultaneously. Teams that spread effort across setup, execution, flakiness, and debugging at the same time make slow progress on all of them and fast progress on none.

Setting a Time Budget That Sticks

A benchmark is useful once. A budget is useful forever.

Step 1: Measure your baseline. Run your E2E suite 10 times. Record setup, execution, reruns, and total wall clock time separately.

Step 2: Set a target. Based on the benchmarks above, 10-15 minutes total CI time is achievable for most teams with parallelization and basic optimization.

Step 3: Create an alert. Fire a CI notification when the total E2E duration exceeds your budget by 20%. This catches the creep before it becomes a crisis.

Step 4: Review monthly. Test suites grow. Without regular review, a 10-minute suite becomes 30 minutes in 6 months. TestDino's test case history shows duration trends per test, making it easy to spot which new tests are breaking your budget.

Conclusion

Your E2E suite doesn't need to be fast in the abstract. It needs to fit within a time budget that keeps your team shipping.

That budget has four components: setup, execution, reruns, and debug. Most teams only measure one. The others are where the real time goes.

Measure all four. Fix the biggest. Keep watching.

FAQs

How long should a full E2E test suite take?

For a suite to fit within a healthy CI pipeline, aim for under 10-15 minutes of total wall-clock time with parallelization. The best-performing teams (top 25%) complete their full CI workflows in 3-5 minutes. Individual tests should range from 5 seconds (simple flows) to 2 minutes (complex multi-step scenarios).

What's a normal flake rate for E2E tests?

A healthy flake rate is under 3%. The Bitrise Mobile Insights report found that 26% of test runs encountered at least one flaky test in 2025, up from 10% in 2022. Google's internal research found that flaky tests account for 4.56% of all test failures. If your flake rate is above 10%, it's actively costing you shipping velocity and should be your first optimization target.

How many CI workers should I use for E2E tests?

Four workers is a good starting point for suites over 50 tests. Going from 1 to 4 workers typically cuts total execution time by 70%. Beyond 8 workers, the gains diminish unless your suite is very large. Use duration-based sharding to distribute tests evenly across workers.

Does faster test execution actually speed up releases?

Only if execution is your bottleneck. If 40% of your CI time goes to reruns and debugging (common in teams with high flake rates), faster test execution just means you hit the flakiness wall sooner. The time budget model helps you identify which component to fix for actual release speed improvements.

How do I know if my test suite is getting slower over time?

Track median and P95 duration for your CI workflow weekly. TestDino's test case history shows per-test duration trends, making it easy to spot individual tests that are getting slower. Set an alert when your total CI time exceeds your budget by 20% so you catch creep early.

Jashn Jain

Product & Growth Engineer

Jashn is a Product and Growth Engineer at TestDino with 1+ years of experience in automation strategy, technical marketing, and applied AI/ML research. She specializes in Playwright automation, developer tooling, and AI-driven testing workflows, turning complex automation concepts into clear, actionable resources for engineering teams.

At TestDino, she bridges product development and customer success, helping engineering teams get the most out of the platform through developer-first education and thoughtful positioning. She collaborates closely with the tech team on automation strategy while nurturing a growing community of practitioners.

Jashn's research includes a presentation at ICICV 2025 (CNR NANOTEC, University of Calabria, Italy) on explainability in ethnicity-altered synthetic images. She also has a publication in ACL Anthology at RANLP 2025 evaluating multimodal LLM performance on face recognition, age estimation, and gender classification.

View all posts →

Table of content

Flaky tests killing your velocity?

TestDino auto-detects flakiness, categorizes root causes, tracks patterns over time.

See Your Flakiest Tests

How Long Should E2E Tests Take? Performance Data from Real Teams

Score Your Suite: A 60-Second Self-Check

The Time Budget Model: Where Your Minutes Actually Go