Flaky Tests Are Killing Your CI: How to Find and Fix Them for Good

A test suite nobody trusts is worse than no suite at all. Here is a systematic approach to detecting, diagnosing, and eliminating flaky tests in Playwright and CI pipelines.

Fixing flaky tests in CI pipelines illustration
Fixing flaky tests in CI pipelines illustration
7 min read

Every engineering team knows the ritual: CI fails, someone comments "just re-run it," the second run passes, and everyone moves on. Each re-run erodes trust in the suite — until the day a real regression gets re-run into production because failures had become background noise.

Flakiness is not random bad luck. It has a small number of root causes, and each one has a systematic fix.

The four root causes of flaky tests

Race conditions dominate: the test asserts before the app finishes reacting — a fetch still in flight, an animation mid-transition, a debounced input not yet settled. Hard-coded waits (sleep 2000) are the classic symptom, and they fail the moment CI runs slower than a dev laptop.

The other three: shared state between tests (one test's data pollutes the next), external dependencies (third-party APIs, real clocks, real networks), and order dependence (tests that only pass when run after another test). Parallel execution exposes all three brutally.

Diagnose with data, not vibes

Track pass/fail history per test across CI runs — a test that fails more than 2% of runs without a related code change is flaky by definition. Playwright's trace viewer shows exactly what the page looked like at failure; run the suspect test 50 times in a loop locally with --repeat-each to reproduce the race.

At QaLock, flaky-test triage is part of every maintenance retainer: we quarantine suspect tests the day they first flake, diagnose from traces, and return them to the blocking suite only after 100 consecutive green runs.

Fix patterns that actually hold

Replace every fixed sleep with a web-first assertion — Playwright's expect() auto-retries until the condition is true or times out. Isolate test data per run with unique identifiers, so parallel workers never collide. Mock external services at the network layer for deterministic responses, and reserve live-integration checks for a small nightly suite that does not block merges.

The payoff is compounding: a suite the team trusts gets run more, catches more, and earns its place as a release gate instead of a rubber stamp.

Want help implementing this for your product?

Book a free 30-minute QA audit — coverage report in 48 hours.