Join the DZone community and get the full member experience.
Join For Free
Nothing interrupts a CI/CD pipeline quite like an intermittent test failure. Over time, these "flaky" tests erode confidence in automation and become a drag on velocity.
Industry data confirms the pain: a 2023 survey found that flaky tests account for nearly 5% of all test failures, costing organizations up to 2% of total development time each month [1]. When tests that once guarded quality instead generate noise, developers learn to ignore failures, and genuine defects can slip through unnoticed.
What if your QA system could do more than just report failures, to actually analyze patterns, identify likely causes, and recommend targeted fixes? In this article, I'll show you how to build a lightweight AI-driven assistant that calculates flaky-test rates, leverages a large language model to diagnose root causes, and delivers actionable remediation suggestions -- all without touching your codebase.
The Vision: From Failure Logs to Actionable Fixes
Imagine a QA workflow where every failed test isn't just a red mark but a trigger for insight. I wanted more than a dashboard full of failures -- I needed an assistant that could sift through test history, quantify flakiness, and translate cryptic errors into clear next steps.
Here's the goal in a nutshell:
This flow transforms passive test results into a proactive QA companion, reducing the manual toil of failure analysis and helping me fix flaky tests faster.
Architecture Breakdown
To bring this vision to life, I built a simple, extensible pipeline using open-source and low-code components:
Reading Test History (n8n)
I use n8n to read my results.json -- a time-ordered list of test runs -- via a "Read Files" node.
1. Calculating Flaky Rate (n8n Code Node)
A JavaScript snippet slices the last 10 runs and computes:
If flakyRate > 0.3, the flow flags the test as flaky.
2. AI Diagnosis (LLM Chain)
The node then packages the error message, flaky rate, and test steps into a JSON prompt for the LLM. I've configured it to return a JSON object with keys like , , and .
3. Saving Recommendations (n8n Write File)
The LLM output is formatted into markdown and written to recommendation.md.
4. Dashboard Integration (Flask)
A small Flask app serves two routes:
Detecting Flakiness: The Numbers
Flakiness is inherently a numbers game -- without clear metrics, you're just guessing which tests need attention. I settled on a sliding window of the last 10 runs because it strikes a balance between recency and statistical significance. Here's how I compute it:
If the flakyRate exceeds 30%, I classify the test as flaky. This threshold isn't arbitrary -- Jeff Morgan's QA Journal research recommends 25-35% as a practical cutoff for intermittent failures in UI pipelines, balancing sensitivity and noise reduction [3].
AI-Powered Diagnosis and Recommendation
Once a test is flagged as flaky, I want more than just a pass/fail -- I need insight. That's where the LLM chain in n8n comes in. I bundle together the test's metadata (name, timestamp), the computed flakyRate, raw error messages, and the exact Playwright steps that failed, and send it as one JSON payload. By including both context (the last-known test steps) and quantitative data (failure ratio), I give the model everything it needs to pinpoint the root cause.
In practice, my LLM system prompt looks like this:
"You are a QA expert. Given a test's failure logs, flaky rate, and test steps, return a JSON object with an 'analysis' section explaining why the test fails, and a 'fix_suggestion' section that shows the original failing line, the new code lines to insert, and a brief explanation."
The result is powerful: rather than digging into long stack traces, I see instantly that a missing waitForSelector is the culprit, along with the exact code snippet to insert. Confidence scores in the output let me decide when to trust the AI's recommendation or fall back to manual debugging. Over dozens of flaky tests, this approach has cut my diagnosis time from hours to minutes, and the consistent format allows easy tracking of "fix adoption" over time.
From JSON to Dashboard: Delivering Developer-Facing Fixes
Once the LLM has generated its structured diagnosis and fix suggestion, the final step is surfacing this insight in a way that developers and QA engineers can actually act on. I chose to keep things simple and transparent by writing the LLM output to a local Markdown file (recommendation.md). This file is then rendered by the existing Flask dashboard under a new route: /recommendation.
This addition required only a lightweight extension to the original dashboard logic. Now, when a test fails and is deemed flaky, the LLM's recommendation is saved, and a new "Show" button appears in the dashboard's table row. Clicking it opens the dedicated analysis page, showing:
* A plain-language explanation of why the test likely failed
* A precise Playwright code snippet highlighting the issue
* A replacement suggestion or fix, e.g., using waitForSelector() or expect(locator).toBeVisible()
* A confidence score or explanation when appropriate
A view of the /recommendation page showing LLM analysis, the identified flaky test line, and the suggested code fix:
This format proved far more usable than logs or stack traces alone. It turns debugging into guided troubleshooting. More importantly, it closes the feedback loop: engineers no longer just know what broke -- they get a strong hint for how to fix it.
Limitations and What I Plan to Build Next
While the flaky test recommendation system has already made debugging more efficient and less frustrating, it's far from a silver bullet. There are several limitations -- both technical and practical -- that I encountered while building and using it.
Not All Failures Are Diagnosable
The LLM performs well on common failure patterns -- missing waits, incorrect selectors, timing issues -- but it struggles with edge cases. educated guesses, but in these cases, it's better treated as a supportive assistant than a source of truth.
Subtle Test Bugs Still Slip Through
Some flaky tests don't leave obvious traces in logs -- they might pass 4 out of 5 times, or fail under specific screen sizes or device emulations. While I experimented with flaky rate thresholds to identify these, a more robust solution would involve test repetition, trend analysis, or even machine learning-based pattern detection ([4]).
For now, the system relies on surface-level signals (e.g., number of recent failures), which are useful but not infallible.
Conclusion: Toward Smarter, More Resilient QA
Test flakiness is more than just a nuisance -- it's a silent killer of confidence, developer time, and continuous delivery momentum. For years, I treated flaky failures as inevitable artifacts of UI testing. But building this system helped me rethink that assumption.
By layering an AI assistant on top of my existing test workflow, I now get contextual feedback the moment a test fails. Instead of wasting time parsing stack traces, I get actionable suggestions tailored to the specific error. This doesn't eliminate all flakiness, but it gives me a fighting chance to address it faster and smarter.
The best part? It didn't require rewriting my whole QA strategy. I built this on open-source tools I already used -- Playwright, Flask, n8n -- and added a lightweight LLM layer for analysis. It was the smallest change with the biggest leverage.
Even in its current state, this experiment proved a powerful point: AI can make QA feel less like firefighting and more like flow. It can turn a frustrating red failure into a teachable moment -- and turn brittle tests into self-improving systems.
If you're drowning in test failures or tired of debugging the same issues repeatedly, I'd encourage you to start small. Add just one layer of feedback -- one diagnosis step -- and see what clarity that brings.
Sometimes, the smartest thing your QA system can do isn't just run tests -- it's tell you why they failed, and what to do next.