Table of Contents
Your GenAI pilot worked beautifully in the demo. The chatbot answered questions, the document summariser cut review time, the sales assistant drafted emails that sounded human. Leadership was impressed. Budget was approved. And then — nothing happened. The pilot stayed a pilot. Six months later, the project is quietly shelved, the team is reassigned, and the only lasting artefact is a slide deck nobody opens.
If this sounds familiar, you are not alone. MIT’s 2025 GenAI Divide report found that 95% of enterprise generative AI pilots fail to deliver return on investment. Gartner pegs the failure rate at roughly 80% for projects that reach production but miss their business targets. NTT DATA’s global survey puts it between 70% and 85%. The numbers vary, but the pattern does not: most GenAI initiatives die somewhere between a promising proof-of-concept and a working production system.
This post is not another list of “AI trends to watch.” It is a diagnostic. We break down the seven failure modes we see most often across mid-market and enterprise engagements in Singapore, Hong Kong, the Middle East, and India — and for each one, we describe what a fix actually looks like in practice. If you are a CTO, VP of Engineering, or Head of Digital who has a stalled GenAI initiative, this is for you.
The Real Scale of the Problem — and Why It Matters Now
In 2025, global enterprises invested an estimated USD 684 billion in AI initiatives. By year-end, more than USD 547 billion of that — over 80% — had failed to deliver intended business value. One-third of projects were abandoned before reaching production. Another 28% reached completion but missed their ROI targets. Only about one in five AI initiatives achieved or exceeded their business objectives.
The gap is not closing. If anything, it is widening. As models improve and costs drop, more teams are launching pilots — but the organisational, data, and engineering problems that cause failure remain unchanged. The result is what analysts now call “pilot purgatory”: organisations flooded with GenAI experiments but lacking the structure, alignment, or prioritisation to operationalise any of them.
For enterprises in regulated markets — banking in Singapore, insurance in the UAE, healthcare in India — the stakes are higher still. A failed GenAI pilot does not just waste budget. It erodes trust in the technology, making it harder to get the next initiative approved. And it creates a growing gap between your organisation and competitors who are already shipping.
Why GenAI Pilots Actually Fail: The Seven Failure Modes
After reviewing dozens of stalled GenAI projects across industries, we have identified seven failure modes that recur with striking consistency. Most failed pilots exhibit three or four of these simultaneously — which is why fixing just one rarely unsticks a project.
1. Starting with the Technology, Not the Problem
The single most common failure mode. A team hears about GPT-4, Claude, or Gemini and asks, “What can we do with this?” instead of asking, “What business problem costs us the most, and could GenAI solve it better than what we have?” The result is a pilot that is technically interesting but commercially irrelevant — a chatbot nobody asked for, a summariser that saves five minutes a week, a code assistant that developers refuse to use because it does not fit their workflow.
The fix: Start with a process audit, not a technology evaluation. Identify the three to five workflows where manual effort, error rates, or cycle times are highest. Map each to a GenAI capability (generation, summarisation, extraction, classification, conversational interface). Only then choose a model and architecture. The best pilots we have seen start with a sentence like “Our compliance team spends 14 hours per week manually reviewing trade confirmations” — not “We want to build an AI agent.”
2. Treating Data as an Afterthought
Gartner’s research shows that 85% of AI projects fail due to poor data quality or a lack of relevant data. Yet in most pilot proposals we review, the data strategy section is a single paragraph that says something like “We will use our existing data lake.” The pilot team builds a beautiful RAG pipeline, points it at the data lake, and discovers that the documents are inconsistent, incomplete, duplicated, or structured in ways the retrieval layer cannot parse.
The fix: Treat data preparation as phase one of the project, not a prerequisite someone else handles. Budget 30–40% of your pilot timeline for data audit, cleaning, and structuring. For document-heavy use cases (compliance, legal, knowledge management), invest in a proper chunking and indexing strategy before you write a single prompt. The data work is not glamorous, but it is where the ROI lives.
3. Optimising for Demo Day Instead of Day 200
Pilots are often designed to impress a steering committee, not to survive contact with real users at scale. The demo uses a curated dataset of 50 documents. The production environment has 50,000 documents in 14 formats across three languages. The demo handles 10 queries per minute. Production needs to handle 500. The demo runs on a single API key with no authentication, rate limiting, logging, or error handling.
The fix: Define production requirements from day one, even if the pilot only implements a subset. Write down the target latency, throughput, data volume, user count, and availability. Build the pilot on infrastructure that can scale — not on a Jupyter notebook connected to your personal OpenAI account. This does not mean over-engineering. It means making deliberate choices about what you defer versus what you build to last.
4. Underestimating Infrastructure and Operational Costs
The API call that costs USD 0.03 in a demo costs USD 3,000 per day at production volume. Retrieved data context can represent 50–65% of total query token costs in many GenAI workloads. A pilot that looked economically viable at 100 users becomes ruinous at 10,000. Most organisations discover this after launch, not before — and by then, the budget conversation is politically difficult because leadership has already announced the rollout.
The fix: Model your cost curve before you build. Calculate token costs at 10x, 100x, and 1,000x your pilot volume. Factor in embedding storage, vector database hosting, model inference (especially if you are using GPU compute for fine-tuned models), monitoring infrastructure, and the human cost of prompt maintenance and evaluation. If the unit economics do not work at scale, redesign the architecture — smaller models, fewer retrieval steps, caching, or a different approach entirely — before you ship the pilot.
5. No Evaluation Framework Beyond “Looks Good”
Most pilot teams evaluate their GenAI system by reading a few outputs and deciding they “look good.” There are no automated evaluation metrics, no ground-truth datasets, no A/B tests, no user-satisfaction tracking. When the pilot moves toward production and outputs start degrading (because the data changes, the model updates, or the prompts drift), there is no way to detect or quantify the regression.
The fix: Build evaluation into the system from week one. Define three to five metrics that matter for your use case: accuracy against a labelled test set, latency percentiles (p50, p95, p99), user satisfaction scores, task completion rates, hallucination frequency. Create a golden dataset of 100–500 question-answer pairs that represents the production distribution. Run automated evals on every prompt change, model update, or data refresh. This is not optional tooling — it is the difference between a system you can trust and a system you are guessing about.
6. Ignoring Organisational Change Management
A technically flawless GenAI system fails if the people who are supposed to use it do not change their behaviour. The compliance analyst who has reviewed documents manually for 15 years is not going to trust an AI summary overnight. The sales team that already has a CRM workflow is not going to adopt a separate AI assistant that does not integrate with their tools. MIT’s research found that 84% of GenAI failures are leadership-driven: 73% lack clear metrics, 68% underinvest in foundations, and 56% lose C-suite sponsorship within six months.
The fix: Assign a change management lead to every GenAI initiative — someone whose job is adoption, not technology. Start with a small group of enthusiastic early adopters (five to ten people) and let them shape the workflow. Integrate the AI into existing tools (Slack, email, CRM, internal portals) rather than asking users to learn a new interface. Measure adoption weekly, not quarterly. And get executive sponsorship that lasts beyond the announcement — a named executive who reviews progress monthly and removes blockers.
7. Building Instead of Buying (or Buying Instead of Building)
Research from Pertama Partners shows that purchasing AI tools from specialised vendors and building partnerships succeeds about 67% of the time, while internal builds succeed only one-third as often. Yet many engineering teams default to building custom solutions for problems that off-the-shelf tools solve well — burning months on prompt engineering and RAG infrastructure when a commercial product would have worked in weeks. The reverse is also true: some organisations buy generic tools and then spend months trying to customise them into something that fits their workflow, when a targeted build would have been faster.
The fix: Use a simple decision framework. If your use case is common (customer support chatbot, document search, meeting summarisation), evaluate commercial tools first — the build-vs-buy threshold is higher than most engineering teams think. If your use case involves proprietary data, proprietary workflows, or regulatory requirements that commercial tools cannot meet, build — but build on top of proven frameworks (LangChain, LlamaIndex, Haystack) rather than from scratch. The sweet spot for most enterprises is a commercial platform with custom integrations, not a fully custom stack.
The Anatomy of the 5% That Succeed
If 95% of GenAI pilots fail, what do the 5% that succeed have in common? Based on our work with enterprises that have moved from pilot to production, the pattern is consistent:
They started with a specific, measurable business problem — not a technology. They had executive sponsorship that lasted beyond the first quarter. They invested 30–40% of their timeline in data preparation. They defined evaluation metrics before writing the first prompt. They modelled costs at production scale before building the demo. They assigned a change management lead alongside the technical lead. And they chose the simplest architecture that could solve the problem, rather than the most impressive one.
None of this is revolutionary. That is precisely the point. GenAI pilot success does not require a breakthrough in model capability. It requires the same engineering discipline, business alignment, and organisational maturity that makes any technology initiative succeed — applied consistently to a technology that is uniquely good at producing impressive demos and uniquely bad at surviving contact with reality.
A Practical Recovery Playbook for Stalled Pilots
If you have a GenAI pilot that is stalled, shelved, or underperforming, here is a structured approach to diagnosing and recovering it — or deciding, deliberately, to kill it.
Step 1: Revalidate the Business Case (Week 1)
Go back to the original problem statement. Is it still the right problem? Has the business context changed? Can you quantify the cost of the problem in hours, dollars, or error rates? If you cannot articulate a clear, measurable business outcome, stop here. Kill the project and redirect the team.
Step 2: Audit the Data Foundation (Weeks 2–3)
Assess the data your pilot depends on. How complete is it? How current? How consistent? How accessible? If the data is fragmented across systems, inconsistent in format, or stale, fix the data pipeline before touching the model. This is where most recovered pilots find their biggest gains.
Step 3: Redesign for Production Economics (Weeks 3–4)
Model the cost at 10x and 100x current volume. If the economics do not work, explore architectural changes: smaller models (Haiku-class instead of Opus-class), aggressive caching, reduced retrieval depth, or a hybrid approach where GenAI handles the long tail and rules-based systems handle the high-volume cases.
Step 4: Build the Evaluation Layer (Weeks 4–5)
Create a golden test set. Define metrics. Set up automated evaluation pipelines. This is the single highest-leverage investment you can make — it turns your GenAI system from a black box into something you can measure, debug, and improve systematically.
Step 5: Relaunch with a Controlled Cohort (Weeks 6–8)
Pick five to ten users who represent the target persona. Give them the redesigned system with clear instructions and a feedback channel. Measure adoption, satisfaction, and business impact weekly. Iterate based on real usage, not assumptions. Only expand to the next cohort when the metrics from the first cohort meet your thresholds.
What a Well-Scoped GenAI Engagement Actually Costs
One of the reasons pilots fail is that organisations budget for the demo, not the production system. Here is what a realistic GenAI engagement looks like in terms of investment, based on our work across Singapore, Hong Kong, and India:
| Phase | Duration | Typical Investment (SGD) | What You Get |
|---|---|---|---|
| Discovery and Problem Framing | 2–3 weeks | 15,000–25,000 | Validated use case, data audit, architecture recommendation |
| Pilot Build (Production-Grade) | 6–10 weeks | 60,000–120,000 | Working system with evaluation framework, tested on real data |
| Controlled Rollout | 4–6 weeks | 30,000–50,000 | User onboarding, feedback integration, performance tuning |
| Production Hardening | 4–8 weeks | 40,000–80,000 | Monitoring, scaling, security, compliance, SLA-grade reliability |
| Ongoing Optimisation | Monthly | 8,000–15,000/month | Prompt tuning, model updates, data refresh, eval maintenance |
Total investment for a production-grade GenAI system: SGD 145,000–275,000 over four to six months, plus ongoing optimisation. This is significantly more than the SGD 20,000–40,000 that most organisations budget for a “pilot” — and that gap in expectations is itself a major cause of failure.
How Sthambh Helps Enterprises Move GenAI from Pilot to Production
Sthambh works with mid-market and enterprise clients across Singapore, Hong Kong, India, and the Middle East to design, build, and operationalise GenAI systems that survive contact with production. Our approach is informed by the failure patterns described in this post — and specifically designed to avoid them.
We start with a two-to-three-week discovery sprint that validates the business case, audits the data, and produces a concrete architecture recommendation — before a line of production code is written. We build pilots on production-grade infrastructure from day one, with automated evaluation frameworks, cost modelling, and a clear path to scale. We pair every technical lead with a change management plan that covers user onboarding, workflow integration, and adoption tracking.
Whether you need to rescue a stalled pilot, design a new initiative from scratch, or build the internal capability to run GenAI at scale, our team has the engineering depth and the business context to make it work. We have delivered production GenAI systems for document processing, compliance automation, customer service, and internal knowledge management — across banking, insurance, healthcare, logistics, and professional services.
FAQs
Q. Why do most GenAI pilots fail even when the technology works in demos?
A. Demos use curated data, low volume, and no integration with real workflows. Production requires dirty data, high throughput, security, monitoring, and user adoption. The gap between a working demo and a working product is where most pilots die — it is an engineering and organisational problem, not a model capability problem.
Q. What is the most common mistake enterprises make with GenAI?
A. Starting with the technology instead of the problem. Teams that begin by asking “How can we use GPT?” instead of “What is our most expensive manual process?” almost always build something impressive but commercially irrelevant.
Q. How much should we budget for a production-grade GenAI system?
A. For a single well-scoped use case, expect SGD 145,000–275,000 over four to six months, including discovery, build, rollout, and production hardening. Ongoing optimisation runs SGD 8,000–15,000 per month. Budgeting only for the pilot (SGD 20,000–40,000) is a common cause of failure.
Q. Can we fix a stalled GenAI pilot, or should we start over?
A. It depends on why it stalled. If the business case is still valid and the data foundation is recoverable, a structured recovery — revalidate, audit data, redesign architecture, build evaluation, relaunch with a controlled cohort — typically takes six to eight weeks. If the original problem was wrong or the data does not exist, starting over with a different use case is faster and cheaper.
Q. Should we build our GenAI system in-house or buy a commercial tool?
A. If your use case is common (customer support chatbot, document search, meeting notes), evaluate commercial tools first — they succeed about 67% of the time versus roughly 33% for internal builds. If you have proprietary data, proprietary workflows, or regulatory requirements that commercial tools cannot meet, build on top of proven open-source frameworks rather than from scratch.
Q. How do we measure whether a GenAI system is actually working?
A. Define three to five metrics before you build: accuracy against a labelled test set, latency percentiles, user satisfaction scores, task completion rates, and hallucination frequency. Create a golden dataset of 100–500 examples and run automated evaluations on every change. If you cannot measure it, you cannot improve it — and you certainly cannot trust it.
Q. What role does data quality play in GenAI success?
A. It is the single biggest determinant. Gartner’s research shows 85% of AI projects fail due to poor data quality. Budget 30–40% of your timeline for data audit, cleaning, and structuring. The model is only as good as the context you feed it.
Q. Is the 95% GenAI failure rate real?
A. MIT’s 2025 report, based on 150 executive interviews, 350 employee surveys, and 300 public deployment analyses, found that 95% of enterprise GenAI implementations fall short of ROI targets. Other studies put the number between 70% and 85%. The exact figure varies, but the pattern is consistent: the vast majority of GenAI pilots do not deliver their intended business value.
Nikhil Khandelwal
Co-founder & CTO, Sthambh
