Roughly 80 percent of enterprise GenAI pilots never reach production. In regulated industries, that number is worse. Banks in Singapore and Hong Kong, insurers across APAC, UK fintechs under FCA scrutiny, and healthcare groups serving the EU all face the same wall. A demo wins applause. The same system stalls the moment compliance, audit, security, or operations teams ask the second-order questions.
If you are running a GenAI pilot production enterprise program and the gap between Friday’s demo and Monday’s production review keeps widening, this guide is for you. It covers the four gaps that kill pilots, the regulatory compliance layer you cannot skip, an architecture checklist for production-shaped builds, a real twelve-week case study, and a week-by-week roadmap.
Why GenAI Pilots Stall Inside Regulated Industries
Pilots succeed because they answer one question well in a controlled setting. Production fails for the opposite reason. It must answer many questions, on real data, under real load, with real consequences when it gets something wrong.
In regulated industries, three forces compound that gap.
The regulator. The Monetary Authority of Singapore’s FEAT principles (Fairness, Ethics, Accountability, Transparency) and the HKMA’s Generative AI guidance circular both require explainability, traceability, and human oversight on every consequential output. The EU AI Act extends that envelope to anything classified as high-risk — including AI systems that inform credit decisions, insurance pricing, or employee management. A pilot that hallucinates with charm in a sandbox cannot do the same in a customer-facing or compliance-facing workflow. The UK Financial Conduct Authority’s model risk management expectations under SS1/23 add a further dimension: firms must demonstrate that AI models used in regulated activities have been validated, documented, and monitored. None of these frameworks care that you were “just piloting.” They care whether the system was used in a consequential workflow.
For APAC-headquartered groups selling into Europe, the EU AI Act’s general-purpose AI transparency requirements — covering models with systemic risk above the 10^25 FLOP training compute threshold — apply to the model layer even when the deploying firm is outside the EU. If your pilot uses a frontier model and your output informs any decision touching EU citizens, that is in scope.
The data. A pilot uses a clean, hand-picked corpus. Production runs on the messy reality of your document estate. Internal policies in PDF, customer correspondence in email threads, regulatory filings in Traditional Chinese or Cantonese, claims notes typed by a human in a hurry, scanned documents with imperfect OCR, multilingual tables mixing English headers with Mandarin data cells. The retrieval layer that handled twenty test documents collapses against twenty thousand — not because the model is worse but because the indexing and metadata architecture was never designed to handle that variety.
Specific problems that pilot teams underestimate: scanned PDFs with no embedded text require an OCR pipeline before chunking; multilingual documents require language-aware tokenization that preserves proper nouns and regulatory identifiers; and structured data (spreadsheets, database exports) requires a different chunking strategy from flowing prose. If any of these document types exist in your production corpus and your pilot did not account for them, you have a grounding gap.
The operating model. A pilot has one engineer babysitting the system. Production needs incident response, change management, a model risk owner, and a clear path to roll back when something goes wrong. It needs a defined owner for each layer of the stack: the embedding model, the vector store, the LLM inference endpoint, the prompt templates, and the evaluation set. Most pilots never plan for that handover. When the pilot engineer moves on to the next proof of concept — which they always do — the system has no owner and quietly rots until it is cancelled.
Get one of these wrong and the pilot stalls. Get all three wrong and you join the 80 percent.
The Four Production Gaps That Kill Enterprise GenAI Pilots
After working with banks, insurers, and platform teams across Singapore, Hong Kong, the UK, and the US, you see the same four gaps every time. None of them are about model quality. All of them are about the system around the model.
1. The Evaluation Gap
Pilots run on vibes. Someone on the team types a question, reads the answer, and says “yes that’s good.” Production cannot work that way.
You need an evaluation set: a few hundred questions, mapped to known correct answers, scored on retrieval precision, answer faithfulness, and citation accuracy. Without that, you have no way to tell whether the system is improving, regressing, or quietly drifting after a model update. Regulators ask for this explicitly. So do your own internal audit teams.
The three metrics that matter most for regulated RAG systems:
- Context precision: of all the chunks retrieved, what fraction were actually relevant to the question? Low context precision means your answers are grounded in noise.
- Answer faithfulness: is every claim in the generated answer traceable to a retrieved source chunk? This is the hallucination metric. You want it above 95 percent before going live.
- Citation accuracy: does the cited source — document name, section, date — correctly identify the passage the answer drew from? This is what the regulator looks at when they ask “how did your system reach that conclusion?”
Open-source evaluation frameworks RAGAS and TruLens both provide these metrics out of the box. RAGAS uses LLM-as-judge to score faithfulness and relevance at scale. TruLens provides deeper tracing into the retrieval chain, which is useful for debugging low-precision retrievals. Neither replaces a human-curated evaluation set, but both make it practical to run hundreds of evaluations on every deployment.
A practical baseline for regulated industries is 200 to 500 questions per use case, refreshed quarterly, with at least 20 percent of those questions designed to probe edge cases the regulator cares about: anti-money laundering thresholds, sanctions screening edge cases, customer suitability rules, policy supersession scenarios. The questions you do not want the system to answer incorrectly are as important as the ones you do.
2. The Grounding Gap
Most pilots use whatever embedding model and chunking strategy was in the tutorial. That is fine for a demo. It is not fine for a production system that has to cite the right paragraph of the right policy when the regulator asks.
Production-grade RAG in regulated industries needs three things the pilot probably skipped.
First, source-aware chunking that respects document structure — page numbers, section IDs, clause numbers, and heading hierarchy — so every answer can be traced back to a specific line in a specific version of a specific document. For scanned PDFs, this requires an OCR step that preserves positional metadata. For multilingual documents, it requires a language-detection step that preserves identifiers and proper nouns across chunk boundaries.
Second, rich metadata filtering so a query about a 2024 regulation does not retrieve a superseded 2019 document. The minimum metadata schema for regulated financial documents should include: effective_date (ISO 8601), supersedes (list of document IDs this version replaces), classification (public / internal / restricted / partner-confidential), issuer (MAS / HKMA / FCA / internal), jurisdiction, and language. Without supersedes metadata, your retrieval layer will happily surface invalidated guidance alongside current guidance, and the model will blend them into an answer that looks authoritative but is partly wrong.
Third, a hybrid retrieval layer combining dense vector search with sparse BM25 search so domain terminology and exact identifiers — a policy number, a CUSIP, a Hong Kong ID format, a specific MAS circular reference — work reliably alongside semantic queries. Dense search alone fails on exact identifier lookups. Sparse search alone fails on semantic queries. Regulated financial documents contain both, so you need both.
If your pilot answer says “according to our policy” without naming the policy, the section, the effective date, and the version, it is not production-ready.
3. The Governance Gap
Pilots have one user. Production has hundreds, sometimes thousands. Every one of them needs the right level of access to the right documents, and every interaction needs to be logged in a way that satisfies the bank’s record-keeping rules.
That means row-level access controls inside the vector store, keyed on user role and document classification. A junior analyst should not be able to retrieve partner-confidential counterparty assessments. A branch user in one jurisdiction should not be able to retrieve policies scoped to another. Most off-the-shelf vector databases support metadata filtering but not true row-level security — implementing this correctly requires either a pre-query filter injected from the access control system or a post-retrieval filter before the context is passed to the LLM.
On logging: the minimum fields for a compliant audit log entry in an APAC financial services context are: timestamp (UTC), user_id (pseudonymised), session_id, query_text (or its hash if the query contains PII), retrieved_chunk_ids (list), model_version, prompt_template_version, response_text (or hash), and human_review_flag (boolean indicating whether the output was reviewed before being acted on). Retention for seven years is the standard for most MAS and HKMA use cases — this is not the kind of log you can purge after thirty days.
A human review path must also be defined for high-stakes outputs. What constitutes a high-stakes output varies by use case: for compliance document retrieval, it might be any answer that cites fewer than two sources (low confidence signal) or any answer that references a regulatory threshold (because the consequence of an error is a compliance breach). The SLA for human review — how long can the output sit in a queue before the analyst acts on it — must be defined before go-live.
Most pilots have none of this. Building it after the fact, on top of an architecture that did not anticipate it, is two to three times more expensive than building it in from the start.
4. The Operations Gap
Production GenAI is software. It needs deployment pipelines, observability, alerting, and an on-call rotation. It needs a playbook for what to do when the model provider has an outage, when retrieval latency spikes, when a hallucination is reported by a user, when a new regulation comes into force and twenty documents in your corpus need to be updated.
Specific runbook scenarios every regulated GenAI system must address:
- Model provider outage: does the system fail gracefully or silently return degraded outputs? Is there a fallback model or a clear user-facing error message?
- Latency spike: what is the P95 latency SLO? Who gets paged when it is breached?
- Hallucination report: what is the intake process? Who investigates? How quickly must a confirmed hallucination be disclosed to compliance?
- Corpus update: when a new regulation comes into force and supersedes an existing document, how quickly does the new document reach the vector store? What happens to queries in flight?
- Model drift: embedding model updates can shift the retrieval distribution even without changing any query. How often do you re-run the evaluation set to catch silent regressions?
Pilots can survive without any of that. Production cannot. If your roadmap from pilot to production does not include a runbook, an SLO, and a named on-call engineer, you do not have a roadmap. You have a wish list.
Pilot vs Production: What Actually Changes in Regulated Industries
The table below shows what a typical pilot looks like on each dimension versus what a production-ready system requires. The gap is not primarily in the model — it is in everything around the model.
| Dimension | Typical Pilot | Production-Ready System |
|---|---|---|
| Knowledge corpus size | 20–100 hand-picked documents, clean PDFs | 5,000–50,000 documents, mixed formats, multiple languages, ongoing ingestion pipeline |
| Retrieval approach | Dense vector search only, default chunking | Hybrid dense+sparse, source-aware chunking, metadata filtering (effective date, classification, jurisdiction) |
| Evaluation method | Ad hoc manual queries, no scoring | 300–500 question evaluation set, RAGAS/TruLens scoring on context precision, faithfulness, citation accuracy |
| Audit logging | None or application logs only | Structured per-query logs: timestamp, user ID, query hash, chunk IDs, model version, prompt version, response hash. 7-year retention. |
| Access controls | Single user or open access | Row-level access control keyed on user role + document classification. Pre-query filter injected from identity system. |
| Data residency | Cloud-hosted, region not specified | Data residency locked to required jurisdiction (SG, HK, EU). Self-hosted or regional inference endpoint. Documented in data flow diagram. |
| On-call ownership | One pilot engineer, informal | Named model risk owner, named on-call rotation, runbook for 5+ failure scenarios, P95 latency SLO defined |
| Regulator readiness | Not considered | Compliance evidence pack: evaluation set results, audit log architecture, data residency proof, human oversight design, model version history |
The Regulatory Compliance Layer You Cannot Skip
Regulation is not a post-launch concern. It shapes architecture decisions. Here is what the major frameworks actually require of a production GenAI system in each key market.
1. MAS FEAT Principles (Singapore)
The Monetary Authority of Singapore’s FEAT principles are not aspirational guidelines — they are the framework MAS supervisors reference when assessing AI governance in financial institutions. Each principle has direct architectural implications.
Fairness requires that the system does not produce discriminatory outputs based on protected characteristics. For a RAG compliance system, this means ensuring the retrieval layer does not systematically return different quality answers for different user groups, and that the evaluation set tests for this explicitly.
Ethics requires that the system aligns with societal norms and that the firm can demonstrate this. In practice, this means maintaining a documented content policy — what topics the system will and will not address — and logging when users attempt to elicit out-of-scope responses.
Accountability requires that someone is responsible for every AI output that informs a decision. This translates directly to the model risk owner requirement: one named individual who is accountable for the system’s behaviour, has the authority to roll it back, and signs off on major configuration changes.
Transparency requires that the system’s outputs are explainable to the people affected by them. For a compliance analyst using the system, this means every answer includes its source citations. For the regulator, it means the firm can reconstruct, from audit logs, exactly what the system told any user and on what basis.
2. HKMA Generative AI Guidance
The HKMA’s guidance on the use of Generative AI by authorised institutions (issued late 2023, updated in engagement with the industry through 2024 and 2025) focuses on three areas that translate directly to system design. First, model validation: firms should validate that the generative AI output is accurate and appropriate before deploying in any customer-facing or decision-support workflow. This is the evaluation set requirement. Second, data governance: firms should ensure that customer data is not used to train or fine-tune external models without appropriate consent and data handling controls. This is the data residency and model selection constraint. Third, human oversight: for high-risk outputs, a qualified person should review the AI output before it informs a consequential decision. This is the human review path requirement.
3. EU AI Act: General-Purpose AI and High-Risk Provisions
The EU AI Act’s provisions that are most likely to affect APAC-headquartered firms deploying GenAI into European operations are twofold.
The general-purpose AI (GPAI) model rules require providers of models above the 10^25 FLOP training compute threshold — which includes the major frontier models from OpenAI, Google, and Anthropic — to publish technical documentation and comply with transparency obligations. As a deployer of these models, your firm inherits some of these obligations: you must document which model you are using, at which version, and for what purpose.
The high-risk AI system provisions apply when your system is used in contexts listed in Annex III of the Act: employment decisions, credit scoring, insurance risk assessment, biometric categorisation, or access to essential services. If any of these apply to your use case, you need a conformity assessment, an EU-accessible technical documentation package, and post-market monitoring. The timelines for these obligations are rolling in through 2025 and 2026 depending on the risk category.
For APAC firms not yet actively deploying into Europe but considering it: design your audit logging and documentation architecture to EU AI Act standards from day one. Retrofitting conformity documentation is considerably more expensive than capturing the evidence as you build.
4. UK FCA: SS1/23 Model Risk Management
The Bank of England and PRA’s Supervisory Statement SS1/23 on model risk management applies to PRA-regulated firms, which includes most banks and insurers operating in the UK. It requires firms to have a model risk management framework that covers AI and machine learning models used in regulated activities. Key requirements: model inventory, pre-deployment validation, ongoing performance monitoring, and model risk tiering (which determines the depth of validation required). For a RAG compliance system at a UK bank, this means the system needs a model card, a validation report covering the evaluation set results, and a monitoring plan that defines what metrics will be tracked and at what frequency.
How to Design a Production-Shaped GenAI Pilot
The fix is not to build bigger pilots. It is to build pilots that are already production-shaped, just smaller. Five practices make the difference.
1. Start with the Production Use Case, Not the Demo Use Case
Pick a workflow that real users do every day, that has a measurable outcome, and that touches real data. Customer onboarding screening, internal policy lookup, claims triage, regulatory horizon scanning. Avoid demo-friendly use cases like “ask any question about anything” — they look great on stage and never reach production because they have no evaluation criteria and no clear success metric.
The right question is not “what can GenAI do.” It is “what does this team currently do every Tuesday morning that AI can take from two hours to ten minutes.”
2. Build the Evaluation Set Before You Build the System
This sounds backwards. It is not. If you cannot write the evaluation set, you do not understand the use case well enough to build for it. Spend the first two weeks writing 200 questions and curating known-good answers. Then build. Your pilot will outperform teams that started with the architecture, because you have already resolved the ambiguities that kill systems at review time.
3. Architect for Audit on Day One
Every retrieval, every prompt, every response — logged with a timestamp, a user ID, a source citation, and a model version. This is not optional for regulated industries and it is not something you can bolt on later without rebuilding half the system. Treat your RAG pipeline like any other regulated software system: versioned, observable, and reproducible. The teams who get this right can answer the regulator’s question “what did your system tell this customer six months ago?” in under an hour.
4. Pick a Stack That Respects Data Residency
If your data has to stay in Singapore, in Hong Kong, or inside the EU, your model and your vector store must respect that. That rules out some hosted services and rules in self-hosted open-weight models, regional inference endpoints (such as Azure East Asia, GCP Singapore, AWS ap-southeast-1), and on-premises vector databases. Decide this at architecture time, not at procurement time. Sthambh’s RAG pipeline practice is built around this exact constraint, because almost every regulated client in APAC raises data residency on day one and most pilot teams have no documented answer.
5. Plan the Handover from Day One
Every pilot has a target handover team. Identify them at kickoff. Invite them to the architecture review. Get their non-negotiables in writing — access controls, logging, SLAs, monitoring tooling, change management process — and build to those non-negotiables. The pilot that reaches production is the one whose target operations team has been in the room since week one. If you cannot name the team that will run this in production, you are not running a pilot. You are running a science fair.
A Real Use Case: Compliance Document Retrieval at an APAC Bank
A regional bank running operations in Singapore and Hong Kong wanted to deploy a RAG system to help compliance analysts answer questions about internal policies and external regulations. The pilot worked well in a sandbox with 40 selected documents. Six months later it had not reached production, and the compliance team had quietly stopped using it.
The reasons were the four gaps. No formal evaluation set — progress was assessed by one champion user who had memorised most of the documents and was effectively pattern-matching rather than genuinely relying on the system. Citations pointed to a document but not a section, making them useless for a compliance analyst who needed to cite the exact clause in a response to a regulator. No row-level access controls — some policies were partner-confidential and the system could not safely be opened to all analysts. And no on-call ownership — the pilot engineer had been reassigned.
Phase 1: Use Case Definition and Evaluation Set (Weeks 1–2)
We started by scoping the use case precisely: compliance analysts answering internal queries about MAS and HKMA regulatory obligations and internal policy. We excluded customer-facing use cases from scope. With three senior compliance analysts, we built an evaluation set of 350 questions: 120 on MAS regulations, 90 on HKMA regulations, 100 on internal policy, and 40 edge cases — superseded regulations, policy conflicts between jurisdictions, and deliberately ambiguous queries that the system should flag rather than answer confidently.
Phase 2: Architecture and Compliance Design (Weeks 3–4)
We rebuilt the chunking pipeline to preserve section IDs, clause numbers, page numbers, and effective dates. We added metadata fields for issuer, jurisdiction, effective_date, supersedes, and classification. We designed a row-level access layer keyed on the bank’s existing identity provider (Azure AD), with document classification mapped to user roles.
We also made the data residency decision explicit: all model inference would run on a regionally-hosted endpoint within Singapore, with no data leaving the APAC region. This ruled out several hosted services and led to a self-hosted embedding model and a managed vector store on a Singapore-region cloud instance.
Phase 3: Build and Internal Pilot (Weeks 5–8)
We implemented hybrid retrieval (dense + BM25) and ran the evaluation set against each retrieval configuration variant. Default dense-only retrieval scored 71 percent context precision on the evaluation set. Hybrid retrieval with metadata filtering on effective_date and classification reached 89 percent. Adding source-aware chunking that preserved section boundaries pushed this to 94 percent.
We also instrumented the full audit log — per-query logging to a structured log store with the fields noted in the governance section above — and wired it into the bank’s existing SIEM for retention.
Phase 4: Compliance Review and Expanded Beta (Weeks 9–10)
We ran the evaluation set results and the audit log architecture through the bank’s model risk team. The compliance evidence pack we prepared included: evaluation set methodology and results, the data flow diagram showing Singapore-only data residency, the access control architecture, the human review SLA (any query flagged as high-confidence-low-source-count reviewed within four business hours), and the model version and prompt template change log.
The model risk team approved the system for controlled rollout to 15 analysts across Singapore and Hong Kong compliance. Feedback from the beta confirmed the evaluation set results held on real traffic.
Phase 5: Production Handover (Weeks 11–12)
We ran two sessions with the platform team, walked through the runbook, and trained two on-call engineers on the five failure scenarios. We handed over the evaluation set as a living document, with a quarterly refresh process documented and owned by the compliance operations team.
The compliance team’s average time-to-answer on a policy question fell from 18 minutes to under 2 minutes. Retrieval precision on the evaluation set held at 94 percent through the first month of production traffic. None of that improvement was about a better model. All of it was about a better system around the model.
For more on the architecture patterns we use in regulated APAC environments, see our deep dive on RAG for financial services in Asia.
From Pilot to Production: A 12-Week Roadmap
Most regulated GenAI production deployments run 10 to 16 weeks from scoping to production rollout. The variance is driven by data preparation complexity and compliance sign-off cycles, not the model or retrieval layer itself. Here is a practical week-by-week structure.
Phase 1: Use Case Definition and Evaluation Set (Weeks 1–2)
Define the use case with specificity: who uses it, what they are doing today, what a successful output looks like, and what a harmful output looks like. Build the evaluation set. 200 questions minimum, with at least 20 percent edge cases. Do not start building the system until the evaluation set is reviewed by a domain expert and a compliance representative.
Also in this phase: document the data residency requirement, identify the on-call team, and get the model risk team’s intake criteria in writing.
Phase 2: Architecture and Compliance Design (Weeks 3–4)
Make every consequential architectural decision explicit and documented: embedding model, vector store, LLM inference endpoint, chunking strategy, metadata schema, access control architecture, audit log schema, data residency approach. Have this architecture reviewed by the operations team before a line of production code is written.
Draft the compliance evidence pack structure at this stage — it is far easier to capture documentation as you build than to reconstruct it at review time.
Phase 3: Build and Internal Pilot (Weeks 5–8)
Build the system. Run the evaluation set at the end of each sprint. Fix retrieval before fixing generation — low context precision cannot be corrected at the prompt layer. Instrument audit logging from the first deployment, not as a last-minute addition.
At the end of week 8, the system should score above 90 percent on context precision and above 95 percent on answer faithfulness on the evaluation set before it is shown to any internal users.
Phase 4: Compliance Review and Expanded Beta (Weeks 9–10)
Submit the compliance evidence pack to the model risk team. Run a controlled beta with 10 to 20 users from the target team. Collect feedback systematically — not “did you like it?” but “was any answer wrong, and if so, which query and what was wrong?” Add the confirmed errors to the evaluation set.
Address any findings from the model risk review before proceeding.
Phase 5: Production Handover (Weeks 11–12)
Run handover sessions with the operations team. Walk through the runbook. Test the failure scenarios in a staging environment. Set the monitoring alerts. Confirm the on-call rotation. Go live to the full user group. Watch the first week of production metrics against the evaluation set. Brief the compliance team on the incident reporting process.
Where Regulated GenAI Goes Next
The regulators are converging. MAS, HKMA, the EU AI Act, and the FCA’s emerging AI framework all point in the same direction: explainability, traceability, human oversight, and demonstrable evaluation. The teams who build for that now will not be scrambling in 2027 when enforcement tightens.
The technical envelope is also shifting. Agentic RAG — where the system plans multi-step retrieval and reasoning across tools — is moving from research into production deployments. That raises the bar on governance again, because now you are auditing not just an answer but a sequence of decisions: which tool the agent chose to call, what query it constructed, what it did with the result, and why it chose to stop. If you have not solved governance for single-shot RAG, you cannot solve it for agentic systems.
Start with the basics. Earn the right to build the more advanced ones. For teams ready to take that next step, our work on agentic AI is built on the same production-first principles laid out in this guide.
How Sthambh Takes Enterprise GenAI from Pilot to Production
Sthambh works with banks, insurers, and enterprise platform teams in Singapore, Hong Kong, the UK, and the US to build GenAI systems that reach production — not just demo. Our RAG pipeline practice covers the full stack: evaluation set design, source-aware chunking, hybrid retrieval architecture, metadata schema for regulated documents, audit logging, and data residency compliance. We have delivered production RAG systems under MAS and HKMA governance frameworks, and we build compliance evidence packs as a standard deliverable.
For teams at the pilot stage, we offer a structured production readiness assessment: a two-week engagement that scores your current system against the eight dimensions in the table above and produces a prioritised remediation plan. For teams starting from scratch, we run the full 12-week pilot-to-production program as a fixed-scope engagement with a defined production success criterion agreed at kickoff.
Our agentic AI practice extends this foundation to multi-step reasoning systems, with the same governance-first approach applied to the full agent loop — tool selection, reasoning trace logging, and human-in-the-loop design. See our agentic AI services page for more detail.
If you are running a GenAI pilot in a regulated industry and want to close the gap to production, book a RAG readiness call with our team.
FAQs
Q. Why do most enterprise GenAI pilots fail to reach production?
A. The four most common reasons are: no formal evaluation set (so there is no way to prove the system is safe to deploy), inadequate grounding architecture (retrieval that works on 40 documents breaks on 40,000), missing governance infrastructure (access controls, audit logging, and human review paths were not built in), and no defined operations model (no on-call owner, no runbook, no monitoring). None of these are about the model being insufficient. They are about the system around the model not being production-ready.
Q. How long does it take to move a GenAI pilot to production in a regulated industry?
A. Most regulated-industry RAG deployments run 10 to 16 weeks from scoping to production rollout. A 12-week program is achievable when data preparation is straightforward and the compliance review process is well-defined. The variance is almost entirely in data preparation (ingestion, chunking, metadata tagging) and compliance sign-off cycles, not in the model or retrieval layer itself. Teams that skip the evaluation set design phase and the architecture review typically take longer, not shorter, because they encounter the compliance gaps during review rather than before build.
Q. What does a production-ready RAG evaluation set look like?
A. A production-ready evaluation set for regulated RAG contains 200 to 500 questions with known-good answers, curated by domain experts. For a compliance document retrieval use case, that means questions drawn from MAS and HKMA circulars, internal policy, and edge cases — superseded regulations, policy conflicts between jurisdictions, and high-stakes threshold queries. At least 20 percent should be edge cases the system must handle correctly or refuse confidently. The set is scored on context precision, answer faithfulness, and citation accuracy using RAGAS or TruLens, and refreshed quarterly or after any major corpus update.
Q. How do I handle audit logging for GenAI in Singapore or Hong Kong financial services?
A. The minimum audit log entry for MAS and HKMA compliance should include: timestamp (UTC), user ID (pseudonymised where PII is present), session ID, query text or hash, retrieved chunk IDs, model version, prompt template version, response text or hash, and a human review flag. Retention for seven years is the standard for most APAC financial services use cases — consistent with MAS Notice on Technology Risk Management and HKMA guidance on record-keeping. The log should be immutable once written and should feed into the firm’s existing SIEM or audit infrastructure rather than being held in application logs only.
Q. What is the difference between a GenAI pilot and a production GenAI system?
A. A pilot answers one question well in a controlled setting with a small, clean corpus, one user, informal evaluation, and no governance infrastructure. A production system answers many questions on real data, under real load, with formal evaluation, access controls, audit logging, data residency compliance, human oversight for high-stakes outputs, and a named operations team with a runbook. The model is often the same in both. Everything around the model is completely different. The table in this post maps the differences across eight specific dimensions.
Q. How does the EU AI Act affect our GenAI production deployment?
A. The EU AI Act’s impact depends on two factors: whether you are deploying a general-purpose AI model above the 10^25 FLOP training compute threshold (which includes major frontier models), and whether your use case falls into a high-risk category under Annex III (credit scoring, insurance risk, employment decisions, biometric categorisation, access to essential services). If the first condition applies, you inherit transparency obligations from the model provider, including the requirement to document which model version is in use. If the second condition also applies, your deployment requires a conformity assessment, technical documentation, and post-market monitoring. For APAC firms selling into Europe or processing data about EU citizens, these obligations apply regardless of where the deploying firm is headquartered. The safest approach is to design your audit logging and documentation architecture to EU AI Act standards from day one — it is significantly cheaper than retrofitting compliance documentation after launch.
Nikhil Khandelwal
Co-founder & CTO, Sthambh
