LLM Cost Optimisation in Production: How APAC Enterprises Cut Token Costs 40–60%

Table of contents

Share This Article

The most expensive part of a production LLM deployment is not the model. It is the second-order behaviour of the application around it. By mid-2026, most APAC enterprises running real workloads have learnt this the hard way. Their first month’s invoice is fine. Their fourth month is not. The token bill grows three to five times faster than traffic, and nobody on the engineering team can explain why in a single sentence.

This is the practitioner’s guide to LLM cost optimisation in production for APAC enterprises. It assumes you already have something live, you are paying real money, and your CFO has started asking pointed questions. We will cover the seven cost levers that consistently deliver 40–60% reductions without degrading quality, why APAC data-residency rules narrow the option space versus US peers, and a 90-day roadmap that prioritises quick wins before structural rebuilds. The economics matter now: enterprise model-API spend crossed USD 8.4 billion globally in 2025 and is projected to grow further through 2026, and most teams have no systematic plan to control it.

Why LLM Bills Are Quietly Breaking APAC AI Budgets

The pattern is consistent across the Singapore, Hong Kong, and broader APAC enterprises Sthambh works with. A successful pilot ships in Q1. By Q3, the same workload is consuming token volumes that nobody modelled. The CFO asks for a cost-per-conversation number, and the engineering team realises they do not have one.

There are three structural reasons this keeps happening, and none of them are about the model being too expensive in isolation.

The first is that context windows have grown faster than discipline. A retrieval-augmented generation pipeline that started with 4,000 tokens of context per call now sends 60,000. Conversation histories accumulate. System prompts get longer every sprint as product managers stack instructions to fix edge cases. By month four, 80% of every token you pay for is repeated context that the model has seen on the previous turn.

The second is that agentic workloads multiply token consumption non-linearly. A traditional chat call is one round trip. An agent loop is fifteen. Each tool call has its own system prompt, its own tool definitions, and its own context. The unit economics of a single user request quietly went from $0.02 to $0.40 the day you upgraded the chatbot to an agent. Most teams discover this after the fact, when they read the operational tradeoffs we covered in our agentic RAG enterprise guide for Singapore and Hong Kong and realise the cost section was the part they should have read first.

The third is that APAC enterprises are paying a “geography tax” that US peers do not. Cross-region inference latency forces production teams onto a smaller set of regional endpoints. Data-residency rules under MAS, HKMA, IMDA, and PDPA shut down cheaper offshore options. Currency conversion and partner-cloud markups add 10–20% to underlying pricing. Most APAC token bills are inflated 15–30% relative to what a US team running the same workload would pay.

The good news is that the cost levers available in 2026 are far more powerful than they were in 2024. The teams that deploy them systematically routinely report 40–60% reductions, and a meaningful minority report 70–80%. The bad news is that none of these levers ship out of the box. Every one requires deliberate engineering, evaluation discipline, and a willingness to treat token cost as a product KPI rather than a back-office line item.

What LLM Cost Optimisation Actually Means in Production

Cost optimisation is not the same thing as “use a cheaper model.” A cheaper model with worse quality routes more requests to escalation paths, generates more retries, frustrates more users, and ultimately costs more in absolute terms. Real cost optimisation is a portfolio of decisions across six dimensions: which model you call, when you call it, how much you send it, how much you ask it to send back, whether the same call has been made before, and where the compute physically runs.

A useful working definition for an APAC enterprise is this: LLM cost optimisation in production is the discipline of reducing the cost per successful task — not per token, not per request, per successful task — without degrading user-perceived quality, latency budgets, or regulatory posture. The “successful task” part matters because it forces you to define what success is. For a compliance copilot, it might be “first-pass acceptance rate of the suspicious activity report.” For a customer-service assistant, it might be “containment rate without human escalation.” Cost-per-task is the only metric a CFO and a head of engineering will ever both agree on.

With that definition in place, the rest of this guide is a tour of the levers and how they combine.

The Seven Cost Levers Every Production Team Should Pull

These are listed roughly in order of return on engineering effort. The first three are quick wins. The next two require sustained discipline. The last two are structural decisions that take a quarter or more to land.

1. Prompt Caching for Stable Prefixes

Prompt caching is the single highest-ROI lever available in 2026. Both Anthropic and OpenAI now offer it, and it changes the unit economics of any workload with a stable prompt prefix more aggressively than any other technique.

The mechanism is straightforward. The model provider caches the processed key-value (KV) state of a stable prompt prefix — typically your system prompt, tool definitions, and long static context. Subsequent calls that share that prefix reuse the cached state and pay a fraction of the normal input rate. Anthropic charges 10% of the normal input rate on cache reads (a 90% discount) with a small write surcharge of 1.25x on the first call. OpenAI automatically caches stable prefixes above 1,024 tokens and bills cached reads at 50% of normal input rate.

For a RAG pipeline with a 12,000-token system prompt and tool definition block, prompt caching converts the dominant cost component into background noise. Real-world production deployments routinely report 50–70% savings on input-token cost from caching alone. One frequently cited case study from a developer-tools company showed a 59% overall LLM cost reduction from caching, climbing to 70% after several rounds of optimisation. Anthropic’s published data suggests 30–50% cost reductions are typical on agent loops and RAG pipelines.

The technique works only if your prompt is structured for caching. The placement principle is to put the most stable content first (system prompt), followed by tool definitions, then long static context, then slowly changing context, with the current user message last. Teams that randomise the order of context blocks or interleave dynamic timestamps inside their system prompt invalidate the cache on every call and capture none of the savings.

2. Model Routing and Cascading

Most production workloads do not need a frontier model for every request. The hard distribution of difficulty in a real chatbot or agent looks roughly like a power law: 60–80% of requests are simple and can be answered correctly by a cheap model, 15–30% are moderate, and 5–10% genuinely require the strongest model available.

Two architectures exploit this. Routing classifies each request before calling any model and sends it to one model based on the classification. Cascading calls the cheap model first, scores the output against a quality check, and escalates to a stronger model only when the cheap output fails. Research using RouteLLM-style approaches shows that maintaining 95% of frontier-model quality while sending 85% of queries to cheaper models is achievable, with 45–85% cost reductions depending on workload shape. By 2026, 37% of enterprises run five or more models in production routed by this kind of logic.

The pricing math is what makes this work. Frontier-tier models in 2026 cost roughly $3–5 per million input tokens and $15–25 per million output tokens. Mid-tier models like Claude Sonnet or GPT-4o sit at $2.50–3 input and $10–15 output. Lightweight models like Claude Haiku 4.5 or GPT-4o Mini sit at $0.15–1.00 input and $0.60–5.00 output. A workload that lands 80% of requests on the lightweight tier and 20% on the mid-tier pays an effective blended rate that is 70–85% lower than the same workload running entirely on frontier models.

The trap to avoid is undisciplined cascading. A naive cascade that re-runs every failed cheap call through a frontier model can end up paying for both calls and being slower than the original architecture. Production cascades need calibrated quality thresholds, latency budgets that account for the worst case, and clear stop conditions.

3. Batch API for Asynchronous Workloads

Both Anthropic and OpenAI offer a Batch API that processes requests asynchronously within 24 hours at a flat 50% discount across every model in the catalogue. This is the most underused lever in the 2026 toolkit.

The discount applies with no quota negotiation, no contract amendment, and no quality difference — the same model produces the same output, just queued. Any workload that does not require real-time response should be routed through batch by default. In practice that includes nightly document ingestion, weekly market-research summarisation, end-of-day compliance reconciliation, embedding refreshes, evaluation runs, content backfills, and synthetic data generation. Teams that build a batch-first architecture for everything that can tolerate a 24-hour SLA typically shave 15–25% off their total token bill.

A small architectural pattern helps: a “tier-classification” middleware in front of the LLM gateway that tags each incoming job with a latency budget (interactive, near-real-time, or batch) and routes it accordingly. The same business logic produces a 50% bill reduction on every job that did not need to be interactive but was treated as if it did.

4. Context Window Discipline

The cheapest token is the one you do not send. The single biggest cost-amplification mistake we see in APAC production deployments is the unbounded growth of prompts and context windows. Three patterns recur.

The first is the unbounded conversation history. Chatbot architects often pass the entire conversation transcript on every turn. By turn fifteen, the user message is 200 tokens and the prompt is 18,000. The fix is a sliding-window or summary-compression strategy: keep the last N turns verbatim and replace older turns with a model-generated summary. Cost reduction from this single change is typically 30–50% for active chat workloads.

The second is the over-retrieval RAG pipeline. Teams often default to retrieving 20 chunks and feeding all of them into the prompt because “more context is safer.” It is not. It is more expensive and often worse, because long-context attention dilutes signal. A well-tuned retrieval pipeline returning 5–8 highly relevant chunks routinely outperforms a noisy 20-chunk pipeline at half the input cost. Our GenAI pilot-to-production playbook for regulated industries covers retrieval evaluation in depth, including the specific metrics to track before tuning your retrieval count.

The third is the runaway system prompt. Product managers keep stacking instructions to fix edge cases until the system prompt is 6,000 tokens of historical patches. Most of those instructions can be consolidated or moved into deterministic pre- and post-processing logic. A system-prompt audit every quarter, with measurable cost-per-call before and after, is one of the highest-ROI engineering rituals a production AI team can run.

5. Output Token Compression

Output tokens cost three to five times more than input tokens at every major provider. Compressing output is therefore disproportionately valuable.

The technique is mostly about prompt design. Asking for structured output in the smallest viable format (JSON over prose, codes over verbose descriptions, ID references over full repetition of source text) routinely cuts output token volume by 30–60%. Asking for explicit length limits in the prompt (“respond in one paragraph of at most 80 words”) works for narrative outputs. Using a smaller, faster model for the output-formatting step after a larger model has done the reasoning is another pattern that compounds with model routing.

Output compression has a quality dimension that input compression does not. Compressed outputs can read clinical or robotic. Production teams should A/B-test compressed and uncompressed variants against user satisfaction, not just token count.

6. Quantization and Open-Source Hosting (Selective)

For workloads with predictable shape, regulatory pressure to keep inference on-premise, or extreme volume, self-hosting an open-source model at INT4 or INT8 quantization becomes economically rational at a known crossover point.

Branch8 and other APAC infrastructure analyses through 2025 and 2026 have published the regional benchmarks. On AWS Singapore, AWS Tokyo, and Alibaba Cloud Hong Kong, INT4 quantization combined with spot instances reduces inference cost 77–82% compared to on-demand FP16 deployments. The headline number is that self-hosted Singapore inference for a quantized open-source model lands around USD 12–15 per hour on spot, versus USD 65 or more per hour for a two-GPU FP16 on-demand setup.

The catch is that quantization has accuracy implications, and self-hosting has an operating-cost component the spreadsheet rarely captures upfront. Oracle’s production deployment case studies for code-generation tasks reported 5–8% accuracy degradation on HumanEval from aggressive INT4 quantization, which had to be recovered with task-specific calibration. And the engineering load of running production-grade inference (autoscaling, batching, observability, model updates) is real. Self-hosting is the right answer when monthly token volume crosses roughly one to two billion tokens, when data residency forces in-region compute, or when both. Below that threshold, the API economics usually still win.

7. Provisioned Throughput and Reserved Capacity

For predictable, high-volume workloads, both AWS Bedrock and Azure OpenAI offer provisioned-throughput options that swap per-token billing for a fixed hourly rate. The crossover is workload-dependent, but at sustained throughput above roughly 10–20 million tokens per hour, provisioned capacity often wins on a per-token basis and improves tail latency at the same time.

In APAC, provisioned throughput has the additional benefit of pinning inference to a specific in-region endpoint, which simplifies the residency story for MAS- and HKMA-regulated workloads. Cross-region inference (CRIS) on Bedrock is now available across Thailand, Malaysia, Singapore, Indonesia, and Taiwan for the latest Claude Opus, Sonnet, and Haiku models, which means an APAC enterprise can get the resilience and throughput characteristics of US-style deployments without exfiltrating data outside the region.

The tradeoff is commitment. Reserved capacity is paid for whether you use it or not. Teams that overcommit on provisioned throughput in a launch month and then watch usage flatline below the reservation line have paid the cost of the optimisation without realising the benefit. A 90-day usage curve is the minimum baseline before committing to reserved capacity.

Comparing the Cost Levers: Impact, Latency, and Effort

A pragmatic way to sequence the work is to map each lever against three axes: realistic cost impact, latency impact (helpful or harmful), and engineering effort to land. The table below is the version Sthambh uses with clients to plan a 90-day cost-reduction sprint.

Lever Typical Cost Reduction Latency Impact Engineering Effort When to Prioritise
Prompt caching 30–70% on input cost Reduces latency 30–80% on cache hits Low (prompt restructuring) Any workload with stable system prompt or tool definitions
Model routing and cascading 45–85% on total bill Improves median, may hurt tail Medium (classifier + evaluation) Heterogeneous workloads with mixed difficulty
Batch API 50% on any batchable workload 24-hour SLA Low (job queue middleware) Async ingestion, evaluation, content generation
Context window discipline 30–50% on input cost Reduces latency proportionally Medium (summarisation + retrieval tuning) Long conversations, over-retrieving RAG
Output token compression 15–30% on output cost Improves latency Low (prompt + schema) Output-heavy workloads (summaries, drafts)
Quantized self-hosting 60–80% at sufficient scale Variable High (full inference stack) Above 1–2 billion tokens/month, residency constraints
Provisioned throughput 10–30% above sustained volume Improves tail latency Low to medium (capacity planning) Predictable, high-volume APAC workloads

A team with no prior cost work in place can realistically pull levers 1, 3, and 4 inside the first 30 days and capture 35–50% of total savings. Levers 2 and 5 are weeks four through ten. Levers 6 and 7 are the structural ones that pay off in the second quarter.

How MAS, HKMA, IMDA, and PDPA Rules Constrain Your Optimisation Choices

APAC enterprises operate under a tighter set of data-residency and outsourcing rules than US peers, and these rules shape which levers are realistically on the table.

For MAS-regulated entities in Singapore, the Outsourcing Guidelines and the Technology Risk Management Guidelines impose due-diligence and notification obligations on material outsourcing of customer or transaction data, including to AI model providers. The MAS Notice 644 cross-border data flow requirements and the broader PDPA framework constrain where inference can run. In practice, most MAS-regulated workloads now require inference inside the AWS Singapore, Azure Singapore, or Google Cloud Singapore regions, and the choice of model is filtered by what is actually available in those regions on Bedrock, Azure OpenAI, or Vertex AI.

For HKMA-regulated banks in Hong Kong, similar logic applies under the SA-2 outsourcing guidance and the SPM module on AI risk management. Cross-region inference into the HKMA-supervised perimeter is now possible via the regional CRIS configuration on Bedrock and equivalent capabilities on Azure, but offshore routing to mainland-hosted inference APIs is generally off the table for PDPO-regulated customer data, even when the pricing is attractive.

For IMDA-aligned AI deployments under the Model AI Governance Framework, the operational expectation is provenance and auditability across the inference chain. Cost optimisation that routes the same workload through different models at different times without an audit trail will fail an MAGF review. Any routing or cascading architecture needs request-level logging of which model produced which output, which our practitioner’s guide to Singapore’s agentic AI governance framework breaks down in detail for engineering teams.

For multi-jurisdiction APAC deployments — for example, a Singapore-headquartered bank serving Hong Kong and Indonesian customers — the optimisation strategy has to be partitioned by jurisdiction. A single model-routing rule that works in Singapore may breach residency requirements in Indonesia under PP71/2019 or in Vietnam under Decree 53/2022. The clean architecture is a regional gateway that enforces residency policy before the routing decision, not after.

The net effect is that APAC cost optimisation has a smaller option space than the US equivalent, but the optimisations that survive the residency filter are still powerful. The 40–60% reduction range cited in this guide is a real-world figure observed across MAS- and HKMA-regulated deployments, not a US benchmark imported into an APAC context.

Implementation Roadmap: From Cost Audit to Production Savings in 90 Days

A disciplined 90-day cost-reduction sprint follows a predictable shape. The phases below are the version Sthambh runs with mid-market and enterprise clients in Singapore and Hong Kong.

Phase 1: Audit and Instrumentation (Weeks 1–2)

You cannot optimise what you do not measure. The first two weeks are about instrumenting the production stack to expose where the cost actually is.

The deliverables are a cost-per-task baseline across the top three production workloads, a token-volume breakdown by model and by call type, a histogram of prompt sizes and output sizes, and a list of every workload that is currently real-time but does not need to be. The instrumentation usually requires adding a thin observability layer in front of the LLM provider (LiteLLM, OpenRouter, or an internal gateway) that captures structured per-call telemetry. Teams that try to do this with provider dashboards alone usually miss the largest savings, because the dashboards are billing-aware, not workload-aware.

The audit also exposes the things you did not know were there. Forgotten background jobs that call frontier models in a loop. Stale evaluation harnesses left running. Debug prompts shipped to production with verbose chain-of-thought enabled. We routinely find 5–15% of total spend in these “ghost workloads” inside the first week of an engagement.

Phase 2: Quick Wins (Weeks 3–6)

The next four weeks land the three low-effort, high-impact levers: prompt caching, batch API migration, and the worst context-window offences.

The order matters. Prompt caching is first because it requires only prompt restructuring, the savings are visible in the next invoice cycle, and the work overlaps cleanly with whatever observability the audit has just introduced. Batch API migration is second because it is purely an infrastructure change with no quality risk. Context-window cleanup is third because it requires evaluation discipline, which the team will have built up while implementing prompt caching.

Realistic outcomes from Phase 2: 30–45% total cost reduction, no degradation in quality metrics, and a much clearer picture of the structural problems that remain.

Phase 3: Structural Optimisation (Weeks 7–12)

The final six weeks tackle model routing, cascading, and output compression. These are the levers that require sustained engineering work and a real evaluation harness, because misclassified requests or aggressive output limits can silently degrade quality.

The deliverables are a production model router with calibrated quality thresholds, a cascading architecture for the workloads where it makes sense, output-schema discipline across at least the top five workloads, and a regression test suite that prevents the next code change from re-inflating the bill. By end of Phase 3, the cumulative reduction across all three phases typically lands in the 50–65% range against the original baseline.

What does not fit in 90 days is structural self-hosting. Quantized open-source inference is a Q2 decision after the API-side optimisations have run their course. Most APAC enterprises do not need it. The ones that do — high-volume, residency-constrained, predictable workload — should treat it as a separate three-to-six-month programme, not an extension of the cost-reduction sprint.

Real-World APAC Examples of Production Cost Optimisation

The patterns above land differently in different verticals. A few examples from the kind of deployments common across Sthambh’s APAC client base.

A Singapore mid-market bank running a compliance copilot for transaction-monitoring analysts was paying USD 180,000 a month in token cost by Q4 2025. The pilot used a frontier model for every call, no caching, and a 20-chunk RAG retrieval over the sanctions and PEP database. A 90-day optimisation sprint introduced prompt caching on the system prompt and sanctions context (47% reduction on input cost), cascaded to a mid-tier model for the 70% of queries that were routine classification (additional 25% reduction), and tightened retrieval from 20 chunks to 6 with a re-ranker (another 12%). Total bill at the end of Phase 3 was USD 71,000 a month — 60% lower — with first-pass analyst acceptance rate unchanged.

A Hong Kong insurer running a customer-service assistant for policy enquiries had a different problem. Their bill was not the issue. Their P99 latency was. By caching the policy-document context and routing simple FAQ-style queries to Haiku-tier, they cut both cost (38%) and P99 latency (52%). The cost saving was a side-effect of a latency optimisation, which is the more honest framing for many enterprise programmes.

A Jakarta-based logistics platform was running a multi-agent shipment-exception system under Indonesian PP71/2019 residency rules. They could not self-host (no in-house ML engineering depth) and could not route offshore (residency). Their optimisation was almost entirely about reducing the number of agent loop iterations: more deterministic pre-processing of shipment data before the agent saw it, tighter tool definitions, and a hard cap of six loop iterations with a deterministic fallback. Cost per resolved exception dropped 44%.

A Singapore healthcare provider deploying a clinical-summary assistant under PDPA and HSA constraints sat at the other end of the spectrum. Token volume was small (under 50 million per month) and the value of getting a summary wrong was high. Their optimisation was not about cheaper models. It was about prompt caching the patient-record context and batching overnight summary generation for next-morning ward rounds. The savings were modest in absolute terms (around USD 8,000 per month), but the latency improvement on the interactive path was the real win.

The common thread across these four cases is that none of them switched to a cheaper model as the primary lever. Cheaper models played a role in the bank and the insurer, but the structural savings came from caching, batching, retrieval discipline, and loop control. This pattern repeats consistently in the APAC market we work in.

Common Failure Modes That Erase Your Savings

Cost optimisation programmes that look successful at the end of the sprint often regress within two quarters. The recurring failure modes are worth naming explicitly so they can be designed against.

The first is the absence of a regression test suite. Without an automated cost-per-task harness running in CI, the next product manager who adds a paragraph to the system prompt or the next engineer who bumps the retrieval count “to be safe” silently un-optimises everything. The harness does not need to be elaborate, but it does need to exist and to block deploys that exceed budget.

The second is treating routing as a static decision. The right routing thresholds in May are the wrong thresholds in November as models, prices, and user behaviour shift. Production routers need a recalibration cadence — typically quarterly — that re-evaluates threshold accuracy against fresh data.

The third is the “free upgrade” trap. When a provider releases a newer, faster, smarter model at the same price as the older one, teams quickly migrate to it. The migration is usually correct on quality grounds, but the cache they spent weeks tuning is invalidated and the unit economics shift. Provider-version changes need to be treated as cost events as well as quality events.

The fourth is uncontrolled agent sprawl. A single optimised agent that gets cloned into seventeen variants by seventeen teams without shared infrastructure ends up with seventeen separate cost problems. Our analysis of the broader operational risks of unmanaged agent sprawl in APAC enterprises covers the governance side; the cost dimension is that every additional agent is a new prompt, a new cache strategy, and a new opportunity to under-route. Centralising the gateway, the routing logic, and the cache configuration across agents prevents the sprawl from compounding into a sprawl-shaped bill.

The fifth is taking your eye off output costs. Most cost dashboards focus on input tokens because they dominate volume. Output tokens dominate cost on a per-token basis. A workload that doubles its output length to “be more helpful” can easily double its bill while the input volume stays flat.

How Sthambh Helps APAC Enterprises Cut LLM Production Costs

Sthambh works with mid-market and enterprise teams across Singapore, Hong Kong, and the broader APAC region to take production LLM deployments from “expensive and hard to defend” to “predictable and CFO-ready.” The work is engineering-led, not consulting-deck-led. We instrument, measure, redesign, and ship.

A typical engagement is the 90-day sprint described above: a two-week audit and instrumentation phase, a four-week quick-wins phase covering prompt caching and batch migration, and a six-week structural phase for routing, cascading, and output discipline. We bring the gateway scaffolding, the evaluation harness, and the FinOps reporting templates that a small in-house team would otherwise spend a quarter building from scratch. The deliverable is a measurable cost-per-task reduction (typically 40–60% on Phase 3 exit), a regression suite that prevents reversion, and an internal handover so your team owns the optimised system, not us.

For teams who want to go further, we run vector and retrieval audits, agentic-loop redesigns, and selective self-hosting evaluations for workloads that have crossed the API-versus-self-host crossover. Our broader LLM integration playbook for enterprise teams in 2026 covers the full lifecycle for teams that are earlier in their production journey.

If you have a live LLM deployment, a token bill that is growing faster than your traffic, and a CFO asking pointed questions, the next sensible step is a 30-minute RAG and LLM cost readiness call. Book the readiness call and we will walk through your current spend, the highest-leverage levers for your workload shape, and what a realistic 90-day savings target looks like in your specific APAC regulatory context.

FAQs

Q. What is a realistic cost reduction target for a production LLM workload in APAC?

A. For most APAC enterprises with no prior cost-optimisation work, 40–60% reduction over 90 days is the typical range observed across MAS- and HKMA-regulated deployments. Teams with heavily redundant prompts, long conversations, or no prior batch-API usage frequently land at the higher end. Teams that have already implemented prompt caching and batching may see 15–25% from the remaining levers.

Q. Does prompt caching work for RAG pipelines, or only for stable chat prompts?

A. It works very well for RAG pipelines, often better than for general chat. The retrieved chunks are dynamic, but the system prompt, tool definitions, and any long static reference context (regulations, taxonomies, schemas) are usually stable across calls. Placing the dynamic retrieved chunks last in the prompt and everything stable above them is the structure that maximises cache hit rate. Cache savings of 50–70% on input cost for production RAG are common.

Q. How does MAS or HKMA regulation affect model routing decisions?

A. Routing itself is permitted, but every routing decision must be auditable. MAS Notice 644 and the broader Technology Risk Management Guidelines require that you can explain, on a per-request basis, which model produced which output and why. HKMA’s SA-2 and the SPM module on AI risk impose similar audit requirements. Practically, this means routing must be logged at request level, the routing logic must be documented and version-controlled, and any model used in the route must be in scope of your outsourcing notifications. Routing through models that are not in your approved vendor list is a compliance failure regardless of how much it saves.

Q. Should we self-host an open-source model to cut costs?

A. The crossover where self-hosting beats API economics is typically around one to two billion tokens per month, assuming you can keep the inference stack busy. Below that, the operating cost of running production inference (autoscaling, observability, model updates, on-call) usually exceeds the savings. Above that, or in cases where data residency forces in-region compute and the API options are limited, self-hosting becomes rational. INT4 quantization on AWS Singapore or Alibaba Cloud Hong Kong typically lands inference cost around USD 12–15 per hour on spot instances.

Q. How do I prevent cost regressions after the optimisation work is done?

A. Treat cost-per-task as a first-class production metric, alongside latency and quality. Implement an automated cost regression test in CI that runs the optimised prompts against a representative sample and fails the build if the cost-per-task exceeds budget. Recalibrate model-routing thresholds quarterly. Treat every model-provider version upgrade as a cost event that re-runs the evaluation harness. Centralise the gateway and routing logic across teams to prevent fragmented configurations from drifting.

Q. Is batch API usable for customer-facing workloads?

A. Not directly, because the SLA is up to 24 hours. The pattern is to use batch for everything that does not need to be interactive, which is usually more than teams expect. Document ingestion, embedding refreshes, evaluation runs, content backfills, overnight summarisation, and end-of-day reconciliation are all batchable. Routing these to batch frees up your interactive budget and the 50% discount applies on every job. A “tier-classification” middleware in front of your LLM gateway that tags each request with a latency budget is the cleanest way to land this.

Q. What is the right cost-per-task baseline to aim for?

A. It is workload-dependent, but the discipline is more important than the absolute number. For a compliance copilot handling transaction-monitoring alerts, well-optimised production teams target cost-per-resolved-alert in the USD 0.05–0.15 range. For a customer-service assistant, cost-per-contained-conversation in the USD 0.02–0.10 range is typical after optimisation. For a clinical summary tool, cost-per-summary in the USD 0.03–0.20 range is reasonable depending on document length. The number that matters is the trend, not the headline.

Q. How quickly can a team realistically capture meaningful savings?

A. Prompt caching and batch API migration are usually live in two to four weeks and visible in the next invoice cycle. Model routing and cascading take six to ten weeks because of the evaluation work required. Structural changes (self-hosting, provisioned throughput) take a quarter or more. A disciplined 90-day sprint typically lands 40–60% of total savings; the remaining headroom requires the structural quarter.

Picture of Nikhil Khandelwal
Nikhil Khandelwal

Co-founder & CTO, Sthambh

Let's Build Digital Excellence Together

Share This Article
case studies

See More Blog

Contact us

Partner with Us for Comprehensive IT

We’re happy to answer any questions you may have and help you determine which of our services best fit your needs.

Your benefits:
What happens next?
1

We Schedule a call at your convenience 

2

We do a discovery and consulting meeting 

3

We prepare a proposal 

Schedule a Free Consultation