How to Run a Model Routing Pilot Without Rebuilding Your AI Stack
Tools & Technical Tutorials
21 April 2026 | By Ashley Marshall
How to Run a Model Routing Pilot Without Rebuilding Your AI Stack?
You can run a model routing pilot by inserting a lightweight proxy layer (LiteLLM, OpenRouter, or RouteLLM) in front of your existing API calls. Start with two or three use cases, route on simple rules like task type or token budget, and measure cost and quality before touching any application logic.
Your LLM bill is climbing, your latency is erratic, and swapping one model for another feels like open-heart surgery on a running system. It doesn't have to be.
Why Your Single-Model Setup Will Eventually Break
Most AI teams start sensibly. They pick one model, usually the best they can justify on budget, and wire every use case through a single API endpoint. Customer support chatbot? Same model. Document summarisation? Same model. Spam classification? You guessed it. The logic is sound: one integration, predictable behaviour, faster shipping.
The problem arrives when that sensible decision starts costing real money. According to a 2025 survey by Kong Inc, 37% of enterprises now spend over $250,000 annually on LLM API costs, and 73% spend more than $50,000 a year. Those numbers are rising. The AI market is forecast to grow at a 79.8% compound annual rate through to 2030, which means costs only move in one direction if you do nothing about them.
The single-model approach also creates fragility that only shows up in production. When OpenAI suffered multi-hour outages during high-demand periods in 2024 and early 2025, teams with a single integration saw their entire AI layer go dark. There was no fallback, no alternative, just a blank screen and an apology email. For customer-facing products, that is an existential problem.
Beyond cost and resilience, there is a quality mismatch hidden in the single-model approach. You are paying supercomputer prices to solve calculator problems. A frontier model like GPT-4o or Claude 3 Opus makes sense for nuanced creative tasks or complex multi-step reasoning. It is dramatic overkill for extracting a postcode from a form submission or classifying whether a support ticket is about billing or a product bug. The price difference between a frontier model and a capable small model like GPT-4o mini or Haiku can be 20 to 100 times, with no meaningful quality difference on simple tasks.
This is the routing problem. It is not exotic infrastructure engineering. It is basic triage logic: send the right request to the right model at the right cost. The difficult part is doing it without rewriting your entire codebase, convincing your board to approve a platform migration, or taking three months off feature work. That is exactly what this guide addresses.
What Model Routing Actually Means in Practice
Model routing is decision logic that directs an incoming AI request to the most appropriate model based on the request's characteristics. That sounds abstract, so here is a concrete version: instead of every call going to one model, a thin layer sits between your application and the models, reads some properties of the request, and picks where to send it.
The confusion in the market is that people hear "routing" and imagine a complex ML system doing real-time semantic analysis of every prompt. That exists, and it is genuinely useful at scale, but it is not where you start. Most production routing is conditional logic. If the request is short and classification-like, use the cheap model. If it involves more than 2,000 tokens or the task type is flagged as complex, use the capable model. If the primary provider returns a 429 or 503, try the backup. This is not machine learning; it is a well-structured if-else block.
IBM Research describes LLM routing as working "almost like an air traffic controller" that evaluates each query and dispatches it to the most appropriate model. That framing is useful because it makes clear what routing is not: it is not a replacement for models, and it is not a fundamental change to your product. It is orchestration logic that already belongs in a well-built AI stack.
In practice, routing decisions cluster around four dimensions. Cost and volume: route high-volume, low-value tasks to cheap models and reserve expensive models for tasks that justify the price. Latency: route interactive requests to fast models and background jobs to slower but more capable ones. Task complexity: route pattern-matching and extraction tasks to small models and nuanced reasoning to frontier models. Reliability: route to backup providers when the primary is slow or unavailable.
The key insight is that routing and quality are not in tension. FutureAGI research published in late 2025 found that organisations implementing systematic model selection and routing achieve a 30% reduction in LLM spend without any measurable drop in output quality. A December 2025 implementation guide by engineer Michael Hannecke documented 88% cost reduction using Ollama and LiteLLM together. You are not choosing between cheap and good; you are choosing between paying for capability you need and paying for capability you do not.
What this means in practice for a UK mid-market team: if you have a product that does any combination of chat, summarisation, classification, and generation, you almost certainly have two or three distinct routing tiers already hiding in your use cases. You just have not separated them yet.
The Proxy Approach: How to Add Routing Without Touching Your Application Code
The fastest path to a routing pilot is the proxy approach. You insert a routing layer between your application and the model APIs. Your application continues to send OpenAI-compatible API calls to a single endpoint. The proxy receives those calls, applies routing logic, and dispatches them to the appropriate backend model. Your application code does not change.
Three tools dominate this space in early 2026. LiteLLM is an open-source proxy that translates between provider APIs and provides a unified OpenAI-compatible interface. It supports over 100 model providers including OpenAI, Anthropic, Google Gemini, Azure OpenAI, and open-source models running on Ollama or a self-hosted vLLM instance. You define routing rules in a YAML configuration file and point your application at the LiteLLM proxy URL instead of the provider URL. Existing code requires no modification.
OpenRouter is a managed gateway service that provides a similar OpenAI-compatible API across hundreds of models. It is particularly useful for teams that do not want to manage infrastructure. You replace your API base URL with the OpenRouter endpoint, keep your existing API key structure, and gain access to routing across all major providers and dozens of open-source models. OpenRouter handles load balancing and fallback automatically.
RouteLLM, developed by the LMSYS team at UC Berkeley, takes a more sophisticated approach. It trains lightweight router models that predict which model is most likely to answer a given query correctly. In benchmarks, RouteLLM reduces GPT-4-class calls by up to 40% while maintaining performance on standard benchmarks. It requires more setup than LiteLLM or OpenRouter but produces smarter routing decisions for teams with sufficient query volume to train on.
For a typical pilot, LiteLLM is the right starting point. It is self-hosted (important for UK organisations with data residency requirements under the UK GDPR and guidance from the ICO), well-documented, and actively maintained. A basic deployment takes under an hour. You run the proxy as a Docker container or directly as a Python process, write a config that maps a primary model and a fallback, and update your application's base URL environment variable. That is the entire change.
Azure AI Foundry also launched a native Model Router capability in 2025, which routes intelligently between Azure-hosted models. For organisations already inside the Azure ecosystem, this provides routing without any additional infrastructure at all. Microsoft's documentation describes it as being able to "swap in new LLMs as they emerge without rebuilding workflows", which captures the fundamental value proposition clearly.
Running the Pilot: A Five-Step Process
A routing pilot should be scoped tightly, run for two to four weeks, and produce clear data on cost, quality, and reliability. Here is a repeatable process that works without any architectural commitment.
Step one: Map your existing use cases by task type. Before touching any code, list every place in your product or workflow where you call an LLM. Categorise each by the task type (classification, extraction, summarisation, generation, reasoning), the typical input length in tokens, whether users wait for the response or it runs in the background, and the acceptable quality floor. This exercise usually takes two to three hours and immediately reveals which tasks are obviously overpowered and which genuinely need a frontier model. A UK financial services firm doing document ingestion and customer FAQ generation will typically find that 60 to 70% of their call volume is extraction or classification work that could run on a much cheaper model.
Step two: Instrument your current baseline. Before routing anything, add logging to capture the model used, token counts, latency, cost per request, and a simple quality signal (user rating, downstream error rate, or a spot-check sample). You need this baseline to prove what routing actually saves. Without it, you are guessing. Tools like Langfuse or Helicone provide observability for LLM calls with minimal integration effort.
Step three: Deploy a proxy with a simple routing rule. Start with one rule: if the request body contains a task_type hint of "classification" or "extraction", route to GPT-4o mini or Haiku. Everything else continues to the primary model. Deploy LiteLLM locally or as a container, update your base URL, and run this for one week. You will see cost savings immediately on the routed tasks.
Step four: Expand routing rules incrementally. Add a second rule based on token count: if input plus output tokens are under 500, route to the small model. Add a fallback rule: if the primary provider returns a 5xx error, retry on the backup provider. Add a latency rule: background jobs can use a slower but cheaper model. Each rule is independently testable and independently rollbackable. This is the critical advantage of the proxy approach: you can tune routing logic without deploying a new version of your application.
Step five: Measure and decide. After two to four weeks, compare your cost per unit of work, latency distributions, and quality metrics against the baseline. In most cases you will see 20 to 40% cost reduction on overall API spend, improved average latency (because cheap models are often faster for simple tasks), and at least one near-miss prevented by the fallback rule. Present this data to your stakeholders and make a scoped decision about whether to extend routing to more use cases or formalise the proxy as permanent infrastructure.
Governance, Data Residency and the UK Regulatory Angle
Model routing is not just a cost and performance problem. For UK organisations, it also touches on data governance in ways that a single-provider setup makes easier to ignore. Once you start routing traffic between multiple model providers, you need to think explicitly about where your data goes and what your obligations are under the UK GDPR and any sector-specific regulations.
The ICO's guidance on AI and data protection makes clear that organisations must understand the processing activities of any third-party sub-processor, including AI model providers. If you route a customer query to OpenAI, that data is processed under OpenAI's terms. If you route the same class of query to Anthropic or Google, the processing terms differ. For most general business data this is manageable, but for organisations handling personal data, health information, or financial data under FCA requirements, routing to multiple cloud-hosted providers requires a data processing agreement review for each.
This is one of the genuine counterarguments to multi-provider routing: it complicates your data processing inventory. A single provider relationship is easier to audit and explain to your DPO. Adding three or four providers multiplies the compliance surface.
The practical mitigation is twofold. First, tier your routing by data sensitivity. Route queries that contain personal data only to pre-approved providers with signed DPAs, and route anonymised or synthetic data more freely. Second, consider self-hosted models for the highest-sensitivity tier. LiteLLM supports routing to locally hosted models running on Ollama or vLLM. For a regulated UK firm, routing classification and extraction tasks to a locally hosted Llama 3 or Mistral instance eliminates the third-party data transfer entirely for that tier. You get cost savings and compliance simultaneously.
The NCSC's guidance on AI security, updated in 2024, also emphasises the importance of understanding the attack surface of third-party AI services. Routing adds a proxy layer to your architecture, which needs its own security consideration: authenticate the proxy endpoint, ensure API keys are scoped per provider and not shared, and add rate limiting to the proxy to prevent abuse. LiteLLM includes all of these as configuration options.
What this means in practice: treat the routing pilot as the moment to get your AI data processing inventory in order. Map each use case to a data sensitivity level and document which providers are approved for each level. This work is valuable independent of routing; it is good AI governance hygiene that the ICO expects organisations to have in place. Routing just makes the gaps visible sooner.
The Counterargument: Is Routing Worth the Complexity?
There is a legitimate case against routing, and it deserves a direct answer rather than hand-waving. The core argument is this: you are adding infrastructure complexity, a new failure mode (the proxy itself), and operational overhead in exchange for cost savings that may not materialise at your current scale. For some teams, at some stages of growth, that trade-off is genuinely wrong.
LogRocket's February 2026 engineering analysis articulates this clearly: if your LLM costs are under $300 a month, routing is a waste of engineering time. The overhead of maintaining a proxy, managing multiple provider relationships, and debugging routing logic is not justified by the savings. The same analysis suggests that teams under around 10,000 LLM calls per day are usually better served by optimising prompts and reducing token counts before touching routing.
There is also the quality risk. Simple routing rules based on task type or token count will sometimes route a complex query to a cheap model. The cheap model will sometimes fail or produce a noticeably worse output. If you are routing a customer-facing feature, that degradation is visible to users. Teams that implement routing without a quality monitoring layer can end up saving money while silently degrading their product. The baseline instrumentation in step two of the pilot process exists precisely to catch this.
The counterargument also extends to vendor complexity. Every provider you add brings new API quirks, new rate limit behaviours, new pricing changes, and new terms of service to track. OpenAI, Anthropic, and Google all change their pricing and models frequently. In 2025 alone, GPT-4o mini pricing dropped twice, Anthropic released Claude 3.5 Haiku, and Google expanded Gemini Flash throughput. Keeping routing rules calibrated to actual current pricing requires ongoing attention.
The honest summary: routing is the right move if you have diverse use cases, meaningful call volume, and cost pressure. It is premature optimisation if you do not. The proxy approach described in this guide minimises the downside by keeping the complexity in the routing layer rather than the application. If the pilot does not produce compelling savings, you remove the proxy and revert to a single endpoint with a one-line environment variable change. The optionality cost is low; the information value is high. Run the pilot, measure what happens, then decide with data rather than assumptions.
Frequently Asked Questions
Do I need to change my application code to implement model routing?
No. With a proxy approach using LiteLLM or OpenRouter, you change only the base URL your application points to. The proxy receives your existing OpenAI-compatible API calls and handles routing transparently. Your application continues to call one endpoint with one API key format.
Which routing tool should I start with: LiteLLM, OpenRouter, or RouteLLM?
Start with LiteLLM if you want full control and self-hosting (important for UK data residency requirements). Use OpenRouter if you want a managed service with no infrastructure to run. RouteLLM is worth exploring once you have significant call volume and want ML-based routing decisions rather than rule-based logic.
How much can I realistically save with model routing?
Research from FutureAGI suggests 30% overall LLM spend reduction is achievable with systematic routing. Individual implementations have reported higher figures: an 88% cost reduction has been documented for workloads heavily weighted toward classification and extraction tasks routed to small local models. Your savings depend entirely on how much of your call volume is genuinely simple work being sent to an expensive model.
Does routing to cheaper models mean lower quality outputs?
Not for tasks where cheaper models perform comparably. For classification, extraction, structured data output, and short-form responses, small models like GPT-4o mini or Claude Haiku produce output indistinguishable from frontier models at a fraction of the cost. Quality degradation only appears when you route genuinely complex reasoning tasks to underpowered models. The pilot process includes quality monitoring specifically to detect this.
What are the UK GDPR implications of routing to multiple model providers?
Each model provider you use as a sub-processor requires a data processing agreement (DPA). The ICO expects you to maintain a record of processing activities that includes all sub-processors. For personal data, you need to ensure each provider meets UK GDPR adequacy standards. Routing to locally hosted models eliminates this concern for the tiers where it matters most.
How do I handle provider outages in a routing setup?
Configure a fallback provider in your routing layer. In LiteLLM, this is a two-line config change: define a primary model and a fallback model. When the proxy receives a 5xx error or timeout from the primary, it automatically retries on the fallback. This is one of the strongest arguments for routing: you gain resilience for free once the proxy layer exists.
How long does a model routing pilot typically take to show results?
Most teams see meaningful cost data within the first week if they have sufficient call volume. A full pilot with baseline comparison, routing rule iteration, and quality assessment typically takes two to four weeks. For teams with fewer than a few thousand daily calls, the pilot may need to run four to six weeks to accumulate enough data for statistically confident conclusions.
Can I use routing to comply with data residency requirements and still use cloud models?
Yes. You can build a routing tier that restricts personal data to Azure OpenAI (which offers UK region deployment) or AWS Bedrock (with UK regions available) while routing non-personal data to public API endpoints for cost efficiency. LiteLLM supports routing rules based on custom metadata you attach to each request, so you can tag requests by data classification at the application layer and route accordingly.