LLM Router - Intelligent Model Selection
Standalone LLM router utility that routes queries to optimal models based on complexity and cost. Tested with 40-50% routing accuracy with heuristics (up to 60% with ML) and >90% cache hit rate after population.
Overview
Core Features:
- ML-Based Classification - Up to 60% routing accuracy with Upstash Vector (vs 40-50% heuristics)
- Semantic Caching - Upstash Vector with built-in embeddings (measured under 50ms lookups, >90% cache hit rate after populated)
- Accurate Token Counting - tiktoken integration (±2% accuracy vs ±15% estimation)
- Multi-Provider Support - 6 providers, 15 models (OpenAI, Anthropic, Google, Groq, Together, Ollama)
- Streaming Support - Real-time response streaming with StreamingRouter
- Interactive UI - Chat interface, benchmarks, and training data management
- Comprehensive Testing - 106 tests covering router, cache, streaming, and agent patterns
What You'll Learn: How to build LLM routing systems with cost optimization, ML classification, semantic caching, and measurable performance improvements.
The Problem
Using GPT-4 or Claude Opus for every query is expensive. Most customer support queries are simple ("What are your hours?") but you're paying premium model prices.
Theoretical Cost Example (1M queries/month, 500 tokens avg):
- All GPT-4o: $1,250/month - Excellent quality but expensive overkill
- All GPT-4o-mini: $75/month - Good quality but poor on complex queries
- Intelligent Routing: $38-75/month - Excellent quality, best of both
With this router:
- Simple queries → Gemini Flash ($0.075/1M) or Groq ($0.05/1M)
- Complex queries → Claude 3.5 Sonnet or GPT-4o
- Cached queries → $0 (>90% cache hit rate after population)
- Result: 70-97% theoretical cost savings vs GPT-4o (depends on your query mix)
Additional benefits:
- 40-50% routing accuracy on test data with heuristics (84% simple, 96% moderate, 8% complex, 0% reasoning)
- Under 100ms routing latency (measured locally)
- Streaming support for real-time UX
- Interactive UI for testing and training
Quick Start
Try the Live Demo
The easiest way to try the router is through the deployed UI:
https://llm-router-ui.vercel.app/
Features:
- Chat Interface - Test routing with real queries
- Benchmarks - Run automated tests on 100+ queries
- Training - View and manage the 200 training examples
Local Development
1# 1. Clone and install
2git clone https://github.com/vishesh-baghel/experiments
3cd experiments/packages/llm-router-ui
4pnpm install
5
6# 2. Set up environment
7cp .env.example .env
8# Add your OPENAI_API_KEY and ANTHROPIC_API_KEY
9
10# 3. Run dev server
11pnpm dev
12# Open http://localhost:3000How It Works
1. Complexity Analysis
The router analyzes queries using heuristics to classify complexity:
1import { ComplexityAnalyzer } from './router/analyzer';
2
3const analyzer = new ComplexityAnalyzer();
4const result = await analyzer.analyze('How do I reset my password?');
5
6console.log(result);
7// {
8// level: 'simple',
9// score: 13,
10// factors: {
11// length: 32,
12// hasCode: false,
13// hasMath: false,
14// questionType: 'simple',
15// keywords: [],
16// sentenceComplexity: 1
17// },
18// reasoning: '...'
19// }Classification Levels:
- Simple (0-25): Factual questions, yes/no
- Moderate (25-50): Multi-part, comparisons
- Complex (50-75): Technical, multi-issue
- Reasoning (75-100): Strategic, decision-making
The router includes both heuristic-based classification (~50-60% accuracy) and ML-based classification with Upstash Vector (~50-60% accuracy). Enable ML classification with useMLClassifier: true in router options.
2. Cost-Aware Selection
After classification, the router selects the optimal model:
1import { LLMRouter } from './router';
2
3const router = new LLMRouter();
4
5// Route with cost optimization
6const routing = await router.routeQuery(
7 'Explain the differences between your premium and basic plans',
8 { preferCheaper: true }
9);
10
11console.log(routing);
12// {
13// model: 'gpt-4o-mini',
14// provider: 'openai',
15// displayName: 'GPT-4o Mini',
16// complexity: { level: 'moderate', score: 36, ... },
17// estimatedCost: { input: 0.000003, output: 0.0003, total: 0.000303 },
18// reasoning: 'Query classified as moderate. Selected GPT-4o Mini...'
19// }Router Options:
1// Prefer cheaper models (default)
2{ preferCheaper: true }
3
4// Force specific provider
5{ forceProvider: 'anthropic' }
6
7// Force specific model
8{ forceModel: 'gpt-4o' }
9
10// Set cost constraint
11{ maxCostPerQuery: 0.001 }3. Token Estimation
The router estimates costs before making API calls:
1import { CostCalculator } from './router/calculator';
2
3const calculator = new CostCalculator();
4const query = 'Explain your return policy in detail';
5
6// Estimate tokens (rough approximation: 1 token ≈ 4 characters)
7const tokens = calculator.estimateTokens(query); // ~10 tokens
8
9// Get cost estimate for specific model
10const modelConfig = getModelConfig('gpt-4o-mini');
11const cost = calculator.estimateCost(query, modelConfig);
12
13console.log(cost);
14// {
15// input: 0.000003,
16// output: 0.000006,
17// total: 0.000009
18// }The router includes tiktoken integration for accurate token counting (±2% accuracy). Falls back to character-based estimation (±15% error) if tiktoken is unavailable.
4. Testing with AI SDK Agent
Simple customer care agent built with AI SDK for testing the router:
1import { CustomerCareAgent } from 'llm-router-ui/lib/customer-care-agent';
2
3const agent = new CustomerCareAgent(undefined, {
4 useCache: true,
5 useMLClassifier: true,
6 enabledProviders: ['openai', 'anthropic'],
7});
8
9// Handle a query (returns routing decision)
10const result = await agent.handleQuery(
11 'I received a damaged product and was charged twice',
12 { preferCheaper: true }
13);
14
15console.log(result.routing);
16// {
17// model: 'gpt-4o-mini',
18// provider: 'openai',
19// complexity: 'moderate',
20// estimatedCost: 0.000307,
21// cacheHit: false
22// }The agent demonstrates:
- Query complexity analysis
- Optimal model selection
- Semantic cache integration
- Cost estimation
- Streaming support (in UI)
Interactive UI Features
The router includes a full-featured web UI for testing and training:
1. Chat Interface
Real-time chat with routing visualization:
Features:
- Streaming responses with Vercel AI SDK
- Expandable routing metadata (model, complexity, cost)
- Cache hit indicators
- Cumulative cost tracking
- Sample queries for quick testing
Try it: https://llm-router-ui.vercel.app/
2. Benchmarks
Automated testing system for measuring router performance:
How it works:
- Run tests on 100+ predefined queries
- Queries organized by expected complexity (simple, moderate, complex, reasoning)
- Batch processing (10 queries at a time)
- Real-time progress tracking
Metrics tracked:
- Routing accuracy (actual vs expected complexity)
- Cache hit rate
- Average response time
- Total cost
- Cost savings vs always using GPT-4
- Per-complexity accuracy breakdown
Example results:
1Total Queries: 100
2Routing Accuracy: 40-50% (84% simple, 96% moderate, 8% complex, 0% reasoning)
3Cache Hit Rate: 0% (cold start), >90% after cache populated
4Avg Response Time: 1.8s
5Total Cost: $0.054
6Cost Saved: $4.95 (99% vs GPT-4)3. Training Data Management
View and manage the 200 training examples used for ML classification:
Features:
- Browse all training examples
- Filter by complexity level
- Search queries
- Add new examples
- Delete examples
- Upload to Upstash Vector
- View statistics (distribution by complexity)
Training data structure:
1{
2 query: "What are your business hours?",
3 complexity: "simple"
4}The training data is stored in training-data.ts and can be uploaded to Upstash Vector for ML-based classification.
Cost Comparison
The router compares costs across all configured models:
1const comparison = await router.compareCosts(
2 'Explain your return policy'
3);
4
5console.log(comparison);
6// [
7// { model: 'GPT-4o Mini', cost: 0.000307, formatted: '$0.000307' },
8// { model: 'Claude 3 Haiku', cost: 0.000637, formatted: '$0.000637' },
9// { model: 'GPT-3.5 Turbo', cost: 0.000773, formatted: '$0.000773' },
10// { model: 'GPT-4o', cost: 0.005118, formatted: '$0.005118' },
11// { model: 'Claude 3 Opus', cost: 0.0382, formatted: '$0.0382' }
12// ]Potential Savings: 99.2% by choosing optimal model
Usage Statistics
Track costs and usage across sessions:
1// Get statistics
2const stats = agent.getStats();
3
4console.log(stats);
5// {
6// totalQueries: 4,
7// totalCost: 0.000431,
8// averageCost: 0.000108,
9// modelBreakdown: {
10// 'gpt-4o-mini': 4
11// },
12// complexityBreakdown: {
13// simple: 1,
14// moderate: 3
15// }
16// }
17
18// Reset statistics
19agent.resetStats();Architecture Decisions
Heuristics + ML Hybrid Approach
The router supports both classification methods:
Heuristic Classification (default):
- Zero latency (under 5ms)
- No training data needed
- Transparent and debuggable
- 40-50% overall routing accuracy (84% simple, 96% moderate, 8% complex, 0% reasoning)
ML Classification (optional, useMLClassifier: true):
- Up to 60% routing accuracy on test data (small improvement over heuristics)
- Learns from 200 training examples
- Under 50ms latency for embedding lookup
- Adapts to your domain
Semantic Caching (optional, useCache: true):
- Upstash Vector with built-in embeddings
- Under 50ms similarity search
-
90% hit rate after cache populated with similar queries
- Zero cost for cached responses
Start with heuristics for speed. Enable ML classification when accuracy matters. Add semantic caching for cost optimization.
Lessons from Testing
From testing with 100 queries (controlled environment, not production validated):
1. Cost Tracking Overhead
Finding: Cost calculation adds 5-10ms per request.
1// Measure overhead
2const start = Date.now();
3const cost = calculator.calculateActualCost(inputTokens, outputTokens, model);
4const overhead = Date.now() - start;
5// ~7ms averageRecommendation: Acceptable for most use cases. For ultra-low latency, calculate costs async.
2. Complexity Thresholds
Finding: Default thresholds work for 50-60% of test queries. This is not validated on real production data.
1// Default thresholds
2const thresholds = {
3 simple: 25,
4 moderate: 50,
5 complex: 75,
6};
7
8// Tuned for customer support (after 500 queries)
9const tunedThresholds = {
10 simple: 20, // More aggressive routing to cheap models
11 moderate: 45,
12 complex: 70,
13};Recommendation: Start with defaults. Tune after collecting 500+ queries with feedback.
3. Semantic Caching Performance
Finding: Semantic caching with Upstash Vector achieves >90% hit rate after cache is populated with similar queries.
1// Similar queries match semantically after cache population
2"What are your hours?" → Cache hit
3"When are you open?" → Cache hit (>90% similarity)
4"What time do you close?" → Cache hit (>90% similarity)Recommendation: Enable semantic caching for high-traffic applications. The >90% hit rate significantly reduces costs after initial cache population.
4. Model Fallback
Finding: Primary model fails ~2% of the time (rate limits, timeouts).
1// Fallback logic
2try {
3 return await primaryModel.generate(query);
4} catch (error) {
5 console.log('Primary failed, using fallback');
6 return await fallbackModel.generate(query);
7}Recommendation: Always configure fallback models. Test failover regularly.
5. Token Estimation Accuracy
Finding: Character-based estimation is 85% accurate (±15% error).
1// Estimated: 100 tokens
2// Actual: 87 tokens (13% under)
3// Cost impact: ~$0.000013 differenceRecommendation: Good enough for routing decisions. Use tiktoken for billing accuracy.
Performance Metrics
From local batch tests (sample runs, not production validated):
- Avg Routing Decision: 3ms (classification + model selection only)
- Avg Latency (simple): 1.5-2.5s (includes API calls)
- Avg Latency (moderate): 2.5-4.5s
- Avg Latency (complex): 4-7s
- Cost per Query: $0.000037 - $0.000307 (approximate)
- Routing Accuracy: 40-50% overall (84% simple, 96% moderate, 8% complex, 0% reasoning)
- Cost Savings: 87-99% vs strawman baseline (always GPT-4)
These metrics are from controlled local tests. The 87% savings compares against always using GPT-4. Real production savings depend on your query distribution, classification accuracy, and caching strategy. Expect 30-50% savings vs a reasonable baseline.
Testing
Comprehensive test suite with 100 tests:
1# Run all tests
2pnpm test
3
4# Run with coverage
5pnpm test:coverage
6
7# Run specific test file
8pnpm test customer-care-agent.test.tsTest Coverage:
- ComplexityAnalyzer: 22 tests
- CostCalculator: 16 tests
- ModelSelector: 18 tests
- LLMRouter: 21 tests
- SemanticCache: 12 tests
- ML Classifier: 11 tests
Next Steps
Use This as a Learning Foundation
This experiment is for:
- Understanding routing patterns and architecture
- Learning ML classification with embeddings
- Seeing how to structure a router system
- Getting started with cost optimization concepts
Build On Top of This
Want to extend this experiment with your own features? Use the Experiments MCP Server to fetch the complete code directly in your IDE. The MCP server lets you explore, search, and pull experiment code without leaving your development environment.
Already Implemented:
- ✅ ML-based classification with Upstash Vector
- ✅ Semantic caching with built-in embeddings
- ✅ Accurate token counting with tiktoken
- ✅ Streaming support with StreamingRouter
- ✅ 6 providers (OpenAI, Anthropic, Google, Groq, Together, Ollama)
- ✅ Interactive UI with chat, benchmarks, and training
- ✅ 100 tests with 99% coverage
- ✅ Error handling and fallback logic
To make this production-ready, consider adding:
Critical (Must Have):
- Rate limiting and retry logic for API calls
- Database for storing usage analytics and patterns
- Monitoring and alerting for routing failures
- Configuration management for production environments
Important (Should Have):
- Distributed tracing (OpenTelemetry) for debugging
- Metrics collection (Prometheus) for performance monitoring
- Load balancing for high-traffic scenarios
- Authentication/authorization for API access
Nice to Have:
- A/B testing framework for routing strategies
- Automatic threshold tuning based on feedback
- Multi-tenant support with isolated routing configs
- Advanced caching strategies (LRU, TTL policies)
About
Author: Vishesh Baghel
I write TypeScript for a living and work with agents, vector systems, and orchestration. I build these experiments to learn in public and share patterns with other developers.
Tech Stack: TypeScript, AI SDK, Upstash Vector, tiktoken, Vitest
Source: GitHub
Note: This experiment uses real API keys and makes actual LLM calls. Costs are minimal (~$0.50 for full demo) but monitor your usage.