LLM Router - Intelligent Model Selection

Standalone LLM router utility that routes queries to optimal models based on complexity and cost. Tested with 40-50% routing accuracy with heuristics (up to 60% with ML) and >90% cache hit rate after population.

GitHub Live Demo

Overview

Core Features:

ML-Based Classification - Up to 60% routing accuracy with Upstash Vector (vs 40-50% heuristics)
Semantic Caching - Upstash Vector with built-in embeddings (measured under 50ms lookups, >90% cache hit rate after populated)
Accurate Token Counting - tiktoken integration (±2% accuracy vs ±15% estimation)
Multi-Provider Support - 6 providers, 15 models (OpenAI, Anthropic, Google, Groq, Together, Ollama)
Streaming Support - Real-time response streaming with StreamingRouter
Interactive UI - Chat interface, benchmarks, and training data management
Comprehensive Testing - 106 tests covering router, cache, streaming, and agent patterns

What You'll Learn: How to build LLM routing systems with cost optimization, ML classification, semantic caching, and measurable performance improvements.

The Problem

Using GPT-4 or Claude Opus for every query is expensive. Most customer support queries are simple ("What are your hours?") but you're paying premium model prices.

Theoretical Cost Example (1M queries/month, 500 tokens avg):

All GPT-4o: $1,250/month - Excellent quality but expensive overkill
All GPT-4o-mini: $75/month - Good quality but poor on complex queries
Intelligent Routing: $38-75/month - Excellent quality, best of both

With this router:

Simple queries → Gemini Flash ($0.075/1M) or Groq ($0.05/1M)
Complex queries → Claude 3.5 Sonnet or GPT-4o
Cached queries → $0 (>90% cache hit rate after population)
Result: 70-97% theoretical cost savings vs GPT-4o (depends on your query mix)

Additional benefits:

40-50% routing accuracy on test data with heuristics (84% simple, 96% moderate, 8% complex, 0% reasoning)
Under 100ms routing latency (measured locally)
Streaming support for real-time UX
Interactive UI for testing and training

Quick Start

Try the Live Demo

The easiest way to try the router is through the deployed UI:

https://llm-router-ui.vercel.app/

Features:

Chat Interface - Test routing with real queries
Benchmarks - Run automated tests on 100+ queries
Training - View and manage the 200 training examples

Local Development

Bash

1# 1. Clone and install
2git clone https://github.com/vishesh-baghel/experiments
3cd experiments/packages/llm-router-ui
4pnpm install
5
6# 2. Set up environment
7cp .env.example .env
8# Add your OPENAI_API_KEY and ANTHROPIC_API_KEY
9
10# 3. Run dev server
11pnpm dev
12# Open http://localhost:3000

How It Works

1. Complexity Analysis

The router analyzes queries using heuristics to classify complexity:

TypeScript

1import { ComplexityAnalyzer } from './router/analyzer';
2
3const analyzer = new ComplexityAnalyzer();
4const result = await analyzer.analyze('How do I reset my password?');
5
6console.log(result);
7// {
8//   level: 'simple',
9//   score: 13,
10//   factors: {
11//     length: 32,
12//     hasCode: false,
13//     hasMath: false,
14//     questionType: 'simple',
15//     keywords: [],
16//     sentenceComplexity: 1
17//   },
18//   reasoning: '...'
19// }

Classification Levels:

Simple (0-25): Factual questions, yes/no
Moderate (25-50): Multi-part, comparisons
Complex (50-75): Technical, multi-issue
Reasoning (75-100): Strategic, decision-making

The router includes both heuristic-based classification (~50-60% accuracy) and ML-based classification with Upstash Vector (~50-60% accuracy). Enable ML classification with useMLClassifier: true in router options.

2. Cost-Aware Selection

After classification, the router selects the optimal model:

TypeScript

1import { LLMRouter } from './router';
2
3const router = new LLMRouter();
4
5// Route with cost optimization
6const routing = await router.routeQuery(
7  'Explain the differences between your premium and basic plans',
8  { preferCheaper: true }
9);
10
11console.log(routing);
12// {
13//   model: 'gpt-4o-mini',
14//   provider: 'openai',
15//   displayName: 'GPT-4o Mini',
16//   complexity: { level: 'moderate', score: 36, ... },
17//   estimatedCost: { input: 0.000003, output: 0.0003, total: 0.000303 },
18//   reasoning: 'Query classified as moderate. Selected GPT-4o Mini...'
19// }

Router Options:

TypeScript

1// Prefer cheaper models (default)
2{ preferCheaper: true }
3
4// Force specific provider
5{ forceProvider: 'anthropic' }
6
7// Force specific model
8{ forceModel: 'gpt-4o' }
9
10// Set cost constraint
11{ maxCostPerQuery: 0.001 }

3. Token Estimation

The router estimates costs before making API calls:

TypeScript

1import { CostCalculator } from './router/calculator';
2
3const calculator = new CostCalculator();
4const query = 'Explain your return policy in detail';
5
6// Estimate tokens (rough approximation: 1 token ≈ 4 characters)
7const tokens = calculator.estimateTokens(query); // ~10 tokens
8
9// Get cost estimate for specific model
10const modelConfig = getModelConfig('gpt-4o-mini');
11const cost = calculator.estimateCost(query, modelConfig);
12
13console.log(cost);
14// {
15//   input: 0.000003,
16//   output: 0.000006,
17//   total: 0.000009
18// }

The router includes tiktoken integration for accurate token counting (±2% accuracy). Falls back to character-based estimation (±15% error) if tiktoken is unavailable.

4. Testing with AI SDK Agent

Simple customer care agent built with AI SDK for testing the router:

TypeScript

1import { CustomerCareAgent } from 'llm-router-ui/lib/customer-care-agent';
2
3const agent = new CustomerCareAgent(undefined, {
4  useCache: true,
5  useMLClassifier: true,
6  enabledProviders: ['openai', 'anthropic'],
7});
8
9// Handle a query (returns routing decision)
10const result = await agent.handleQuery(
11  'I received a damaged product and was charged twice',
12  { preferCheaper: true }
13);
14
15console.log(result.routing);
16// {
17//   model: 'gpt-4o-mini',
18//   provider: 'openai',
19//   complexity: 'moderate',
20//   estimatedCost: 0.000307,
21//   cacheHit: false
22// }

The agent demonstrates:

Query complexity analysis
Optimal model selection
Semantic cache integration
Cost estimation
Streaming support (in UI)

Interactive UI Features

The router includes a full-featured web UI for testing and training:

1. Chat Interface

Real-time chat with routing visualization:

Features:

Streaming responses with Vercel AI SDK
Expandable routing metadata (model, complexity, cost)
Cache hit indicators
Cumulative cost tracking
Sample queries for quick testing

Try it: https://llm-router-ui.vercel.app/

2. Benchmarks

Automated testing system for measuring router performance:

How it works:

Run tests on 100+ predefined queries
Queries organized by expected complexity (simple, moderate, complex, reasoning)
Batch processing (10 queries at a time)
Real-time progress tracking

Metrics tracked:

Routing accuracy (actual vs expected complexity)
Cache hit rate
Average response time
Total cost
Cost savings vs always using GPT-4
Per-complexity accuracy breakdown

Example results:

TEXT

1Total Queries: 100
2Routing Accuracy: 40-50% (84% simple, 96% moderate, 8% complex, 0% reasoning)
3Cache Hit Rate: 0% (cold start), >90% after cache populated
4Avg Response Time: 1.8s
5Total Cost: $0.054
6Cost Saved: $4.95 (99% vs GPT-4)

3. Training Data Management

View and manage the 200 training examples used for ML classification:

Features:

Browse all training examples
Filter by complexity level
Search queries
Add new examples
Delete examples
Upload to Upstash Vector
View statistics (distribution by complexity)

Training data structure:

TypeScript

1{
2  query: "What are your business hours?",
3  complexity: "simple"
4}

The training data is stored in training-data.ts and can be uploaded to Upstash Vector for ML-based classification.

Cost Comparison

The router compares costs across all configured models:

TypeScript

1const comparison = await router.compareCosts(
2  'Explain your return policy'
3);
4
5console.log(comparison);
6// [
7//   { model: 'GPT-4o Mini', cost: 0.000307, formatted: '$0.000307' },
8//   { model: 'Claude 3 Haiku', cost: 0.000637, formatted: '$0.000637' },
9//   { model: 'GPT-3.5 Turbo', cost: 0.000773, formatted: '$0.000773' },
10//   { model: 'GPT-4o', cost: 0.005118, formatted: '$0.005118' },
11//   { model: 'Claude 3 Opus', cost: 0.0382, formatted: '$0.0382' }
12// ]

Potential Savings: 99.2% by choosing optimal model

Usage Statistics

Track costs and usage across sessions:

TypeScript

1// Get statistics
2const stats = agent.getStats();
3
4console.log(stats);
5// {
6//   totalQueries: 4,
7//   totalCost: 0.000431,
8//   averageCost: 0.000108,
9//   modelBreakdown: {
10//     'gpt-4o-mini': 4
11//   },
12//   complexityBreakdown: {
13//     simple: 1,
14//     moderate: 3
15//   }
16// }
17
18// Reset statistics
19agent.resetStats();

Architecture Decisions

Heuristics + ML Hybrid Approach

The router supports both classification methods:

Heuristic Classification (default):

Zero latency (under 5ms)
No training data needed
Transparent and debuggable
40-50% overall routing accuracy (84% simple, 96% moderate, 8% complex, 0% reasoning)

ML Classification (optional, useMLClassifier: true):

Up to 60% routing accuracy on test data (small improvement over heuristics)
Learns from 200 training examples
Under 50ms latency for embedding lookup
Adapts to your domain

Semantic Caching (optional, useCache: true):

Upstash Vector with built-in embeddings
Under 50ms similarity search
90% hit rate after cache populated with similar queries
Zero cost for cached responses

Start with heuristics for speed. Enable ML classification when accuracy matters. Add semantic caching for cost optimization.

Lessons from Testing

From testing with 100 queries (controlled environment, not production validated):

1. Cost Tracking Overhead

Finding: Cost calculation adds 5-10ms per request.

TypeScript

1// Measure overhead
2const start = Date.now();
3const cost = calculator.calculateActualCost(inputTokens, outputTokens, model);
4const overhead = Date.now() - start;
5// ~7ms average

Recommendation: Acceptable for most use cases. For ultra-low latency, calculate costs async.

2. Complexity Thresholds

Finding: Default thresholds work for 50-60% of test queries. This is not validated on real production data.

TypeScript

1// Default thresholds
2const thresholds = {
3  simple: 25,
4  moderate: 50,
5  complex: 75,
6};
7
8// Tuned for customer support (after 500 queries)
9const tunedThresholds = {
10  simple: 20,  // More aggressive routing to cheap models
11  moderate: 45,
12  complex: 70,
13};

Recommendation: Start with defaults. Tune after collecting 500+ queries with feedback.

3. Semantic Caching Performance

Finding: Semantic caching with Upstash Vector achieves >90% hit rate after cache is populated with similar queries.

TypeScript

1// Similar queries match semantically after cache population
2"What are your hours?" → Cache hit
3"When are you open?" → Cache hit (>90% similarity)
4"What time do you close?" → Cache hit (>90% similarity)

Recommendation: Enable semantic caching for high-traffic applications. The >90% hit rate significantly reduces costs after initial cache population.

4. Model Fallback

Finding: Primary model fails ~2% of the time (rate limits, timeouts).

TypeScript

1// Fallback logic
2try {
3  return await primaryModel.generate(query);
4} catch (error) {
5  console.log('Primary failed, using fallback');
6  return await fallbackModel.generate(query);
7}

Recommendation: Always configure fallback models. Test failover regularly.

5. Token Estimation Accuracy

Finding: Character-based estimation is 85% accurate (±15% error).

TypeScript

1// Estimated: 100 tokens
2// Actual: 87 tokens (13% under)
3// Cost impact: ~$0.000013 difference

Recommendation: Good enough for routing decisions. Use tiktoken for billing accuracy.

Performance Metrics

From local batch tests (sample runs, not production validated):

Avg Routing Decision: 3ms (classification + model selection only)
Avg Latency (simple): 1.5-2.5s (includes API calls)
Avg Latency (moderate): 2.5-4.5s
Avg Latency (complex): 4-7s
Cost per Query: $0.000037 - $0.000307 (approximate)
Routing Accuracy: 40-50% overall (84% simple, 96% moderate, 8% complex, 0% reasoning)
Cost Savings: 87-99% vs strawman baseline (always GPT-4)

These metrics are from controlled local tests. The 87% savings compares against always using GPT-4. Real production savings depend on your query distribution, classification accuracy, and caching strategy. Expect 30-50% savings vs a reasonable baseline.

Testing

Comprehensive test suite with 100 tests:

Bash

1# Run all tests
2pnpm test
3
4# Run with coverage
5pnpm test:coverage
6
7# Run specific test file
8pnpm test customer-care-agent.test.ts

Test Coverage:

ComplexityAnalyzer: 22 tests
CostCalculator: 16 tests
ModelSelector: 18 tests
LLMRouter: 21 tests
SemanticCache: 12 tests
ML Classifier: 11 tests

Next Steps

Use This as a Learning Foundation

This experiment is for:

Understanding routing patterns and architecture
Learning ML classification with embeddings
Seeing how to structure a router system
Getting started with cost optimization concepts

Build On Top of This

Want to extend this experiment with your own features? Use the Experiments MCP Server to fetch the complete code directly in your IDE. The MCP server lets you explore, search, and pull experiment code without leaving your development environment.

Already Implemented:

✅ ML-based classification with Upstash Vector
✅ Semantic caching with built-in embeddings
✅ Accurate token counting with tiktoken
✅ Streaming support with StreamingRouter
✅ 6 providers (OpenAI, Anthropic, Google, Groq, Together, Ollama)
✅ Interactive UI with chat, benchmarks, and training
✅ 100 tests with 99% coverage
✅ Error handling and fallback logic

To make this production-ready, consider adding:

Critical (Must Have):

Rate limiting and retry logic for API calls
Database for storing usage analytics and patterns
Monitoring and alerting for routing failures
Configuration management for production environments

Important (Should Have):

Distributed tracing (OpenTelemetry) for debugging
Metrics collection (Prometheus) for performance monitoring
Load balancing for high-traffic scenarios
Authentication/authorization for API access

Nice to Have:

A/B testing framework for routing strategies
Automatic threshold tuning based on feedback
Multi-tenant support with isolated routing configs
Advanced caching strategies (LRU, TTL policies)

About

Author: Vishesh Baghel

I write TypeScript for a living and work with agents, vector systems, and orchestration. I build these experiments to learn in public and share patterns with other developers.

Tech Stack: TypeScript, AI SDK, Upstash Vector, tiktoken, Vitest
Source: GitHub

Questions? Email | X/Twitter

Note: This experiment uses real API keys and makes actual LLM calls. Costs are minimal (~$0.50 for full demo) but monitor your usage.