# vLLM / Local Backend The `vllm` backend allows `kanoa` to connect to locally hosted models or any OpenAI-compatible API endpoint. This is ideal for students, researchers, and organizations running their own inference infrastructure. ## Configuration To use vLLM or a local model, point `kanoa` to your API endpoint. ### Basic Setup ```python from kanoa import AnalyticsInterpreter interpreter = AnalyticsInterpreter( backend="vllm", api_base="http://localhost:8000/v1", # URL of your vLLM server model="allenai/Molmo-7B-D-0924", # Model name served by vLLM api_key="EMPTY" # vLLM usually doesn't require a key pragma: allowlist secret ) ``` ## Supported Models ### Tested Models These models have been verified working with specific hardware configurations: | Model | Parameters | VRAM Required | Hardware Tested | Vision Support | Status | |-------|------------|---------------|-----------------|----------------|--------| | **Molmo 7B-D** (Allen AI) | 7B | 12GB (4-bit) | NVIDIA RTX 5080 (eGPU, 16GB) | ✓ | [✓] Verified (31.1 tok/s avg) | | **Gemma 3 4B** (Google) | 4B | 8GB (4-bit) | NVIDIA RTX 5080 (eGPU, 16GB) | ✓ | [✓] Verified (2.5 tok/s) | | **Gemma 3 12B** (Google) | 12B | 14GB (4-bit + FP8 KV) | NVIDIA RTX 5080 (eGPU, 16GB) | ✓ | [✓] Verified (10.3 tok/s avg) | | **Gemma 3 12B** (Google) | 12B | 24GB (fp16) | GCP L4 GPU (24GB) | ✓ | [ ] Planned | #### Performance Benchmarks (RTX 5080 16GB) **Molmo 7B Statistics** (3-run benchmark on RTX 5080 eGPU): | Test | Mean (tok/s) | StdDev | Min | Max | Notes | |------|--------------|--------|-----|-----|-------| | **Vision - Boardwalk Photo** | 29.3 | 5.8 | 24.0 | 35.5 | Stable ✓ | | **Vision - Complex Plot** | 32.7 | 6.3 | 25.7 | 38.0 | Stable ✓ | | **Vision - Data Interpretation** | 28.8 | 8.8 | 18.8 | 35.4 | Medium variance | | **Overall Average** | 31.1 | 5.9 | 24.2 | 34.5 | 19% CV | **Gemma 3 12B Statistics** (3-run benchmark on RTX 5080 eGPU): | Test | Mean (tok/s) | StdDev | Min | Max | Notes | |------|--------------|--------|-----|-----|-------| | **Vision - Boardwalk** | 2.2 | 0.3 | 2.0 | 2.5 | Stable ✓ | | **Vision - Chart** | 13.6 | 1.0 | 12.5 | 14.4 | Stable ✓ | | **Basic Chat** | 12.6 | 1.4 | 10.9 | 13.7 | Stable ✓ | | **Code Generation** | 16.0 | 2.9 | 14.3 | 19.4 | Medium variance | | **Reasoning** | 2.3 | 2.3 | 0.8 | 4.9 | High variance ⚠️ | | **Structured Output** | 25.1 | 18.0 | 13.5 | 45.8 | High variance ⚠️ | | **Multi-turn** | 0.2 | 0.2 | 0.1 | 0.3 | High variance ⚠️ | | **Overall Average** | 10.3 | 3.5 | 8.1 | 14.4 | 43% CV | **Model Comparison**: | Metric | Gemma 3 4B | Gemma 3 12B | Molmo 7B | Best For | |--------|------------|-------------|----------|----------| | **Avg Throughput** | 2.5 tok/s | 10.3 tok/s | **31.1 tok/s** | **Molmo 7B (3x faster)** | | **Vision Tasks** | 0.8-1.5 tok/s | 2.0-14.4 tok/s | **28.8-32.7 tok/s** | **Molmo 7B** | | **Text Tasks** | 3.3-7.1 tok/s | 12.6-25.1 tok/s | N/A | **Gemma 3 12B** | | **Stability (CV)** | ~20% | 34% | 19% | **Molmo 7B / Gemma 4B** | | **VRAM Required** | 8GB | 14GB | 12GB | **Gemma 4B (lowest)** | **Key Findings**: - **Molmo 7B is 3x faster than Gemma 3 12B** for vision tasks with excellent stability - **Gemma 3 12B excels at text reasoning** and structured outputs (12-25 tok/s) - **Gemma 3 4B is the most efficient** for limited VRAM (8GB) but significantly slower - **Vision tasks**: Molmo dominates (29-33 tok/s) vs Gemma 3 12B (2-14 tok/s) - **Complex reasoning**: Gemma 3 12B shows high variance due to KV cache pressure **Performance Notes**: - High variance in reasoning/multi-turn tasks indicates KV cache pressure - Vision tasks show excellent stability despite large image inputs - Prefix caching helps with repeated queries but may evict under memory pressure - Both models fit in 16GB VRAM with 4-bit quantization + FP8 KV cache **Recommendations**: - **Vision-focused workflows**: Use **Molmo 7B** (3x faster than Gemma, 31 tok/s average) - **Text reasoning & structured outputs**: Use **Gemma 3 12B** (strong for code, JSON, multi-turn) - **Limited VRAM (<12GB)**: Use **Gemma 3 4B** (fits in 8GB but slower) - **General use with 16GB VRAM**: Start with **Molmo 7B** for vision, switch to Gemma 3 12B for text-heavy tasks Monitor vLLM metrics (`/metrics` endpoint) to track cache performance if you experience latency spikes. See the [GPU Metrics Monitoring](../user_guide/monitoring.md) guide for setting up Prometheus and Grafana to visualize these metrics in real-time. For detailed technical analysis of these performance differences, see [Performance Analysis: Molmo vs Gemma](https://github.com/lhzn-io/kanoa-mlops/blob/main/docs/source/performance-analysis.md) in the kanoa-mlops repository. ### Theoretically Supported These models should work with vLLM but have not been tested with kanoa: | Model | Parameters | Est. VRAM (4-bit) | Vision Support | Notes | |-------|------------|-------------------|----------------|-------| | **Llama 3.2 Vision** (Meta) | 11B, 90B | 12GB, 48GB | ✓ | Strong multimodal capabilities | | **Llama 4 Scout/Maverick** (Meta) | 17B (16E/128E) | 16GB, 32GB | ✓ | Latest from Meta (Dec 2024), any-to-any model | | **Qwen2.5-VL** (Alibaba) | 3B, 7B, 72B | 6GB, 12GB, 40GB | ✓ | Latest Qwen vision model (Nov 2024) | | **Qwen3-VL** (Alibaba) | 2B, 4B, 32B, 235B | 4GB, 6GB, 20GB, 120GB | ✓ | Newest Qwen series (Dec 2024) | | **InternVL 3 / 3.5** (OpenGVLab) | 1B, 4B, 8B, 26B, 78B | 4GB, 6GB, 12GB, 20GB, 40GB | ✓ | Latest InternVL series (2024) | | **Llama 3.1** (Meta) | 8B, 70B, 405B | 8GB, 40GB, 200GB+ | ✗ | Text-only, excellent reasoning | | **Mistral** | 7B | 8GB | ✗ | Fast, efficient text model | | **Mixtral 8x7B** | 47B total | 28GB | ✗ | Mixture-of-experts architecture | ### Hardware Requirements **Minimum Configuration:** - NVIDIA GPU with CUDA support - 12GB VRAM (for 7B models with 4-bit quantization) - 16GB system RAM **Recommended Configuration:** - 16GB+ VRAM for 12B models - 24GB+ VRAM for full-precision inference - PCIe 3.0 x4 or better (important for eGPU setups) For detailed infrastructure setup and more hardware configurations, see the [kanoa-mlops repository](https://github.com/lhzn-io/kanoa-mlops). ## Features ### Vision Capabilities Vision support depends on the underlying model: - **Multimodal models** (Molmo, Llama 3.2 Vision, Gemma 3, Qwen-VL, InternVL): `kanoa` can send figures as images - **Text-only models**: Passing a figure will result in an error or the image being ignored ### Knowledge Base The vLLM backend supports **Text Knowledge Bases**: ```python # Load a text-based knowledge base interpreter = interpreter.with_kb(kb_path="data/docs") # Auto-detects file types ``` ## Cost Tracking Since local models don't have standard API pricing, `kanoa` estimates computational cost to help track usage: - **Default Estimate**: ~$0.10 per 1 million tokens (input + output) This is a rough heuristic for tracking relative usage intensity rather than actual dollar spend. ## Advanced Configuration ### Custom Model Parameters ```python # Example with additional vLLM parameters interpreter = AnalyticsInterpreter( backend="vllm", api_base="http://localhost:8000/v1", model="allenai/Molmo-7B-D-0924", temperature=0.7, max_tokens=2048 ) ``` ### Multiple Model Endpoints Switch between different local models by restarting the vLLM server: ```python # Molmo for vision tasks molmo = AnalyticsInterpreter( backend="vllm", api_base="http://localhost:8000/v1", model="allenai/Molmo-7B-D-0924" ) # Gemma 3 4B for text reasoning (restart server with different model) gemma = AnalyticsInterpreter( backend="vllm", api_base="http://localhost:8000/v1", model="google/gemma-3-4b-it" ) ``` ## See Also - [Getting Started with Local Inference](../user_guide/getting_started_local.md) - [GPU Metrics Monitoring](../user_guide/monitoring.md) - Set up Prometheus and Grafana for vLLM metrics - [kanoa-mlops Repository](https://github.com/lhzn-io/kanoa-mlops) - [vLLM Documentation](https://docs.vllm.ai/)