vLLM / Local Backend
The vllm backend allows kanoa to connect to locally hosted models or any OpenAI-compatible API endpoint. This is ideal for students, researchers, and organizations running their own inference infrastructure.
Configuration
To use vLLM or a local model, point kanoa to your API endpoint.
Basic Setup
from kanoa import AnalyticsInterpreter
interpreter = AnalyticsInterpreter(
backend="vllm",
api_base="http://localhost:8000/v1", # URL of your vLLM server
model="allenai/Molmo-7B-D-0924", # Model name served by vLLM
api_key="EMPTY" # vLLM usually doesn't require a key pragma: allowlist secret
)
Supported Models
Tested Models
These models have been verified working with specific hardware configurations:
Model |
Parameters |
VRAM Required |
Hardware Tested |
Vision Support |
Status |
|---|---|---|---|---|---|
Molmo 7B-D (Allen AI) |
7B |
12GB (4-bit) |
NVIDIA RTX 5080 (eGPU, 16GB) |
✓ |
[✓] Verified (31.1 tok/s avg) |
Gemma 3 4B (Google) |
4B |
8GB (4-bit) |
NVIDIA RTX 5080 (eGPU, 16GB) |
✓ |
[✓] Verified (2.5 tok/s) |
Gemma 3 12B (Google) |
12B |
14GB (4-bit + FP8 KV) |
NVIDIA RTX 5080 (eGPU, 16GB) |
✓ |
[✓] Verified (10.3 tok/s avg) |
Gemma 3 12B (Google) |
12B |
24GB (fp16) |
GCP L4 GPU (24GB) |
✓ |
[ ] Planned |
Performance Benchmarks (RTX 5080 16GB)
Molmo 7B Statistics (3-run benchmark on RTX 5080 eGPU):
Test |
Mean (tok/s) |
StdDev |
Min |
Max |
Notes |
|---|---|---|---|---|---|
Vision - Boardwalk Photo |
29.3 |
5.8 |
24.0 |
35.5 |
Stable ✓ |
Vision - Complex Plot |
32.7 |
6.3 |
25.7 |
38.0 |
Stable ✓ |
Vision - Data Interpretation |
28.8 |
8.8 |
18.8 |
35.4 |
Medium variance |
Overall Average |
31.1 |
5.9 |
24.2 |
34.5 |
19% CV |
Gemma 3 12B Statistics (3-run benchmark on RTX 5080 eGPU):
Test |
Mean (tok/s) |
StdDev |
Min |
Max |
Notes |
|---|---|---|---|---|---|
Vision - Boardwalk |
2.2 |
0.3 |
2.0 |
2.5 |
Stable ✓ |
Vision - Chart |
13.6 |
1.0 |
12.5 |
14.4 |
Stable ✓ |
Basic Chat |
12.6 |
1.4 |
10.9 |
13.7 |
Stable ✓ |
Code Generation |
16.0 |
2.9 |
14.3 |
19.4 |
Medium variance |
Reasoning |
2.3 |
2.3 |
0.8 |
4.9 |
High variance ⚠️ |
Structured Output |
25.1 |
18.0 |
13.5 |
45.8 |
High variance ⚠️ |
Multi-turn |
0.2 |
0.2 |
0.1 |
0.3 |
High variance ⚠️ |
Overall Average |
10.3 |
3.5 |
8.1 |
14.4 |
43% CV |
Model Comparison:
Metric |
Gemma 3 4B |
Gemma 3 12B |
Molmo 7B |
Best For |
|---|---|---|---|---|
Avg Throughput |
2.5 tok/s |
10.3 tok/s |
31.1 tok/s |
Molmo 7B (3x faster) |
Vision Tasks |
0.8-1.5 tok/s |
2.0-14.4 tok/s |
28.8-32.7 tok/s |
Molmo 7B |
Text Tasks |
3.3-7.1 tok/s |
12.6-25.1 tok/s |
N/A |
Gemma 3 12B |
Stability (CV) |
~20% |
34% |
19% |
Molmo 7B / Gemma 4B |
VRAM Required |
8GB |
14GB |
12GB |
Gemma 4B (lowest) |
Key Findings:
Molmo 7B is 3x faster than Gemma 3 12B for vision tasks with excellent stability
Gemma 3 12B excels at text reasoning and structured outputs (12-25 tok/s)
Gemma 3 4B is the most efficient for limited VRAM (8GB) but significantly slower
Vision tasks: Molmo dominates (29-33 tok/s) vs Gemma 3 12B (2-14 tok/s)
Complex reasoning: Gemma 3 12B shows high variance due to KV cache pressure
Performance Notes:
High variance in reasoning/multi-turn tasks indicates KV cache pressure
Vision tasks show excellent stability despite large image inputs
Prefix caching helps with repeated queries but may evict under memory pressure
Both models fit in 16GB VRAM with 4-bit quantization + FP8 KV cache
Recommendations:
Vision-focused workflows: Use Molmo 7B (3x faster than Gemma, 31 tok/s average)
Text reasoning & structured outputs: Use Gemma 3 12B (strong for code, JSON, multi-turn)
Limited VRAM (<12GB): Use Gemma 3 4B (fits in 8GB but slower)
General use with 16GB VRAM: Start with Molmo 7B for vision, switch to Gemma 3 12B for text-heavy tasks
Monitor vLLM metrics (/metrics endpoint) to track cache performance if you experience latency spikes. See the GPU Metrics Monitoring guide for setting up Prometheus and Grafana to visualize these metrics in real-time.
For detailed technical analysis of these performance differences, see Performance Analysis: Molmo vs Gemma in the kanoa-mlops repository.
Theoretically Supported
These models should work with vLLM but have not been tested with kanoa:
Model |
Parameters |
Est. VRAM (4-bit) |
Vision Support |
Notes |
|---|---|---|---|---|
Llama 3.2 Vision (Meta) |
11B, 90B |
12GB, 48GB |
✓ |
Strong multimodal capabilities |
Llama 4 Scout/Maverick (Meta) |
17B (16E/128E) |
16GB, 32GB |
✓ |
Latest from Meta (Dec 2024), any-to-any model |
Qwen2.5-VL (Alibaba) |
3B, 7B, 72B |
6GB, 12GB, 40GB |
✓ |
Latest Qwen vision model (Nov 2024) |
Qwen3-VL (Alibaba) |
2B, 4B, 32B, 235B |
4GB, 6GB, 20GB, 120GB |
✓ |
Newest Qwen series (Dec 2024) |
InternVL 3 / 3.5 (OpenGVLab) |
1B, 4B, 8B, 26B, 78B |
4GB, 6GB, 12GB, 20GB, 40GB |
✓ |
Latest InternVL series (2024) |
Llama 3.1 (Meta) |
8B, 70B, 405B |
8GB, 40GB, 200GB+ |
✗ |
Text-only, excellent reasoning |
Mistral |
7B |
8GB |
✗ |
Fast, efficient text model |
Mixtral 8x7B |
47B total |
28GB |
✗ |
Mixture-of-experts architecture |
Hardware Requirements
Minimum Configuration:
NVIDIA GPU with CUDA support
12GB VRAM (for 7B models with 4-bit quantization)
16GB system RAM
Recommended Configuration:
16GB+ VRAM for 12B models
24GB+ VRAM for full-precision inference
PCIe 3.0 x4 or better (important for eGPU setups)
For detailed infrastructure setup and more hardware configurations, see the kanoa-mlops repository.
Features
Vision Capabilities
Vision support depends on the underlying model:
Multimodal models (Molmo, Llama 3.2 Vision, Gemma 3, Qwen-VL, InternVL):
kanoacan send figures as imagesText-only models: Passing a figure will result in an error or the image being ignored
Knowledge Base
The vLLM backend supports Text Knowledge Bases:
# Load a text-based knowledge base
interpreter = interpreter.with_kb(kb_path="data/docs") # Auto-detects file types
Cost Tracking
Since local models don’t have standard API pricing, kanoa estimates computational cost to help track usage:
Default Estimate: ~$0.10 per 1 million tokens (input + output)
This is a rough heuristic for tracking relative usage intensity rather than actual dollar spend.
Advanced Configuration
Custom Model Parameters
# Example with additional vLLM parameters
interpreter = AnalyticsInterpreter(
backend="vllm",
api_base="http://localhost:8000/v1",
model="allenai/Molmo-7B-D-0924",
temperature=0.7,
max_tokens=2048
)
Multiple Model Endpoints
Switch between different local models by restarting the vLLM server:
# Molmo for vision tasks
molmo = AnalyticsInterpreter(
backend="vllm",
api_base="http://localhost:8000/v1",
model="allenai/Molmo-7B-D-0924"
)
# Gemma 3 4B for text reasoning (restart server with different model)
gemma = AnalyticsInterpreter(
backend="vllm",
api_base="http://localhost:8000/v1",
model="google/gemma-3-4b-it"
)
See Also
GPU Metrics Monitoring - Set up Prometheus and Grafana for vLLM metrics