vLLM / Local Backend

The vllm backend allows kanoa to connect to locally hosted models or any OpenAI-compatible API endpoint. This is ideal for students, researchers, and organizations running their own inference infrastructure.

Configuration

To use vLLM or a local model, point kanoa to your API endpoint.

Basic Setup

from kanoa import AnalyticsInterpreter

interpreter = AnalyticsInterpreter(
    backend="vllm",
    api_base="http://localhost:8000/v1",  # URL of your vLLM server
    model="allenai/Molmo-7B-D-0924",      # Model name served by vLLM
    api_key="EMPTY"                        # vLLM usually doesn't require a key  pragma: allowlist secret
)

Supported Models

Tested Models

These models have been verified working with specific hardware configurations:

Model	Parameters	VRAM Required	Hardware Tested	Vision Support	Status
Molmo 7B-D (Allen AI)	7B	12GB (4-bit)	NVIDIA RTX 5080 (eGPU, 16GB)	✓	[✓] Verified (31.1 tok/s avg)
Gemma 3 4B (Google)	4B	8GB (4-bit)	NVIDIA RTX 5080 (eGPU, 16GB)	✓	[✓] Verified (2.5 tok/s)
Gemma 3 12B (Google)	12B	14GB (4-bit + FP8 KV)	NVIDIA RTX 5080 (eGPU, 16GB)	✓	[✓] Verified (10.3 tok/s avg)
Gemma 3 12B (Google)	12B	24GB (fp16)	GCP L4 GPU (24GB)	✓	[ ] Planned

Performance Benchmarks (RTX 5080 16GB)

Molmo 7B Statistics (3-run benchmark on RTX 5080 eGPU):

Test	Mean (tok/s)	StdDev	Min	Max	Notes
Vision - Boardwalk Photo	29.3	5.8	24.0	35.5	Stable ✓
Vision - Complex Plot	32.7	6.3	25.7	38.0	Stable ✓
Vision - Data Interpretation	28.8	8.8	18.8	35.4	Medium variance
Overall Average	31.1	5.9	24.2	34.5	19% CV

Gemma 3 12B Statistics (3-run benchmark on RTX 5080 eGPU):

Test	Mean (tok/s)	StdDev	Min	Max	Notes
Vision - Boardwalk	2.2	0.3	2.0	2.5	Stable ✓
Vision - Chart	13.6	1.0	12.5	14.4	Stable ✓
Basic Chat	12.6	1.4	10.9	13.7	Stable ✓
Code Generation	16.0	2.9	14.3	19.4	Medium variance
Reasoning	2.3	2.3	0.8	4.9	High variance ⚠️
Structured Output	25.1	18.0	13.5	45.8	High variance ⚠️
Multi-turn	0.2	0.2	0.1	0.3	High variance ⚠️
Overall Average	10.3	3.5	8.1	14.4	43% CV

Model Comparison:

Metric	Gemma 3 4B	Gemma 3 12B	Molmo 7B	Best For
Avg Throughput	2.5 tok/s	10.3 tok/s	31.1 tok/s	Molmo 7B (3x faster)
Vision Tasks	0.8-1.5 tok/s	2.0-14.4 tok/s	28.8-32.7 tok/s	Molmo 7B
Text Tasks	3.3-7.1 tok/s	12.6-25.1 tok/s	N/A	Gemma 3 12B
Stability (CV)	~20%	34%	19%	Molmo 7B / Gemma 4B
VRAM Required	8GB	14GB	12GB	Gemma 4B (lowest)

Key Findings:

Molmo 7B is 3x faster than Gemma 3 12B for vision tasks with excellent stability
Gemma 3 12B excels at text reasoning and structured outputs (12-25 tok/s)
Gemma 3 4B is the most efficient for limited VRAM (8GB) but significantly slower
Vision tasks: Molmo dominates (29-33 tok/s) vs Gemma 3 12B (2-14 tok/s)
Complex reasoning: Gemma 3 12B shows high variance due to KV cache pressure

Performance Notes:

High variance in reasoning/multi-turn tasks indicates KV cache pressure
Vision tasks show excellent stability despite large image inputs
Prefix caching helps with repeated queries but may evict under memory pressure
Both models fit in 16GB VRAM with 4-bit quantization + FP8 KV cache

Recommendations:

Vision-focused workflows: Use Molmo 7B (3x faster than Gemma, 31 tok/s average)
Text reasoning & structured outputs: Use Gemma 3 12B (strong for code, JSON, multi-turn)
Limited VRAM (<12GB): Use Gemma 3 4B (fits in 8GB but slower)
General use with 16GB VRAM: Start with Molmo 7B for vision, switch to Gemma 3 12B for text-heavy tasks

Monitor vLLM metrics (/metrics endpoint) to track cache performance if you experience latency spikes. See the GPU Metrics Monitoring guide for setting up Prometheus and Grafana to visualize these metrics in real-time.

For detailed technical analysis of these performance differences, see Performance Analysis: Molmo vs Gemma in the kanoa-mlops repository.

Theoretically Supported

These models should work with vLLM but have not been tested with kanoa:

Model	Parameters	Est. VRAM (4-bit)	Vision Support	Notes
Llama 3.2 Vision (Meta)	11B, 90B	12GB, 48GB	✓	Strong multimodal capabilities
Llama 4 Scout/Maverick (Meta)	17B (16E/128E)	16GB, 32GB	✓	Latest from Meta (Dec 2024), any-to-any model
Qwen2.5-VL (Alibaba)	3B, 7B, 72B	6GB, 12GB, 40GB	✓	Latest Qwen vision model (Nov 2024)
Qwen3-VL (Alibaba)	2B, 4B, 32B, 235B	4GB, 6GB, 20GB, 120GB	✓	Newest Qwen series (Dec 2024)
InternVL 3 / 3.5 (OpenGVLab)	1B, 4B, 8B, 26B, 78B	4GB, 6GB, 12GB, 20GB, 40GB	✓	Latest InternVL series (2024)
Llama 3.1 (Meta)	8B, 70B, 405B	8GB, 40GB, 200GB+	✗	Text-only, excellent reasoning
Mistral	7B	8GB	✗	Fast, efficient text model
Mixtral 8x7B	47B total	28GB	✗	Mixture-of-experts architecture

Hardware Requirements

Minimum Configuration:

NVIDIA GPU with CUDA support
12GB VRAM (for 7B models with 4-bit quantization)
16GB system RAM

Recommended Configuration:

16GB+ VRAM for 12B models
24GB+ VRAM for full-precision inference
PCIe 3.0 x4 or better (important for eGPU setups)

For detailed infrastructure setup and more hardware configurations, see the kanoa-mlops repository.

Features

Vision Capabilities

Vision support depends on the underlying model:

Multimodal models (Molmo, Llama 3.2 Vision, Gemma 3, Qwen-VL, InternVL): kanoa can send figures as images
Text-only models: Passing a figure will result in an error or the image being ignored

Knowledge Base

The vLLM backend supports Text Knowledge Bases:

# Load a text-based knowledge base
interpreter = interpreter.with_kb(kb_path="data/docs")  # Auto-detects file types

Cost Tracking

Since local models don’t have standard API pricing, kanoa estimates computational cost to help track usage:

Default Estimate: ~$0.10 per 1 million tokens (input + output)

This is a rough heuristic for tracking relative usage intensity rather than actual dollar spend.

Advanced Configuration

Custom Model Parameters

# Example with additional vLLM parameters
interpreter = AnalyticsInterpreter(
    backend="vllm",
    api_base="http://localhost:8000/v1",
    model="allenai/Molmo-7B-D-0924",
    temperature=0.7,
    max_tokens=2048
)

Multiple Model Endpoints

Switch between different local models by restarting the vLLM server:

# Molmo for vision tasks
molmo = AnalyticsInterpreter(
    backend="vllm",
    api_base="http://localhost:8000/v1",
    model="allenai/Molmo-7B-D-0924"
)

# Gemma 3 4B for text reasoning (restart server with different model)
gemma = AnalyticsInterpreter(
    backend="vllm",
    api_base="http://localhost:8000/v1",
    model="google/gemma-3-4b-it"
)