vLLM / Local Backend

The vllm backend allows kanoa to connect to locally hosted models or any OpenAI-compatible API endpoint. This is ideal for students, researchers, and organizations running their own inference infrastructure.

Configuration

To use vLLM or a local model, point kanoa to your API endpoint.

Basic Setup

from kanoa import AnalyticsInterpreter

interpreter = AnalyticsInterpreter(
    backend="vllm",
    api_base="http://localhost:8000/v1",  # URL of your vLLM server
    model="allenai/Molmo-7B-D-0924",      # Model name served by vLLM
    api_key="EMPTY"                        # vLLM usually doesn't require a key  pragma: allowlist secret
)

Supported Models

Tested Models

These models have been verified working with specific hardware configurations:

Model

Parameters

VRAM Required

Hardware Tested

Vision Support

Status

Molmo 7B-D (Allen AI)

7B

12GB (4-bit)

NVIDIA RTX 5080 (eGPU, 16GB)

[✓] Verified (31.1 tok/s avg)

Gemma 3 4B (Google)

4B

8GB (4-bit)

NVIDIA RTX 5080 (eGPU, 16GB)

[✓] Verified (2.5 tok/s)

Gemma 3 12B (Google)

12B

14GB (4-bit + FP8 KV)

NVIDIA RTX 5080 (eGPU, 16GB)

[✓] Verified (10.3 tok/s avg)

Gemma 3 12B (Google)

12B

24GB (fp16)

GCP L4 GPU (24GB)

[ ] Planned

Performance Benchmarks (RTX 5080 16GB)

Molmo 7B Statistics (3-run benchmark on RTX 5080 eGPU):

Test

Mean (tok/s)

StdDev

Min

Max

Notes

Vision - Boardwalk Photo

29.3

5.8

24.0

35.5

Stable ✓

Vision - Complex Plot

32.7

6.3

25.7

38.0

Stable ✓

Vision - Data Interpretation

28.8

8.8

18.8

35.4

Medium variance

Overall Average

31.1

5.9

24.2

34.5

19% CV

Gemma 3 12B Statistics (3-run benchmark on RTX 5080 eGPU):

Test

Mean (tok/s)

StdDev

Min

Max

Notes

Vision - Boardwalk

2.2

0.3

2.0

2.5

Stable ✓

Vision - Chart

13.6

1.0

12.5

14.4

Stable ✓

Basic Chat

12.6

1.4

10.9

13.7

Stable ✓

Code Generation

16.0

2.9

14.3

19.4

Medium variance

Reasoning

2.3

2.3

0.8

4.9

High variance ⚠️

Structured Output

25.1

18.0

13.5

45.8

High variance ⚠️

Multi-turn

0.2

0.2

0.1

0.3

High variance ⚠️

Overall Average

10.3

3.5

8.1

14.4

43% CV

Model Comparison:

Metric

Gemma 3 4B

Gemma 3 12B

Molmo 7B

Best For

Avg Throughput

2.5 tok/s

10.3 tok/s

31.1 tok/s

Molmo 7B (3x faster)

Vision Tasks

0.8-1.5 tok/s

2.0-14.4 tok/s

28.8-32.7 tok/s

Molmo 7B

Text Tasks

3.3-7.1 tok/s

12.6-25.1 tok/s

N/A

Gemma 3 12B

Stability (CV)

~20%

34%

19%

Molmo 7B / Gemma 4B

VRAM Required

8GB

14GB

12GB

Gemma 4B (lowest)

Key Findings:

  • Molmo 7B is 3x faster than Gemma 3 12B for vision tasks with excellent stability

  • Gemma 3 12B excels at text reasoning and structured outputs (12-25 tok/s)

  • Gemma 3 4B is the most efficient for limited VRAM (8GB) but significantly slower

  • Vision tasks: Molmo dominates (29-33 tok/s) vs Gemma 3 12B (2-14 tok/s)

  • Complex reasoning: Gemma 3 12B shows high variance due to KV cache pressure

Performance Notes:

  • High variance in reasoning/multi-turn tasks indicates KV cache pressure

  • Vision tasks show excellent stability despite large image inputs

  • Prefix caching helps with repeated queries but may evict under memory pressure

  • Both models fit in 16GB VRAM with 4-bit quantization + FP8 KV cache

Recommendations:

  • Vision-focused workflows: Use Molmo 7B (3x faster than Gemma, 31 tok/s average)

  • Text reasoning & structured outputs: Use Gemma 3 12B (strong for code, JSON, multi-turn)

  • Limited VRAM (<12GB): Use Gemma 3 4B (fits in 8GB but slower)

  • General use with 16GB VRAM: Start with Molmo 7B for vision, switch to Gemma 3 12B for text-heavy tasks

Monitor vLLM metrics (/metrics endpoint) to track cache performance if you experience latency spikes. See the GPU Metrics Monitoring guide for setting up Prometheus and Grafana to visualize these metrics in real-time.

For detailed technical analysis of these performance differences, see Performance Analysis: Molmo vs Gemma in the kanoa-mlops repository.

Theoretically Supported

These models should work with vLLM but have not been tested with kanoa:

Model

Parameters

Est. VRAM (4-bit)

Vision Support

Notes

Llama 3.2 Vision (Meta)

11B, 90B

12GB, 48GB

Strong multimodal capabilities

Llama 4 Scout/Maverick (Meta)

17B (16E/128E)

16GB, 32GB

Latest from Meta (Dec 2024), any-to-any model

Qwen2.5-VL (Alibaba)

3B, 7B, 72B

6GB, 12GB, 40GB

Latest Qwen vision model (Nov 2024)

Qwen3-VL (Alibaba)

2B, 4B, 32B, 235B

4GB, 6GB, 20GB, 120GB

Newest Qwen series (Dec 2024)

InternVL 3 / 3.5 (OpenGVLab)

1B, 4B, 8B, 26B, 78B

4GB, 6GB, 12GB, 20GB, 40GB

Latest InternVL series (2024)

Llama 3.1 (Meta)

8B, 70B, 405B

8GB, 40GB, 200GB+

Text-only, excellent reasoning

Mistral

7B

8GB

Fast, efficient text model

Mixtral 8x7B

47B total

28GB

Mixture-of-experts architecture

Hardware Requirements

Minimum Configuration:

  • NVIDIA GPU with CUDA support

  • 12GB VRAM (for 7B models with 4-bit quantization)

  • 16GB system RAM

Recommended Configuration:

  • 16GB+ VRAM for 12B models

  • 24GB+ VRAM for full-precision inference

  • PCIe 3.0 x4 or better (important for eGPU setups)

For detailed infrastructure setup and more hardware configurations, see the kanoa-mlops repository.

Features

Vision Capabilities

Vision support depends on the underlying model:

  • Multimodal models (Molmo, Llama 3.2 Vision, Gemma 3, Qwen-VL, InternVL): kanoa can send figures as images

  • Text-only models: Passing a figure will result in an error or the image being ignored

Knowledge Base

The vLLM backend supports Text Knowledge Bases:

# Load a text-based knowledge base
interpreter = interpreter.with_kb(kb_path="data/docs")  # Auto-detects file types

Cost Tracking

Since local models don’t have standard API pricing, kanoa estimates computational cost to help track usage:

  • Default Estimate: ~$0.10 per 1 million tokens (input + output)

This is a rough heuristic for tracking relative usage intensity rather than actual dollar spend.

Advanced Configuration

Custom Model Parameters

# Example with additional vLLM parameters
interpreter = AnalyticsInterpreter(
    backend="vllm",
    api_base="http://localhost:8000/v1",
    model="allenai/Molmo-7B-D-0924",
    temperature=0.7,
    max_tokens=2048
)

Multiple Model Endpoints

Switch between different local models by restarting the vLLM server:

# Molmo for vision tasks
molmo = AnalyticsInterpreter(
    backend="vllm",
    api_base="http://localhost:8000/v1",
    model="allenai/Molmo-7B-D-0924"
)

# Gemma 3 4B for text reasoning (restart server with different model)
gemma = AnalyticsInterpreter(
    backend="vllm",
    api_base="http://localhost:8000/v1",
    model="google/gemma-3-4b-it"
)

See Also