Getting Started with Local Inference

Run models locally with full control, privacy, and zero API costs. We recommend Molmo 7B for vision tasks (31 tok/s, 3x faster than Gemma) or Gemma 3 12B for text reasoning (10.3 tok/s, strong for code/JSON). The kanoa-mlops repository provides infrastructure for local hosting.

Prerequisites

  • Python 3.11 or higher

  • kanoa installed (pip install kanoa)

  • NVIDIA GPU (see hardware requirements)

  • kanoa-mlops repository cloned

Quick Start

Step 1: Set Up Infrastructure

Clone and set up the kanoa-mlops repository:

git clone https://github.com/lhzn-io/kanoa-mlops.git
cd kanoa-mlops

# Create environment
conda env create -f environment.yml
conda activate kanoa-mlops

Step 2: Download and Start Model

Option B: Molmo 7B (Best for vision tasks)

# Download Molmo 7B (verified working)
./scripts/download-models.sh molmo-7b-d

# Start vLLM server
docker compose -f docker/vllm/docker-compose.molmo.yml up -d

The server will be available at http://localhost:8000.

Step 3: Connect kanoa to Local Server

import numpy as np
import matplotlib.pyplot as plt
from kanoa import AnalyticsInterpreter

# Create sample data
x = np.linspace(0, 10, 100)
y = np.exp(-x/5) * np.sin(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.title("Damped Oscillation")
plt.xlabel("Time")
plt.ylabel("Amplitude")

# Connect to local vLLM server (Gemma 3 12B)
interpreter = AnalyticsInterpreter(
    backend='openai',  # vLLM uses OpenAI-compatible API
    api_base='http://localhost:8000/v1',
    model='google/gemma-3-12b-it'  # Use whatever model you started
)

# Interpret the plot
result = interpreter.interpret(
    fig=plt.gcf(),
    context="Physics simulation results",
    focus="Describe the pattern and suggest what physical process this could represent"
)

print(result.text)
print(f"Tokens: {result.usage.total_tokens}, Cost: ${result.usage.cost:.4f}")

Hardware Requirements

Verified Working Configurations

Model

VRAM Required

Hardware Tested

Avg Throughput

Molmo 7B

12GB (4-bit)

NVIDIA RTX 5080 (16GB)

31.1 tok/s (±5.9)

Gemma 3 12B

14GB (4-bit + FP8 KV)

NVIDIA RTX 5080 (16GB)

10.3 tok/s (±3.5)

Gemma 3 4B

8GB (4-bit)

NVIDIA RTX 5080 (16GB)

2.5 tok/s

Recommendations:

  • Vision-focused: Use Molmo 7B — it’s 3x faster than Gemma 3 12B (31 tok/s average)

  • Text reasoning/code: Use Gemma 3 12B — better for structured outputs, multi-turn chat

  • Limited VRAM: Use Gemma 3 4B — fits in 8GB but significantly slower

Minimum Requirements

  • GPU: NVIDIA GPU with CUDA support

  • VRAM: 12GB minimum (for 7B models with 4-bit quantization)

  • Storage: 20-30GB for model weights

  • RAM: 16GB system RAM

  • PCIe: 3.0 x4 or better (important for eGPU setups)

Tested Configurations

See vLLM Backend Reference for the complete list of tested hardware configurations.

Supported Models

Performance Comparison

Based on 3-run benchmark (RTX 5080 16GB):

Task Type

Gemma 3 4B

Gemma 3 12B

Molmo 7B

Best Model

Vision (photos)

1.0 tok/s

2.2 ± 0.3 tok/s

29.3 ± 5.8 tok/s

Molmo 7B (13x)

Vision (charts)

1.5 tok/s

13.6 ± 1.0 tok/s

32.7 ± 6.3 tok/s

Molmo 7B (2.4x)

Vision (data viz)

~1 tok/s

~10 tok/s

28.8 ± 8.8 tok/s

Molmo 7B (2.9x)

Basic chat

3.3 tok/s

12.6 ± 1.4 tok/s

Not tested

Gemma 3 12B

Code generation

5.0 tok/s

16.0 ± 2.9 tok/s

Not tested

Gemma 3 12B

Overall Average

2.5 tok/s

10.3 ± 3.5 tok/s

31.1 ± 5.9 tok/s

Molmo 7B (3x)

Performance Notes:

  • Molmo 7B dominates vision tasks (29-33 tok/s) — 3x faster than Gemma 3 12B overall

  • Gemma 3 12B excels at text reasoning, code, and structured outputs (12-25 tok/s)

  • Molmo has better stability (19% CV) vs Gemma 3 12B (34% CV)

  • Complex reasoning on Gemma may show higher latency due to KV cache pressure

  • Monitor vLLM /metrics endpoint to track cache hits and GPU utilization

For a comprehensive list of models (including theoretical support), see the vLLM Backend Reference.

Next Steps

Troubleshooting

Server connection failed

Verify the server is running:

# Check server health
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

Check logs:

# For direct vLLM process (Gemma 3)
ps aux | grep vllm

# For Docker (Molmo)
docker compose -f docker/vllm/docker-compose.molmo.yml logs -f

Out of memory errors

If you hit VRAM limits:

# For Gemma 3 12B (reduce GPU memory allocation)
vllm serve google/gemma-3-12b-it --gpu-memory-utilization 0.85

# Or switch to 4B variant
vllm serve google/gemma-3-4b-it

# For Docker setups: use 4-bit quantization (default in configs)
# Reduce --max-model-len parameter in docker-compose.yml

See kanoa-mlops hardware guide for detailed memory optimization.

GPU not detected

# Verify GPU detection
nvidia-smi

# For WSL2 users
# See kanoa-mlops/docs/source/wsl2-gpu-setup.md