Getting Started with Local Inference

Run models locally with full control, privacy, and zero API costs. We recommend Molmo 7B for vision tasks (31 tok/s, 3x faster than Gemma) or Gemma 3 12B for text reasoning (10.3 tok/s, strong for code/JSON). The kanoa-mlops repository provides infrastructure for local hosting.

Prerequisites

Python 3.11 or higher
kanoa installed (pip install kanoa)
NVIDIA GPU (see hardware requirements)
kanoa-mlops repository cloned

Quick Start

Step 1: Set Up Infrastructure

Clone and set up the kanoa-mlops repository:

git clone https://github.com/lhzn-io/kanoa-mlops.git
cd kanoa-mlops

# Create environment
conda env create -f environment.yml
conda activate kanoa-mlops

Step 2: Download and Start Model

Option A: Gemma 3 12B (Recommended for 16GB VRAM)

# Start vLLM server (downloads model automatically)
vllm serve google/gemma-3-12b-it --port 8000

Option B: Molmo 7B (Best for vision tasks)

# Download Molmo 7B (verified working)
./scripts/download-models.sh molmo-7b-d

# Start vLLM server
docker compose -f docker/vllm/docker-compose.molmo.yml up -d

The server will be available at http://localhost:8000.

Step 3: Connect kanoa to Local Server

import numpy as np
import matplotlib.pyplot as plt
from kanoa import AnalyticsInterpreter

# Create sample data
x = np.linspace(0, 10, 100)
y = np.exp(-x/5) * np.sin(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.title("Damped Oscillation")
plt.xlabel("Time")
plt.ylabel("Amplitude")

# Connect to local vLLM server (Gemma 3 12B)
interpreter = AnalyticsInterpreter(
    backend='openai',  # vLLM uses OpenAI-compatible API
    api_base='http://localhost:8000/v1',
    model='google/gemma-3-12b-it'  # Use whatever model you started
)

# Interpret the plot
result = interpreter.interpret(
    fig=plt.gcf(),
    context="Physics simulation results",
    focus="Describe the pattern and suggest what physical process this could represent"
)

print(result.text)
print(f"Tokens: {result.usage.total_tokens}, Cost: ${result.usage.cost:.4f}")

Hardware Requirements

Verified Working Configurations

Model	VRAM Required	Hardware Tested	Avg Throughput
Molmo 7B	12GB (4-bit)	NVIDIA RTX 5080 (16GB)	31.1 tok/s (±5.9)
Gemma 3 12B	14GB (4-bit + FP8 KV)	NVIDIA RTX 5080 (16GB)	10.3 tok/s (±3.5)
Gemma 3 4B	8GB (4-bit)	NVIDIA RTX 5080 (16GB)	2.5 tok/s

Recommendations:

Vision-focused: Use Molmo 7B — it’s 3x faster than Gemma 3 12B (31 tok/s average)
Text reasoning/code: Use Gemma 3 12B — better for structured outputs, multi-turn chat
Limited VRAM: Use Gemma 3 4B — fits in 8GB but significantly slower

Minimum Requirements

GPU: NVIDIA GPU with CUDA support
VRAM: 12GB minimum (for 7B models with 4-bit quantization)
Storage: 20-30GB for model weights
RAM: 16GB system RAM
PCIe: 3.0 x4 or better (important for eGPU setups)

Tested Configurations

See vLLM Backend Reference for the complete list of tested hardware configurations.

Supported Models

Recommended Models (Verified)

For 16GB VRAM:

✅ Molmo 7B (allenai/Molmo-7B-D-0924) — Best for vision, 31 tok/s average, 3x faster than Gemma
✅ Gemma 3 12B (google/gemma-3-12b-it) — Best for text reasoning, 10.3 tok/s average

For <16GB VRAM:

✅ Gemma 3 4B (google/gemma-3-4b-it) — Fits in 8GB, slower but capable

Performance Comparison

Based on 3-run benchmark (RTX 5080 16GB):

Task Type	Gemma 3 4B	Gemma 3 12B	Molmo 7B	Best Model
Vision (photos)	1.0 tok/s	2.2 ± 0.3 tok/s	29.3 ± 5.8 tok/s	Molmo 7B (13x)
Vision (charts)	1.5 tok/s	13.6 ± 1.0 tok/s	32.7 ± 6.3 tok/s	Molmo 7B (2.4x)
Vision (data viz)	~1 tok/s	~10 tok/s	28.8 ± 8.8 tok/s	Molmo 7B (2.9x)
Basic chat	3.3 tok/s	12.6 ± 1.4 tok/s	Not tested	Gemma 3 12B
Code generation	5.0 tok/s	16.0 ± 2.9 tok/s	Not tested	Gemma 3 12B
Overall Average	2.5 tok/s	10.3 ± 3.5 tok/s	31.1 ± 5.9 tok/s	Molmo 7B (3x)

Performance Notes:

Molmo 7B dominates vision tasks (29-33 tok/s) — 3x faster than Gemma 3 12B overall
Gemma 3 12B excels at text reasoning, code, and structured outputs (12-25 tok/s)
Molmo has better stability (19% CV) vs Gemma 3 12B (34% CV)
Complex reasoning on Gemma may show higher latency due to KV cache pressure
Monitor vLLM /metrics endpoint to track cache hits and GPU utilization

For a comprehensive list of models (including theoretical support), see the vLLM Backend Reference.

Next Steps

Model Selection: Check vLLM Backend Reference for model options
Performance Monitoring: See kanoa-mlops GPU Monitoring for Prometheus and Grafana setup
Infrastructure Details: See kanoa-mlops repository for advanced setup
Knowledge Bases: Learn about Knowledge Bases Guide
Cost Tracking: Understand Cost Management for local models

Troubleshooting

Server connection failed

Verify the server is running:

# Check server health
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

Check logs:

# For direct vLLM process (Gemma 3)
ps aux | grep vllm

# For Docker (Molmo)
docker compose -f docker/vllm/docker-compose.molmo.yml logs -f

Out of memory errors

If you hit VRAM limits:

# For Gemma 3 12B (reduce GPU memory allocation)
vllm serve google/gemma-3-12b-it --gpu-memory-utilization 0.85

# Or switch to 4B variant
vllm serve google/gemma-3-4b-it

# For Docker setups: use 4-bit quantization (default in configs)
# Reduce --max-model-len parameter in docker-compose.yml

See kanoa-mlops hardware guide for detailed memory optimization.

GPU not detected

# Verify GPU detection
nvidia-smi

# For WSL2 users
# See kanoa-mlops/docs/source/wsl2-gpu-setup.md