Getting Started with Local Inference
Run models locally with full control, privacy, and zero API costs. We recommend Molmo 7B for vision tasks (31 tok/s, 3x faster than Gemma) or Gemma 3 12B for text reasoning (10.3 tok/s, strong for code/JSON). The kanoa-mlops repository provides infrastructure for local hosting.
Prerequisites
Python 3.11 or higher
kanoa installed (
pip install kanoa)NVIDIA GPU (see hardware requirements)
kanoa-mlops repository cloned
Quick Start
Step 1: Set Up Infrastructure
Clone and set up the kanoa-mlops repository:
git clone https://github.com/lhzn-io/kanoa-mlops.git
cd kanoa-mlops
# Create environment
conda env create -f environment.yml
conda activate kanoa-mlops
Step 2: Download and Start Model
Option A: Gemma 3 12B (Recommended for 16GB VRAM)
# Start vLLM server (downloads model automatically)
vllm serve google/gemma-3-12b-it --port 8000
Option B: Molmo 7B (Best for vision tasks)
# Download Molmo 7B (verified working)
./scripts/download-models.sh molmo-7b-d
# Start vLLM server
docker compose -f docker/vllm/docker-compose.molmo.yml up -d
The server will be available at http://localhost:8000.
Step 3: Connect kanoa to Local Server
import numpy as np
import matplotlib.pyplot as plt
from kanoa import AnalyticsInterpreter
# Create sample data
x = np.linspace(0, 10, 100)
y = np.exp(-x/5) * np.sin(x)
plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.title("Damped Oscillation")
plt.xlabel("Time")
plt.ylabel("Amplitude")
# Connect to local vLLM server (Gemma 3 12B)
interpreter = AnalyticsInterpreter(
backend='openai', # vLLM uses OpenAI-compatible API
api_base='http://localhost:8000/v1',
model='google/gemma-3-12b-it' # Use whatever model you started
)
# Interpret the plot
result = interpreter.interpret(
fig=plt.gcf(),
context="Physics simulation results",
focus="Describe the pattern and suggest what physical process this could represent"
)
print(result.text)
print(f"Tokens: {result.usage.total_tokens}, Cost: ${result.usage.cost:.4f}")
Hardware Requirements
Verified Working Configurations
Model |
VRAM Required |
Hardware Tested |
Avg Throughput |
|---|---|---|---|
Molmo 7B |
12GB (4-bit) |
NVIDIA RTX 5080 (16GB) |
31.1 tok/s (±5.9) |
Gemma 3 12B |
14GB (4-bit + FP8 KV) |
NVIDIA RTX 5080 (16GB) |
10.3 tok/s (±3.5) |
Gemma 3 4B |
8GB (4-bit) |
NVIDIA RTX 5080 (16GB) |
2.5 tok/s |
Recommendations:
Vision-focused: Use Molmo 7B — it’s 3x faster than Gemma 3 12B (31 tok/s average)
Text reasoning/code: Use Gemma 3 12B — better for structured outputs, multi-turn chat
Limited VRAM: Use Gemma 3 4B — fits in 8GB but significantly slower
Minimum Requirements
GPU: NVIDIA GPU with CUDA support
VRAM: 12GB minimum (for 7B models with 4-bit quantization)
Storage: 20-30GB for model weights
RAM: 16GB system RAM
PCIe: 3.0 x4 or better (important for eGPU setups)
Tested Configurations
See vLLM Backend Reference for the complete list of tested hardware configurations.
Supported Models
Recommended Models (Verified)
For 16GB VRAM:
✅ Molmo 7B (
allenai/Molmo-7B-D-0924) — Best for vision, 31 tok/s average, 3x faster than Gemma✅ Gemma 3 12B (
google/gemma-3-12b-it) — Best for text reasoning, 10.3 tok/s average
For <16GB VRAM:
✅ Gemma 3 4B (
google/gemma-3-4b-it) — Fits in 8GB, slower but capable
Performance Comparison
Based on 3-run benchmark (RTX 5080 16GB):
Task Type |
Gemma 3 4B |
Gemma 3 12B |
Molmo 7B |
Best Model |
|---|---|---|---|---|
Vision (photos) |
1.0 tok/s |
2.2 ± 0.3 tok/s |
29.3 ± 5.8 tok/s |
Molmo 7B (13x) |
Vision (charts) |
1.5 tok/s |
13.6 ± 1.0 tok/s |
32.7 ± 6.3 tok/s |
Molmo 7B (2.4x) |
Vision (data viz) |
~1 tok/s |
~10 tok/s |
28.8 ± 8.8 tok/s |
Molmo 7B (2.9x) |
Basic chat |
3.3 tok/s |
12.6 ± 1.4 tok/s |
Not tested |
Gemma 3 12B |
Code generation |
5.0 tok/s |
16.0 ± 2.9 tok/s |
Not tested |
Gemma 3 12B |
Overall Average |
2.5 tok/s |
10.3 ± 3.5 tok/s |
31.1 ± 5.9 tok/s |
Molmo 7B (3x) |
Performance Notes:
Molmo 7B dominates vision tasks (29-33 tok/s) — 3x faster than Gemma 3 12B overall
Gemma 3 12B excels at text reasoning, code, and structured outputs (12-25 tok/s)
Molmo has better stability (19% CV) vs Gemma 3 12B (34% CV)
Complex reasoning on Gemma may show higher latency due to KV cache pressure
Monitor vLLM
/metricsendpoint to track cache hits and GPU utilization
For a comprehensive list of models (including theoretical support), see the vLLM Backend Reference.
Next Steps
Model Selection: Check vLLM Backend Reference for model options
Performance Monitoring: See kanoa-mlops GPU Monitoring for Prometheus and Grafana setup
Infrastructure Details: See kanoa-mlops repository for advanced setup
Knowledge Bases: Learn about Knowledge Bases Guide
Cost Tracking: Understand Cost Management for local models
Troubleshooting
Server connection failed
Verify the server is running:
# Check server health
curl http://localhost:8000/health
# List available models
curl http://localhost:8000/v1/models
Check logs:
# For direct vLLM process (Gemma 3)
ps aux | grep vllm
# For Docker (Molmo)
docker compose -f docker/vllm/docker-compose.molmo.yml logs -f
Out of memory errors
If you hit VRAM limits:
# For Gemma 3 12B (reduce GPU memory allocation)
vllm serve google/gemma-3-12b-it --gpu-memory-utilization 0.85
# Or switch to 4B variant
vllm serve google/gemma-3-4b-it
# For Docker setups: use 4-bit quantization (default in configs)
# Reduce --max-model-len parameter in docker-compose.yml
See kanoa-mlops hardware guide for detailed memory optimization.
GPU not detected
# Verify GPU detection
nvidia-smi
# For WSL2 users
# See kanoa-mlops/docs/source/wsl2-gpu-setup.md