# Getting Started with Local Inference

Run models locally with full control, privacy, and zero API costs. We recommend **Molmo 7B** for vision tasks (31 tok/s, 3x faster than Gemma) or **Gemma 3 12B** for text reasoning (10.3 tok/s, strong for code/JSON). The `kanoa-mlops` repository provides infrastructure for local hosting.

## Prerequisites

- Python 3.11 or higher
- kanoa installed (`pip install kanoa`)
- NVIDIA GPU (see [hardware requirements](#hardware-requirements))
- kanoa-mlops repository cloned

## Quick Start

### Step 1: Set Up Infrastructure

Clone and set up the `kanoa-mlops` repository:

```bash
git clone https://github.com/lhzn-io/kanoa-mlops.git
cd kanoa-mlops

# Create environment
conda env create -f environment.yml
conda activate kanoa-mlops
```

### Step 2: Download and Start Model

#### Option A: Gemma 3 12B (Recommended for 16GB VRAM)

```bash
# Start vLLM server (downloads model automatically)
vllm serve google/gemma-3-12b-it --port 8000
```

#### Option B: Molmo 7B (Best for vision tasks)

```bash
# Download Molmo 7B (verified working)
./scripts/download-models.sh molmo-7b-d

# Start vLLM server
docker compose -f docker/vllm/docker-compose.molmo.yml up -d
```

The server will be available at `http://localhost:8000`.

### Step 3: Connect kanoa to Local Server

```python
import numpy as np
import matplotlib.pyplot as plt
from kanoa import AnalyticsInterpreter

# Create sample data
x = np.linspace(0, 10, 100)
y = np.exp(-x/5) * np.sin(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.title("Damped Oscillation")
plt.xlabel("Time")
plt.ylabel("Amplitude")

# Connect to local vLLM server (Gemma 3 12B)
interpreter = AnalyticsInterpreter(
    backend='openai',  # vLLM uses OpenAI-compatible API
    api_base='http://localhost:8000/v1',
    model='google/gemma-3-12b-it'  # Use whatever model you started
)

# Interpret the plot
result = interpreter.interpret(
    fig=plt.gcf(),
    context="Physics simulation results",
    focus="Describe the pattern and suggest what physical process this could represent"
)

print(result.text)
print(f"Tokens: {result.usage.total_tokens}, Cost: ${result.usage.cost:.4f}")
```

## Hardware Requirements

### Verified Working Configurations

| Model | VRAM Required | Hardware Tested | Avg Throughput |
|-------|---------------|-----------------|----------------|
| **Molmo 7B** | 12GB (4-bit) | NVIDIA RTX 5080 (16GB) | **31.1 tok/s** (±5.9) |
| **Gemma 3 12B** | 14GB (4-bit + FP8 KV) | NVIDIA RTX 5080 (16GB) | **10.3 tok/s** (±3.5) |
| **Gemma 3 4B** | 8GB (4-bit) | NVIDIA RTX 5080 (16GB) | 2.5 tok/s |

**Recommendations**:

- **Vision-focused**: Use **Molmo 7B** — it's **3x faster** than Gemma 3 12B (31 tok/s average)
- **Text reasoning/code**: Use **Gemma 3 12B** — better for structured outputs, multi-turn chat
- **Limited VRAM**: Use **Gemma 3 4B** — fits in 8GB but significantly slower

### Minimum Requirements

- **GPU**: NVIDIA GPU with CUDA support
- **VRAM**: 12GB minimum (for 7B models with 4-bit quantization)
- **Storage**: 20-30GB for model weights
- **RAM**: 16GB system RAM
- **PCIe**: 3.0 x4 or better (important for eGPU setups)

### Tested Configurations

See [vLLM Backend Reference](../backends/vllm.md#tested-models) for the complete list of tested hardware configurations.

## Supported Models

### Recommended Models (Verified)

**For 16GB VRAM:**

- ✅ **Molmo 7B** (`allenai/Molmo-7B-D-0924`) — Best for vision, 31 tok/s average, 3x faster than Gemma
- ✅ **Gemma 3 12B** (`google/gemma-3-12b-it`) — Best for text reasoning, 10.3 tok/s average

**For <16GB VRAM:**

- ✅ **Gemma 3 4B** (`google/gemma-3-4b-it`) — Fits in 8GB, slower but capable

### Performance Comparison

Based on 3-run benchmark (RTX 5080 16GB):

| Task Type | Gemma 3 4B | Gemma 3 12B | Molmo 7B | Best Model |
|-----------|------------|-------------|----------|------------|
| **Vision (photos)** | 1.0 tok/s | 2.2 ± 0.3 tok/s | **29.3 ± 5.8 tok/s** | **Molmo 7B (13x)** |
| **Vision (charts)** | 1.5 tok/s | 13.6 ± 1.0 tok/s | **32.7 ± 6.3 tok/s** | **Molmo 7B (2.4x)** |
| **Vision (data viz)** | ~1 tok/s | ~10 tok/s | **28.8 ± 8.8 tok/s** | **Molmo 7B (2.9x)** |
| **Basic chat** | 3.3 tok/s | **12.6 ± 1.4 tok/s** | Not tested | **Gemma 3 12B** |
| **Code generation** | 5.0 tok/s | **16.0 ± 2.9 tok/s** | Not tested | **Gemma 3 12B** |
| **Overall Average** | 2.5 tok/s | 10.3 ± 3.5 tok/s | **31.1 ± 5.9 tok/s** | **Molmo 7B (3x)** |

**Performance Notes**:

- **Molmo 7B dominates vision tasks** (29-33 tok/s) — 3x faster than Gemma 3 12B overall
- **Gemma 3 12B excels at text** reasoning, code, and structured outputs (12-25 tok/s)
- **Molmo has better stability** (19% CV) vs Gemma 3 12B (34% CV)
- Complex reasoning on Gemma may show higher latency due to KV cache pressure
- Monitor vLLM `/metrics` endpoint to track cache hits and GPU utilization

For a comprehensive list of models (including theoretical support), see the [vLLM Backend Reference](../backends/vllm.md).

## Next Steps

- **Model Selection**: Check [vLLM Backend Reference](../backends/vllm.md) for model options
- **Performance Monitoring**: See [kanoa-mlops GPU Monitoring](https://github.com/lhzn-io/kanoa-mlops/blob/main/docs/source/gpu-monitoring.md) for Prometheus and Grafana setup
- **Infrastructure Details**: See [kanoa-mlops repository](https://github.com/lhzn-io/kanoa-mlops) for advanced setup
- **Knowledge Bases**: Learn about [Knowledge Bases Guide](knowledge_bases.md)
- **Cost Tracking**: Understand [Cost Management](cost_management.md) for local models

## Troubleshooting

### Server connection failed

Verify the server is running:

```bash
# Check server health
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models
```

Check logs:

```bash
# For direct vLLM process (Gemma 3)
ps aux | grep vllm

# For Docker (Molmo)
docker compose -f docker/vllm/docker-compose.molmo.yml logs -f
```

### Out of memory errors

If you hit VRAM limits:

```bash
# For Gemma 3 12B (reduce GPU memory allocation)
vllm serve google/gemma-3-12b-it --gpu-memory-utilization 0.85

# Or switch to 4B variant
vllm serve google/gemma-3-4b-it

# For Docker setups: use 4-bit quantization (default in configs)
# Reduce --max-model-len parameter in docker-compose.yml
```

See [kanoa-mlops hardware guide](https://github.com/lhzn-io/kanoa-mlops#hardware-testing-roadmap) for detailed memory optimization.

### GPU not detected

```bash
# Verify GPU detection
nvidia-smi

# For WSL2 users
# See kanoa-mlops/docs/source/wsl2-gpu-setup.md
```