kanoa Integration Guide

This guide helps you integrate kanoa into your data science project with domain-specific customizations.

Quick Start

1. Install kanoa

pip install kanoa
# or add to requirements.txt
echo "kanoa>=0.1.0" >> requirements.txt

2. Basic Usage

from kanoa import AnalyticsInterpreter
import matplotlib.pyplot as plt

# Create interpreter
interpreter = AnalyticsInterpreter(backend='gemini')

# Interpret a plot
plt.plot(data)
result = interpreter.interpret(plt.gcf(), context="Your analysis context")
print(result.text)

Recommended: Create a Domain-Specific Wrapper

For better integration, create a thin wrapper that provides domain-specific defaults.

Step 1: Create Wrapper Module

File: your_project/analysis/interpretation.py

"""
Domain-specific analytics interpretation wrapper for [Your Project].
"""

from pathlib import Path
from typing import Optional
from kanoa import AnalyticsInterpreter


class YourProjectInterpreter:
    """Wrapper with project-specific defaults."""

    def __init__(self, backend='gemini', **kwargs):
        # Auto-detect project knowledge base
        project_root = Path(__file__).parent.parent.parent
        kb_path = project_root / "docs"  # or wherever your docs are

        self.interpreter = AnalyticsInterpreter(
            backend=backend,
            kb_path=kb_path,
            **kwargs
        )

    def interpret_your_viz_type(self, fig, metadata=None, **kwargs):
        """Domain-specific convenience method."""
        context = "Your domain-specific context"
        if metadata:
            context += f" - {metadata}"

        return self.interpreter.interpret(
            fig=fig,
            context=context,
            focus="Domain-specific analysis focus",
            **kwargs
        )

    def interpret(self, *args, **kwargs):
        """Pass-through to underlying interpreter."""
        return self.interpreter.interpret(*args, **kwargs)


# Convenience function for notebooks
def interpret(fig=None, **kwargs):
    """Quick helper for project notebooks."""
    return YourProjectInterpreter().interpret(fig=fig, **kwargs)

Step 2: Export from Module

File: your_project/analysis/__init__.py

from .interpretation import YourProjectInterpreter, interpret

__all__ = [
    # ... existing exports ...
    'YourProjectInterpreter',
    'interpret',
]

Step 3: Use in Notebooks

from your_project.analysis import interpret

# Simple one-liner interpretation
plt.plot(your_data)
interpret(context="Experiment 1")

Knowledge Base Setup

Option 1: Markdown Documentation

Place .md files in your docs/ directory:

your_project/
├── docs/
│   ├── methods.md
│   ├── background.md
│   └── glossary.md

kanoa will automatically load and use these for context.

Option 2: Academic PDFs (Recommended for Research)

Place PDF papers in a docs/refs/ directory:

your_project/
├── docs/
│   ├── refs/
│   │   ├── paper1.pdf
│   │   ├── paper2.pdf
│   │   └── review.pdf

⚠️ Note: PDF knowledge bases require the Gemini backend for native vision support.

Option 3: Mixed Content

your_project/
├── docs/
│   ├── methods.md
│   ├── glossary.md
│   └── refs/
│       ├── paper1.pdf
│       └── paper2.pdf

kanoa will auto-detect and use both.

Knowledge Base Strategies

kanoa supports multiple strategies for integrating domain knowledge, each optimized for different use cases:

Strategy 1: Context Stuffing (Default)

Best for: Small to medium knowledge bases (<200K tokens), simple setup

The default approach loads your entire knowledge base into the model’s context window. With Gemini 3 Pro’s 2M token context and context caching, this is cost-effective for most use cases.

interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs',
    enable_caching=True  # Reuse KB across calls
)

Pros:

Simple setup, no additional configuration
Works with all content types (PDFs, markdown, code)
Leverages Gemini’s native vision for PDFs
Context caching makes it cost-effective

Cons:

Limited by context window size
All content loaded every time (even with caching)
May include irrelevant information

Cost: ~$0.02-0.05 per interpretation (with caching)

Strategy 2: Vertex AI RAG Engine (Recommended for Production)

Best for: Large knowledge bases (>500K tokens), production deployments, multimodal content

The standard, best-practice approach for connecting Gemini 3 Pro to a private knowledge base. This is Google’s managed RAG service, natively integrated with Gemini and designed specifically for grounding LLM responses in your own data.

from kanoa import AnalyticsInterpreter

interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs',
    grounding_mode='rag_engine',  # Use Vertex AI RAG Engine
    rag_config={
        'project_id': 'your-gcp-project',
        'location': 'us-central1',
        'corpus_display_name': 'marine-bio-kb',

        # Data sources (versatile ingestion)
        'sources': ['cloud_storage', 'google_drive', 'local_files'],

        # Chunking and retrieval
        'chunk_size': 512,
        'chunk_overlap': 50,
        'top_k': 5,  # Retrieve top 5 most relevant chunks
        'similarity_threshold': 0.7,

        # Multimodal support
        'extract_images': True,  # Extract and caption plots/images from PDFs
        'extract_tables': True,  # Convert tables to searchable text
        'process_video': True    # Extract transcripts and scene descriptions
    }
)

# First call creates the RAG corpus (one-time setup)
result = interpreter.interpret(
    fig=plt.gcf(),
    context="Dive profile analysis"
)

How it works:

Ingestion: Documents are processed to extract text, images, tables, and video content
Multimodal Extraction:
- Images/plots → descriptive captions generated
- Tables → converted to structured text (Markdown)
- Videos → speech-to-text transcripts + scene descriptions
Embedding: All content is embedded using multimodal embedding models
Indexing: RAG corpus created with managed vector database (uses Vertex AI Search backend)
Retrieval: Semantic search retrieves relevant chunks (text + image descriptions)
Grounding: Retrieved context grounds Gemini’s response

Pros:

Natively integrated with Gemini 3 Pro - official, best-practice approach
Multimodal knowledge base - handles PDFs with plots/tables, images, video
Scales to massive knowledge bases (GBs of content)
Only retrieves relevant information (semantic search)
60-80% cost reduction for large KBs vs context stuffing
Supports incremental updates (add/remove documents)
Managed infrastructure - uses Vertex AI Search backend
Versatile data sources (Cloud Storage, Google Drive, local files)
Minimizes hallucinations through grounded retrieval

Cons:

Requires GCP project and Vertex AI access
Initial corpus creation takes time (one-time)
Additional complexity vs simple context stuffing
Corpus storage costs (~$0.40/GB/month)

Cost: ~$0.01-0.02 per interpretation + corpus storage (~$0.40/GB/month)

RAG Engine Workflow

Initial Setup

from kanoa import AnalyticsInterpreter
from pathlib import Path

# Initialize with RAG Engine
interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs',
    grounding_mode='rag_engine',
    rag_config={
        'project_id': 'my-research-project',
        'location': 'us-central1',
        'corpus_display_name': 'research-kb-2025',

        # Chunking strategy
        'chunk_size': 512,  # Tokens per chunk
        'chunk_overlap': 50,  # Overlap for context continuity

        # Retrieval parameters
        'top_k': 5,  # Number of chunks to retrieve
        'similarity_threshold': 0.7,  # Minimum relevance score

        # Optional: Use existing corpus
        # 'corpus_name': 'projects/.../corpora/...'
    }
)

# First call triggers corpus creation
print("Creating RAG corpus... (one-time setup)")
result = interpreter.interpret(
    fig=my_plot,
    context="Initial analysis"
)

# Corpus ID is cached for future use
print(f"Corpus created: {interpreter.kb.corpus_name}")

Adding Documents to Existing Corpus

# Add new papers to existing corpus
interpreter.kb.add_documents([
    './docs/new_paper_2025.pdf',
    './docs/updated_methods.md'
])

# Corpus automatically updates
result = interpreter.interpret(fig, context="Updated analysis")

Inspecting Retrieved Context

result = interpreter.interpret(
    fig=plt.gcf(),
    context="Dive behavior analysis"
)

# View what was retrieved
print("Retrieved chunks:")
for chunk in result.metadata['retrieved_chunks']:
    print(f"- {chunk['source']}: {chunk['text'][:100]}...")
    print(f"  Relevance: {chunk['score']:.2f}")

Strategy 3: Grounding with Google Search

Best for: Real-time information, current events, general knowledge augmentation

Enables Gemini to search Google for relevant information when needed.

interpreter = AnalyticsInterpreter(
    backend='gemini',
    grounding_mode='google_search',
    google_search_config={
        'dynamic_retrieval': True,  # Model decides when to search
        'max_search_results': 5
    }
)

result = interpreter.interpret(
    fig=plt.gcf(),
    context="Compare this trend with recent oceanographic findings"
)

# Result includes grounding sources
print("Grounded with sources:")
for source in result.metadata['grounding_sources']:
    print(f"- {source['title']}: {source['url']}")

Pros:

Access to real-time, up-to-date information
No knowledge base maintenance
Verifiable citations
Dynamic retrieval optimizes cost

Cons:

Requires internet connectivity
May retrieve irrelevant results
No control over source quality
Additional cost per search

Cost: ~$0.05-0.10 per interpretation (when grounding is used)

Strategy 4: Hybrid (RAG Engine + Google Search)

Best for: Production systems requiring both domain expertise and current information

Combines your private knowledge base with real-time web search.

interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs',
    grounding_mode='hybrid',
    rag_config={
        'project_id': 'my-project',
        'location': 'us-central1',
        'corpus_display_name': 'domain-kb',
        'top_k': 3
    },
    google_search_config={
        'dynamic_retrieval': True,
        'max_search_results': 3
    }
)

result = interpreter.interpret(
    fig=plt.gcf(),
    context="Analyze with both internal methods and recent publications"
)

# Result includes both RAG and search sources
print(f"Retrieved {len(result.metadata['rag_chunks'])} KB chunks")
print(f"Grounded with {len(result.metadata['search_sources'])} web sources")

Pros:

Best of both worlds
Domain expertise + current information
Comprehensive source attribution

Cons:

Highest complexity
Higher cost
Requires careful configuration

Cost: ~$0.03-0.08 per interpretation

Strategy Comparison

Strategy	Setup	Cost/Call	KB Size Limit	Use Case
Context Stuffing	Simple	$0.02-0.05	~800K tokens	Small KBs, PDFs with figures
RAG Engine	Moderate	$0.01-0.02	Unlimited	Large KBs, production
Google Search	Simple	$0.05-0.10	N/A	Current events, general knowledge
Hybrid (RAG+Search)	Complex	$0.03-0.08	Unlimited	Production, comprehensive

Choosing a Strategy

Use Context Stuffing if:

Your KB is <200K tokens (~150 pages)
You have PDFs with important figures/tables
You want the simplest setup
You’re prototyping

Use RAG Engine if:

Your KB is >500K tokens
You have many documents (>50 PDFs)
You need cost optimization
You’re deploying to production
You need to update KB frequently

Use Google Search if:

You need current information
You’re analyzing trends or recent events
You want verifiable web citations
Your domain has good web coverage

Use Hybrid if:

You need both domain expertise and current info
Cost is less important than comprehensiveness
You’re building a production research assistant
You need maximum accuracy

Use BigQuery if:

Your data is already in BigQuery
You need to query structured/tabular data
You’re analyzing time-series or sensor data
You want to combine historical data with current analysis
You have embeddings stored in BigQuery for semantic search
You’re in an enterprise environment with data warehouses

Advanced RAG Engine Configuration

Custom Chunking Strategy

Different content types benefit from different chunking strategies:

# For academic papers (preserve section context)
rag_config = {
    'chunk_size': 1024,  # Larger chunks for papers
    'chunk_overlap': 100,  # More overlap for continuity
    'chunking_strategy': 'semantic',  # Respect paragraph boundaries
}

# For code documentation (smaller, focused chunks)
rag_config = {
    'chunk_size': 256,
    'chunk_overlap': 25,
    'chunking_strategy': 'fixed',  # Fixed-size chunks
}

# For mixed content (adaptive)
rag_config = {
    'chunk_size': 512,
    'chunk_overlap': 50,
    'chunking_strategy': 'auto',  # Auto-detect best strategy
}

Multi-Corpus Setup

Organize knowledge by topic or project:

from kanoa import AnalyticsInterpreter

# Methods corpus
methods_interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs/methods',
    grounding_mode='rag_engine',
    rag_config={
        'corpus_display_name': 'methods-kb',
        'top_k': 3
    }
)

# Literature corpus
literature_interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs/literature',
    grounding_mode='rag_engine',
    rag_config={
        'corpus_display_name': 'literature-kb',
        'top_k': 5
    }
)

# Use appropriate interpreter for each analysis
methods_interpreter.interpret(fig1, context="Methodology validation")
literature_interpreter.interpret(fig2, context="Compare with prior work")

Corpus Management

from kanoa.knowledge_base import VertexRAGKnowledgeBase

# Initialize corpus manager
kb = VertexRAGKnowledgeBase(
    project_id='my-project',
    location='us-central1',
    corpus_display_name='my-kb'
)

# List all documents in corpus
docs = kb.list_documents()
for doc in docs:
    print(f"{doc.display_name}: {doc.chunk_count} chunks")

# Update specific document
kb.update_document(
    document_name='paper_2024.pdf',
    new_path='./docs/paper_2024_revised.pdf'
)

# Delete outdated documents
kb.delete_document('old_paper_2020.pdf')

# Get corpus statistics
stats = kb.get_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Total tokens: {stats['total_tokens']}")
print(f"Storage cost: ${stats['monthly_cost_usd']:.2f}/month")

Complete RAG Engine Example

Here’s a complete workflow for setting up kanoa with Vertex AI RAG Engine for a marine biology research project:

# marine_project/analysis/interpretation.py
from pathlib import Path
from kanoa import AnalyticsInterpreter
import os

class MarineBioInterpreter:
    """Marine biology interpreter with RAG Engine."""

    def __init__(self, use_rag_engine=True):
        project_root = Path(__file__).parent.parent.parent
        kb_path = project_root / "docs"

        if use_rag_engine:
            # Production: Use RAG Engine for large KB
            self.interpreter = AnalyticsInterpreter(
                backend='gemini',
                kb_path=kb_path,
                grounding_mode='rag_engine',
                rag_config={
                    'project_id': os.environ.get('GCP_PROJECT_ID'),
                    'location': 'us-central1',
                    'corpus_display_name': 'marine-bio-kb-2025',
                    'chunk_size': 512,
                    'chunk_overlap': 50,
                    'top_k': 5,
                    'similarity_threshold': 0.7
                },
                enable_caching=True,
                track_costs=True
            )
        else:
            # Development: Use context stuffing for simplicity
            self.interpreter = AnalyticsInterpreter(
                backend='gemini',
                kb_path=kb_path,
                enable_caching=True,
                track_costs=True
            )

    def interpret_dive_profile(self, fig, species=None, deployment_id=None):
        """Interpret dive profile with marine biology context."""
        context = f"Dive profile analysis"
        if species:
            context += f" for {species}"
        if deployment_id:
            context += f" (deployment: {deployment_id})"

        result = self.interpreter.interpret(
            fig=fig,
            context=context,
            focus="Dive frequency, depth range, behavioral patterns, anomalies"
        )

        # Show retrieved sources
        if 'retrieved_chunks' in result.metadata:
            print("\nGrounded with sources:")
            for chunk in result.metadata['retrieved_chunks']:
                print(f"  - {chunk['source']} (relevance: {chunk['score']:.2f})")

        return result

    def get_cost_summary(self):
        """Get interpretation cost summary."""
        summary = self.interpreter.get_cost_summary()

        # Add RAG-specific metrics
        if hasattr(self.interpreter.kb, 'get_stats'):
            rag_stats = self.interpreter.kb.get_stats()
            summary['rag_corpus_size'] = rag_stats['total_chunks']
            summary['rag_storage_cost_monthly'] = rag_stats['monthly_cost_usd']

        return summary

# Convenience function
def interpret(fig=None, **kwargs):
    return MarineBioInterpreter().interpret(fig=fig, **kwargs)

Usage in Notebooks

from marine_project.analysis import MarineBioInterpreter
import matplotlib.pyplot as plt

# Initialize (one-time per session)
interpreter = MarineBioInterpreter(use_rag_engine=True)

# Interpret dive profiles
plt.figure(figsize=(12, 6))
plt.plot(time, depth)
plt.title("Whale Shark Dive Profile - Deployment RED001")

result = interpreter.interpret_dive_profile(
    fig=plt.gcf(),
    species="Whale Shark",
    deployment_id="RED001"
)

# View cost summary
summary = interpreter.get_cost_summary()
print(f"\nSession costs:")
print(f"  Total calls: {summary['total_calls']}")
print(f"  Total cost: ${summary['total_cost_usd']:.4f}")
print(f"  Avg per call: ${summary['avg_cost_per_call']:.4f}")
if 'rag_corpus_size' in summary:
    print(f"  RAG corpus: {summary['rag_corpus_size']} chunks")
    print(f"  Storage: ${summary['rag_storage_cost_monthly']:.2f}/month")

Migrating from Context Stuffing to RAG Engine

If you’re currently using context stuffing and want to migrate:

# Step 1: Test RAG Engine with existing code
interpreter_rag = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs',  # Same KB path
    grounding_mode='rag_engine',
    rag_config={
        'project_id': 'my-project',
        'location': 'us-central1',
        'corpus_display_name': 'test-migration',
        'top_k': 5
    }
)

# Step 2: Compare results
result_context = interpreter_context_stuffing.interpret(fig, context="Test")
result_rag = interpreter_rag.interpret(fig, context="Test")

print("Context stuffing cost:", result_context.usage.cost)
print("RAG Engine cost:", result_rag.usage.cost)

# Step 3: Validate accuracy (manual review)
# Compare interpretations for quality

# Step 4: Switch to RAG Engine in production
# Update your wrapper to use grounding_mode='rag_engine'

Gemini 3 Pro (Recommended)

Best for: PDF knowledge bases, cost optimization
Requires: GOOGLE_API_KEY environment variable
Cost: ~$0.02-0.05 per interpretation (with caching)

interpreter = AnalyticsInterpreter(backend='gemini')

Claude Sonnet 4.5

Best for: Proven reliability, text-only knowledge bases
Requires: ANTHROPIC_API_KEY environment variable
Cost: ~$0.30 per interpretation

interpreter = AnalyticsInterpreter(backend='claude')

OpenAI GPT 5.1

Best for: Vector store integration
Requires: OPENAI_API_KEY environment variable

interpreter = AnalyticsInterpreter(backend='openai')

Molmo (Local)

Best for: Privacy-sensitive data, no API costs
Requires: GPU, local model download

interpreter = AnalyticsInterpreter(backend='molmo')

Advanced Configuration

Custom System Prompts

result = interpreter.interpret(
    fig=fig,
    context="Your context",
    custom_prompt="Analyze this plot focusing on X, Y, Z..."
)

Cost Tracking

interpreter = AnalyticsInterpreter(track_costs=True)

# ... multiple interpretations ...

summary = interpreter.get_cost_summary()
print(f"Total cost: ${summary['total_cost_usd']:.4f}")

Backend-Specific Options

# Gemini with high thinking level
interpreter = AnalyticsInterpreter(
    backend='gemini',
    thinking_level='high'
)

# Claude with specific model
interpreter = AnalyticsInterpreter(
    backend='claude',
    model='claude-sonnet-4-5-20250514'
)

Testing Your Integration

1. Unit Tests

import pytest
from your_project.analysis import YourProjectInterpreter

def test_interpreter_initialization():
    """Test wrapper initializes correctly."""
    interp = YourProjectInterpreter()
    assert interp.interpreter is not None

def test_domain_specific_method():
    """Test domain-specific convenience method."""
    interp = YourProjectInterpreter()
    # Mock or use test fixtures
    result = interp.interpret_your_viz_type(test_fig)
    assert result.text is not None

2. Integration Tests

@pytest.mark.integration
def test_real_interpretation(test_data):
    """Test with real API (requires API key)."""
    from your_project.analysis import interpret

    fig = create_test_plot(test_data)
    result = interpret(fig, context="Test")

    assert len(result.text) > 100
    assert result.usage.cost > 0

Best Practices

Set API Keys in Environment

export GOOGLE_API_KEY='your-key'
export ANTHROPIC_API_KEY='your-key'
export OPENAI_API_KEY='your-key'

Use Context Caching for Repeated Calls

# Initialize once, reuse for multiple interpretations
interpreter = AnalyticsInterpreter(enable_caching=True)

for fig in figures:
    result = interpreter.interpret(fig)

Provide Specific Context

# Good: Specific context
result = interpreter.interpret(
    fig,
    context="Water temperature time series from Station A, July 2024",
    focus="Identify anomalies and trends"
)

# Less effective: Vague context
result = interpreter.interpret(fig, context="Data")

Organize Knowledge Base
- Keep documentation up to date
- Remove outdated PDFs
- Use clear, descriptive filenames

Troubleshooting

“No module named ‘kanoa’”

pip install kanoa

“API key not found”

Set environment variables:

export GOOGLE_API_KEY='your-key'

“PDF knowledge base requires Gemini backend”

Either:

Switch to Gemini: backend='gemini'

High costs

Enable caching: enable_caching=True
Use Gemini instead of Claude (10x cheaper with caching)
Reduce knowledge base size

Example: Complete Integration

Here’s a complete example for a marine biology project:

# marine_project/analysis/interpretation.py
from pathlib import Path
from kanoa import AnalyticsInterpreter


class MarineBioInterpreter:
    """Marine biology analysis interpreter."""

    def __init__(self, backend='gemini', **kwargs):
        project_root = Path(__file__).parent.parent.parent
        kb_path = project_root / "docs"

        self.interpreter = AnalyticsInterpreter(
            backend=backend,
            kb_path=kb_path,
            enable_caching=True,
            track_costs=True,
            **kwargs
        )

    def interpret_dive_profile(self, fig, species=None, deployment_id=None):
        """Interpret dive profile with marine biology context."""
        context = f"Dive profile analysis"
        if species:
            context += f" for {species}"
        if deployment_id:
            context += f" (deployment: {deployment_id})"

        return self.interpreter.interpret(
            fig=fig,
            context=context,
            focus="Dive frequency, depth range, behavioral patterns, anomalies"
        )

    def get_cost_summary(self):
        """Get interpretation cost summary."""
        return self.interpreter.get_cost_summary()


# Convenience function
def interpret(fig=None, **kwargs):
    return MarineBioInterpreter().interpret(fig=fig, **kwargs)

Resources & References

Official Vertex AI RAG Engine Documentation

1. Vertex AI RAG Engine Quickstart (Start Here)

The fastest way to get your proprietary data indexed and connected to Gemini for Q&A.

Intro to Vertex AI RAG Engine
What it covers:
- Create a RAG Corpus
- Import files (PDFs, documents)
- Configure RAG retrieval as a Tool for Gemini
- Generate responses grounded in your data

2. Multimodal RAG Codelab (For PDFs with Plots/Tables)

Essential for handling documents with both text and images (plots, tables, charts).

Multimodal RAG using Gemini API
What it covers:
- Extract and index text + images from PDFs
- Generate multimodal embeddings
- Retrieve relevant text chunks AND images
- Pass both text and image context to Gemini
- Advanced grounded reasoning over complex documents

3. Document AI Layout Parser (For Complex PDFs)

For PDFs with complex layouts (tables, multi-column text, charts).

Document AI Layout Parser Integration
What it covers:
- Enable Document AI layout parser for RAG Corpus
- Accurate parsing of visual elements and structure
- Superior retrieval accuracy for complex documents

Additional Resources

Vertex AI Search (RAG Backend): Documentation
Grounding with Google Search: Documentation
Gemini 3 Pro API: Documentation
Context Caching: Documentation

kanoa Support

Documentation: GitHub README
Issues: GitHub Issues
Discussions: GitHub Discussions