kanoa Integration Guide

This guide helps you integrate kanoa into your data science project with domain-specific customizations.

Quick Start

1. Install kanoa

pip install kanoa
# or add to requirements.txt
echo "kanoa>=0.1.0" >> requirements.txt

2. Basic Usage

from kanoa import AnalyticsInterpreter
import matplotlib.pyplot as plt

# Create interpreter
interpreter = AnalyticsInterpreter(backend='gemini')

# Interpret a plot
plt.plot(data)
result = interpreter.interpret(plt.gcf(), context="Your analysis context")
print(result.text)

Knowledge Base Setup

Option 1: Markdown Documentation

Place .md files in your docs/ directory:

your_project/
├── docs/
│   ├── methods.md
│   ├── background.md
│   └── glossary.md

kanoa will automatically load and use these for context.

Option 3: Mixed Content

your_project/
├── docs/
│   ├── methods.md
│   ├── glossary.md
│   └── refs/
│       ├── paper1.pdf
│       └── paper2.pdf

kanoa will auto-detect and use both.

Knowledge Base Strategies

kanoa supports multiple strategies for integrating domain knowledge, each optimized for different use cases:

Strategy 1: Context Stuffing (Default)

Best for: Small to medium knowledge bases (<200K tokens), simple setup

The default approach loads your entire knowledge base into the model’s context window. With Gemini 3 Pro’s 2M token context and context caching, this is cost-effective for most use cases.

interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs',
    enable_caching=True  # Reuse KB across calls
)

Pros:

  • Simple setup, no additional configuration

  • Works with all content types (PDFs, markdown, code)

  • Leverages Gemini’s native vision for PDFs

  • Context caching makes it cost-effective

Cons:

  • Limited by context window size

  • All content loaded every time (even with caching)

  • May include irrelevant information

Cost: ~$0.02-0.05 per interpretation (with caching)

Strategy Comparison

Strategy

Setup

Cost/Call

KB Size Limit

Use Case

Context Stuffing

Simple

$0.02-0.05

~800K tokens

Small KBs, PDFs with figures

RAG Engine

Moderate

$0.01-0.02

Unlimited

Large KBs, production

Google Search

Simple

$0.05-0.10

N/A

Current events, general knowledge

Hybrid (RAG+Search)

Complex

$0.03-0.08

Unlimited

Production, comprehensive

Choosing a Strategy

Use Context Stuffing if:

  • Your KB is <200K tokens (~150 pages)

  • You have PDFs with important figures/tables

  • You want the simplest setup

  • You’re prototyping

Use RAG Engine if:

  • Your KB is >500K tokens

  • You have many documents (>50 PDFs)

  • You need cost optimization

  • You’re deploying to production

  • You need to update KB frequently

Use Google Search if:

  • You need current information

  • You’re analyzing trends or recent events

  • You want verifiable web citations

  • Your domain has good web coverage

Use Hybrid if:

  • You need both domain expertise and current info

  • Cost is less important than comprehensiveness

  • You’re building a production research assistant

  • You need maximum accuracy

Use BigQuery if:

  • Your data is already in BigQuery

  • You need to query structured/tabular data

  • You’re analyzing time-series or sensor data

  • You want to combine historical data with current analysis

  • You have embeddings stored in BigQuery for semantic search

  • You’re in an enterprise environment with data warehouses

Advanced RAG Engine Configuration

Custom Chunking Strategy

Different content types benefit from different chunking strategies:

# For academic papers (preserve section context)
rag_config = {
    'chunk_size': 1024,  # Larger chunks for papers
    'chunk_overlap': 100,  # More overlap for continuity
    'chunking_strategy': 'semantic',  # Respect paragraph boundaries
}

# For code documentation (smaller, focused chunks)
rag_config = {
    'chunk_size': 256,
    'chunk_overlap': 25,
    'chunking_strategy': 'fixed',  # Fixed-size chunks
}

# For mixed content (adaptive)
rag_config = {
    'chunk_size': 512,
    'chunk_overlap': 50,
    'chunking_strategy': 'auto',  # Auto-detect best strategy
}

Multi-Corpus Setup

Organize knowledge by topic or project:

from kanoa import AnalyticsInterpreter

# Methods corpus
methods_interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs/methods',
    grounding_mode='rag_engine',
    rag_config={
        'corpus_display_name': 'methods-kb',
        'top_k': 3
    }
)

# Literature corpus
literature_interpreter = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs/literature',
    grounding_mode='rag_engine',
    rag_config={
        'corpus_display_name': 'literature-kb',
        'top_k': 5
    }
)

# Use appropriate interpreter for each analysis
methods_interpreter.interpret(fig1, context="Methodology validation")
literature_interpreter.interpret(fig2, context="Compare with prior work")

Corpus Management

from kanoa.knowledge_base import VertexRAGKnowledgeBase

# Initialize corpus manager
kb = VertexRAGKnowledgeBase(
    project_id='my-project',
    location='us-central1',
    corpus_display_name='my-kb'
)

# List all documents in corpus
docs = kb.list_documents()
for doc in docs:
    print(f"{doc.display_name}: {doc.chunk_count} chunks")

# Update specific document
kb.update_document(
    document_name='paper_2024.pdf',
    new_path='./docs/paper_2024_revised.pdf'
)

# Delete outdated documents
kb.delete_document('old_paper_2020.pdf')

# Get corpus statistics
stats = kb.get_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Total tokens: {stats['total_tokens']}")
print(f"Storage cost: ${stats['monthly_cost_usd']:.2f}/month")

Complete RAG Engine Example

Here’s a complete workflow for setting up kanoa with Vertex AI RAG Engine for a marine biology research project:

# marine_project/analysis/interpretation.py
from pathlib import Path
from kanoa import AnalyticsInterpreter
import os

class MarineBioInterpreter:
    """Marine biology interpreter with RAG Engine."""

    def __init__(self, use_rag_engine=True):
        project_root = Path(__file__).parent.parent.parent
        kb_path = project_root / "docs"

        if use_rag_engine:
            # Production: Use RAG Engine for large KB
            self.interpreter = AnalyticsInterpreter(
                backend='gemini',
                kb_path=kb_path,
                grounding_mode='rag_engine',
                rag_config={
                    'project_id': os.environ.get('GCP_PROJECT_ID'),
                    'location': 'us-central1',
                    'corpus_display_name': 'marine-bio-kb-2025',
                    'chunk_size': 512,
                    'chunk_overlap': 50,
                    'top_k': 5,
                    'similarity_threshold': 0.7
                },
                enable_caching=True,
                track_costs=True
            )
        else:
            # Development: Use context stuffing for simplicity
            self.interpreter = AnalyticsInterpreter(
                backend='gemini',
                kb_path=kb_path,
                enable_caching=True,
                track_costs=True
            )

    def interpret_dive_profile(self, fig, species=None, deployment_id=None):
        """Interpret dive profile with marine biology context."""
        context = f"Dive profile analysis"
        if species:
            context += f" for {species}"
        if deployment_id:
            context += f" (deployment: {deployment_id})"

        result = self.interpreter.interpret(
            fig=fig,
            context=context,
            focus="Dive frequency, depth range, behavioral patterns, anomalies"
        )

        # Show retrieved sources
        if 'retrieved_chunks' in result.metadata:
            print("\nGrounded with sources:")
            for chunk in result.metadata['retrieved_chunks']:
                print(f"  - {chunk['source']} (relevance: {chunk['score']:.2f})")

        return result

    def get_cost_summary(self):
        """Get interpretation cost summary."""
        summary = self.interpreter.get_cost_summary()

        # Add RAG-specific metrics
        if hasattr(self.interpreter.kb, 'get_stats'):
            rag_stats = self.interpreter.kb.get_stats()
            summary['rag_corpus_size'] = rag_stats['total_chunks']
            summary['rag_storage_cost_monthly'] = rag_stats['monthly_cost_usd']

        return summary

# Convenience function
def interpret(fig=None, **kwargs):
    return MarineBioInterpreter().interpret(fig=fig, **kwargs)

Usage in Notebooks

from marine_project.analysis import MarineBioInterpreter
import matplotlib.pyplot as plt

# Initialize (one-time per session)
interpreter = MarineBioInterpreter(use_rag_engine=True)

# Interpret dive profiles
plt.figure(figsize=(12, 6))
plt.plot(time, depth)
plt.title("Whale Shark Dive Profile - Deployment RED001")

result = interpreter.interpret_dive_profile(
    fig=plt.gcf(),
    species="Whale Shark",
    deployment_id="RED001"
)

# View cost summary
summary = interpreter.get_cost_summary()
print(f"\nSession costs:")
print(f"  Total calls: {summary['total_calls']}")
print(f"  Total cost: ${summary['total_cost_usd']:.4f}")
print(f"  Avg per call: ${summary['avg_cost_per_call']:.4f}")
if 'rag_corpus_size' in summary:
    print(f"  RAG corpus: {summary['rag_corpus_size']} chunks")
    print(f"  Storage: ${summary['rag_storage_cost_monthly']:.2f}/month")

Migrating from Context Stuffing to RAG Engine

If you’re currently using context stuffing and want to migrate:

# Step 1: Test RAG Engine with existing code
interpreter_rag = AnalyticsInterpreter(
    backend='gemini',
    kb_path='./docs',  # Same KB path
    grounding_mode='rag_engine',
    rag_config={
        'project_id': 'my-project',
        'location': 'us-central1',
        'corpus_display_name': 'test-migration',
        'top_k': 5
    }
)

# Step 2: Compare results
result_context = interpreter_context_stuffing.interpret(fig, context="Test")
result_rag = interpreter_rag.interpret(fig, context="Test")

print("Context stuffing cost:", result_context.usage.cost)
print("RAG Engine cost:", result_rag.usage.cost)

# Step 3: Validate accuracy (manual review)
# Compare interpretations for quality

# Step 4: Switch to RAG Engine in production
# Update your wrapper to use grounding_mode='rag_engine'

Claude Sonnet 4.5

  • Best for: Proven reliability, text-only knowledge bases

  • Requires: ANTHROPIC_API_KEY environment variable

  • Cost: ~$0.30 per interpretation

interpreter = AnalyticsInterpreter(backend='claude')

OpenAI GPT 5.1

  • Best for: Vector store integration

  • Requires: OPENAI_API_KEY environment variable

interpreter = AnalyticsInterpreter(backend='openai')

Molmo (Local)

  • Best for: Privacy-sensitive data, no API costs

  • Requires: GPU, local model download

interpreter = AnalyticsInterpreter(backend='molmo')

Advanced Configuration

Custom System Prompts

result = interpreter.interpret(
    fig=fig,
    context="Your context",
    custom_prompt="Analyze this plot focusing on X, Y, Z..."
)

Cost Tracking

interpreter = AnalyticsInterpreter(track_costs=True)

# ... multiple interpretations ...

summary = interpreter.get_cost_summary()
print(f"Total cost: ${summary['total_cost_usd']:.4f}")

Backend-Specific Options

# Gemini with high thinking level
interpreter = AnalyticsInterpreter(
    backend='gemini',
    thinking_level='high'
)

# Claude with specific model
interpreter = AnalyticsInterpreter(
    backend='claude',
    model='claude-sonnet-4-5-20250514'
)

Testing Your Integration

1. Unit Tests

import pytest
from your_project.analysis import YourProjectInterpreter

def test_interpreter_initialization():
    """Test wrapper initializes correctly."""
    interp = YourProjectInterpreter()
    assert interp.interpreter is not None

def test_domain_specific_method():
    """Test domain-specific convenience method."""
    interp = YourProjectInterpreter()
    # Mock or use test fixtures
    result = interp.interpret_your_viz_type(test_fig)
    assert result.text is not None

2. Integration Tests

@pytest.mark.integration
def test_real_interpretation(test_data):
    """Test with real API (requires API key)."""
    from your_project.analysis import interpret

    fig = create_test_plot(test_data)
    result = interpret(fig, context="Test")

    assert len(result.text) > 100
    assert result.usage.cost > 0

Best Practices

  1. Set API Keys in Environment

    export GOOGLE_API_KEY='your-key'
    export ANTHROPIC_API_KEY='your-key'
    export OPENAI_API_KEY='your-key'
    
  2. Use Context Caching for Repeated Calls

    # Initialize once, reuse for multiple interpretations
    interpreter = AnalyticsInterpreter(enable_caching=True)
    
    for fig in figures:
        result = interpreter.interpret(fig)
    
  3. Provide Specific Context

    # Good: Specific context
    result = interpreter.interpret(
        fig,
        context="Water temperature time series from Station A, July 2024",
        focus="Identify anomalies and trends"
    )
    
    # Less effective: Vague context
    result = interpreter.interpret(fig, context="Data")
    
  4. Organize Knowledge Base

    • Keep documentation up to date

    • Remove outdated PDFs

    • Use clear, descriptive filenames

Troubleshooting

“No module named ‘kanoa’”

pip install kanoa

“API key not found”

Set environment variables:

export GOOGLE_API_KEY='your-key'

“PDF knowledge base requires Gemini backend”

Either:

  • Switch to Gemini: backend='gemini'

High costs

  • Enable caching: enable_caching=True

  • Use Gemini instead of Claude (10x cheaper with caching)

  • Reduce knowledge base size

Example: Complete Integration

Here’s a complete example for a marine biology project:

# marine_project/analysis/interpretation.py
from pathlib import Path
from kanoa import AnalyticsInterpreter


class MarineBioInterpreter:
    """Marine biology analysis interpreter."""

    def __init__(self, backend='gemini', **kwargs):
        project_root = Path(__file__).parent.parent.parent
        kb_path = project_root / "docs"

        self.interpreter = AnalyticsInterpreter(
            backend=backend,
            kb_path=kb_path,
            enable_caching=True,
            track_costs=True,
            **kwargs
        )

    def interpret_dive_profile(self, fig, species=None, deployment_id=None):
        """Interpret dive profile with marine biology context."""
        context = f"Dive profile analysis"
        if species:
            context += f" for {species}"
        if deployment_id:
            context += f" (deployment: {deployment_id})"

        return self.interpreter.interpret(
            fig=fig,
            context=context,
            focus="Dive frequency, depth range, behavioral patterns, anomalies"
        )

    def get_cost_summary(self):
        """Get interpretation cost summary."""
        return self.interpreter.get_cost_summary()


# Convenience function
def interpret(fig=None, **kwargs):
    return MarineBioInterpreter().interpret(fig=fig, **kwargs)

Resources & References

Official Vertex AI RAG Engine Documentation

1. Vertex AI RAG Engine Quickstart (Start Here)

The fastest way to get your proprietary data indexed and connected to Gemini for Q&A.

  • Intro to Vertex AI RAG Engine

  • What it covers:

    • Create a RAG Corpus

    • Import files (PDFs, documents)

    • Configure RAG retrieval as a Tool for Gemini

    • Generate responses grounded in your data

2. Multimodal RAG Codelab (For PDFs with Plots/Tables)

Essential for handling documents with both text and images (plots, tables, charts).

  • Multimodal RAG using Gemini API

  • What it covers:

    • Extract and index text + images from PDFs

    • Generate multimodal embeddings

    • Retrieve relevant text chunks AND images

    • Pass both text and image context to Gemini

    • Advanced grounded reasoning over complex documents

3. Document AI Layout Parser (For Complex PDFs)

For PDFs with complex layouts (tables, multi-column text, charts).

  • Document AI Layout Parser Integration

  • What it covers:

    • Enable Document AI layout parser for RAG Corpus

    • Accurate parsing of visual elements and structure

    • Superior retrieval accuracy for complex documents

Additional Resources

kanoa Support