PageIndex - Vectorless Reasoning-based RAG

Key Point: Vectorless, reasoning-based RAG that simulates human expert document navigation. No vector DB, no chunking, just hierarchical tree-based retrieval with 98.7% accuracy on FinanceBench.

🚀 Quick Start

Installation

pip3 install --upgrade -r requirements.txt

Set API Key

Create .env file:

CHATGPT_API_KEY=your_openai_key_here

Run PageIndex

# Process PDF document
python3 run_pageindex.py --pdf_path /path/to/document.pdf

# Process Markdown (experimental)
python3 run_pageindex.py --md_path /path/to/document.md

🎯 Core Concepts

The Problem with Vector-Based RAG

Traditional RAG relies on similarity, not relevance:

❌ Similarity ≠ Relevance: Semantic similarity often misses the truly relevant content
❌ Chunking Breaks Context: Artificial chunks lose document structure
❌ Opaque Retrieval: "Vibe retrieval" with no clear reasoning path
❌ Poor Explainability: Hard to trace why specific content was retrieved

PageIndex Solution

Reasoning-based retrieval through tree structure:

✅ Generate Tree Index: Create hierarchical "Table of Contents" structure
✅ Tree Search Retrieval: LLM reasons through tree to find relevant sections
✅ Human-like Navigation: Simulates expert document exploration
✅ Traceable & Explainable: Clear reasoning path with page/section references

📊 Key Features

1. No Vector Database

Uses document structure and LLM reasoning instead of vector similarity search.

2. No Chunking

Documents organized into natural sections, not artificial chunks.

3. Human-like Retrieval

Simulates how human experts navigate and extract knowledge from complex documents.

4. Better Explainability

Retrieval based on reasoning - traceable and interpretable with page references.

5. State-of-the-Art Accuracy

98.7% accuracy on FinanceBench benchmark for financial document analysis.

🌲 Tree Structure Format

PageIndex transforms PDF documents into semantic tree structure:

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}

⚙️ Command Line Options

Basic Usage

# Process PDF
python3 run_pageindex.py --pdf_path document.pdf

# Process Markdown
python3 run_pageindex.py --md_path document.md

Optional Parameters

Model Configuration

# Specify OpenAI model (default: gpt-4o-2024-11-20)
python3 run_pageindex.py --pdf_path doc.pdf --model gpt-4o-mini

Tree Structure Options

# Pages to check for table of contents (default: 20)
--toc-check-pages 30

# Max pages per node (default: 10)
--max-pages-per-node 15

# Max tokens per node (default: 20000)
--max-tokens-per-node 25000

# Add node ID (yes/no, default: yes)
--if-add-node-id yes

# Add node summary (yes/no, default: yes)
--if-add-node-summary yes

# Add document description (yes/no, default: yes)
--if-add-doc-description yes

Complete Example

python3 run_pageindex.py \
  --pdf_path financial_report.pdf \
  --model gpt-4o-2024-11-20 \
  --toc-check-pages 25 \
  --max-pages-per-node 12 \
  --max-tokens-per-node 22000 \
  --if-add-node-id yes \
  --if-add-node-summary yes \
  --if-add-doc-description yes

💡 Common Use Cases

1. Financial Document Analysis

# Analyze SEC filings, earnings reports, 10-K/10-Q forms
python3 run_pageindex.py --pdf_path sec_10k.pdf

# Query the generated tree
# "What were the key financial risks disclosed in the 10-K?"

2. Legal Document Review

# Process legal contracts, regulations, case files
python3 run_pageindex.py --pdf_path legal_contract.pdf

# Enable reasoning-based retrieval
# "Find all clauses related to liability and indemnification"

3. Academic Research

# Process research papers, textbooks, dissertations
python3 run_pageindex.py --pdf_path research_paper.pdf

# Navigate complex academic content
# "Explain the methodology section and key findings"

4. Technical Documentation

# Process API docs, technical manuals, specifications
python3 run_pageindex.py --pdf_path technical_manual.pdf

# Retrieve specific technical details
# "What are the system requirements and setup procedures?"

5. Regulatory Compliance

# Process compliance documents, policy manuals
python3 run_pageindex.py --pdf_path compliance_policy.pdf

# Find relevant compliance requirements
# "List all data privacy requirements and GDPR clauses"

🔧 Integration Examples

Using PageIndex Output

After running PageIndex, you get a tree structure JSON:

import json

# Load PageIndex output
with open('pageindex_output.json', 'r') as f:
    tree = json.load(f)

# Navigate tree structure
def find_section(tree, keyword):
    """Find sections containing keyword"""
    results = []
    
    def search(node):
        if keyword.lower() in node.get('title', '').lower():
            results.append({
                'title': node['title'],
                'node_id': node['node_id'],
                'pages': f"{node['start_index']}-{node['end_index']}",
                'summary': node.get('summary', '')
            })
        
        for child in node.get('nodes', []):
            search(child)
    
    search(tree)
    return results

# Find financial risk sections
risk_sections = find_section(tree, 'risk')
print(json.dumps(risk_sections, indent=2))

Reasoning-based Retrieval

from openai import OpenAI
import json

client = OpenAI()

# Load PageIndex tree
with open('pageindex_output.json', 'r') as f:
    tree = json.load(f)

def reason_through_tree(query, tree):
    """Use LLM to reason through tree for relevant sections"""
    
    prompt = f"""
    Given this document tree structure:
    {json.dumps(tree, indent=2)}
    
    Query: {query}
    
    Reason through the tree to find the most relevant sections.
    Provide:
    1. The reasoning path you took
    2. The relevant node IDs
    3. The page ranges to examine
    """
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a document navigation expert."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content

# Example usage
query = "What are the key financial risks mentioned?"
reasoning = reason_through_tree(query, tree)
print(reasoning)

Building RAG System

from openai import OpenAI
import json

client = OpenAI()

class PageIndexRAG:
    def __init__(self, tree_path, pdf_path):
        with open(tree_path, 'r') as f:
            self.tree = json.load(f)
        self.pdf_path = pdf_path
    
    def retrieve(self, query):
        """Retrieve relevant content based on query"""
        
        # Step 1: Reason through tree
        prompt = f"""
        Document tree: {json.dumps(self.tree, indent=2)}
        Query: {query}
        
        Navigate the tree to find relevant sections.
        Return node IDs and page ranges.
        """
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Step 2: Extract content from identified pages
        # (Implementation depends on PDF library)
        
        return response.choices[0].message.content
    
    def answer(self, query):
        """Answer query using retrieved content"""
        
        # Retrieve relevant content
        context = self.retrieve(query)
        
        # Generate answer
        prompt = f"""
        Context: {context}
        Query: {query}
        
        Provide a detailed answer based on the context.
        """
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content

# Usage
rag = PageIndexRAG('pageindex_output.json', 'document.pdf')
answer = rag.answer("What were the Q4 earnings results?")
print(answer)

📊 PageIndex vs Vector RAG

Feature	PageIndex	Vector-based RAG
Retrieval Method	Reasoning-based tree search	Semantic similarity
Document Structure	Natural sections preserved	Artificial chunks
Vector Database	❌ Not needed	✅ Required
Explainability	✅ Clear reasoning path	❌ Opaque similarity
Context Preservation	✅ Hierarchical structure	❌ Lost in chunks
Accuracy (FinanceBench)	98.7%	Lower
Setup Complexity	Low (just LLM)	High (vector DB + embeddings)
Traceability	✅ Page/section references	⚠️ Chunk IDs only

🎓 Advanced Features

Document Search Strategy

PageIndex enables two retrieval modes:

Document Search: Find relevant documents in a collection
Tree Search: Navigate within a document's tree structure

# Document Search
def search_documents(query, document_trees):
    """Search across multiple documents"""
    relevant_docs = []
    
    for doc_name, tree in document_trees.items():
        # Use LLM to assess document relevance
        prompt = f"""
        Document: {doc_name}
        Tree structure: {json.dumps(tree, indent=2)[:500]}...
        Query: {query}
        
        Is this document relevant? (yes/no)
        If yes, which sections?
        """
        
        # Process response and collect relevant docs
        # ...
    
    return relevant_docs

# Tree Search within selected document
def tree_search(query, tree):
    """Navigate tree structure for specific content"""
    # Reasoning-based navigation through tree
    # ...
    pass

Vision-based Vectorless RAG

PageIndex supports OCR-free, vision-only RAG:

# Process PDF page images directly
python3 run_pageindex.py \
  --pdf_path document.pdf \
  --vision_mode true \
  --model gpt-4o

This enables:

OCR-free processing
Preserves visual formatting
Better for tables, charts, diagrams
Reasoning-native retrieval over page images

📚 Integration Patterns

With LangChain

from langchain.document_loaders import PageIndexLoader
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Load document with PageIndex
loader = PageIndexLoader(pdf_path='document.pdf')
documents = loader.load()

# Create retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=documents.as_retriever()
)

# Ask questions
answer = qa_chain.run("What are the key financial metrics?")

With LlamaIndex

from llama_index import PageIndexReader, GPTVectorStoreIndex

# Load with PageIndex
reader = PageIndexReader()
documents = reader.load_data(file_path='document.pdf')

# Create index
index = GPTVectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the risk factors")

API Integration

import requests

# Use PageIndex API (if available)
response = requests.post(
    'https://api.pageindex.ai/v1/index',
    headers={'Authorization': 'Bearer YOUR_API_KEY'},
    json={
        'pdf_url': 'https://example.com/document.pdf',
        'model': 'gpt-4o',
        'options': {
            'max_pages_per_node': 10,
            'add_node_summary': True
        }
    }
)

tree = response.json()

🛠️ Best Practices

1. Optimize Tree Structure

# For short documents (< 50 pages)
python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --max-pages-per-node 8

# For long documents (> 200 pages)
python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --max-pages-per-node 15 \
  --max-tokens-per-node 25000

2. Choose Right Model

# For cost efficiency
--model gpt-4o-mini

# For best accuracy
--model gpt-4o-2024-11-20

# For speed
--model gpt-3.5-turbo

3. Leverage Node Summaries

# Always enable for better navigation
--if-add-node-summary yes
--if-add-doc-description yes

4. Handle Large Documents

# Increase node capacity
--max-pages-per-node 20
--max-tokens-per-node 30000

# Check more pages for TOC
--toc-check-pages 30

5. Reasoning Prompts

# Good prompt
"Navigate through the financial report to find sections discussing 
revenue growth in Q4 2023, then extract specific numbers"

# Bad prompt
"Find Q4 revenue"  # Too vague, doesn't leverage reasoning

🔍 Troubleshooting

Issue: Tree Structure Too Deep

# Solution: Increase max pages per node
python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --max-pages-per-node 15

Issue: Missing Table of Contents

# Solution: Check more pages
python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --toc-check-pages 30

Issue: Markdown Hierarchy Wrong

# Solution: Ensure proper heading structure
# Use "#" for level 1, "##" for level 2, etc.
# Or use PDF instead for better accuracy

Issue: API Rate Limits

# Solution: Add retry logic
import time
from openai import RateLimitError

def generate_tree_with_retry(pdf_path, max_retries=3):
    for attempt in range(max_retries):
        try:
            return generate_tree(pdf_path)
        except RateLimitError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

📈 Performance Benchmarks

FinanceBench Results

PageIndex-powered Mafin 2.5: 98.7% accuracy

Comparison with vector-based RAG:

Traditional RAG: ~70-85% accuracy
PageIndex RAG: 98.7% accuracy
Improvement: +13-28% accuracy

Processing Speed

Small PDF (< 50 pages): ~2-3 minutes
Medium PDF (50-200 pages): ~5-10 minutes
Large PDF (> 200 pages): ~15-30 minutes

Cost Efficiency

Tree generation: One-time cost
Retrieval: Lower token usage than full document
No vector DB hosting costs

🌐 Official Resources

Website: pageindex.ai
GitHub: github.com/VectifyAI/PageIndex
Documentation: PageIndex Docs
Discord: Join Community
Blog: Technical Articles

Additional Resources

Cookbooks: Hands-on examples and use cases
Tutorials: Document Search and Tree Search guides
MCP Setup: Model Context Protocol integration
API Docs: API integration details

🎯 Summary

PageIndex is ideal for:

📊 Financial document analysis (10-K, earnings, SEC filings)
⚖️ Legal document review (contracts, regulations)
🎓 Academic research (papers, textbooks)
📚 Technical documentation (manuals, specifications)
📋 Regulatory compliance (policies, standards)

Key Advantages:

98.7% accuracy on FinanceBench
No vector database required
No chunking - preserves structure
Reasoning-based retrieval
Human-like navigation
Better explainability
Clear page/section references
Lower setup complexity

When to Choose PageIndex:

Working with long professional documents
Need high accuracy and explainability
Want to preserve document structure
Prefer reasoning over similarity
Avoid vector DB complexity

When to Use Vector RAG:

Short documents or snippets
Similarity is sufficient
Already have vector infrastructure
Need sub-second retrieval (cached)

Bottom Line: PageIndex represents a paradigm shift from similarity-based to reasoning-based RAG. By preserving document hierarchy and using LLM reasoning for navigation, it achieves superior accuracy while eliminating the need for vector databases and chunking. Essential for professional document analysis requiring high accuracy and explainability.

🚀 Quick Start​

Installation​

Set API Key​

Run PageIndex​

🎯 Core Concepts​

The Problem with Vector-Based RAG​

PageIndex Solution​

📊 Key Features​

1. No Vector Database​

2. No Chunking​

3. Human-like Retrieval​

4. Better Explainability​

5. State-of-the-Art Accuracy​

🌲 Tree Structure Format​

⚙️ Command Line Options​

Basic Usage​

Optional Parameters​

Model Configuration​

Tree Structure Options​

Complete Example​

💡 Common Use Cases​

1. Financial Document Analysis​

2. Legal Document Review​

3. Academic Research​

4. Technical Documentation​

5. Regulatory Compliance​

🔧 Integration Examples​

Using PageIndex Output​

Reasoning-based Retrieval​

Building RAG System​

📊 PageIndex vs Vector RAG​

🎓 Advanced Features​

Document Search Strategy​

Vision-based Vectorless RAG​

📚 Integration Patterns​

With LangChain​

With LlamaIndex​

API Integration​

🛠️ Best Practices​

1. Optimize Tree Structure​

2. Choose Right Model​

3. Leverage Node Summaries​

4. Handle Large Documents​

5. Reasoning Prompts​

🔍 Troubleshooting​

Issue: Tree Structure Too Deep​

Issue: Missing Table of Contents​

Issue: Markdown Hierarchy Wrong​

Issue: API Rate Limits​

📈 Performance Benchmarks​

FinanceBench Results​

Processing Speed​

Cost Efficiency​

🌐 Official Resources​

Additional Resources​

🎯 Summary​