Skip to main content

PageIndex - Vectorless Reasoning-based RAG

Key Point: Vectorless, reasoning-based RAG that simulates human expert document navigation. No vector DB, no chunking, just hierarchical tree-based retrieval with 98.7% accuracy on FinanceBench.

🚀 Quick Start

Installation

pip3 install --upgrade -r requirements.txt

Set API Key

Create .env file:

CHATGPT_API_KEY=your_openai_key_here

Run PageIndex

# Process PDF document
python3 run_pageindex.py --pdf_path /path/to/document.pdf

# Process Markdown (experimental)
python3 run_pageindex.py --md_path /path/to/document.md

🎯 Core Concepts

The Problem with Vector-Based RAG

Traditional RAG relies on similarity, not relevance:

  • Similarity ≠ Relevance: Semantic similarity often misses the truly relevant content
  • Chunking Breaks Context: Artificial chunks lose document structure
  • Opaque Retrieval: "Vibe retrieval" with no clear reasoning path
  • Poor Explainability: Hard to trace why specific content was retrieved

PageIndex Solution

Reasoning-based retrieval through tree structure:

  1. Generate Tree Index: Create hierarchical "Table of Contents" structure
  2. Tree Search Retrieval: LLM reasons through tree to find relevant sections
  3. Human-like Navigation: Simulates expert document exploration
  4. Traceable & Explainable: Clear reasoning path with page/section references

📊 Key Features

1. No Vector Database

Uses document structure and LLM reasoning instead of vector similarity search.

2. No Chunking

Documents organized into natural sections, not artificial chunks.

3. Human-like Retrieval

Simulates how human experts navigate and extract knowledge from complex documents.

4. Better Explainability

Retrieval based on reasoning - traceable and interpretable with page references.

5. State-of-the-Art Accuracy

98.7% accuracy on FinanceBench benchmark for financial document analysis.

🌲 Tree Structure Format

PageIndex transforms PDF documents into semantic tree structure:

{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}

⚙️ Command Line Options

Basic Usage

# Process PDF
python3 run_pageindex.py --pdf_path document.pdf

# Process Markdown
python3 run_pageindex.py --md_path document.md

Optional Parameters

Model Configuration

# Specify OpenAI model (default: gpt-4o-2024-11-20)
python3 run_pageindex.py --pdf_path doc.pdf --model gpt-4o-mini

Tree Structure Options

# Pages to check for table of contents (default: 20)
--toc-check-pages 30

# Max pages per node (default: 10)
--max-pages-per-node 15

# Max tokens per node (default: 20000)
--max-tokens-per-node 25000

# Add node ID (yes/no, default: yes)
--if-add-node-id yes

# Add node summary (yes/no, default: yes)
--if-add-node-summary yes

# Add document description (yes/no, default: yes)
--if-add-doc-description yes

Complete Example

python3 run_pageindex.py \
--pdf_path financial_report.pdf \
--model gpt-4o-2024-11-20 \
--toc-check-pages 25 \
--max-pages-per-node 12 \
--max-tokens-per-node 22000 \
--if-add-node-id yes \
--if-add-node-summary yes \
--if-add-doc-description yes

💡 Common Use Cases

1. Financial Document Analysis

# Analyze SEC filings, earnings reports, 10-K/10-Q forms
python3 run_pageindex.py --pdf_path sec_10k.pdf

# Query the generated tree
# "What were the key financial risks disclosed in the 10-K?"
# Process legal contracts, regulations, case files
python3 run_pageindex.py --pdf_path legal_contract.pdf

# Enable reasoning-based retrieval
# "Find all clauses related to liability and indemnification"

3. Academic Research

# Process research papers, textbooks, dissertations
python3 run_pageindex.py --pdf_path research_paper.pdf

# Navigate complex academic content
# "Explain the methodology section and key findings"

4. Technical Documentation

# Process API docs, technical manuals, specifications
python3 run_pageindex.py --pdf_path technical_manual.pdf

# Retrieve specific technical details
# "What are the system requirements and setup procedures?"

5. Regulatory Compliance

# Process compliance documents, policy manuals
python3 run_pageindex.py --pdf_path compliance_policy.pdf

# Find relevant compliance requirements
# "List all data privacy requirements and GDPR clauses"

🔧 Integration Examples

Using PageIndex Output

After running PageIndex, you get a tree structure JSON:

import json

# Load PageIndex output
with open('pageindex_output.json', 'r') as f:
tree = json.load(f)

# Navigate tree structure
def find_section(tree, keyword):
"""Find sections containing keyword"""
results = []

def search(node):
if keyword.lower() in node.get('title', '').lower():
results.append({
'title': node['title'],
'node_id': node['node_id'],
'pages': f"{node['start_index']}-{node['end_index']}",
'summary': node.get('summary', '')
})

for child in node.get('nodes', []):
search(child)

search(tree)
return results

# Find financial risk sections
risk_sections = find_section(tree, 'risk')
print(json.dumps(risk_sections, indent=2))

Reasoning-based Retrieval

from openai import OpenAI
import json

client = OpenAI()

# Load PageIndex tree
with open('pageindex_output.json', 'r') as f:
tree = json.load(f)

def reason_through_tree(query, tree):
"""Use LLM to reason through tree for relevant sections"""

prompt = f"""
Given this document tree structure:
{json.dumps(tree, indent=2)}

Query: {query}

Reason through the tree to find the most relevant sections.
Provide:
1. The reasoning path you took
2. The relevant node IDs
3. The page ranges to examine
"""

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a document navigation expert."},
{"role": "user", "content": prompt}
]
)

return response.choices[0].message.content

# Example usage
query = "What are the key financial risks mentioned?"
reasoning = reason_through_tree(query, tree)
print(reasoning)

Building RAG System

from openai import OpenAI
import json

client = OpenAI()

class PageIndexRAG:
def __init__(self, tree_path, pdf_path):
with open(tree_path, 'r') as f:
self.tree = json.load(f)
self.pdf_path = pdf_path

def retrieve(self, query):
"""Retrieve relevant content based on query"""

# Step 1: Reason through tree
prompt = f"""
Document tree: {json.dumps(self.tree, indent=2)}
Query: {query}

Navigate the tree to find relevant sections.
Return node IDs and page ranges.
"""

response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)

# Step 2: Extract content from identified pages
# (Implementation depends on PDF library)

return response.choices[0].message.content

def answer(self, query):
"""Answer query using retrieved content"""

# Retrieve relevant content
context = self.retrieve(query)

# Generate answer
prompt = f"""
Context: {context}
Query: {query}

Provide a detailed answer based on the context.
"""

response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)

return response.choices[0].message.content

# Usage
rag = PageIndexRAG('pageindex_output.json', 'document.pdf')
answer = rag.answer("What were the Q4 earnings results?")
print(answer)

📊 PageIndex vs Vector RAG

FeaturePageIndexVector-based RAG
Retrieval MethodReasoning-based tree searchSemantic similarity
Document StructureNatural sections preservedArtificial chunks
Vector Database❌ Not needed✅ Required
Explainability✅ Clear reasoning path❌ Opaque similarity
Context Preservation✅ Hierarchical structure❌ Lost in chunks
Accuracy (FinanceBench)98.7%Lower
Setup ComplexityLow (just LLM)High (vector DB + embeddings)
Traceability✅ Page/section references⚠️ Chunk IDs only

🎓 Advanced Features

Document Search Strategy

PageIndex enables two retrieval modes:

  1. Document Search: Find relevant documents in a collection
  2. Tree Search: Navigate within a document's tree structure
# Document Search
def search_documents(query, document_trees):
"""Search across multiple documents"""
relevant_docs = []

for doc_name, tree in document_trees.items():
# Use LLM to assess document relevance
prompt = f"""
Document: {doc_name}
Tree structure: {json.dumps(tree, indent=2)[:500]}...
Query: {query}

Is this document relevant? (yes/no)
If yes, which sections?
"""

# Process response and collect relevant docs
# ...

return relevant_docs

# Tree Search within selected document
def tree_search(query, tree):
"""Navigate tree structure for specific content"""
# Reasoning-based navigation through tree
# ...
pass

Vision-based Vectorless RAG

PageIndex supports OCR-free, vision-only RAG:

# Process PDF page images directly
python3 run_pageindex.py \
--pdf_path document.pdf \
--vision_mode true \
--model gpt-4o

This enables:

  • OCR-free processing
  • Preserves visual formatting
  • Better for tables, charts, diagrams
  • Reasoning-native retrieval over page images

📚 Integration Patterns

With LangChain

from langchain.document_loaders import PageIndexLoader
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Load document with PageIndex
loader = PageIndexLoader(pdf_path='document.pdf')
documents = loader.load()

# Create retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=documents.as_retriever()
)

# Ask questions
answer = qa_chain.run("What are the key financial metrics?")

With LlamaIndex

from llama_index import PageIndexReader, GPTVectorStoreIndex

# Load with PageIndex
reader = PageIndexReader()
documents = reader.load_data(file_path='document.pdf')

# Create index
index = GPTVectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the risk factors")

API Integration

import requests

# Use PageIndex API (if available)
response = requests.post(
'https://api.pageindex.ai/v1/index',
headers={'Authorization': 'Bearer YOUR_API_KEY'},
json={
'pdf_url': 'https://example.com/document.pdf',
'model': 'gpt-4o',
'options': {
'max_pages_per_node': 10,
'add_node_summary': True
}
}
)

tree = response.json()

🛠️ Best Practices

1. Optimize Tree Structure

# For short documents (< 50 pages)
python3 run_pageindex.py \
--pdf_path doc.pdf \
--max-pages-per-node 8

# For long documents (> 200 pages)
python3 run_pageindex.py \
--pdf_path doc.pdf \
--max-pages-per-node 15 \
--max-tokens-per-node 25000

2. Choose Right Model

# For cost efficiency
--model gpt-4o-mini

# For best accuracy
--model gpt-4o-2024-11-20

# For speed
--model gpt-3.5-turbo

3. Leverage Node Summaries

# Always enable for better navigation
--if-add-node-summary yes
--if-add-doc-description yes

4. Handle Large Documents

# Increase node capacity
--max-pages-per-node 20
--max-tokens-per-node 30000

# Check more pages for TOC
--toc-check-pages 30

5. Reasoning Prompts

# Good prompt
"Navigate through the financial report to find sections discussing
revenue growth in Q4 2023, then extract specific numbers"

# Bad prompt
"Find Q4 revenue" # Too vague, doesn't leverage reasoning

🔍 Troubleshooting

Issue: Tree Structure Too Deep

# Solution: Increase max pages per node
python3 run_pageindex.py \
--pdf_path doc.pdf \
--max-pages-per-node 15

Issue: Missing Table of Contents

# Solution: Check more pages
python3 run_pageindex.py \
--pdf_path doc.pdf \
--toc-check-pages 30

Issue: Markdown Hierarchy Wrong

# Solution: Ensure proper heading structure
# Use "#" for level 1, "##" for level 2, etc.
# Or use PDF instead for better accuracy

Issue: API Rate Limits

# Solution: Add retry logic
import time
from openai import RateLimitError

def generate_tree_with_retry(pdf_path, max_retries=3):
for attempt in range(max_retries):
try:
return generate_tree(pdf_path)
except RateLimitError:
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise

📈 Performance Benchmarks

FinanceBench Results

PageIndex-powered Mafin 2.5: 98.7% accuracy

Comparison with vector-based RAG:

  • Traditional RAG: ~70-85% accuracy
  • PageIndex RAG: 98.7% accuracy
  • Improvement: +13-28% accuracy

Processing Speed

  • Small PDF (< 50 pages): ~2-3 minutes
  • Medium PDF (50-200 pages): ~5-10 minutes
  • Large PDF (> 200 pages): ~15-30 minutes

Cost Efficiency

  • Tree generation: One-time cost
  • Retrieval: Lower token usage than full document
  • No vector DB hosting costs

🌐 Official Resources

Additional Resources

  • Cookbooks: Hands-on examples and use cases
  • Tutorials: Document Search and Tree Search guides
  • MCP Setup: Model Context Protocol integration
  • API Docs: API integration details

🎯 Summary

PageIndex is ideal for:

  • 📊 Financial document analysis (10-K, earnings, SEC filings)
  • ⚖️ Legal document review (contracts, regulations)
  • 🎓 Academic research (papers, textbooks)
  • 📚 Technical documentation (manuals, specifications)
  • 📋 Regulatory compliance (policies, standards)

Key Advantages:

  • 98.7% accuracy on FinanceBench
  • No vector database required
  • No chunking - preserves structure
  • Reasoning-based retrieval
  • Human-like navigation
  • Better explainability
  • Clear page/section references
  • Lower setup complexity

When to Choose PageIndex:

  • Working with long professional documents
  • Need high accuracy and explainability
  • Want to preserve document structure
  • Prefer reasoning over similarity
  • Avoid vector DB complexity

When to Use Vector RAG:

  • Short documents or snippets
  • Similarity is sufficient
  • Already have vector infrastructure
  • Need sub-second retrieval (cached)

Bottom Line: PageIndex represents a paradigm shift from similarity-based to reasoning-based RAG. By preserving document hierarchy and using LLM reasoning for navigation, it achieves superior accuracy while eliminating the need for vector databases and chunking. Essential for professional document analysis requiring high accuracy and explainability.