PageIndex - Vectorless Reasoning-based RAG
Key Point: Vectorless, reasoning-based RAG that simulates human expert document navigation. No vector DB, no chunking, just hierarchical tree-based retrieval with 98.7% accuracy on FinanceBench.
🚀 Quick Start
Installation
pip3 install --upgrade -r requirements.txt
Set API Key
Create .env file:
CHATGPT_API_KEY=your_openai_key_here
Run PageIndex
# Process PDF document
python3 run_pageindex.py --pdf_path /path/to/document.pdf
# Process Markdown (experimental)
python3 run_pageindex.py --md_path /path/to/document.md
🎯 Core Concepts
The Problem with Vector-Based RAG
Traditional RAG relies on similarity, not relevance:
- ❌ Similarity ≠ Relevance: Semantic similarity often misses the truly relevant content
- ❌ Chunking Breaks Context: Artificial chunks lose document structure
- ❌ Opaque Retrieval: "Vibe retrieval" with no clear reasoning path
- ❌ Poor Explainability: Hard to trace why specific content was retrieved
PageIndex Solution
Reasoning-based retrieval through tree structure:
- ✅ Generate Tree Index: Create hierarchical "Table of Contents" structure
- ✅ Tree Search Retrieval: LLM reasons through tree to find relevant sections
- ✅ Human-like Navigation: Simulates expert document exploration
- ✅ Traceable & Explainable: Clear reasoning path with page/section references
📊 Key Features
1. No Vector Database
Uses document structure and LLM reasoning instead of vector similarity search.
2. No Chunking
Documents organized into natural sections, not artificial chunks.
3. Human-like Retrieval
Simulates how human experts navigate and extract knowledge from complex documents.
4. Better Explainability
Retrieval based on reasoning - traceable and interpretable with page references.
5. State-of-the-Art Accuracy
98.7% accuracy on FinanceBench benchmark for financial document analysis.
🌲 Tree Structure Format
PageIndex transforms PDF documents into semantic tree structure:
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}
⚙️ Command Line Options
Basic Usage
# Process PDF
python3 run_pageindex.py --pdf_path document.pdf
# Process Markdown
python3 run_pageindex.py --md_path document.md
Optional Parameters
Model Configuration
# Specify OpenAI model (default: gpt-4o-2024-11-20)
python3 run_pageindex.py --pdf_path doc.pdf --model gpt-4o-mini
Tree Structure Options
# Pages to check for table of contents (default: 20)
--toc-check-pages 30
# Max pages per node (default: 10)
--max-pages-per-node 15
# Max tokens per node (default: 20000)
--max-tokens-per-node 25000
# Add node ID (yes/no, default: yes)
--if-add-node-id yes
# Add node summary (yes/no, default: yes)
--if-add-node-summary yes
# Add document description (yes/no, default: yes)
--if-add-doc-description yes
Complete Example
python3 run_pageindex.py \
--pdf_path financial_report.pdf \
--model gpt-4o-2024-11-20 \
--toc-check-pages 25 \
--max-pages-per-node 12 \
--max-tokens-per-node 22000 \
--if-add-node-id yes \
--if-add-node-summary yes \
--if-add-doc-description yes
💡 Common Use Cases
1. Financial Document Analysis
# Analyze SEC filings, earnings reports, 10-K/10-Q forms
python3 run_pageindex.py --pdf_path sec_10k.pdf
# Query the generated tree
# "What were the key financial risks disclosed in the 10-K?"
2. Legal Document Review
# Process legal contracts, regulations, case files
python3 run_pageindex.py --pdf_path legal_contract.pdf
# Enable reasoning-based retrieval
# "Find all clauses related to liability and indemnification"
3. Academic Research
# Process research papers, textbooks, dissertations
python3 run_pageindex.py --pdf_path research_paper.pdf
# Navigate complex academic content
# "Explain the methodology section and key findings"
4. Technical Documentation
# Process API docs, technical manuals, specifications
python3 run_pageindex.py --pdf_path technical_manual.pdf
# Retrieve specific technical details
# "What are the system requirements and setup procedures?"
5. Regulatory Compliance
# Process compliance documents, policy manuals
python3 run_pageindex.py --pdf_path compliance_policy.pdf
# Find relevant compliance requirements
# "List all data privacy requirements and GDPR clauses"
🔧 Integration Examples
Using PageIndex Output
After running PageIndex, you get a tree structure JSON:
import json
# Load PageIndex output
with open('pageindex_output.json', 'r') as f:
tree = json.load(f)
# Navigate tree structure
def find_section(tree, keyword):
"""Find sections containing keyword"""
results = []
def search(node):
if keyword.lower() in node.get('title', '').lower():
results.append({
'title': node['title'],
'node_id': node['node_id'],
'pages': f"{node['start_index']}-{node['end_index']}",
'summary': node.get('summary', '')
})
for child in node.get('nodes', []):
search(child)
search(tree)
return results
# Find financial risk sections
risk_sections = find_section(tree, 'risk')
print(json.dumps(risk_sections, indent=2))
Reasoning-based Retrieval
from openai import OpenAI
import json
client = OpenAI()
# Load PageIndex tree
with open('pageindex_output.json', 'r') as f:
tree = json.load(f)
def reason_through_tree(query, tree):
"""Use LLM to reason through tree for relevant sections"""
prompt = f"""
Given this document tree structure:
{json.dumps(tree, indent=2)}
Query: {query}
Reason through the tree to find the most relevant sections.
Provide:
1. The reasoning path you took
2. The relevant node IDs
3. The page ranges to examine
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a document navigation expert."},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
# Example usage
query = "What are the key financial risks mentioned?"
reasoning = reason_through_tree(query, tree)
print(reasoning)
Building RAG System
from openai import OpenAI
import json
client = OpenAI()
class PageIndexRAG:
def __init__(self, tree_path, pdf_path):
with open(tree_path, 'r') as f:
self.tree = json.load(f)
self.pdf_path = pdf_path
def retrieve(self, query):
"""Retrieve relevant content based on query"""
# Step 1: Reason through tree
prompt = f"""
Document tree: {json.dumps(self.tree, indent=2)}
Query: {query}
Navigate the tree to find relevant sections.
Return node IDs and page ranges.
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# Step 2: Extract content from identified pages
# (Implementation depends on PDF library)
return response.choices[0].message.content
def answer(self, query):
"""Answer query using retrieved content"""
# Retrieve relevant content
context = self.retrieve(query)
# Generate answer
prompt = f"""
Context: {context}
Query: {query}
Provide a detailed answer based on the context.
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Usage
rag = PageIndexRAG('pageindex_output.json', 'document.pdf')
answer = rag.answer("What were the Q4 earnings results?")
print(answer)
📊 PageIndex vs Vector RAG
| Feature | PageIndex | Vector-based RAG |
|---|---|---|
| Retrieval Method | Reasoning-based tree search | Semantic similarity |
| Document Structure | Natural sections preserved | Artificial chunks |
| Vector Database | ❌ Not needed | ✅ Required |
| Explainability | ✅ Clear reasoning path | ❌ Opaque similarity |
| Context Preservation | ✅ Hierarchical structure | ❌ Lost in chunks |
| Accuracy (FinanceBench) | 98.7% | Lower |
| Setup Complexity | Low (just LLM) | High (vector DB + embeddings) |
| Traceability | ✅ Page/section references | ⚠️ Chunk IDs only |
🎓 Advanced Features
Document Search Strategy
PageIndex enables two retrieval modes:
- Document Search: Find relevant documents in a collection
- Tree Search: Navigate within a document's tree structure
# Document Search
def search_documents(query, document_trees):
"""Search across multiple documents"""
relevant_docs = []
for doc_name, tree in document_trees.items():
# Use LLM to assess document relevance
prompt = f"""
Document: {doc_name}
Tree structure: {json.dumps(tree, indent=2)[:500]}...
Query: {query}
Is this document relevant? (yes/no)
If yes, which sections?
"""
# Process response and collect relevant docs
# ...
return relevant_docs
# Tree Search within selected document
def tree_search(query, tree):
"""Navigate tree structure for specific content"""
# Reasoning-based navigation through tree
# ...
pass
Vision-based Vectorless RAG
PageIndex supports OCR-free, vision-only RAG:
# Process PDF page images directly
python3 run_pageindex.py \
--pdf_path document.pdf \
--vision_mode true \
--model gpt-4o
This enables:
- OCR-free processing
- Preserves visual formatting
- Better for tables, charts, diagrams
- Reasoning-native retrieval over page images
📚 Integration Patterns
With LangChain
from langchain.document_loaders import PageIndexLoader
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Load document with PageIndex
loader = PageIndexLoader(pdf_path='document.pdf')
documents = loader.load()
# Create retrieval QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=documents.as_retriever()
)
# Ask questions
answer = qa_chain.run("What are the key financial metrics?")
With LlamaIndex
from llama_index import PageIndexReader, GPTVectorStoreIndex
# Load with PageIndex
reader = PageIndexReader()
documents = reader.load_data(file_path='document.pdf')
# Create index
index = GPTVectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the risk factors")
API Integration
import requests
# Use PageIndex API (if available)
response = requests.post(
'https://api.pageindex.ai/v1/index',
headers={'Authorization': 'Bearer YOUR_API_KEY'},
json={
'pdf_url': 'https://example.com/document.pdf',
'model': 'gpt-4o',
'options': {
'max_pages_per_node': 10,
'add_node_summary': True
}
}
)
tree = response.json()
🛠️ Best Practices
1. Optimize Tree Structure
# For short documents (< 50 pages)
python3 run_pageindex.py \
--pdf_path doc.pdf \
--max-pages-per-node 8
# For long documents (> 200 pages)
python3 run_pageindex.py \
--pdf_path doc.pdf \
--max-pages-per-node 15 \
--max-tokens-per-node 25000
2. Choose Right Model
# For cost efficiency
--model gpt-4o-mini
# For best accuracy
--model gpt-4o-2024-11-20
# For speed
--model gpt-3.5-turbo
3. Leverage Node Summaries
# Always enable for better navigation
--if-add-node-summary yes
--if-add-doc-description yes
4. Handle Large Documents
# Increase node capacity
--max-pages-per-node 20
--max-tokens-per-node 30000
# Check more pages for TOC
--toc-check-pages 30
5. Reasoning Prompts
# Good prompt
"Navigate through the financial report to find sections discussing
revenue growth in Q4 2023, then extract specific numbers"
# Bad prompt
"Find Q4 revenue" # Too vague, doesn't leverage reasoning
🔍 Troubleshooting
Issue: Tree Structure Too Deep
# Solution: Increase max pages per node
python3 run_pageindex.py \
--pdf_path doc.pdf \
--max-pages-per-node 15
Issue: Missing Table of Contents
# Solution: Check more pages
python3 run_pageindex.py \
--pdf_path doc.pdf \
--toc-check-pages 30
Issue: Markdown Hierarchy Wrong
# Solution: Ensure proper heading structure
# Use "#" for level 1, "##" for level 2, etc.
# Or use PDF instead for better accuracy
Issue: API Rate Limits
# Solution: Add retry logic
import time
from openai import RateLimitError
def generate_tree_with_retry(pdf_path, max_retries=3):
for attempt in range(max_retries):
try:
return generate_tree(pdf_path)
except RateLimitError:
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
📈 Performance Benchmarks
FinanceBench Results
PageIndex-powered Mafin 2.5: 98.7% accuracy
Comparison with vector-based RAG:
- Traditional RAG: ~70-85% accuracy
- PageIndex RAG: 98.7% accuracy
- Improvement: +13-28% accuracy
Processing Speed
- Small PDF (< 50 pages): ~2-3 minutes
- Medium PDF (50-200 pages): ~5-10 minutes
- Large PDF (> 200 pages): ~15-30 minutes
Cost Efficiency
- Tree generation: One-time cost
- Retrieval: Lower token usage than full document
- No vector DB hosting costs
🌐 Official Resources
- Website: pageindex.ai
- GitHub: github.com/VectifyAI/PageIndex
- Documentation: PageIndex Docs
- Discord: Join Community
- Blog: Technical Articles
Additional Resources
- Cookbooks: Hands-on examples and use cases
- Tutorials: Document Search and Tree Search guides
- MCP Setup: Model Context Protocol integration
- API Docs: API integration details
🎯 Summary
PageIndex is ideal for:
- 📊 Financial document analysis (10-K, earnings, SEC filings)
- ⚖️ Legal document review (contracts, regulations)
- 🎓 Academic research (papers, textbooks)
- 📚 Technical documentation (manuals, specifications)
- 📋 Regulatory compliance (policies, standards)
Key Advantages:
- 98.7% accuracy on FinanceBench
- No vector database required
- No chunking - preserves structure
- Reasoning-based retrieval
- Human-like navigation
- Better explainability
- Clear page/section references
- Lower setup complexity
When to Choose PageIndex:
- Working with long professional documents
- Need high accuracy and explainability
- Want to preserve document structure
- Prefer reasoning over similarity
- Avoid vector DB complexity
When to Use Vector RAG:
- Short documents or snippets
- Similarity is sufficient
- Already have vector infrastructure
- Need sub-second retrieval (cached)
Bottom Line: PageIndex represents a paradigm shift from similarity-based to reasoning-based RAG. By preserving document hierarchy and using LLM reasoning for navigation, it achieves superior accuracy while eliminating the need for vector databases and chunking. Essential for professional document analysis requiring high accuracy and explainability.