Smart Firecrawl API
A comprehensive web scraping and crawling API service built with Node.js, TypeScript, and Express. This project provides powerful web scraping capabilities with support for various content extraction methods, AI-powered features, and scalable queue-based processing.
🏗️ Project Structure
smart-firecrawl-api/
├── src/ # Main source code
│ ├── controllers/ # API controllers
│ │ ├── v0/ # Legacy API endpoints
│ │ └── v1/ # Current API endpoints
│ ├── lib/ # Core libraries and utilities
│ │ ├── extract/ # Content extraction modules
│ │ ├── deep-research/ # AI-powered research features
│ │ └── generate-llmstxt/ # LLM text generation
│ ├── routes/ # Express route definitions
│ ├── scraper/ # Web scraping engines
│ │ ├── scrapeURL/ # URL scraping logic
│ │ └── WebScraper/ # Web scraper implementations
│ ├── services/ # Background services
│ │ ├── billing/ # Credit and billing management
│ │ ├── queue-service.ts # Queue management
│ │ └── redis.ts # Redis configuration
│ ├── search/ # Search functionality
│ └── types/ # TypeScript type definitions
├── sharedLibs/ # Native libraries
│ ├── crawler/ # Rust-based crawler
│ ├── go-html-to-md/ # Go-based HTML to Markdown
│ ├── html-transformer/ # Rust HTML transformer
│ └── pdf-parser/ # Rust PDF parser
├── chrome-extension/ # Browser extension
├── firecrawl-extension/ # Firecrawl extension
├── page-discovery-extension/ # Page discovery extension
├── dist/ # Compiled JavaScript output
├── node_modules/ # Node.js dependencies
└── tests/ # Test files
🚀 Getting Started
Prerequisites
- Node.js 22+
- pnpm (recommended) or npm
- Redis server
- Docker (for containerized deployment)
Installation
-
Clone the repository
git clone <repository-url>
cd smart-firecrawl-api -
Install dependencies
pnpm install
# or
npm install -
Set up environment variables Create a
.envfile with the following variables:# Database
DATABASE_URL=your_database_url
REDIS_URL=redis://localhost:6379
# Authentication
JWT_SECRET=your_jwt_secret
# External Services
SENTRY_DSN=your_sentry_dsn
SLACK_WEBHOOK_URL=your_slack_webhook
# AI Services
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
# Storage
GCS_FIRE_ENGINE_BUCKET_NAME=your_bucket_name -
Start Redis server
# Using Docker
docker run -d -p 6379:6379 redis:latest
# Or using npm script
npm run mongo-docker
Development
-
Start the development server
npm run start:dev
# or
pnpm start:dev -
Start queue workers (in separate terminals)
# Main queue worker
npm run workers
# Index worker
npm run index-worker -
Run tests
npm test
Production
-
Build the project
npm run build -
Start production server
npm run start:production -
Start production workers
npm run worker:production
npm run index-worker:production
🔧 Core Functionality
API Endpoints
Scraping
POST /v1/scrape- Scrape a single URLGET /v1/scrape/:jobId- Get scrape job status
Crawling
POST /v1/crawl- Start a website crawlGET /v1/crawl/:jobId- Get crawl job statusDELETE /v1/crawl/:jobId- Cancel a crawl job
Batch Operations
POST /v1/batch/scrape- Batch scrape multiple URLsGET /v1/batch/scrape/:jobId- Get batch job status
AI Features
POST /v1/extract- Extract structured data using AIPOST /v1/deep-research- Perform deep research on topicsPOST /v1/llmstxt- Generate LLM-optimized text
Search
POST /v1/search- Search the webPOST /v1/map- Map website structure
Key Features
-
Multi-Engine Scraping: Supports various scraping engines including Playwright, Puppeteer, and custom implementations
-
AI-Powered Extraction:
- Structured data extraction using LLMs
- Content summarization and analysis
- Deep research capabilities
-
Queue-Based Processing:
- BullMQ for job management
- Redis for caching and coordination
- Priority-based job processing
-
Content Processing:
- HTML to Markdown conversion
- PDF parsing and extraction
- Image and media handling
-
Rate Limiting & Authentication:
- Team-based authentication
- Credit-based billing system
- Rate limiting per endpoint
📚 Libraries Used
Core Dependencies
- Express.js - Web framework
- TypeScript - Type-safe JavaScript
- BullMQ - Queue management
- Redis - Caching and job storage
- Winston - Logging
AI & ML Libraries
- @ai-sdk/anthropic - Anthropic Claude integration
- @ai-sdk/openai - OpenAI integration
- @ai-sdk/google - Google AI integration
- @ai-sdk/groq - Groq integration
Web Scraping
- Playwright - Browser automation
- Cheerio - Server-side jQuery
- Axios - HTTP client
- Undici - High-performance HTTP client
Content Processing
- Turndown - HTML to Markdown
- PDF-parse - PDF text extraction
- Mammoth - Word document processing
- JSDOM - DOM manipulation
Native Libraries
- Rust-based crawler - High-performance web crawling
- Go HTML-to-Markdown - Fast HTML conversion
- Rust PDF parser - Efficient PDF processing
Database & Storage
- Supabase - Database and authentication
- Google Cloud Storage - File storage
- MongoDB - Document storage
Monitoring & Analytics
- Sentry - Error tracking
- PostHog - Analytics
- Winston - Structured logging
🐳 Docker Deployment
Build and Run
# Build the Docker image
docker build -t smart-firecrawl-api .
# Run the container
docker run -p 3002:3002 \
-e REDIS_URL=redis://host.docker.internal:6379 \
-e DATABASE_URL=your_database_url \
smart-firecrawl-api
Docker Compose
version: "3.8"
services:
app:
build: .
ports:
- "3002:3002"
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=your_database_url
depends_on:
- redis
redis:
image: redis:latest
ports:
- "6379:6379"
🔍 Development Tools
Available Scripts
npm start- Start development servernpm run build- Build for productionnpm test- Run testsnpm run format- Format code with Prettiernpm run workers- Start queue workers
Testing
- Jest - Testing framework
- Supertest - HTTP testing
- E2E tests - End-to-end testing
- Unit tests - Component testing
Code Quality
- TypeScript - Static type checking
- Prettier - Code formatting
- ESLint - Code linting (if configured)
🚀 Deployment
Fly.io Deployment
# Deploy to Fly.io
npm run deploy:fly
# Deploy to staging
npm run deploy:fly:staging
Environment Variables for Production
IS_PRODUCTION=true
ENV=production
PORT=3002
HOST=0.0.0.0
📊 Monitoring
Health Checks
GET /serverHealthCheck- Server health statusGET /is-production- Production status
Queue Monitoring
- Bull Board UI available at
/admin/{BULL_AUTH_KEY}/queues - Real-time queue statistics
- Job status tracking
Logging
- Structured logging with Winston
- Error tracking with Sentry
- Performance monitoring
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
📄 License
This project is licensed under the ISC License.
🆘 Support
For support and questions:
- Create an issue in the repository
- Contact the development team
- Check the documentation for common issues
Note: This is a production-ready web scraping API with enterprise features including AI integration, scalable queue processing, and comprehensive monitoring capabilities.