Skip to main content

Smart Firecrawl API

A comprehensive web scraping and crawling API service built with Node.js, TypeScript, and Express. This project provides powerful web scraping capabilities with support for various content extraction methods, AI-powered features, and scalable queue-based processing.

🏗️ Project Structure

smart-firecrawl-api/
├── src/ # Main source code
│ ├── controllers/ # API controllers
│ │ ├── v0/ # Legacy API endpoints
│ │ └── v1/ # Current API endpoints
│ ├── lib/ # Core libraries and utilities
│ │ ├── extract/ # Content extraction modules
│ │ ├── deep-research/ # AI-powered research features
│ │ └── generate-llmstxt/ # LLM text generation
│ ├── routes/ # Express route definitions
│ ├── scraper/ # Web scraping engines
│ │ ├── scrapeURL/ # URL scraping logic
│ │ └── WebScraper/ # Web scraper implementations
│ ├── services/ # Background services
│ │ ├── billing/ # Credit and billing management
│ │ ├── queue-service.ts # Queue management
│ │ └── redis.ts # Redis configuration
│ ├── search/ # Search functionality
│ └── types/ # TypeScript type definitions
├── sharedLibs/ # Native libraries
│ ├── crawler/ # Rust-based crawler
│ ├── go-html-to-md/ # Go-based HTML to Markdown
│ ├── html-transformer/ # Rust HTML transformer
│ └── pdf-parser/ # Rust PDF parser
├── chrome-extension/ # Browser extension
├── firecrawl-extension/ # Firecrawl extension
├── page-discovery-extension/ # Page discovery extension
├── dist/ # Compiled JavaScript output
├── node_modules/ # Node.js dependencies
└── tests/ # Test files

🚀 Getting Started

Prerequisites

  • Node.js 22+
  • pnpm (recommended) or npm
  • Redis server
  • Docker (for containerized deployment)

Installation

  1. Clone the repository

    git clone <repository-url>
    cd smart-firecrawl-api
  2. Install dependencies

    pnpm install
    # or
    npm install
  3. Set up environment variables Create a .env file with the following variables:

    # Database
    DATABASE_URL=your_database_url
    REDIS_URL=redis://localhost:6379

    # Authentication
    JWT_SECRET=your_jwt_secret

    # External Services
    SENTRY_DSN=your_sentry_dsn
    SLACK_WEBHOOK_URL=your_slack_webhook

    # AI Services
    OPENAI_API_KEY=your_openai_key
    ANTHROPIC_API_KEY=your_anthropic_key

    # Storage
    GCS_FIRE_ENGINE_BUCKET_NAME=your_bucket_name
  4. Start Redis server

    # Using Docker
    docker run -d -p 6379:6379 redis:latest

    # Or using npm script
    npm run mongo-docker

Development

  1. Start the development server

    npm run start:dev
    # or
    pnpm start:dev
  2. Start queue workers (in separate terminals)

    # Main queue worker
    npm run workers

    # Index worker
    npm run index-worker
  3. Run tests

    npm test

Production

  1. Build the project

    npm run build
  2. Start production server

    npm run start:production
  3. Start production workers

    npm run worker:production
    npm run index-worker:production

🔧 Core Functionality

API Endpoints

Scraping

  • POST /v1/scrape - Scrape a single URL
  • GET /v1/scrape/:jobId - Get scrape job status

Crawling

  • POST /v1/crawl - Start a website crawl
  • GET /v1/crawl/:jobId - Get crawl job status
  • DELETE /v1/crawl/:jobId - Cancel a crawl job

Batch Operations

  • POST /v1/batch/scrape - Batch scrape multiple URLs
  • GET /v1/batch/scrape/:jobId - Get batch job status

AI Features

  • POST /v1/extract - Extract structured data using AI
  • POST /v1/deep-research - Perform deep research on topics
  • POST /v1/llmstxt - Generate LLM-optimized text
  • POST /v1/search - Search the web
  • POST /v1/map - Map website structure

Key Features

  1. Multi-Engine Scraping: Supports various scraping engines including Playwright, Puppeteer, and custom implementations

  2. AI-Powered Extraction:

    • Structured data extraction using LLMs
    • Content summarization and analysis
    • Deep research capabilities
  3. Queue-Based Processing:

    • BullMQ for job management
    • Redis for caching and coordination
    • Priority-based job processing
  4. Content Processing:

    • HTML to Markdown conversion
    • PDF parsing and extraction
    • Image and media handling
  5. Rate Limiting & Authentication:

    • Team-based authentication
    • Credit-based billing system
    • Rate limiting per endpoint

📚 Libraries Used

Core Dependencies

  • Express.js - Web framework
  • TypeScript - Type-safe JavaScript
  • BullMQ - Queue management
  • Redis - Caching and job storage
  • Winston - Logging

AI & ML Libraries

  • @ai-sdk/anthropic - Anthropic Claude integration
  • @ai-sdk/openai - OpenAI integration
  • @ai-sdk/google - Google AI integration
  • @ai-sdk/groq - Groq integration

Web Scraping

  • Playwright - Browser automation
  • Cheerio - Server-side jQuery
  • Axios - HTTP client
  • Undici - High-performance HTTP client

Content Processing

  • Turndown - HTML to Markdown
  • PDF-parse - PDF text extraction
  • Mammoth - Word document processing
  • JSDOM - DOM manipulation

Native Libraries

  • Rust-based crawler - High-performance web crawling
  • Go HTML-to-Markdown - Fast HTML conversion
  • Rust PDF parser - Efficient PDF processing

Database & Storage

  • Supabase - Database and authentication
  • Google Cloud Storage - File storage
  • MongoDB - Document storage

Monitoring & Analytics

  • Sentry - Error tracking
  • PostHog - Analytics
  • Winston - Structured logging

🐳 Docker Deployment

Build and Run

# Build the Docker image
docker build -t smart-firecrawl-api .

# Run the container
docker run -p 3002:3002 \
-e REDIS_URL=redis://host.docker.internal:6379 \
-e DATABASE_URL=your_database_url \
smart-firecrawl-api

Docker Compose

version: "3.8"
services:
app:
build: .
ports:
- "3002:3002"
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=your_database_url
depends_on:
- redis

redis:
image: redis:latest
ports:
- "6379:6379"

🔍 Development Tools

Available Scripts

  • npm start - Start development server
  • npm run build - Build for production
  • npm test - Run tests
  • npm run format - Format code with Prettier
  • npm run workers - Start queue workers

Testing

  • Jest - Testing framework
  • Supertest - HTTP testing
  • E2E tests - End-to-end testing
  • Unit tests - Component testing

Code Quality

  • TypeScript - Static type checking
  • Prettier - Code formatting
  • ESLint - Code linting (if configured)

🚀 Deployment

Fly.io Deployment

# Deploy to Fly.io
npm run deploy:fly

# Deploy to staging
npm run deploy:fly:staging

Environment Variables for Production

IS_PRODUCTION=true
ENV=production
PORT=3002
HOST=0.0.0.0

📊 Monitoring

Health Checks

  • GET /serverHealthCheck - Server health status
  • GET /is-production - Production status

Queue Monitoring

  • Bull Board UI available at /admin/{BULL_AUTH_KEY}/queues
  • Real-time queue statistics
  • Job status tracking

Logging

  • Structured logging with Winston
  • Error tracking with Sentry
  • Performance monitoring

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

📄 License

This project is licensed under the ISC License.

🆘 Support

For support and questions:

  • Create an issue in the repository
  • Contact the development team
  • Check the documentation for common issues

Note: This is a production-ready web scraping API with enterprise features including AI integration, scalable queue processing, and comprehensive monitoring capabilities.