Smart Firecrawl API

A comprehensive web scraping and crawling API service built with Node.js, TypeScript, and Express. This project provides powerful web scraping capabilities with support for various content extraction methods, AI-powered features, and scalable queue-based processing.

🏗️ Project Structure

smart-firecrawl-api/
├── src/                          # Main source code
│   ├── controllers/              # API controllers
│   │   ├── v0/                   # Legacy API endpoints
│   │   └── v1/                   # Current API endpoints
│   ├── lib/                      # Core libraries and utilities
│   │   ├── extract/              # Content extraction modules
│   │   ├── deep-research/        # AI-powered research features
│   │   └── generate-llmstxt/      # LLM text generation
│   ├── routes/                   # Express route definitions
│   ├── scraper/                  # Web scraping engines
│   │   ├── scrapeURL/            # URL scraping logic
│   │   └── WebScraper/           # Web scraper implementations
│   ├── services/                 # Background services
│   │   ├── billing/              # Credit and billing management
│   │   ├── queue-service.ts      # Queue management
│   │   └── redis.ts              # Redis configuration
│   ├── search/                   # Search functionality
│   └── types/                    # TypeScript type definitions
├── sharedLibs/                   # Native libraries
│   ├── crawler/                  # Rust-based crawler
│   ├── go-html-to-md/           # Go-based HTML to Markdown
│   ├── html-transformer/        # Rust HTML transformer
│   └── pdf-parser/              # Rust PDF parser
├── chrome-extension/            # Browser extension
├── firecrawl-extension/         # Firecrawl extension
├── page-discovery-extension/   # Page discovery extension
├── dist/                       # Compiled JavaScript output
├── node_modules/               # Node.js dependencies
└── tests/                      # Test files

🚀 Getting Started

Prerequisites

Node.js 22+
pnpm (recommended) or npm
Redis server
Docker (for containerized deployment)

Installation

Clone the repository

git clone <repository-url>
cd smart-firecrawl-api

Install dependencies
```
pnpm install
# or
npm install
```

Set up environment variables Create a .env file with the following variables:

# Database
DATABASE_URL=your_database_url
REDIS_URL=redis://localhost:6379

# Authentication
JWT_SECRET=your_jwt_secret

# External Services
SENTRY_DSN=your_sentry_dsn
SLACK_WEBHOOK_URL=your_slack_webhook

# AI Services
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key

# Storage
GCS_FIRE_ENGINE_BUCKET_NAME=your_bucket_name

Start Redis server

# Using Docker
docker run -d -p 6379:6379 redis:latest

# Or using npm script
npm run mongo-docker

Development

Start the development server
```
npm run start:dev
# or
pnpm start:dev
```

Start queue workers (in separate terminals)

# Main queue worker
npm run workers

# Index worker
npm run index-worker

Run tests
```
npm test
```

Production

Build the project
```
npm run build
```
Start production server
```
npm run start:production
```

Start production workers

npm run worker:production
npm run index-worker:production

🔧 Core Functionality

API Endpoints

Scraping

POST /v1/scrape - Scrape a single URL
GET /v1/scrape/:jobId - Get scrape job status

Crawling

POST /v1/crawl - Start a website crawl
GET /v1/crawl/:jobId - Get crawl job status
DELETE /v1/crawl/:jobId - Cancel a crawl job

Batch Operations

POST /v1/batch/scrape - Batch scrape multiple URLs
GET /v1/batch/scrape/:jobId - Get batch job status

AI Features

POST /v1/extract - Extract structured data using AI
POST /v1/deep-research - Perform deep research on topics
POST /v1/llmstxt - Generate LLM-optimized text

Search

POST /v1/search - Search the web
POST /v1/map - Map website structure

Key Features

Multi-Engine Scraping: Supports various scraping engines including Playwright, Puppeteer, and custom implementations
AI-Powered Extraction:
- Structured data extraction using LLMs
- Content summarization and analysis
- Deep research capabilities
Queue-Based Processing:
- BullMQ for job management
- Redis for caching and coordination
- Priority-based job processing
Content Processing:
- HTML to Markdown conversion
- PDF parsing and extraction
- Image and media handling
Rate Limiting & Authentication:
- Team-based authentication
- Credit-based billing system
- Rate limiting per endpoint

📚 Libraries Used

Core Dependencies

Express.js - Web framework
TypeScript - Type-safe JavaScript
BullMQ - Queue management
Redis - Caching and job storage
Winston - Logging

AI & ML Libraries

@ai-sdk/anthropic - Anthropic Claude integration
@ai-sdk/openai - OpenAI integration
@ai-sdk/google - Google AI integration
@ai-sdk/groq - Groq integration

Web Scraping

Playwright - Browser automation
Cheerio - Server-side jQuery
Axios - HTTP client
Undici - High-performance HTTP client

Content Processing

Turndown - HTML to Markdown
PDF-parse - PDF text extraction
Mammoth - Word document processing
JSDOM - DOM manipulation

Native Libraries

Rust-based crawler - High-performance web crawling
Go HTML-to-Markdown - Fast HTML conversion
Rust PDF parser - Efficient PDF processing

Database & Storage

Supabase - Database and authentication
Google Cloud Storage - File storage
MongoDB - Document storage

Monitoring & Analytics

Sentry - Error tracking
PostHog - Analytics
Winston - Structured logging

🐳 Docker Deployment

Build and Run

# Build the Docker image
docker build -t smart-firecrawl-api .

# Run the container
docker run -p 3002:3002 \
  -e REDIS_URL=redis://host.docker.internal:6379 \
  -e DATABASE_URL=your_database_url \
  smart-firecrawl-api

Docker Compose

version: "3.8"
services:
  app:
    build: .
    ports:
      - "3002:3002"
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=your_database_url
    depends_on:
      - redis

  redis:
    image: redis:latest
    ports:
      - "6379:6379"

🔍 Development Tools

Available Scripts

npm start - Start development server
npm run build - Build for production
npm test - Run tests
npm run format - Format code with Prettier
npm run workers - Start queue workers

Testing

Jest - Testing framework
Supertest - HTTP testing
E2E tests - End-to-end testing
Unit tests - Component testing

Code Quality

TypeScript - Static type checking
Prettier - Code formatting
ESLint - Code linting (if configured)

🚀 Deployment

Fly.io Deployment

# Deploy to Fly.io
npm run deploy:fly

# Deploy to staging
npm run deploy:fly:staging

Environment Variables for Production

IS_PRODUCTION=true
ENV=production
PORT=3002
HOST=0.0.0.0

📊 Monitoring

Health Checks

GET /serverHealthCheck - Server health status
GET /is-production - Production status

Queue Monitoring

Bull Board UI available at /admin/{BULL_AUTH_KEY}/queues
Real-time queue statistics
Job status tracking

Logging

Structured logging with Winston
Error tracking with Sentry
Performance monitoring

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

📄 License

This project is licensed under the ISC License.

🆘 Support

For support and questions:

Create an issue in the repository
Contact the development team
Check the documentation for common issues

Note: This is a production-ready web scraping API with enterprise features including AI integration, scalable queue processing, and comprehensive monitoring capabilities.

🏗️ Project Structure​

🚀 Getting Started​

Prerequisites​

Installation​

Development​

Production​

🔧 Core Functionality​

API Endpoints​

Scraping​

Crawling​

Batch Operations​

AI Features​

Search​

Key Features​

📚 Libraries Used​

Core Dependencies​

AI & ML Libraries​

Web Scraping​

Content Processing​

Native Libraries​

Database & Storage​

Monitoring & Analytics​

🐳 Docker Deployment​

Build and Run​

Docker Compose​

🔍 Development Tools​

Available Scripts​

Testing​

Code Quality​

🚀 Deployment​

Fly.io Deployment​

Environment Variables for Production​

📊 Monitoring​

Health Checks​

Queue Monitoring​

Logging​

🤝 Contributing​

📄 License​

🆘 Support​