Firecrawl & Playwright - Complete Implementation Guide

A comprehensive guide to the Smart Firecrawl API and Smart Playwright Service implementations. This document centralizes the documentation for both services, providing a complete understanding of their architecture, features, and integration patterns.

System Architecture
Smart Firecrawl API
Smart Playwright Service
Integration Patterns
Deployment & Operations
Advanced Features
Best Practices
Troubleshooting
Use Cases

System Architecture

The Smart Firecrawl API and Smart Playwright Service form a comprehensive web scraping ecosystem with the following architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Client Applications                     │
└─────────────────────┬─────────────────────────────────────┘
                      │
┌─────────────────────▼─────────────────────────────────────┐
│                Smart Firecrawl API                       │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐│
│  │   Controllers   │  │   Services       │  │   Queue     ││
│  │   (v0/v1)      │  │   (Billing)      │  │   (BullMQ)  ││
│  └─────────────────┘  └─────────────────┘  └─────────────┘│
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐│
│  │   Scrapers      │  │   Extractors    │  │   AI/LLM    ││
│  │   (Playwright)  │  │   (Content)     │  │   (Claude)  ││
│  └─────────────────┘  └─────────────────┘  └─────────────┘│
└─────────────────────┬─────────────────────────────────────┘
                      │
┌─────────────────────▼─────────────────────────────────────┐
│              Smart Playwright Service                     │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐│
│  │   Browser       │  │   Stealth       │  │   Resource  ││
│  │   Automation   │  │   Features      │  │   Blocking  ││
│  └─────────────────┘  └─────────────────┘  └─────────────┘│
└─────────────────────┬─────────────────────────────────────┘
                      │
┌─────────────────────▼─────────────────────────────────────┐
│              External Services                            │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐│
│  │   Redis         │  │   Database      │  │   Storage   ││
│  │   (Queue)       │  │   (Supabase)    │  │   (GCS)     ││
│  └─────────────────┘  └─────────────────┘  └─────────────┘│
└─────────────────────────────────────────────────────────────┘

Key Components

Smart Firecrawl API: Main orchestration service with AI-powered extraction
Smart Playwright Service: Microservice for complex browser automation
Queue System: BullMQ with Redis for job management
AI Integration: Multiple LLM providers (OpenAI, Anthropic, Google, Groq)
Content Processing: Native libraries for HTML-to-Markdown, PDF parsing

Smart Firecrawl API

The Smart Firecrawl API is a comprehensive web scraping and crawling service built with Node.js, TypeScript, and Express. It provides powerful web scraping capabilities with AI-powered content extraction, queue-based processing, and enterprise features.

Project Structure

smart-firecrawl-api/
├── src/                          # Main source code
│   ├── controllers/              # API controllers
│   │   ├── v0/                   # Legacy API endpoints
│   │   └── v1/                   # Current API endpoints
│   ├── lib/                      # Core libraries and utilities
│   │   ├── extract/              # Content extraction modules
│   │   ├── deep-research/        # AI-powered research features
│   │   └── generate-llmstxt/      # LLM text generation
│   ├── routes/                   # Express route definitions
│   ├── scraper/                  # Web scraping engines
│   │   ├── scrapeURL/            # URL scraping logic
│   │   └── WebScraper/           # Web scraper implementations
│   ├── services/                 # Background services
│   │   ├── billing/              # Credit and billing management
│   │   ├── queue-service.ts      # Queue management
│   │   └── redis.ts              # Redis configuration
│   ├── search/                   # Search functionality
│   └── types/                    # TypeScript type definitions
├── sharedLibs/                   # Native libraries
│   ├── crawler/                  # Rust-based crawler
│   ├── go-html-to-md/           # Go-based HTML to Markdown
│   ├── html-transformer/        # Rust HTML transformer
│   └── pdf-parser/              # Rust PDF parser
├── chrome-extension/            # Browser extension
├── firecrawl-extension/         # Firecrawl extension
├── page-discovery-extension/   # Page discovery extension
├── dist/                       # Compiled JavaScript output
├── node_modules/               # Node.js dependencies
└── tests/                      # Test files

Getting Started

Prerequisites

Node.js 22+
pnpm (recommended) or npm
Redis server
Docker (for containerized deployment)

Installation

Clone and install dependencies

git clone <repository-url>
cd smart-firecrawl-api
pnpm install

Set up environment variables

# Database
DATABASE_URL=your_database_url
REDIS_URL=redis://localhost:6379

# Authentication
JWT_SECRET=your_jwt_secret

# External Services
SENTRY_DSN=your_sentry_dsn
SLACK_WEBHOOK_URL=your_slack_webhook

# AI Services
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key

# Storage
GCS_FIRE_ENGINE_BUCKET_NAME=your_bucket_name

Start services

# Start Redis
docker run -d -p 6379:6379 redis:latest

# Start development server
npm run start:dev

# Start queue workers (separate terminals)
npm run workers
npm run index-worker

API Endpoints

Scraping Endpoints

POST /v1/scrape - Scrape a single URL
GET /v1/scrape/:jobId - Get scrape job status

Crawling Endpoints

POST /v1/crawl - Start a website crawl
GET /v1/crawl/:jobId - Get crawl job status
DELETE /v1/crawl/:jobId - Cancel a crawl job

Batch Operations

POST /v1/batch/scrape - Batch scrape multiple URLs
GET /v1/batch/scrape/:jobId - Get batch job status

AI Features

POST /v1/extract - Extract structured data using AI
POST /v1/deep-research - Perform deep research on topics
POST /v1/llmstxt - Generate LLM-optimized text

Search & Mapping

POST /v1/search - Search the web
POST /v1/map - Map website structure

Core Features

Multi-Engine Scraping: Supports various scraping engines including Playwright, Puppeteer, and custom implementations
AI-Powered Extraction:
- Structured data extraction using LLMs
- Content summarization and analysis
- Deep research capabilities
Queue-Based Processing:
- BullMQ for job management
- Redis for caching and coordination
- Priority-based job processing
Content Processing:
- HTML to Markdown conversion
- PDF parsing and extraction
- Image and media handling
Rate Limiting & Authentication:
- Team-based authentication
- Credit-based billing system
- Rate limiting per endpoint

Basic Usage Examples

Scrape Single URL

// Using the API directly
const response = await fetch("http://localhost:3002/v1/scrape", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: "Bearer your-api-key",
  },
  body: JSON.stringify({
    url: "https://example.com",
    formats: ["markdown", "html"],
    onlyMainContent: true,
  }),
});

const result = await response.json();
console.log(result.data.markdown);

Crawl Website

// Start a crawl job
const crawlResponse = await fetch("http://localhost:3002/v1/crawl", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: "Bearer your-api-key",
  },
  body: JSON.stringify({
    url: "https://example.com",
    crawlerOptions: {
      includes: ["/blog/*", "/articles/*"],
      excludes: ["/admin/*", "/login"],
      limit: 10,
    },
    pageOptions: {
      formats: ["markdown"],
      onlyMainContent: true,
    },
  }),
});

const crawlJob = await crawlResponse.json();
console.log("Crawl job ID:", crawlJob.jobId);

Smart Playwright Service

The Smart Playwright Service is a lightweight, high-performance web scraping microservice built with Node.js, TypeScript, and Playwright. It provides efficient web scraping capabilities with advanced browser automation, stealth features, and optimized performance for production environments.

Project Structure

smart-playwright-service/
├── api.ts                    # Main API server implementation
├── helpers/                  # Utility functions
│   └── get_error.ts         # HTTP error handling
├── dist/                    # Compiled JavaScript output
├── node_modules/            # Node.js dependencies
├── package.json             # Project configuration
├── tsconfig.json            # TypeScript configuration
├── Dockerfile              # Docker container setup
├── .dockerignore           # Docker ignore patterns
├── .gitignore              # Git ignore patterns
└── README.md               # Project documentation

Getting Started

Prerequisites

Node.js 18+
npm or pnpm
Docker (for containerized deployment)

Installation

Clone and install dependencies

git clone <repository-url>
cd smart-playwright-service
npm install
npx playwright install

Set up environment variables

# Server Configuration
PORT=3003

# Proxy Configuration (Optional)
PROXY_SERVER=your_proxy_server
PROXY_USERNAME=your_proxy_username
PROXY_PASSWORD=your_proxy_password

# Performance Settings
BLOCK_MEDIA=true

Start the service

# Development
npm run dev

# Production
npm run build
npm start

API Endpoints

Health Check

GET /health - Service health status

Web Scraping

POST /scrape - Scrape a single URL with advanced options

Scraping Request Format

{
  "url": "https://example.com",
  "wait_after_load": 1000,
  "timeout": 15000,
  "headers": {
    "Custom-Header": "value",
    "User-Agent": "Mozilla/5.0...",
    "Cookie": "session=abc123; user=john"
  },
  "check_selector": "#content"
}

Response Format

{
  "content": "<html>...</html>",
  "pageStatusCode": 200,
  "contentType": "text/html",
  "pageError": null
}

Key Features

Advanced Browser Automation:
- Playwright-based browser automation
- Stealth mode to avoid detection
- Random user-agent rotation
- Custom viewport and device simulation
Smart Request Blocking:
- Blocks ad-serving domains automatically
- Optional media file blocking for performance
- Configurable request filtering
Cookie Management:
- Automatic cookie parsing from headers
- Support for secure cookies (**Secure-, **Host-)
- Domain-specific cookie handling
- Google services cookie optimization
Proxy Support:
- HTTP/HTTPS proxy configuration
- Authentication support
- Environment-based proxy settings
Performance Optimizations:
- Fresh browser instance per request
- Resource blocking for faster loading
- Configurable timeouts and waits
- Memory-efficient cleanup

Usage Examples

Basic Scraping

curl -X POST http://localhost:3003/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "timeout": 15000
  }'

Advanced Scraping with Headers

curl -X POST http://localhost:3003/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "wait_after_load": 2000,
    "timeout": 30000,
    "headers": {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      "Cookie": "session=abc123; user=john"
    },
    "check_selector": "#main-content"
  }'

Integration Patterns

Service Communication

The Smart Firecrawl API and Smart Playwright Service work together through a microservice architecture:

// Firecrawl API configuration for Playwright integration
const config = {
  playwrightService: {
    url:
      process.env.PLAYWRIGHT_MICROSERVICE_URL || "http://localhost:3003/scrape",
    timeout: 30000,
    retries: 3,
  },
};

Hybrid Scraping Strategy

async function intelligentScrape(url, options = {}) {
  // 1. Try Firecrawl first for simple content
  try {
    const firecrawlResult = await firecrawl.scrapeUrl(url, {
      formats: ["markdown"],
      onlyMainContent: true,
      ...options,
    });

    if (firecrawlResult.success) {
      return {
        method: "firecrawl",
        data: firecrawlResult.data,
      };
    }
  } catch (error) {
    console.log("Firecrawl failed, trying Playwright...");
  }

  // 2. Fallback to Playwright for complex content
  try {
    const playwrightResult = await fetch(config.playwrightService.url, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        url,
        wait_after_load: 2000,
        timeout: 30000,
        ...options,
      }),
    });

    const data = await playwrightResult.json();

    return {
      method: "playwright",
      data: {
        content: data.content,
        statusCode: data.pageStatusCode,
      },
    };
  } catch (error) {
    throw new Error(`Both scraping methods failed: ${error.message}`);
  }
}

Queue-Based Processing

// Firecrawl API queue integration
const Queue = require("bull");
const redis = require("redis");

const scrapeQueue = new Queue("scrape", {
  redis: { host: "localhost", port: 6379 },
});

// Add job to queue
const job = await scrapeQueue.add("scrape-url", {
  url: "https://example.com",
  options: {
    usePlaywright: true,
    formats: ["markdown"],
  },
});

// Process job
scrapeQueue.process("scrape-url", async (job) => {
  const { url, options } = job.data;

  if (options.usePlaywright) {
    // Use Playwright service
    return await callPlaywrightService(url, options);
  } else {
    // Use Firecrawl directly
    return await firecrawl.scrapeUrl(url, options);
  }
});

Deployment & Operations

Docker Deployment

Smart Firecrawl API

# Dockerfile for Smart Firecrawl API
FROM node:22-alpine

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

COPY . .
RUN npm run build

EXPOSE 3002
CMD ["npm", "start"]

# Build and run
docker build -t smart-firecrawl-api .
docker run -p 3002:3002 \
  -e REDIS_URL=redis://host.docker.internal:6379 \
  -e DATABASE_URL=your_database_url \
  smart-firecrawl-api

Smart Playwright Service

# Dockerfile for Smart Playwright Service
FROM mcr.microsoft.com/playwright:v1.40.0-focal

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

COPY . .
RUN npm run build

EXPOSE 3003
CMD ["npm", "start"]

Docker Compose Setup

version: "3.8"
services:
  firecrawl-api:
    build: ./smart-firecrawl-api
    ports:
      - "3002:3002"
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=your_database_url
      - PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3003/scrape
    depends_on:
      - redis
      - playwright-service

  playwright-service:
    build: ./smart-playwright-service
    ports:
      - "3003:3003"
    environment:
      - BLOCK_MEDIA=true
    volumes:
      - /tmp/.cache:/tmp/.cache

  redis:
    image: redis:latest
    ports:
      - "6379:6379"

Environment Configuration

Firecrawl API Environment

# Production Environment
IS_PRODUCTION=true
ENV=production
PORT=3002
HOST=0.0.0.0

# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/firecrawl
REDIS_URL=redis://localhost:6379

# AI Services
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
GOOGLE_API_KEY=your_google_key
GROQ_API_KEY=your_groq_key

# Storage
GCS_FIRE_ENGINE_BUCKET_NAME=your_bucket_name

# Monitoring
SENTRY_DSN=your_sentry_dsn
SLACK_WEBHOOK_URL=your_slack_webhook

Playwright Service Environment

# Production Environment
NODE_ENV=production
PORT=3003

# Proxy Configuration
PROXY_SERVER=your_proxy_server
PROXY_USERNAME=your_proxy_username
PROXY_PASSWORD=your_proxy_password

# Performance
BLOCK_MEDIA=true

Monitoring & Health Checks

Firecrawl API Health Endpoints

# Server health
curl http://localhost:3002/serverHealthCheck

# Production status
curl http://localhost:3002/is-production

# Queue monitoring (with auth)
curl http://localhost:3002/admin/{BULL_AUTH_KEY}/queues

Playwright Service Health

# Service health
curl http://localhost:3003/health

Scaling Considerations

Horizontal Scaling

# Docker Compose with multiple instances
services:
  firecrawl-api-1:
    build: ./smart-firecrawl-api
    ports: ["3002:3002"]

  firecrawl-api-2:
    build: ./smart-firecrawl-api
    ports: ["3003:3002"]

  playwright-service-1:
    build: ./smart-playwright-service
    ports: ["3004:3003"]

  playwright-service-2:
    build: ./smart-playwright-service
    ports: ["3005:3003"]

Load Balancing

# Nginx configuration
upstream firecrawl_api {
    server localhost:3002;
    server localhost:3003;
}

upstream playwright_service {
    server localhost:3004;
    server localhost:3005;
}

server {
    listen 80;

    location /api/ {
        proxy_pass http://firecrawl_api;
    }

    location /scrape/ {
        proxy_pass http://playwright_service;
    }
}

Advanced Features

AI-Powered Content Extraction

Custom LLM Extraction

// Firecrawl API - Custom extraction with multiple LLM providers
const extractResult = await fetch("http://localhost:3002/v1/extract", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: "Bearer your-api-key",
  },
  body: JSON.stringify({
    url: "https://example.com",
    extractionPrompt:
      "Extract all product information including name, price, and description",
    extractionSchema: {
      type: "object",
      properties: {
        products: {
          type: "array",
          items: {
            type: "object",
            properties: {
              name: { type: "string" },
              price: { type: "string" },
              description: { type: "string" },
            },
          },
        },
      },
    },
    llmProvider: "anthropic", // or 'openai', 'google', 'groq'
  }),
});

Deep Research Capabilities

// Firecrawl API - Deep research on topics
const researchResult = await fetch("http://localhost:3002/v1/deep-research", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: "Bearer your-api-key",
  },
  body: JSON.stringify({
    topic: "AI tools for developers",
    depth: "comprehensive",
    sources: 10,
    includeAnalysis: true,
  }),
});

Native Library Integration

Rust-based Crawler

// Using native Rust crawler for high-performance crawling
const crawlResult = await fetch("http://localhost:3002/v1/crawl", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: "Bearer your-api-key",
  },
  body: JSON.stringify({
    url: "https://example.com",
    crawlerOptions: {
      engine: "rust", // Use Rust-based crawler
      maxConcurrency: 10,
      respectRobotsTxt: true,
      delayBetweenRequests: 1000,
    },
  }),
});

Go HTML-to-Markdown

// High-performance HTML to Markdown conversion
const markdownResult = await fetch("http://localhost:3002/v1/scrape", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: "Bearer your-api-key",
  },
  body: JSON.stringify({
    url: "https://example.com",
    formats: ["markdown"],
    converter: "go", // Use Go-based converter
  }),
});

Stealth and Anti-Detection

Playwright Service Stealth Features

// Advanced stealth configuration
const stealthResult = await fetch("http://localhost:3003/scrape", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://example.com",
    stealth: {
      userAgent: "random", // Random user agent rotation
      viewport: { width: 1920, height: 1080 },
      timezone: "America/New_York",
      locale: "en-US",
      geolocation: { latitude: 40.7128, longitude: -74.006 },
    },
    antiDetection: {
      removeWebdriver: true,
      spoofChrome: true,
      randomizeFingerprint: true,
    },
  }),
});

Performance Optimizations

Resource Blocking

// Block unnecessary resources for faster scraping
const optimizedResult = await fetch("http://localhost:3003/scrape", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://example.com",
    blockResources: {
      images: true,
      stylesheets: false,
      fonts: true,
      media: true,
      ads: true,
      analytics: true,
    },
    customBlockList: [
      "*.google-analytics.com",
      "*.facebook.com",
      "*.doubleclick.net",
    ],
  }),
});

Best Practices

Error Handling and Retries

// Robust error handling with exponential backoff
async function robustScrape(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      // Try Firecrawl first
      const firecrawlResult = await fetch("http://localhost:3002/v1/scrape", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ url, formats: ["markdown"] }),
      });

      if (firecrawlResult.ok) {
        return await firecrawlResult.json();
      }
    } catch (error) {
      console.log(`Firecrawl attempt ${attempt} failed:`, error.message);
    }

    // Fallback to Playwright
    try {
      const playwrightResult = await fetch("http://localhost:3003/scrape", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ url, timeout: 30000 }),
      });

      if (playwrightResult.ok) {
        return await playwrightResult.json();
      }
    } catch (error) {
      console.log(`Playwright attempt ${attempt} failed:`, error.message);
    }

    if (attempt === maxRetries) {
      throw new Error(`Failed after ${maxRetries} attempts`);
    }

    // Exponential backoff
    await new Promise((resolve) =>
      setTimeout(resolve, Math.pow(2, attempt) * 1000)
    );
  }
}

Rate Limiting and Respectful Scraping

class RespectfulScraper {
  constructor(delayMs = 1000) {
    this.delayMs = delayMs;
    this.lastRequest = 0;
  }

  async scrape(url) {
    // Respect rate limits
    const timeSinceLastRequest = Date.now() - this.lastRequest;
    if (timeSinceLastRequest < this.delayMs) {
      await new Promise((resolve) =>
        setTimeout(resolve, this.delayMs - timeSinceLastRequest)
      );
    }

    this.lastRequest = Date.now();
    return await this.performScrape(url);
  }
}

Resource Management

// Proper resource cleanup for Playwright Service
async function scrapeWithCleanup(urls) {
  const results = [];

  for (const url of urls) {
    try {
      const response = await fetch("http://localhost:3003/scrape", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ url }),
      });

      const data = await response.json();
      results.push({ url, data });
    } catch (error) {
      console.error(`Failed to scrape ${url}:`, error);
    }
  }

  return results;
}

Configuration Management

// Environment-based configuration
const config = {
  development: {
    firecrawlUrl: "http://localhost:3002",
    playwrightUrl: "http://localhost:3003",
    timeout: 30000,
    retries: 3,
  },
  production: {
    firecrawlUrl: "https://api.firecrawl.dev",
    playwrightUrl: "https://playwright.firecrawl.dev",
    timeout: 60000,
    retries: 5,
  },
};

const env = process.env.NODE_ENV || "development";
const currentConfig = config[env];

Troubleshooting

Common Issues

1. Service Communication Failures

// Health check both services
async function checkServices() {
  try {
    const firecrawlHealth = await fetch(
      "http://localhost:3002/serverHealthCheck"
    );
    const playwrightHealth = await fetch("http://localhost:3003/health");

    console.log("Firecrawl API:", firecrawlHealth.ok ? "OK" : "FAILED");
    console.log("Playwright Service:", playwrightHealth.ok ? "OK" : "FAILED");
  } catch (error) {
    console.error("Service check failed:", error);
  }
}

2. Timeout Issues

// Solution: Increase timeouts and add retries
const scrapeWithTimeout = async (url) => {
  try {
    const response = await fetch("http://localhost:3003/scrape", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        url,
        timeout: 60000, // 60 seconds
        wait_after_load: 5000, // Wait 5 seconds after load
      }),
    });

    return await response.json();
  } catch (error) {
    console.error("Timeout error:", error);
    throw error;
  }
};

3. Memory Leaks

// Solution: Proper resource cleanup
async function scrapeWithCleanup(urls) {
  const results = [];

  for (const url of urls) {
    try {
      // Each request gets a fresh browser instance
      const response = await fetch("http://localhost:3003/scrape", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ url }),
      });

      const data = await response.json();
      results.push({ url, data });
    } catch (error) {
      console.error(`Failed to scrape ${url}:`, error);
    }
  }

  return results;
}

4. Anti-Bot Detection

// Solution: Use stealth mode and realistic behavior
const stealthScrape = async (url) => {
  const response = await fetch("http://localhost:3003/scrape", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      url,
      stealth: {
        userAgent: "random",
        viewport: { width: 1920, height: 1080 },
        timezone: "America/New_York",
      },
      antiDetection: {
        removeWebdriver: true,
        spoofChrome: true,
        randomizeFingerprint: true,
      },
    }),
  });

  return await response.json();
};

Debugging Tips

// Enable debugging for both services
const debugScrape = async (url) => {
  console.log("Starting scrape for:", url);

  try {
    // Try Firecrawl first
    const firecrawlResult = await fetch("http://localhost:3002/v1/scrape", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ url, formats: ["markdown"] }),
    });

    if (firecrawlResult.ok) {
      console.log("Firecrawl success");
      return await firecrawlResult.json();
    }
  } catch (error) {
    console.log("Firecrawl failed:", error.message);
  }

  // Fallback to Playwright
  try {
    const playwrightResult = await fetch("http://localhost:3003/scrape", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ url }),
    });

    console.log("Playwright success");
    return await playwrightResult.json();
  } catch (error) {
    console.error("Both services failed:", error);
    throw error;
  }
};

Use Cases

1. E-commerce Product Monitoring

// Monitor product prices and availability
async function monitorProducts(productUrls) {
  const results = [];

  for (const url of productUrls) {
    try {
      // Use Playwright for dynamic e-commerce sites
      const response = await fetch("http://localhost:3003/scrape", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          url,
          wait_after_load: 3000,
          check_selector: ".product-info",
        }),
      });

      const data = await response.json();

      // Extract product data using Firecrawl AI
      const extractResponse = await fetch("http://localhost:3002/v1/extract", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          url,
          extractionPrompt:
            "Extract product name, price, availability, and images",
          extractionSchema: {
            type: "object",
            properties: {
              name: { type: "string" },
              price: { type: "string" },
              availability: { type: "string" },
              images: { type: "array", items: { type: "string" } },
            },
          },
        }),
      });

      const extractedData = await extractResponse.json();
      results.push({ url, ...extractedData, timestamp: new Date() });
    } catch (error) {
      console.error(`Failed to monitor ${url}:`, error);
    }
  }

  return results;
}

// Extract social media posts and interactions
async function extractSocialContent(url) {
  try {
    // Use Playwright for dynamic social media content
    const response = await fetch("http://localhost:3003/scrape", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        url,
        wait_after_load: 5000,
        check_selector: '[data-testid="post"]',
        stealth: {
          userAgent: "random",
          viewport: { width: 1920, height: 1080 },
        },
      }),
    });

    const data = await response.json();

    // Use Firecrawl AI to extract structured data
    const extractResponse = await fetch("http://localhost:3002/v1/extract", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        url,
        extractionPrompt:
          "Extract all posts with author, text, timestamp, and engagement metrics",
        extractionSchema: {
          type: "object",
          properties: {
            posts: {
              type: "array",
              items: {
                type: "object",
                properties: {
                  author: { type: "string" },
                  text: { type: "string" },
                  timestamp: { type: "string" },
                  likes: { type: "number" },
                  shares: { type: "number" },
                },
              },
            },
          },
        },
      }),
    });

    return await extractResponse.json();
  } catch (error) {
    console.error("Social content extraction failed:", error);
    throw error;
  }
}

3. Research and Content Aggregation

// Deep research on topics using both services
async function deepResearch(topic) {
  try {
    // Use Firecrawl's deep research capabilities
    const researchResponse = await fetch(
      "http://localhost:3002/v1/deep-research",
      {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          topic,
          depth: "comprehensive",
          sources: 20,
          includeAnalysis: true,
        }),
      }
    );

    const researchData = await researchResponse.json();

    // Use Playwright for additional dynamic content
    const additionalSources = await Promise.all(
      researchData.sources.slice(0, 5).map(async (source) => {
        try {
          const response = await fetch("http://localhost:3003/scrape", {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify({
              url: source.url,
              wait_after_load: 3000,
            }),
          });

          return await response.json();
        } catch (error) {
          console.error(`Failed to scrape ${source.url}:`, error);
          return null;
        }
      })
    );

    return {
      research: researchData,
      additionalContent: additionalSources.filter(Boolean),
    };
  } catch (error) {
    console.error("Deep research failed:", error);
    throw error;
  }
}

4. Form Automation and Data Collection

// Automate form submissions and data collection
async function automateFormSubmission(formUrl, formData) {
  try {
    // Use Playwright for form automation
    const response = await fetch("http://localhost:3003/scrape", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        url: formUrl,
        formData,
        wait_after_load: 2000,
        check_selector: ".success-message, .error-message",
      }),
    });

    const result = await response.json();

    // Use Firecrawl to extract the result page content
    const extractResponse = await fetch("http://localhost:3002/v1/scrape", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        url: formUrl,
        formats: ["markdown"],
        onlyMainContent: true,
      }),
    });

    const extractedContent = await extractResponse.json();

    return {
      formResult: result,
      pageContent: extractedContent,
    };
  } catch (error) {
    console.error("Form automation failed:", error);
    throw error;
  }
}

5. Competitive Intelligence

// Monitor competitor websites and extract insights
async function competitiveIntelligence(competitorUrls) {
  const results = [];

  for (const url of competitorUrls) {
    try {
      // Use both services for comprehensive analysis
      const [firecrawlResult, playwrightResult] = await Promise.all([
        fetch("http://localhost:3002/v1/scrape", {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify({
            url,
            formats: ["markdown"],
            onlyMainContent: true,
          }),
        }),
        fetch("http://localhost:3003/scrape", {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify({
            url,
            wait_after_load: 3000,
          }),
        }),
      ]);

      const firecrawlData = await firecrawlResult.json();
      const playwrightData = await playwrightResult.json();

      // Use AI to extract competitive insights
      const insightsResponse = await fetch("http://localhost:3002/v1/extract", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          url,
          extractionPrompt:
            "Extract key business insights, pricing, features, and competitive advantages",
          extractionSchema: {
            type: "object",
            properties: {
              pricing: { type: "string" },
              features: { type: "array", items: { type: "string" } },
              advantages: { type: "array", items: { type: "string" } },
              weaknesses: { type: "array", items: { type: "string" } },
            },
          },
        }),
      });

      const insights = await insightsResponse.json();

      results.push({
        url,
        firecrawl: firecrawlData,
        playwright: playwrightData,
        insights,
      });
    } catch (error) {
      console.error(`Failed to analyze ${url}:`, error);
    }
  }

  return results;
}

Resources

Smart Firecrawl API

Documentation: Comprehensive API documentation with examples
GitHub: Smart Firecrawl API Repository
Docker Hub: Pre-built Docker images for easy deployment
API Reference: Complete endpoint documentation with request/response examples

Smart Playwright Service

Documentation: Service-specific documentation and configuration
GitHub: Smart Playwright Service Repository
Docker Hub: Optimized Docker images with Playwright browsers
Configuration Guide: Environment variables and deployment options

Integration Resources

Architecture Diagrams: Visual representations of system components
Deployment Guides: Step-by-step deployment instructions
Monitoring Setup: Health checks, logging, and alerting configuration
Performance Tuning: Optimization guides for production environments

Conclusion

The Smart Firecrawl API and Smart Playwright Service form a comprehensive web scraping ecosystem that combines the best of both worlds:

Key Advantages

Intelligent Content Extraction: AI-powered extraction with multiple LLM providers
Advanced Browser Automation: Stealth features and anti-detection capabilities
Scalable Architecture: Queue-based processing with Redis coordination
Native Performance: Rust and Go libraries for high-performance operations
Enterprise Features: Authentication, billing, monitoring, and rate limiting

Use Cases Covered

E-commerce Monitoring: Product tracking and price monitoring
Social Media Analysis: Content extraction and sentiment analysis
Research Automation: Deep research with AI-powered insights
Form Automation: Complex form submissions and data collection
Competitive Intelligence: Market analysis and competitor monitoring

Best Practices

Respectful Scraping: Rate limiting and ethical practices
Error Handling: Robust retry mechanisms and fallback strategies
Resource Management: Proper cleanup and memory management
Security: Authentication, authorization, and data protection
Monitoring: Health checks, logging, and performance tracking

Getting Started

Deploy Services: Use Docker Compose for easy setup
Configure Environment: Set up API keys and database connections
Test Integration: Verify both services are working correctly
Implement Use Cases: Start with simple scraping and scale up
Monitor Performance: Set up logging and alerting

This comprehensive solution provides everything needed for modern web scraping applications, from simple content extraction to complex AI-powered data processing. The combination of Firecrawl's intelligent extraction capabilities and Playwright's advanced browser automation creates a powerful platform for any web scraping needs.

Remember to always respect website terms of service, implement proper error handling, and use reasonable delays to ensure ethical and sustainable scraping practices.

Table of Contents​

System Architecture​

Key Components​

Smart Firecrawl API​

Project Structure​

Getting Started​

Prerequisites​

Installation​

API Endpoints​

Scraping Endpoints​

Crawling Endpoints​

Batch Operations​

AI Features​

Search & Mapping​

Core Features​

Basic Usage Examples​

Scrape Single URL​

Crawl Website​

Smart Playwright Service​

Project Structure​

Getting Started​

Prerequisites​

Installation​

API Endpoints​

Health Check​

Web Scraping​

Scraping Request Format​

Response Format​

Key Features​

Usage Examples​

Basic Scraping​

Advanced Scraping with Headers​

Integration Patterns​

Service Communication​

Hybrid Scraping Strategy​

Queue-Based Processing​

Deployment & Operations​

Docker Deployment​

Smart Firecrawl API​

Smart Playwright Service​

Docker Compose Setup​

Environment Configuration​

Firecrawl API Environment​

Playwright Service Environment​

Monitoring & Health Checks​

Firecrawl API Health Endpoints​

Playwright Service Health​

Scaling Considerations​

Horizontal Scaling​

Load Balancing​

Advanced Features​

AI-Powered Content Extraction​

Custom LLM Extraction​

Deep Research Capabilities​

Native Library Integration​

Rust-based Crawler​

Go HTML-to-Markdown​

Stealth and Anti-Detection​

Playwright Service Stealth Features​

Performance Optimizations​

Resource Blocking​

Best Practices​

Error Handling and Retries​

Rate Limiting and Respectful Scraping​

Resource Management​

Configuration Management​

Troubleshooting​

Common Issues​

1. Service Communication Failures​

2. Timeout Issues​

3. Memory Leaks​

4. Anti-Bot Detection​

Debugging Tips​

Use Cases​

1. E-commerce Product Monitoring​

2. Social Media Content Extraction​

3. Research and Content Aggregation​

4. Form Automation and Data Collection​

5. Competitive Intelligence​

Resources​

Table of Contents

System Architecture

Key Components

Smart Firecrawl API

Project Structure

Getting Started

Prerequisites

Installation

API Endpoints

Scraping Endpoints

Crawling Endpoints

Batch Operations

AI Features

Search & Mapping

Core Features

Basic Usage Examples

Scrape Single URL

Crawl Website

Smart Playwright Service

Project Structure

Getting Started

Prerequisites

Installation

API Endpoints

Health Check

Web Scraping

Scraping Request Format

Response Format

Key Features

Usage Examples

Basic Scraping

Advanced Scraping with Headers

Integration Patterns

Service Communication

Hybrid Scraping Strategy

Queue-Based Processing

Deployment & Operations

Docker Deployment

Smart Firecrawl API

Smart Playwright Service

Docker Compose Setup

Environment Configuration

Firecrawl API Environment

Playwright Service Environment

Monitoring & Health Checks

Firecrawl API Health Endpoints

Playwright Service Health

Scaling Considerations

Horizontal Scaling

Load Balancing

Advanced Features

AI-Powered Content Extraction

Custom LLM Extraction

Deep Research Capabilities

Native Library Integration

Rust-based Crawler

Go HTML-to-Markdown

Stealth and Anti-Detection

Playwright Service Stealth Features

Performance Optimizations

Resource Blocking

Best Practices

Error Handling and Retries

Rate Limiting and Respectful Scraping

Resource Management

Configuration Management

Troubleshooting

Common Issues

1. Service Communication Failures

2. Timeout Issues

3. Memory Leaks

4. Anti-Bot Detection

Debugging Tips

Use Cases

1. E-commerce Product Monitoring

2. Social Media Content Extraction

3. Research and Content Aggregation

4. Form Automation and Data Collection

5. Competitive Intelligence

Resources