Firecrawl & Playwright - Complete Implementation Guide
A comprehensive guide to the Smart Firecrawl API and Smart Playwright Service implementations. This document centralizes the documentation for both services, providing a complete understanding of their architecture, features, and integration patterns.
Table of Contents
- System Architecture
- Smart Firecrawl API
- Smart Playwright Service
- Integration Patterns
- Deployment & Operations
- Advanced Features
- Best Practices
- Troubleshooting
- Use Cases
System Architecture
The Smart Firecrawl API and Smart Playwright Service form a comprehensive web scraping ecosystem with the following architecture:
┌─────────────────────────────────────────────────────────────┐
│ Client Applications │
└─────────────────────┬─────────────────────────────────────┘
│
┌────────────────── ───▼─────────────────────────────────────┐
│ Smart Firecrawl API │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐│
│ │ Controllers │ │ Services │ │ Queue ││
│ │ (v0/v1) │ │ (Billing) │ │ (BullMQ) ││
│ └─────────────────┘ └─────────────────┘ └─────────────┘│
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐│
│ │ Scrapers │ │ Extractors │ │ AI/LLM ││
│ │ (Playwright) │ │ (Content) │ │ (Claude) ││
│ └─────────────────┘ └─────────────────┘ └─────────────┘│
└─────────────────────┬─────────────────────────────────────┘
│
┌─────────────────────▼─────────────────────────────────────┐
│ Smart Playwright Service │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐│
│ │ Browser │ │ Stealth │ │ Resource ││
│ │ Automation │ │ Features │ │ Blocking ││
│ └─────────────────┘ └─────────────────┘ └─────────────┘│
└─────────────────────┬─────────────────────────────────────┘
│
┌─────────────────────▼─────────────────────────────────────┐
│ External Services │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐│
│ │ Redis │ │ Database │ │ Storage ││
│ │ (Queue) │ │ (Supabase) │ │ (GCS) ││
│ └─────────────────┘ └─────────────────┘ └─────────────┘│
└─────────────────────────────────────────────────────────────┘
Key Components
- Smart Firecrawl API: Main orchestration service with AI-powered extraction
- Smart Playwright Service: Microservice for complex browser automation
- Queue System: BullMQ with Redis for job management
- AI Integration: Multiple LLM providers (OpenAI, Anthropic, Google, Groq)
- Content Processing: Native libraries for HTML-to-Markdown, PDF parsing
Smart Firecrawl API
The Smart Firecrawl API is a comprehensive web scraping and crawling service built with Node.js, TypeScript, and Express. It provides powerful web scraping capabilities with AI-powered content extraction, queue-based processing, and enterprise features.
Project Structure
smart-firecrawl-api/
├── src/ # Main source code
│ ├── controllers/ # API controllers
│ │ ├── v0/ # Legacy API endpoints
│ │ └── v1/ # Current API endpoints
│ ├── lib/ # Core libraries and utilities
│ │ ├── extract/ # Content extraction modules
│ │ ├── deep-research/ # AI-powered research features
│ │ └── generate-llmstxt/ # LLM text generation
│ ├── routes/ # Express route definitions
│ ├── scraper/ # Web scraping engines
│ │ ├── scrapeURL/ # URL scraping logic
│ │ └── WebScraper/ # Web scraper implementations
│ ├── services/ # Background services
│ │ ├── billing/ # Credit and billing management
│ │ ├── queue-service.ts # Queue management
│ │ └── redis.ts # Redis configuration
│ ├── search/ # Search functionality
│ └── types/ # TypeScript type definitions
├── sharedLibs/ # Native libraries
│ ├── crawler/ # Rust-based crawler
│ ├── go-html-to-md/ # Go-based HTML to Markdown
│ ├── html-transformer/ # Rust HTML transformer
│ └── pdf-parser/ # Rust PDF parser
├── chrome-extension/ # Browser extension
├── firecrawl-extension/ # Firecrawl extension
├── page-discovery-extension/ # Page discovery extension
├── dist/ # Compiled JavaScript output
├── node_modules/ # Node.js dependencies
└── tests/ # Test files
Getting Started
Prerequisites
- Node.js 22+
- pnpm (recommended) or npm
- Redis server
- Docker (for containerized deployment)
Installation
- Clone and install dependencies
git clone <repository-url>
cd smart-firecrawl-api
pnpm install
- Set up environment variables
# Database
DATABASE_URL=your_database_url
REDIS_URL=redis://localhost:6379
# Authentication
JWT_SECRET=your_jwt_secret
# External Services
SENTRY_DSN=your_sentry_dsn
SLACK_WEBHOOK_URL=your_slack_webhook
# AI Services
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
# Storage
GCS_FIRE_ENGINE_BUCKET_NAME=your_bucket_name
- Start services
# Start Redis
docker run -d -p 6379:6379 redis:latest
# Start development server
npm run start:dev
# Start queue workers (separate terminals)
npm run workers
npm run index-worker
API Endpoints
Scraping Endpoints
POST /v1/scrape- Scrape a single URLGET /v1/scrape/:jobId- Get scrape job status
Crawling Endpoints
POST /v1/crawl- Start a website crawlGET /v1/crawl/:jobId- Get crawl job statusDELETE /v1/crawl/:jobId- Cancel a crawl job
Batch Operations
POST /v1/batch/scrape- Batch scrape multiple URLsGET /v1/batch/scrape/:jobId- Get batch job status
AI Features
POST /v1/extract- Extract structured data using AIPOST /v1/deep-research- Perform deep research on topicsPOST /v1/llmstxt- Generate LLM-optimized text
Search & Mapping
POST /v1/search- Search the webPOST /v1/map- Map website structure
Core Features
-
Multi-Engine Scraping: Supports various scraping engines including Playwright, Puppeteer, and custom implementations
-
AI-Powered Extraction:
- Structured data extraction using LLMs
- Content summarization and analysis
- Deep research capabilities
-
Queue-Based Processing:
- BullMQ for job management
- Redis for caching and coordination
- Priority-based job processing
-
Content Processing:
- HTML to Markdown conversion
- PDF parsing and extraction
- Image and media handling
-
Rate Limiting & Authentication:
- Team-based authentication
- Credit-based billing system
- Rate limiting per endpoint
Basic Usage Examples
Scrape Single URL
// Using the API directly
const response = await fetch("http://localhost:3002/v1/scrape", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer your-api-key",
},
body: JSON.stringify({
url: "https://example.com",
formats: ["markdown", "html"],
onlyMainContent: true,
}),
});
const result = await response.json();
console.log(result.data.markdown);
Crawl Website
// Start a crawl job
const crawlResponse = await fetch("http://localhost:3002/v1/crawl", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer your-api-key",
},
body: JSON.stringify({
url: "https://example.com",
crawlerOptions: {
includes: ["/blog/*", "/articles/*"],
excludes: ["/admin/*", "/login"],
limit: 10,
},
pageOptions: {
formats: ["markdown"],
onlyMainContent: true,
},
}),
});
const crawlJob = await crawlResponse.json();
console.log("Crawl job ID:", crawlJob.jobId);
Smart Playwright Service
The Smart Playwright Service is a lightweight, high-performance web scraping microservice built with Node.js, TypeScript, and Playwright. It provides efficient web scraping capabilities with advanced browser automation, stealth features, and optimized performance for production environments.
Project Structure
smart-playwright-service/
├── api.ts # Main API server implementation
├── helpers/ # Utility functions
│ └── get_error.ts # HTTP error handling
├── dist/ # Compiled JavaScript output
├── node_modules/ # Node.js dependencies
├── package.json # Project configuration
├── tsconfig.json # TypeScript configuration
├── Dockerfile # Docker container setup
├── .dockerignore # Docker ignore patterns
├── .gitignore # Git ignore patterns
└── README.md # Project documentation
Getting Started
Prerequisites
- Node.js 18+
- npm or pnpm
- Docker (for containerized deployment)
Installation
- Clone and install dependencies
git clone <repository-url>
cd smart-playwright-service
npm install
npx playwright install
- Set up environment variables
# Server Configuration
PORT=3003
# Proxy Configuration (Optional)
PROXY_SERVER=your_proxy_server
PROXY_USERNAME=your_proxy_username
PROXY_PASSWORD=your_proxy_password
# Performance Settings
BLOCK_MEDIA=true
- Start the service
# Development
npm run dev
# Production
npm run build
npm start
API Endpoints
Health Check
GET /health- Service health status
Web Scraping
POST /scrape- Scrape a single URL with advanced options
Scraping Request Format
{
"url": "https://example.com",
"wait_after_load": 1000,
"timeout": 15000,
"headers": {
"Custom-Header": "value",
"User-Agent": "Mozilla/5.0...",
"Cookie": "session=abc123; user=john"
},
"check_selector": "#content"
}
Response Format
{
"content": "<html>...</html>",
"pageStatusCode": 200,
"contentType": "text/html",
"pageError": null
}
Key Features
-
Advanced Browser Automation:
- Playwright-based browser automation
- Stealth mode to avoid detection
- Random user-agent rotation
- Custom viewport and device simulation
-
Smart Request Blocking:
- Blocks ad-serving domains automatically
- Optional media file blocking for performance
- Configurable request filtering
-
Cookie Management:
- Automatic cookie parsing from headers
- Support for secure cookies (**Secure-, **Host-)
- Domain-specific cookie handling
- Google services cookie optimization
-
Proxy Support:
- HTTP/HTTPS proxy configuration
- Authentication support
- Environment-based proxy settings
-
Performance Optimizations:
- Fresh browser instance per request
- Resource blocking for faster loading
- Configurable timeouts and waits
- Memory-efficient cleanup
Usage Examples
Basic Scraping
curl -X POST http://localhost:3003/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"timeout": 15000
}'
Advanced Scraping with Headers
curl -X POST http://localhost:3003/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"wait_after_load": 2000,
"timeout": 30000,
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Cookie": "session=abc123; user=john"
},
"check_selector": "#main-content"
}'
Integration Patterns
Service Communication
The Smart Firecrawl API and Smart Playwright Service work together through a microservice architecture:
// Firecrawl API configuration for Playwright integration
const config = {
playwrightService: {
url:
process.env.PLAYWRIGHT_MICROSERVICE_URL || "http://localhost:3003/scrape",
timeout: 30000,
retries: 3,
},
};
Hybrid Scraping Strategy
async function intelligentScrape(url, options = {}) {
// 1. Try Firecrawl first for simple content
try {
const firecrawlResult = await firecrawl.scrapeUrl(url, {
formats: ["markdown"],
onlyMainContent: true,
...options,
});
if (firecrawlResult.success) {
return {
method: "firecrawl",
data: firecrawlResult.data,
};
}
} catch (error) {
console.log("Firecrawl failed, trying Playwright...");
}
// 2. Fallback to Playwright for complex content
try {
const playwrightResult = await fetch(config.playwrightService.url, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
wait_after_load: 2000,
timeout: 30000,
...options,
}),
});
const data = await playwrightResult.json();
return {
method: "playwright",
data: {
content: data.content,
statusCode: data.pageStatusCode,
},
};
} catch (error) {
throw new Error(`Both scraping methods failed: ${error.message}`);
}
}
Queue-Based Processing
// Firecrawl API queue integration
const Queue = require("bull");
const redis = require("redis");
const scrapeQueue = new Queue("scrape", {
redis: { host: "localhost", port: 6379 },
});
// Add job to queue
const job = await scrapeQueue.add("scrape-url", {
url: "https://example.com",
options: {
usePlaywright: true,
formats: ["markdown"],
},
});
// Process job
scrapeQueue.process("scrape-url", async (job) => {
const { url, options } = job.data;
if (options.usePlaywright) {
// Use Playwright service
return await callPlaywrightService(url, options);
} else {
// Use Firecrawl directly
return await firecrawl.scrapeUrl(url, options);
}
});
Deployment & Operations
Docker Deployment
Smart Firecrawl API
# Dockerfile for Smart Firecrawl API
FROM node:22-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3002
CMD ["npm", "start"]
# Build and run
docker build -t smart-firecrawl-api .
docker run -p 3002:3002 \
-e REDIS_URL=redis://host.docker.internal:6379 \
-e DATABASE_URL=your_database_url \
smart-firecrawl-api
Smart Playwright Service
# Dockerfile for Smart Playwright Service
FROM mcr.microsoft.com/playwright:v1.40.0-focal
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3003
CMD ["npm", "start"]
Docker Compose Setup
version: "3.8"
services:
firecrawl-api:
build: ./smart-firecrawl-api
ports:
- "3002:3002"
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=your_database_url
- PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3003/scrape
depends_on:
- redis
- playwright-service
playwright-service:
build: ./smart-playwright-service
ports:
- "3003:3003"
environment:
- BLOCK_MEDIA=true
volumes:
- /tmp/.cache:/tmp/.cache
redis:
image: redis:latest
ports:
- "6379:6379"
Environment Configuration
Firecrawl API Environment
# Production Environment
IS_PRODUCTION=true
ENV=production
PORT=3002
HOST=0.0.0.0
# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/firecrawl
REDIS_URL=redis://localhost:6379
# AI Services
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
GOOGLE_API_KEY=your_google_key
GROQ_API_KEY=your_groq_key
# Storage
GCS_FIRE_ENGINE_BUCKET_NAME=your_bucket_name
# Monitoring
SENTRY_DSN=your_sentry_dsn
SLACK_WEBHOOK_URL=your_slack_webhook
Playwright Service Environment
# Production Environment
NODE_ENV=production
PORT=3003
# Proxy Configuration
PROXY_SERVER=your_proxy_server
PROXY_USERNAME=your_proxy_username
PROXY_PASSWORD=your_proxy_password
# Performance
BLOCK_MEDIA=true
Monitoring & Health Checks
Firecrawl API Health Endpoints
# Server health
curl http://localhost:3002/serverHealthCheck
# Production status
curl http://localhost:3002/is-production
# Queue monitoring (with auth)
curl http://localhost:3002/admin/{BULL_AUTH_KEY}/queues
Playwright Service Health
# Service health
curl http://localhost:3003/health
Scaling Considerations
Horizontal Scaling
# Docker Compose with multiple instances
services:
firecrawl-api-1:
build: ./smart-firecrawl-api
ports: ["3002:3002"]
firecrawl-api-2:
build: ./smart-firecrawl-api
ports: ["3003:3002"]
playwright-service-1:
build: ./smart-playwright-service
ports: ["3004:3003"]
playwright-service-2:
build: ./smart-playwright-service
ports: ["3005:3003"]
Load Balancing
# Nginx configuration
upstream firecrawl_api {
server localhost:3002;
server localhost:3003;
}
upstream playwright_service {
server localhost:3004;
server localhost:3005;
}
server {
listen 80;
location /api/ {
proxy_pass http://firecrawl_api;
}
location /scrape/ {
proxy_pass http://playwright_service;
}
}
Advanced Features
AI-Powered Content Extraction
Custom LLM Extraction
// Firecrawl API - Custom extraction with multiple LLM providers
const extractResult = await fetch("http://localhost:3002/v1/extract", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer your-api-key",
},
body: JSON.stringify({
url: "https://example.com",
extractionPrompt:
"Extract all product information including name, price, and description",
extractionSchema: {
type: "object",
properties: {
products: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "string" },
description: { type: "string" },
},
},
},
},
},
llmProvider: "anthropic", // or 'openai', 'google', 'groq'
}),
});
Deep Research Capabilities
// Firecrawl API - Deep research on topics
const researchResult = await fetch("http://localhost:3002/v1/deep-research", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer your-api-key",
},
body: JSON.stringify({
topic: "AI tools for developers",
depth: "comprehensive",
sources: 10,
includeAnalysis: true,
}),
});
Native Library Integration
Rust-based Crawler
// Using native Rust crawler for high-performance crawling
const crawlResult = await fetch("http://localhost:3002/v1/crawl", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer your-api-key",
},
body: JSON.stringify({
url: "https://example.com",
crawlerOptions: {
engine: "rust", // Use Rust-based crawler
maxConcurrency: 10,
respectRobotsTxt: true,
delayBetweenRequests: 1000,
},
}),
});
Go HTML-to-Markdown
// High-performance HTML to Markdown conversion
const markdownResult = await fetch("http://localhost:3002/v1/scrape", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer your-api-key",
},
body: JSON.stringify({
url: "https://example.com",
formats: ["markdown"],
converter: "go", // Use Go-based converter
}),
});
Stealth and Anti-Detection
Playwright Service Stealth Features
// Advanced stealth configuration
const stealthResult = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://example.com",
stealth: {
userAgent: "random", // Random user agent rotation
viewport: { width: 1920, height: 1080 },
timezone: "America/New_York",
locale: "en-US",
geolocation: { latitude: 40.7128, longitude: -74.006 },
},
antiDetection: {
removeWebdriver: true,
spoofChrome: true,
randomizeFingerprint: true,
},
}),
});
Performance Optimizations
Resource Blocking
// Block unnecessary resources for faster scraping
const optimizedResult = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://example.com",
blockResources: {
images: true,
stylesheets: false,
fonts: true,
media: true,
ads: true,
analytics: true,
},
customBlockList: [
"*.google-analytics.com",
"*.facebook.com",
"*.doubleclick.net",
],
}),
});
Best Practices
Error Handling and Retries
// Robust error handling with exponential backoff
async function robustScrape(url, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
// Try Firecrawl first
const firecrawlResult = await fetch("http://localhost:3002/v1/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url, formats: ["markdown"] }),
});
if (firecrawlResult.ok) {
return await firecrawlResult.json();
}
} catch (error) {
console.log(`Firecrawl attempt ${attempt} failed:`, error.message);
}
// Fallback to Playwright
try {
const playwrightResult = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url, timeout: 30000 }),
});
if (playwrightResult.ok) {
return await playwrightResult.json();
}
} catch (error) {
console.log(`Playwright attempt ${attempt} failed:`, error.message);
}
if (attempt === maxRetries) {
throw new Error(`Failed after ${maxRetries} attempts`);
}
// Exponential backoff
await new Promise((resolve) =>
setTimeout(resolve, Math.pow(2, attempt) * 1000)
);
}
}
Rate Limiting and Respectful Scraping
class RespectfulScraper {
constructor(delayMs = 1000) {
this.delayMs = delayMs;
this.lastRequest = 0;
}
async scrape(url) {
// Respect rate limits
const timeSinceLastRequest = Date.now() - this.lastRequest;
if (timeSinceLastRequest < this.delayMs) {
await new Promise((resolve) =>
setTimeout(resolve, this.delayMs - timeSinceLastRequest)
);
}
this.lastRequest = Date.now();
return await this.performScrape(url);
}
}
Resource Management
// Proper resource cleanup for Playwright Service
async function scrapeWithCleanup(urls) {
const results = [];
for (const url of urls) {
try {
const response = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url }),
});
const data = await response.json();
results.push({ url, data });
} catch (error) {
console.error(`Failed to scrape ${url}:`, error);
}
}
return results;
}
Configuration Management
// Environment-based configuration
const config = {
development: {
firecrawlUrl: "http://localhost:3002",
playwrightUrl: "http://localhost:3003",
timeout: 30000,
retries: 3,
},
production: {
firecrawlUrl: "https://api.firecrawl.dev",
playwrightUrl: "https://playwright.firecrawl.dev",
timeout: 60000,
retries: 5,
},
};
const env = process.env.NODE_ENV || "development";
const currentConfig = config[env];
Troubleshooting
Common Issues
1. Service Communication Failures
// Health check both services
async function checkServices() {
try {
const firecrawlHealth = await fetch(
"http://localhost:3002/serverHealthCheck"
);
const playwrightHealth = await fetch("http://localhost:3003/health");
console.log("Firecrawl API:", firecrawlHealth.ok ? "OK" : "FAILED");
console.log("Playwright Service:", playwrightHealth.ok ? "OK" : "FAILED");
} catch (error) {
console.error("Service check failed:", error);
}
}
2. Timeout Issues
// Solution: Increase timeouts and add retries
const scrapeWithTimeout = async (url) => {
try {
const response = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
timeout: 60000, // 60 seconds
wait_after_load: 5000, // Wait 5 seconds after load
}),
});
return await response.json();
} catch (error) {
console.error("Timeout error:", error);
throw error;
}
};
3. Memory Leaks
// Solution: Proper resource cleanup
async function scrapeWithCleanup(urls) {
const results = [];
for (const url of urls) {
try {
// Each request gets a fresh browser instance
const response = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url }),
});
const data = await response.json();
results.push({ url, data });
} catch (error) {
console.error(`Failed to scrape ${url}:`, error);
}
}
return results;
}
4. Anti-Bot Detection
// Solution: Use stealth mode and realistic behavior
const stealthScrape = async (url) => {
const response = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
stealth: {
userAgent: "random",
viewport: { width: 1920, height: 1080 },
timezone: "America/New_York",
},
antiDetection: {
removeWebdriver: true,
spoofChrome: true,
randomizeFingerprint: true,
},
}),
});
return await response.json();
};
Debugging Tips
// Enable debugging for both services
const debugScrape = async (url) => {
console.log("Starting scrape for:", url);
try {
// Try Firecrawl first
const firecrawlResult = await fetch("http://localhost:3002/v1/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url, formats: ["markdown"] }),
});
if (firecrawlResult.ok) {
console.log("Firecrawl success");
return await firecrawlResult.json();
}
} catch (error) {
console.log("Firecrawl failed:", error.message);
}
// Fallback to Playwright
try {
const playwrightResult = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url }),
});
console.log("Playwright success");
return await playwrightResult.json();
} catch (error) {
console.error("Both services failed:", error);
throw error;
}
};
Use Cases
1. E-commerce Product Monitoring
// Monitor product prices and availability
async function monitorProducts(productUrls) {
const results = [];
for (const url of productUrls) {
try {
// Use Playwright for dynamic e-commerce sites
const response = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
wait_after_load: 3000,
check_selector: ".product-info",
}),
});
const data = await response.json();
// Extract product data using Firecrawl AI
const extractResponse = await fetch("http://localhost:3002/v1/extract", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
extractionPrompt:
"Extract product name, price, availability, and images",
extractionSchema: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "string" },
availability: { type: "string" },
images: { type: "array", items: { type: "string" } },
},
},
}),
});
const extractedData = await extractResponse.json();
results.push({ url, ...extractedData, timestamp: new Date() });
} catch (error) {
console.error(`Failed to monitor ${url}:`, error);
}
}
return results;
}
2. Social Media Content Extraction
// Extract social media posts and interactions
async function extractSocialContent(url) {
try {
// Use Playwright for dynamic social media content
const response = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
wait_after_load: 5000,
check_selector: '[data-testid="post"]',
stealth: {
userAgent: "random",
viewport: { width: 1920, height: 1080 },
},
}),
});
const data = await response.json();
// Use Firecrawl AI to extract structured data
const extractResponse = await fetch("http://localhost:3002/v1/extract", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
extractionPrompt:
"Extract all posts with author, text, timestamp, and engagement metrics",
extractionSchema: {
type: "object",
properties: {
posts: {
type: "array",
items: {
type: "object",
properties: {
author: { type: "string" },
text: { type: "string" },
timestamp: { type: "string" },
likes: { type: "number" },
shares: { type: "number" },
},
},
},
},
},
}),
});
return await extractResponse.json();
} catch (error) {
console.error("Social content extraction failed:", error);
throw error;
}
}
3. Research and Content Aggregation
// Deep research on topics using both services
async function deepResearch(topic) {
try {
// Use Firecrawl's deep research capabilities
const researchResponse = await fetch(
"http://localhost:3002/v1/deep-research",
{
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
topic,
depth: "comprehensive",
sources: 20,
includeAnalysis: true,
}),
}
);
const researchData = await researchResponse.json();
// Use Playwright for additional dynamic content
const additionalSources = await Promise.all(
researchData.sources.slice(0, 5).map(async (source) => {
try {
const response = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: source.url,
wait_after_load: 3000,
}),
});
return await response.json();
} catch (error) {
console.error(`Failed to scrape ${source.url}:`, error);
return null;
}
})
);
return {
research: researchData,
additionalContent: additionalSources.filter(Boolean),
};
} catch (error) {
console.error("Deep research failed:", error);
throw error;
}
}
4. Form Automation and Data Collection
// Automate form submissions and data collection
async function automateFormSubmission(formUrl, formData) {
try {
// Use Playwright for form automation
const response = await fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: formUrl,
formData,
wait_after_load: 2000,
check_selector: ".success-message, .error-message",
}),
});
const result = await response.json();
// Use Firecrawl to extract the result page content
const extractResponse = await fetch("http://localhost:3002/v1/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: formUrl,
formats: ["markdown"],
onlyMainContent: true,
}),
});
const extractedContent = await extractResponse.json();
return {
formResult: result,
pageContent: extractedContent,
};
} catch (error) {
console.error("Form automation failed:", error);
throw error;
}
}
5. Competitive Intelligence
// Monitor competitor websites and extract insights
async function competitiveIntelligence(competitorUrls) {
const results = [];
for (const url of competitorUrls) {
try {
// Use both services for comprehensive analysis
const [firecrawlResult, playwrightResult] = await Promise.all([
fetch("http://localhost:3002/v1/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
formats: ["markdown"],
onlyMainContent: true,
}),
}),
fetch("http://localhost:3003/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
wait_after_load: 3000,
}),
}),
]);
const firecrawlData = await firecrawlResult.json();
const playwrightData = await playwrightResult.json();
// Use AI to extract competitive insights
const insightsResponse = await fetch("http://localhost:3002/v1/extract", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url,
extractionPrompt:
"Extract key business insights, pricing, features, and competitive advantages",
extractionSchema: {
type: "object",
properties: {
pricing: { type: "string" },
features: { type: "array", items: { type: "string" } },
advantages: { type: "array", items: { type: "string" } },
weaknesses: { type: "array", items: { type: "string" } },
},
},
}),
});
const insights = await insightsResponse.json();
results.push({
url,
firecrawl: firecrawlData,
playwright: playwrightData,
insights,
});
} catch (error) {
console.error(`Failed to analyze ${url}:`, error);
}
}
return results;
}
Resources
Smart Firecrawl API
- Documentation: Comprehensive API documentation with examples
- GitHub: Smart Firecrawl API Repository
- Docker Hub: Pre-built Docker images for easy deployment
- API Reference: Complete endpoint documentation with request/response examples
Smart Playwright Service
- Documentation: Service-specific documentation and configuration
- GitHub: Smart Playwright Service Repository
- Docker Hub: Optimized Docker images with Playwright browsers
- Configuration Guide: Environment variables and deployment options
Integration Resources
- Architecture Diagrams: Visual representations of system components
- Deployment Guides: Step-by-step deployment instructions
- Monitoring Setup: Health checks, logging, and alerting configuration
- Performance Tuning: Optimization guides for production environments
Conclusion
The Smart Firecrawl API and Smart Playwright Service form a comprehensive web scraping ecosystem that combines the best of both worlds:
Key Advantages
- Intelligent Content Extraction: AI-powered extraction with multiple LLM providers
- Advanced Browser Automation: Stealth features and anti-detection capabilities
- Scalable Architecture: Queue-based processing with Redis coordination
- Native Performance: Rust and Go libraries for high-performance operations
- Enterprise Features: Authentication, billing, monitoring, and rate limiting
Use Cases Covered
- E-commerce Monitoring: Product tracking and price monitoring
- Social Media Analysis: Content extraction and sentiment analysis
- Research Automation: Deep research with AI-powered insights
- Form Automation: Complex form submissions and data collection
- Competitive Intelligence: Market analysis and competitor monitoring
Best Practices
- Respectful Scraping: Rate limiting and ethical practices
- Error Handling: Robust retry mechanisms and fallback strategies
- Resource Management: Proper cleanup and memory management
- Security: Authentication, authorization, and data protection
- Monitoring: Health checks, logging, and performance tracking
Getting Started
- Deploy Services: Use Docker Compose for easy setup
- Configure Environment: Set up API keys and database connections
- Test Integration: Verify both services are working correctly
- Implement Use Cases: Start with simple scraping and scale up
- Monitor Performance: Set up logging and alerting
This comprehensive solution provides everything needed for modern web scraping applications, from simple content extraction to complex AI-powered data processing. The combination of Firecrawl's intelligent extraction capabilities and Playwright's advanced browser automation creates a powerful platform for any web scraping needs.
Remember to always respect website terms of service, implement proper error handling, and use reasonable delays to ensure ethical and sustainable scraping practices.