Skip to main content

Smart Playwright Service

A lightweight, high-performance web scraping microservice built with Node.js, TypeScript, and Playwright. This service provides efficient web scraping capabilities with advanced browser automation, stealth features, and optimized performance for production environments.

🏗️ Project Structure

smart-playwright-service/
├── api.ts # Main API server implementation
├── helpers/ # Utility functions
│ └── get_error.ts # HTTP error handling
├── dist/ # Compiled JavaScript output
├── node_modules/ # Node.js dependencies
├── package.json # Project configuration
├── tsconfig.json # TypeScript configuration
├── Dockerfile # Docker container setup
├── .dockerignore # Docker ignore patterns
├── .gitignore # Git ignore patterns
└── README.md # Project documentation

🚀 Getting Started

Prerequisites

  • Node.js 18+
  • npm or pnpm
  • Docker (for containerized deployment)

Installation

  1. Clone the repository

    git clone <repository-url>
    cd smart-playwright-service
  2. Install dependencies

    npm install
    # or
    pnpm install
  3. Install Playwright browsers

    npx playwright install
  4. Set up environment variables Create a .env file with the following variables:

    # Server Configuration
    PORT=3003

    # Proxy Configuration (Optional)
    PROXY_SERVER=your_proxy_server
    PROXY_USERNAME=your_proxy_username
    PROXY_PASSWORD=your_proxy_password

    # Performance Settings
    BLOCK_MEDIA=true

Development

  1. Start the development server

    npm run dev
    # or
    pnpm dev
  2. Build and start production server

    npm run build
    npm start

Production

  1. Build the project

    npm run build
  2. Start production server

    npm start

🔧 Core Functionality

API Endpoints

Health Check

  • GET /health - Service health status

Web Scraping

  • POST /scrape - Scrape a single URL with advanced options

Scraping Request Format

{
"url": "https://example.com",
"wait_after_load": 1000,
"timeout": 15000,
"headers": {
"Custom-Header": "value",
"User-Agent": "Mozilla/5.0...",
"Cookie": "session=abc123; user=john"
},
"check_selector": "#content"
}

Response Format

{
"content": "<html>...</html>",
"pageStatusCode": 200,
"contentType": "text/html",
"pageError": null
}

Key Features

  1. Advanced Browser Automation:

    • Playwright-based browser automation
    • Stealth mode to avoid detection
    • Random user-agent rotation
    • Custom viewport and device simulation
  2. Smart Request Blocking:

    • Blocks ad-serving domains automatically
    • Optional media file blocking for performance
    • Configurable request filtering
  3. Cookie Management:

    • Automatic cookie parsing from headers
    • Support for secure cookies (**Secure-, **Host-)
    • Domain-specific cookie handling
    • Google services cookie optimization
  4. Proxy Support:

    • HTTP/HTTPS proxy configuration
    • Authentication support
    • Environment-based proxy settings
  5. Performance Optimizations:

    • Fresh browser instance per request
    • Resource blocking for faster loading
    • Configurable timeouts and waits
    • Memory-efficient cleanup
  6. Error Handling:

    • Comprehensive HTTP status code mapping
    • Detailed error messages
    • Graceful failure handling

📚 Libraries Used

Core Dependencies

  • Express.js - Web framework
  • TypeScript - Type-safe JavaScript
  • Playwright - Browser automation
  • Body-parser - Request parsing
  • Dotenv - Environment variable management

Utility Libraries

  • User-agents - Random user-agent generation
  • Node.js - Runtime environment

Development Dependencies

  • @types/express - Express type definitions
  • @types/node - Node.js type definitions
  • @types/body-parser - Body-parser type definitions
  • @types/user-agents - User-agents type definitions
  • ts-node - TypeScript execution
  • TypeScript - TypeScript compiler

🐳 Docker Deployment

Build and Run

# Build the Docker image
docker build -t smart-playwright-service .

# Run the container
docker run -p 3003:3003 \
-e PROXY_SERVER=your_proxy_server \
-e BLOCK_MEDIA=true \
smart-playwright-service

Docker Compose

version: "3.8"
services:
playwright-service:
build: .
ports:
- "3003:3003"
environment:
- PROXY_SERVER=your_proxy_server
- BLOCK_MEDIA=true
volumes:
- /tmp/.cache:/tmp/.cache

Environment Variables

VariableDescriptionDefaultRequired
PORTServer port3003No
PROXY_SERVERProxy server URLnullNo
PROXY_USERNAMEProxy usernamenullNo
PROXY_PASSWORDProxy passwordnullNo
BLOCK_MEDIABlock media filesfalseNo

🔍 Usage Examples

Basic Scraping

curl -X POST http://localhost:3003/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"timeout": 15000
}'

Advanced Scraping with Headers

curl -X POST http://localhost:3003/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"wait_after_load": 2000,
"timeout": 30000,
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Cookie": "session=abc123; user=john"
},
"check_selector": "#main-content"
}'

Health Check

curl http://localhost:3003/health

🚀 Integration with Firecrawl

To integrate this service with the main Firecrawl API:

  1. Set environment variable in your main API:

    PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3003/scrape
  2. The service will be automatically used for scraping operations that require browser automation.

🔧 Configuration Options

Request Parameters

ParameterTypeDescriptionDefault
urlstringTarget URL to scrapeRequired
wait_after_loadnumberWait time after page load (ms)0
timeoutnumberNavigation timeout (ms)15000
headersobjectCustom HTTP headers
check_selectorstringCSS selector to wait fornull

Browser Configuration

  • Headless Mode: Always enabled for performance
  • User Agent: Random rotation with realistic strings
  • Viewport: 1920x1080 (configurable)
  • JavaScript: Enabled
  • Images: Blocked by default (configurable)
  • Media: Blocked by default (configurable)

Blocked Domains

The service automatically blocks requests to known ad-serving domains:

  • Google Analytics
  • Google Tag Manager
  • DoubleClick
  • Facebook tracking
  • Amazon ads
  • And many more...

📊 Performance Features

Memory Management

  • Fresh browser instance per request
  • Automatic cleanup after each scrape
  • Resource blocking to reduce memory usage

Speed Optimizations

  • Media file blocking
  • Ad domain blocking
  • Minimal resource loading
  • Efficient cookie handling

Reliability

  • Comprehensive error handling
  • Timeout management
  • Graceful failure recovery
  • Health check monitoring

🔍 Development Tools

Available Scripts

  • npm run dev - Start development server
  • npm run build - Build for production
  • npm start - Start production server

TypeScript Configuration

  • Strict type checking enabled
  • ES2016 target
  • CommonJS modules
  • Source maps for debugging

🚀 Deployment

Production Considerations

  1. Resource Management:

    • Monitor memory usage
    • Set appropriate timeouts
    • Configure proxy if needed
  2. Security:

    • Use HTTPS in production
    • Implement rate limiting
    • Secure proxy credentials
  3. Monitoring:

    • Health check endpoint
    • Log monitoring
    • Error tracking

Scaling

  • Horizontal Scaling: Deploy multiple instances
  • Load Balancing: Use reverse proxy
  • Resource Limits: Set memory and CPU limits

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

📄 License

This project is licensed under the ISC License.

🆘 Support

For support and questions:

  • Create an issue in the repository
  • Check the documentation for common issues
  • Review the health check endpoint

Note: This is a production-ready web scraping microservice optimized for performance, reliability, and stealth operation. It's designed to work seamlessly with the main Firecrawl API for advanced browser automation tasks.