Smart Playwright Service

A lightweight, high-performance web scraping microservice built with Node.js, TypeScript, and Playwright. This service provides efficient web scraping capabilities with advanced browser automation, stealth features, and optimized performance for production environments.

🏗️ Project Structure

smart-playwright-service/
├── api.ts                    # Main API server implementation
├── helpers/                  # Utility functions
│   └── get_error.ts         # HTTP error handling
├── dist/                    # Compiled JavaScript output
├── node_modules/            # Node.js dependencies
├── package.json             # Project configuration
├── tsconfig.json            # TypeScript configuration
├── Dockerfile              # Docker container setup
├── .dockerignore           # Docker ignore patterns
├── .gitignore              # Git ignore patterns
└── README.md               # Project documentation

🚀 Getting Started

Prerequisites

Node.js 18+
npm or pnpm
Docker (for containerized deployment)

Installation

Clone the repository

git clone <repository-url>
cd smart-playwright-service

Install dependencies
```
npm install
# or
pnpm install
```
Install Playwright browsers
```
npx playwright install
```

Set up environment variables Create a .env file with the following variables:

# Server Configuration
PORT=3003

# Proxy Configuration (Optional)
PROXY_SERVER=your_proxy_server
PROXY_USERNAME=your_proxy_username
PROXY_PASSWORD=your_proxy_password

# Performance Settings
BLOCK_MEDIA=true

Development

Start the development server
```
npm run dev
# or
pnpm dev
```
Build and start production server
```
npm run build
npm start
```

Production

Build the project
```
npm run build
```
Start production server
```
npm start
```

🔧 Core Functionality

API Endpoints

Health Check

GET /health - Service health status

Web Scraping

POST /scrape - Scrape a single URL with advanced options

Scraping Request Format

{
  "url": "https://example.com",
  "wait_after_load": 1000,
  "timeout": 15000,
  "headers": {
    "Custom-Header": "value",
    "User-Agent": "Mozilla/5.0...",
    "Cookie": "session=abc123; user=john"
  },
  "check_selector": "#content"
}

Response Format

{
  "content": "<html>...</html>",
  "pageStatusCode": 200,
  "contentType": "text/html",
  "pageError": null
}

Key Features

Advanced Browser Automation:
- Playwright-based browser automation
- Stealth mode to avoid detection
- Random user-agent rotation
- Custom viewport and device simulation
Smart Request Blocking:
- Blocks ad-serving domains automatically
- Optional media file blocking for performance
- Configurable request filtering
Cookie Management:
- Automatic cookie parsing from headers
- Support for secure cookies (**Secure-, **Host-)
- Domain-specific cookie handling
- Google services cookie optimization
Proxy Support:
- HTTP/HTTPS proxy configuration
- Authentication support
- Environment-based proxy settings
Performance Optimizations:
- Fresh browser instance per request
- Resource blocking for faster loading
- Configurable timeouts and waits
- Memory-efficient cleanup
Error Handling:
- Comprehensive HTTP status code mapping
- Detailed error messages
- Graceful failure handling

📚 Libraries Used

Core Dependencies

Express.js - Web framework
TypeScript - Type-safe JavaScript
Playwright - Browser automation
Body-parser - Request parsing
Dotenv - Environment variable management

Utility Libraries

User-agents - Random user-agent generation
Node.js - Runtime environment

Development Dependencies

@types/express - Express type definitions
@types/node - Node.js type definitions
@types/body-parser - Body-parser type definitions
@types/user-agents - User-agents type definitions
ts-node - TypeScript execution
TypeScript - TypeScript compiler

🐳 Docker Deployment

Build and Run

# Build the Docker image
docker build -t smart-playwright-service .

# Run the container
docker run -p 3003:3003 \
  -e PROXY_SERVER=your_proxy_server \
  -e BLOCK_MEDIA=true \
  smart-playwright-service

Docker Compose

version: "3.8"
services:
  playwright-service:
    build: .
    ports:
      - "3003:3003"
    environment:
      - PROXY_SERVER=your_proxy_server
      - BLOCK_MEDIA=true
    volumes:
      - /tmp/.cache:/tmp/.cache

Environment Variables

Variable	Description	Default	Required
`PORT`	Server port	3003	No
`PROXY_SERVER`	Proxy server URL	null	No
`PROXY_USERNAME`	Proxy username	null	No
`PROXY_PASSWORD`	Proxy password	null	No
`BLOCK_MEDIA`	Block media files	false	No

🔍 Usage Examples

Basic Scraping

curl -X POST http://localhost:3003/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "timeout": 15000
  }'

Advanced Scraping with Headers

curl -X POST http://localhost:3003/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "wait_after_load": 2000,
    "timeout": 30000,
    "headers": {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      "Cookie": "session=abc123; user=john"
    },
    "check_selector": "#main-content"
  }'

Health Check

curl http://localhost:3003/health

🚀 Integration with Firecrawl

To integrate this service with the main Firecrawl API:

Set environment variable in your main API:

PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3003/scrape

The service will be automatically used for scraping operations that require browser automation.

🔧 Configuration Options

Request Parameters

Parameter	Type	Description	Default
`url`	string	Target URL to scrape	Required
`wait_after_load`	number	Wait time after page load (ms)	0
`timeout`	number	Navigation timeout (ms)	15000
`headers`	object	Custom HTTP headers
`check_selector`	string	CSS selector to wait for	null

Browser Configuration

Headless Mode: Always enabled for performance
User Agent: Random rotation with realistic strings
Viewport: 1920x1080 (configurable)
JavaScript: Enabled
Images: Blocked by default (configurable)
Media: Blocked by default (configurable)

Blocked Domains

The service automatically blocks requests to known ad-serving domains:

Google Analytics
Google Tag Manager
DoubleClick
Facebook tracking
Amazon ads
And many more...

📊 Performance Features

Memory Management

Fresh browser instance per request
Automatic cleanup after each scrape
Resource blocking to reduce memory usage

Speed Optimizations

Media file blocking
Ad domain blocking
Minimal resource loading
Efficient cookie handling

Reliability

Comprehensive error handling
Timeout management
Graceful failure recovery
Health check monitoring

🔍 Development Tools

Available Scripts

npm run dev - Start development server
npm run build - Build for production
npm start - Start production server

TypeScript Configuration

Strict type checking enabled
ES2016 target
CommonJS modules
Source maps for debugging

🚀 Deployment

Production Considerations

Resource Management:
- Monitor memory usage
- Set appropriate timeouts
- Configure proxy if needed
Security:
- Use HTTPS in production
- Implement rate limiting
- Secure proxy credentials
Monitoring:
- Health check endpoint
- Log monitoring
- Error tracking

Scaling

Horizontal Scaling: Deploy multiple instances
Load Balancing: Use reverse proxy
Resource Limits: Set memory and CPU limits

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

📄 License

This project is licensed under the ISC License.

🆘 Support

For support and questions:

Create an issue in the repository
Check the documentation for common issues
Review the health check endpoint

Note: This is a production-ready web scraping microservice optimized for performance, reliability, and stealth operation. It's designed to work seamlessly with the main Firecrawl API for advanced browser automation tasks.

🏗️ Project Structure​

🚀 Getting Started​

Prerequisites​

Installation​

Development​

Production​

🔧 Core Functionality​

API Endpoints​

Health Check​

Web Scraping​

Scraping Request Format​

Response Format​

Key Features​

📚 Libraries Used​

Core Dependencies​

Utility Libraries​

Development Dependencies​

🐳 Docker Deployment​

Build and Run​

Docker Compose​

Environment Variables​

🔍 Usage Examples​

Basic Scraping​

Advanced Scraping with Headers​

Health Check​

🚀 Integration with Firecrawl​

🔧 Configuration Options​

Request Parameters​

Browser Configuration​

Blocked Domains​

📊 Performance Features​

Memory Management​

Speed Optimizations​

Reliability​

🔍 Development Tools​

Available Scripts​

TypeScript Configuration​

🚀 Deployment​

Production Considerations​

Scaling​

🤝 Contributing​

📄 License​

🆘 Support​

🏗️ Project Structure

🚀 Getting Started

Prerequisites

Installation

Development

Production

🔧 Core Functionality

API Endpoints

Health Check

Web Scraping

Scraping Request Format

Response Format

Key Features

📚 Libraries Used

Core Dependencies

Utility Libraries

Development Dependencies

🐳 Docker Deployment

Build and Run

Docker Compose

Environment Variables

🔍 Usage Examples

Basic Scraping

Advanced Scraping with Headers

Health Check

🚀 Integration with Firecrawl

🔧 Configuration Options

Request Parameters

Browser Configuration

Blocked Domains

📊 Performance Features

Memory Management

Speed Optimizations

Reliability

🔍 Development Tools

Available Scripts

TypeScript Configuration

🚀 Deployment

Production Considerations

Scaling

🤝 Contributing

📄 License

🆘 Support