Smart Playwright Service
A lightweight, high-performance web scraping microservice built with Node.js, TypeScript, and Playwright. This service provides efficient web scraping capabilities with advanced browser automation, stealth features, and optimized performance for production environments.
🏗️ Project Structure
smart-playwright-service/
├── api.ts # Main API server implementation
├── helpers/ # Utility functions
│ └── get_error.ts # HTTP error handling
├── dist/ # Compiled JavaScript output
├── node_modules/ # Node.js dependencies
├── package.json # Project configuration
├── tsconfig.json # TypeScript configuration
├── Dockerfile # Docker container setup
├── .dockerignore # Docker ignore patterns
├── .gitignore # Git ignore patterns
└── README.md # Project documentation
🚀 Getting Started
Prerequisites
- Node.js 18+
- npm or pnpm
- Docker (for containerized deployment)
Installation
-
Clone the repository
git clone <repository-url>
cd smart-playwright-service -
Install dependencies
npm install
# or
pnpm install -
Install Playwright browsers
npx playwright install -
Set up environment variables Create a
.envfile with the following variables:# Server Configuration
PORT=3003
# Proxy Configuration (Optional)
PROXY_SERVER=your_proxy_server
PROXY_USERNAME=your_proxy_username
PROXY_PASSWORD=your_proxy_password
# Performance Settings
BLOCK_MEDIA=true
Development
-
Start the development server
npm run dev
# or
pnpm dev -
Build and start production server
npm run build
npm start
Production
-
Build the project
npm run build -
Start production server
npm start
🔧 Core Functionality
API Endpoints
Health Check
GET /health- Service health status
Web Scraping
POST /scrape- Scrape a single URL with advanced options
Scraping Request Format
{
"url": "https://example.com",
"wait_after_load": 1000,
"timeout": 15000,
"headers": {
"Custom-Header": "value",
"User-Agent": "Mozilla/5.0...",
"Cookie": "session=abc123; user=john"
},
"check_selector": "#content"
}
Response Format
{
"content": "<html>...</html>",
"pageStatusCode": 200,
"contentType": "text/html",
"pageError": null
}
Key Features
-
Advanced Browser Automation:
- Playwright-based browser automation
- Stealth mode to avoid detection
- Random user-agent rotation
- Custom viewport and device simulation
-
Smart Request Blocking:
- Blocks ad-serving domains automatically
- Optional media file blocking for performance
- Configurable request filtering
-
Cookie Management:
- Automatic cookie parsing from headers
- Support for secure cookies (**Secure-, **Host-)
- Domain-specific cookie handling
- Google services cookie optimization
-
Proxy Support:
- HTTP/HTTPS proxy configuration
- Authentication support
- Environment-based proxy settings
-
Performance Optimizations:
- Fresh browser instance per request
- Resource blocking for faster loading
- Configurable timeouts and waits
- Memory-efficient cleanup
-
Error Handling:
- Comprehensive HTTP status code mapping
- Detailed error messages
- Graceful failure handling
📚 Libraries Used
Core Dependencies
- Express.js - Web framework
- TypeScript - Type-safe JavaScript
- Playwright - Browser automation
- Body-parser - Request parsing
- Dotenv - Environment variable management
Utility Libraries
- User-agents - Random user-agent generation
- Node.js - Runtime environment
Development Dependencies
- @types/express - Express type definitions
- @types/node - Node.js type definitions
- @types/body-parser - Body-parser type definitions
- @types/user-agents - User-agents type definitions
- ts-node - TypeScript execution
- TypeScript - TypeScript compiler
🐳 Docker Deployment
Build and Run
# Build the Docker image
docker build -t smart-playwright-service .
# Run the container
docker run -p 3003:3003 \
-e PROXY_SERVER=your_proxy_server \
-e BLOCK_MEDIA=true \
smart-playwright-service
Docker Compose
version: "3.8"
services:
playwright-service:
build: .
ports:
- "3003:3003"
environment:
- PROXY_SERVER=your_proxy_server
- BLOCK_MEDIA=true
volumes:
- /tmp/.cache:/tmp/.cache
Environment Variables
| Variable | Description | Default | Required |
|---|---|---|---|
PORT | Server port | 3003 | No |
PROXY_SERVER | Proxy server URL | null | No |
PROXY_USERNAME | Proxy username | null | No |
PROXY_PASSWORD | Proxy password | null | No |
BLOCK_MEDIA | Block media files | false | No |
🔍 Usage Examples
Basic Scraping
curl -X POST http://localhost:3003/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"timeout": 15000
}'
Advanced Scraping with Headers
curl -X POST http://localhost:3003/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"wait_after_load": 2000,
"timeout": 30000,
"headers": {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Cookie": "session=abc123; user=john"
},
"check_selector": "#main-content"
}'
Health Check
curl http://localhost:3003/health
🚀 Integration with Firecrawl
To integrate this service with the main Firecrawl API:
-
Set environment variable in your main API:
PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3003/scrape -
The service will be automatically used for scraping operations that require browser automation.
🔧 Configuration Options
Request Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
url | string | Target URL to scrape | Required |
wait_after_load | number | Wait time after page load (ms) | 0 |
timeout | number | Navigation timeout (ms) | 15000 |
headers | object | Custom HTTP headers | |
check_selector | string | CSS selector to wait for | null |
Browser Configuration
- Headless Mode: Always enabled for performance
- User Agent: Random rotation with realistic strings
- Viewport: 1920x1080 (configurable)
- JavaScript: Enabled
- Images: Blocked by default (configurable)
- Media: Blocked by default (configurable)
Blocked Domains
The service automatically blocks requests to known ad-serving domains:
- Google Analytics
- Google Tag Manager
- DoubleClick
- Facebook tracking
- Amazon ads
- And many more...
📊 Performance Features
Memory Management
- Fresh browser instance per request
- Automatic cleanup after each scrape
- Resource blocking to reduce memory usage
Speed Optimizations
- Media file blocking
- Ad domain blocking
- Minimal resource loading
- Efficient cookie handling
Reliability
- Comprehensive error handling
- Timeout management
- Graceful failure recovery
- Health check monitoring
🔍 Development Tools
Available Scripts
npm run dev- Start development servernpm run build- Build for productionnpm start- Start production server
TypeScript Configuration
- Strict type checking enabled
- ES2016 target
- CommonJS modules
- Source maps for debugging
🚀 Deployment
Production Considerations
-
Resource Management:
- Monitor memory usage
- Set appropriate timeouts
- Configure proxy if needed
-
Security:
- Use HTTPS in production
- Implement rate limiting
- Secure proxy credentials
-
Monitoring:
- Health check endpoint
- Log monitoring
- Error tracking
Scaling
- Horizontal Scaling: Deploy multiple instances
- Load Balancing: Use reverse proxy
- Resource Limits: Set memory and CPU limits
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
📄 License
This project is licensed under the ISC License.
🆘 Support
For support and questions:
- Create an issue in the repository
- Check the documentation for common issues
- Review the health check endpoint
Note: This is a production-ready web scraping microservice optimized for performance, reliability, and stealth operation. It's designed to work seamlessly with the main Firecrawl API for advanced browser automation tasks.