Skip to main content

APIFY Scraper

APIFY is a powerful web scraping and automation platform that provides a comprehensive suite of tools for data extraction, web automation, and data processing. It offers both cloud-based and self-hosted solutions for various scraping needs.

Table of Contents

Overview

APIFY is a cloud-based web scraping and automation platform that allows developers to:

  • Extract data from websites using pre-built scrapers (Actors)
  • Build custom scrapers using JavaScript/TypeScript
  • Automate web interactions and form submissions
  • Process and transform scraped data
  • Schedule and monitor scraping tasks
  • Store and export data in various formats

Key Features

🚀 Pre-built Actors

  • Ready-to-use scrapers for popular websites
  • E-commerce platforms (Amazon, eBay, Shopify)
  • Social media platforms (Instagram, Twitter, LinkedIn)
  • News and content sites
  • Job boards and real estate sites

🛠️ Custom Development

  • JavaScript/TypeScript-based actor development
  • Puppeteer and Playwright integration
  • Advanced browser automation
  • Proxy rotation and anti-detection features

📊 Data Management

  • Multiple storage options (Dataset, Key-Value Store, Request Queue)
  • Data export in JSON, CSV, Excel formats
  • Real-time data processing
  • Data transformation and cleaning

Automation & Scheduling

  • Cron-based scheduling
  • Webhook triggers
  • API integration
  • Monitoring and alerting

Getting Started

1. Account Setup

  1. Visit APIFY Console
  2. Sign up for a free account
  3. Verify your email address
  4. Access the dashboard

2. First Steps

// Basic actor example
import { Actor } from "apify";

await Actor.init();
console.log("Actor started!");

// Your scraping logic here
const input = await Actor.getInput();
console.log("Input:", input);

await Actor.exit();

3. Installation

# Install APIFY CLI
npm install -g @apify/cli

# Login to your account
apify login

# Create a new actor
apify create my-scraper

Core Concepts

Actors

Actors are the core building blocks of APIFY. They are JavaScript/TypeScript applications that can:

  • Scrape websites
  • Process data
  • Interact with web pages
  • Handle authentication
  • Manage data storage

Input/Output

  • Input: Configuration data passed to actors
  • Output: Scraped data stored in datasets
  • State: Persistent data between actor runs

Storage

  • Dataset: Structured data storage (JSON, CSV)
  • Key-Value Store: File and binary data storage
  • Request Queue: URL management and crawling

Actors and Scrapers

Using Pre-built Actors

// Example: Using Web Scraper actor
const { Actor } = require("apify");

await Actor.init();

const input = {
startUrls: ["https://example.com"],
maxRequestsPerCrawl: 10,
maxConcurrency: 1,
};

const crawler = new Actor.WebScraper(input);
await crawler.run();

const dataset = await Actor.openDataset();
const data = await dataset.getData();
console.log("Scraped data:", data);

Building Custom Actors

import { Actor, PlaywrightCrawler } from "apify";

await Actor.init();

const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, log }) {
log.info(`Processing: ${request.url}`);

// Extract data
const title = await page.title();
const content = await page.textContent("body");

// Save to dataset
await Actor.pushData({
url: request.url,
title,
content,
});
},

async failedRequestHandler({ request, error }) {
console.log(`Request failed: ${request.url}`);
},
});

await crawler.run();
await Actor.exit();

Data Storage

Dataset Storage

// Save data to dataset
await Actor.pushData({
name: "Product Name",
price: "$29.99",
description: "Product description",
url: "https://example.com/product",
});

// Retrieve data from dataset
const dataset = await Actor.openDataset();
const data = await dataset.getData();

Key-Value Store

// Save files and binary data
await Actor.setValue("screenshot", screenshotBuffer, {
contentType: "image/png",
});

// Retrieve stored data
const screenshot = await Actor.getValue("screenshot");

Request Queue

// Add URLs to crawl
await Actor.addRequest({
url: "https://example.com/page1",
userData: { pageType: "product" },
});

// Process requests
const request = await Actor.getRequest();

Scheduling and Automation

Cron Scheduling

// Schedule actor to run daily at 9 AM
const schedule = {
cron: "0 9 * * *",
timezone: "UTC",
};

Webhook Triggers

// Trigger actor via webhook
const webhookUrl =
"https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs?token=YOUR_TOKEN";

API Integration

// Start actor run via API
const response = await fetch(
"https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.APIFY_TOKEN}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
input: { url: "https://example.com" },
}),
}
);

Best Practices

1. Respectful Scraping

// Add delays between requests
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page }) => {
// Add random delay
await page.waitForTimeout(Math.random() * 3000 + 1000);

// Your scraping logic
},
maxConcurrency: 1, // Limit concurrent requests
maxRequestsPerCrawl: 100, // Set reasonable limits
});

2. Error Handling

const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, log }) {
try {
// Scraping logic
await page.waitForSelector(".content");
const data = await page.evaluate(() => {
return document.querySelector(".content").textContent;
});

await Actor.pushData({ url: request.url, data });
} catch (error) {
log.error(`Failed to scrape ${request.url}: ${error.message}`);
// Retry logic or skip
}
},

async failedRequestHandler({ request, error }) {
console.log(`Request ${request.url} failed: ${error.message}`);
},
});

3. Proxy Usage

const crawler = new PlaywrightCrawler({
proxyConfiguration: new Actor.ProxyConfiguration({
groups: ["RESIDENTIAL"],
countryCode: "US",
}),
requestHandler: async ({ request, page }) => {
// Your scraping logic
},
});

4. Data Validation

// Validate scraped data
const validateData = (data) => {
const required = ["title", "price", "url"];
return required.every((field) => data[field] && data[field].trim());
};

// Use validation
const scrapedData = await page.evaluate(() => ({
title: document.querySelector("h1")?.textContent,
price: document.querySelector(".price")?.textContent,
url: window.location.href,
}));

if (validateData(scrapedData)) {
await Actor.pushData(scrapedData);
}

Troubleshooting

Common Issues

1. Rate Limiting

// Solution: Add delays and respect robots.txt
const crawler = new PlaywrightCrawler({
maxConcurrency: 1,
requestHandler: async ({ request, page }) => {
await page.waitForTimeout(2000); // 2 second delay
// Your logic
},
});

2. Anti-Bot Detection

// Solution: Use stealth mode and realistic user agents
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
args: ["--no-sandbox", "--disable-setuid-sandbox"],
},
},
requestHandler: async ({ page }) => {
// Set realistic user agent
await page.setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
);
// Your logic
},
});

3. Memory Issues

// Solution: Clean up resources
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page }) => {
try {
// Your scraping logic
} finally {
// Clean up
await page.close();
}
},
});

Debugging Tips

// Enable detailed logging
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page, log }) => {
log.info(`Processing: ${request.url}`);

// Take screenshots for debugging
await page.screenshot({ path: `debug-${Date.now()}.png` });

// Log page content
const content = await page.textContent("body");
log.debug(`Page content length: ${content.length}`);
},
});

Use Cases

1. E-commerce Price Monitoring

const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page }) => {
const productData = await page.evaluate(() => ({
name: document.querySelector(".product-title")?.textContent,
price: document.querySelector(".price")?.textContent,
availability: document.querySelector(".stock")?.textContent,
timestamp: new Date().toISOString(),
}));

await Actor.pushData(productData);
},
});

2. Lead Generation

const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page }) => {
const contacts = await page.evaluate(() => {
const elements = document.querySelectorAll(".contact");
return Array.from(elements).map((el) => ({
name: el.querySelector(".name")?.textContent,
email: el.querySelector(".email")?.textContent,
phone: el.querySelector(".phone")?.textContent,
}));
});

await Actor.pushData(contacts);
},
});

3. Content Aggregation

const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page }) => {
const articles = await page.evaluate(() => {
const articleElements = document.querySelectorAll("article");
return Array.from(articleElements).map((article) => ({
title: article.querySelector("h2")?.textContent,
content: article.querySelector(".content")?.textContent,
author: article.querySelector(".author")?.textContent,
date: article.querySelector(".date")?.textContent,
}));
});

await Actor.pushData(articles);
},
});

Advanced Features

Custom Storage

// Custom dataset with specific schema
const dataset = await Actor.openDataset("products", {
schema: {
title: "string",
price: "number",
category: "string",
url: "string",
},
});

Data Transformation

// Transform data before saving
const transformData = (rawData) => ({
id: rawData.id,
name: rawData.title.trim(),
price: parseFloat(rawData.price.replace("$", "")),
category: rawData.category.toLowerCase(),
scrapedAt: new Date().toISOString(),
});

await Actor.pushData(transformData(scrapedData));

Monitoring and Alerts

// Set up monitoring
const monitor = new Actor.Monitor({
onError: (error) => {
console.error("Actor failed:", error);
// Send alert notification
},
onSuccess: (result) => {
console.log("Actor completed successfully");
},
});

Resources

Conclusion

APIFY provides a powerful and flexible platform for web scraping and automation. With its pre-built actors, custom development capabilities, and comprehensive data management features, it's an excellent choice for developers looking to extract and process web data at scale.

Remember to always respect website terms of service, implement proper error handling, and use reasonable delays to ensure ethical and sustainable scraping practices.