APIFY Scraper

APIFY is a powerful web scraping and automation platform that provides a comprehensive suite of tools for data extraction, web automation, and data processing. It offers both cloud-based and self-hosted solutions for various scraping needs.

Overview
Key Features
Getting Started
Core Concepts
Actors and Scrapers
Data Storage
Scheduling and Automation
Best Practices
Troubleshooting
Use Cases

Overview

APIFY is a cloud-based web scraping and automation platform that allows developers to:

Extract data from websites using pre-built scrapers (Actors)
Build custom scrapers using JavaScript/TypeScript
Automate web interactions and form submissions
Process and transform scraped data
Schedule and monitor scraping tasks
Store and export data in various formats

Key Features

🚀 Pre-built Actors

Ready-to-use scrapers for popular websites
E-commerce platforms (Amazon, eBay, Shopify)
Social media platforms (Instagram, Twitter, LinkedIn)
News and content sites
Job boards and real estate sites

🛠️ Custom Development

JavaScript/TypeScript-based actor development
Puppeteer and Playwright integration
Advanced browser automation
Proxy rotation and anti-detection features

📊 Data Management

Multiple storage options (Dataset, Key-Value Store, Request Queue)
Data export in JSON, CSV, Excel formats
Real-time data processing
Data transformation and cleaning

⏰ Automation & Scheduling

Cron-based scheduling
Webhook triggers
API integration
Monitoring and alerting

Getting Started

1. Account Setup

Visit APIFY Console
Sign up for a free account
Verify your email address
Access the dashboard

2. First Steps

// Basic actor example
import { Actor } from "apify";

await Actor.init();
console.log("Actor started!");

// Your scraping logic here
const input = await Actor.getInput();
console.log("Input:", input);

await Actor.exit();

3. Installation

# Install APIFY CLI
npm install -g @apify/cli

# Login to your account
apify login

# Create a new actor
apify create my-scraper

Core Concepts

Actors

Actors are the core building blocks of APIFY. They are JavaScript/TypeScript applications that can:

Scrape websites
Process data
Interact with web pages
Handle authentication
Manage data storage

Input/Output

Input: Configuration data passed to actors
Output: Scraped data stored in datasets
State: Persistent data between actor runs

Storage

Dataset: Structured data storage (JSON, CSV)
Key-Value Store: File and binary data storage
Request Queue: URL management and crawling

Actors and Scrapers

Using Pre-built Actors

// Example: Using Web Scraper actor
const { Actor } = require("apify");

await Actor.init();

const input = {
  startUrls: ["https://example.com"],
  maxRequestsPerCrawl: 10,
  maxConcurrency: 1,
};

const crawler = new Actor.WebScraper(input);
await crawler.run();

const dataset = await Actor.openDataset();
const data = await dataset.getData();
console.log("Scraped data:", data);

Building Custom Actors

import { Actor, PlaywrightCrawler } from "apify";

await Actor.init();

const crawler = new PlaywrightCrawler({
  async requestHandler({ request, page, log }) {
    log.info(`Processing: ${request.url}`);

    // Extract data
    const title = await page.title();
    const content = await page.textContent("body");

    // Save to dataset
    await Actor.pushData({
      url: request.url,
      title,
      content,
    });
  },

  async failedRequestHandler({ request, error }) {
    console.log(`Request failed: ${request.url}`);
  },
});

await crawler.run();
await Actor.exit();

Data Storage

Dataset Storage

// Save data to dataset
await Actor.pushData({
  name: "Product Name",
  price: "$29.99",
  description: "Product description",
  url: "https://example.com/product",
});

// Retrieve data from dataset
const dataset = await Actor.openDataset();
const data = await dataset.getData();

Key-Value Store

// Save files and binary data
await Actor.setValue("screenshot", screenshotBuffer, {
  contentType: "image/png",
});

// Retrieve stored data
const screenshot = await Actor.getValue("screenshot");

Request Queue

// Add URLs to crawl
await Actor.addRequest({
  url: "https://example.com/page1",
  userData: { pageType: "product" },
});

// Process requests
const request = await Actor.getRequest();

Scheduling and Automation

Cron Scheduling

// Schedule actor to run daily at 9 AM
const schedule = {
  cron: "0 9 * * *",
  timezone: "UTC",
};

Webhook Triggers

// Trigger actor via webhook
const webhookUrl =
  "https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs?token=YOUR_TOKEN";

API Integration

// Start actor run via API
const response = await fetch(
  "https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.APIFY_TOKEN}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      input: { url: "https://example.com" },
    }),
  }
);

Best Practices

1. Respectful Scraping

// Add delays between requests
const crawler = new PlaywrightCrawler({
  requestHandler: async ({ request, page }) => {
    // Add random delay
    await page.waitForTimeout(Math.random() * 3000 + 1000);

    // Your scraping logic
  },
  maxConcurrency: 1, // Limit concurrent requests
  maxRequestsPerCrawl: 100, // Set reasonable limits
});

2. Error Handling

const crawler = new PlaywrightCrawler({
  async requestHandler({ request, page, log }) {
    try {
      // Scraping logic
      await page.waitForSelector(".content");
      const data = await page.evaluate(() => {
        return document.querySelector(".content").textContent;
      });

      await Actor.pushData({ url: request.url, data });
    } catch (error) {
      log.error(`Failed to scrape ${request.url}: ${error.message}`);
      // Retry logic or skip
    }
  },

  async failedRequestHandler({ request, error }) {
    console.log(`Request ${request.url} failed: ${error.message}`);
  },
});

3. Proxy Usage

const crawler = new PlaywrightCrawler({
  proxyConfiguration: new Actor.ProxyConfiguration({
    groups: ["RESIDENTIAL"],
    countryCode: "US",
  }),
  requestHandler: async ({ request, page }) => {
    // Your scraping logic
  },
});

4. Data Validation

// Validate scraped data
const validateData = (data) => {
  const required = ["title", "price", "url"];
  return required.every((field) => data[field] && data[field].trim());
};

// Use validation
const scrapedData = await page.evaluate(() => ({
  title: document.querySelector("h1")?.textContent,
  price: document.querySelector(".price")?.textContent,
  url: window.location.href,
}));

if (validateData(scrapedData)) {
  await Actor.pushData(scrapedData);
}

Troubleshooting

Common Issues

1. Rate Limiting

// Solution: Add delays and respect robots.txt
const crawler = new PlaywrightCrawler({
  maxConcurrency: 1,
  requestHandler: async ({ request, page }) => {
    await page.waitForTimeout(2000); // 2 second delay
    // Your logic
  },
});

2. Anti-Bot Detection

// Solution: Use stealth mode and realistic user agents
const crawler = new PlaywrightCrawler({
  launchContext: {
    launchOptions: {
      args: ["--no-sandbox", "--disable-setuid-sandbox"],
    },
  },
  requestHandler: async ({ page }) => {
    // Set realistic user agent
    await page.setUserAgent(
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    );
    // Your logic
  },
});

3. Memory Issues

// Solution: Clean up resources
const crawler = new PlaywrightCrawler({
  requestHandler: async ({ request, page }) => {
    try {
      // Your scraping logic
    } finally {
      // Clean up
      await page.close();
    }
  },
});

Debugging Tips

// Enable detailed logging
const crawler = new PlaywrightCrawler({
  requestHandler: async ({ request, page, log }) => {
    log.info(`Processing: ${request.url}`);

    // Take screenshots for debugging
    await page.screenshot({ path: `debug-${Date.now()}.png` });

    // Log page content
    const content = await page.textContent("body");
    log.debug(`Page content length: ${content.length}`);
  },
});

Use Cases

1. E-commerce Price Monitoring

const crawler = new PlaywrightCrawler({
  requestHandler: async ({ request, page }) => {
    const productData = await page.evaluate(() => ({
      name: document.querySelector(".product-title")?.textContent,
      price: document.querySelector(".price")?.textContent,
      availability: document.querySelector(".stock")?.textContent,
      timestamp: new Date().toISOString(),
    }));

    await Actor.pushData(productData);
  },
});

2. Lead Generation

const crawler = new PlaywrightCrawler({
  requestHandler: async ({ request, page }) => {
    const contacts = await page.evaluate(() => {
      const elements = document.querySelectorAll(".contact");
      return Array.from(elements).map((el) => ({
        name: el.querySelector(".name")?.textContent,
        email: el.querySelector(".email")?.textContent,
        phone: el.querySelector(".phone")?.textContent,
      }));
    });

    await Actor.pushData(contacts);
  },
});

3. Content Aggregation

const crawler = new PlaywrightCrawler({
  requestHandler: async ({ request, page }) => {
    const articles = await page.evaluate(() => {
      const articleElements = document.querySelectorAll("article");
      return Array.from(articleElements).map((article) => ({
        title: article.querySelector("h2")?.textContent,
        content: article.querySelector(".content")?.textContent,
        author: article.querySelector(".author")?.textContent,
        date: article.querySelector(".date")?.textContent,
      }));
    });

    await Actor.pushData(articles);
  },
});

Advanced Features

Custom Storage

// Custom dataset with specific schema
const dataset = await Actor.openDataset("products", {
  schema: {
    title: "string",
    price: "number",
    category: "string",
    url: "string",
  },
});

Data Transformation

// Transform data before saving
const transformData = (rawData) => ({
  id: rawData.id,
  name: rawData.title.trim(),
  price: parseFloat(rawData.price.replace("$", "")),
  category: rawData.category.toLowerCase(),
  scrapedAt: new Date().toISOString(),
});

await Actor.pushData(transformData(scrapedData));

Monitoring and Alerts

// Set up monitoring
const monitor = new Actor.Monitor({
  onError: (error) => {
    console.error("Actor failed:", error);
    // Send alert notification
  },
  onSuccess: (result) => {
    console.log("Actor completed successfully");
  },
});

Resources

Conclusion

APIFY provides a powerful and flexible platform for web scraping and automation. With its pre-built actors, custom development capabilities, and comprehensive data management features, it's an excellent choice for developers looking to extract and process web data at scale.

Remember to always respect website terms of service, implement proper error handling, and use reasonable delays to ensure ethical and sustainable scraping practices.

Table of Contents​

Overview​

Key Features​

🚀 Pre-built Actors​

🛠️ Custom Development​

📊 Data Management​

⏰ Automation & Scheduling​

Getting Started​

1. Account Setup​

2. First Steps​

3. Installation​

Core Concepts​

Actors​

Input/Output​

Storage​

Actors and Scrapers​

Using Pre-built Actors​

Building Custom Actors​

Data Storage​

Dataset Storage​

Key-Value Store​

Request Queue​

Scheduling and Automation​

Cron Scheduling​

Webhook Triggers​

API Integration​

Best Practices​

1. Respectful Scraping​

2. Error Handling​

3. Proxy Usage​

4. Data Validation​

Troubleshooting​

Common Issues​

1. Rate Limiting​

2. Anti-Bot Detection​

3. Memory Issues​

Debugging Tips​

Use Cases​

1. E-commerce Price Monitoring​

2. Lead Generation​

3. Content Aggregation​

Advanced Features​

Custom Storage​

Data Transformation​

Monitoring and Alerts​

Resources​

Conclusion​

Table of Contents