Processing Scraped Data

Clean, transform, and analyze data scraped with Apify. Learn deduplication, field normalization, JSON-to-CSV conversion, and best practices for reliable export pipelines.

Reviewed by Try Apify Editorial

Docs pages are reviewed against current Apify workflows, common setup mistakes, and the claims we make on the page. Read our methodology or send a correction to hello@tryapify.com.

TL;DR

Raw scraped data is messy. Clean it by removing duplicates, normalizing formats, and validating fields. Export to CSV, JSON, or database. Automate processing in your scraper or post-processing pipeline.

Why Processing Matters

Scraped data often has problems:

  • Duplicate entries from pagination or retries
  • Inconsistent formats (dates, phones, currencies)
  • Missing required fields
  • HTML artifacts and extra whitespace
  • Wrong data types (numbers as strings)

Clean data before analysis. Garbage in, garbage out.

Common Data Issues

Problem Example Solution
Duplicates Same product appears 3 times Dedupe by unique ID or URL
Inconsistent dates "Jan 1, 2026" vs "2026-01-01" Parse and normalize to ISO
Price formatting "$1,299.99" as string Extract number, store as float
Extra whitespace " Product Name \n" Trim and collapse spaces
Missing fields Phone number null on some rows Set defaults or filter out

Processing in JavaScript

Clean data before saving in Crawlee:

import { CheerioCrawler, Dataset } from 'crawlee';

function cleanProduct(raw) {
    return {
        // Normalize name
        name: raw.name?.trim() || 'Unknown',

        // Extract number from price string
        price: parseFloat(raw.price?.replace(/[$,]/g, '')) || 0,

        // Normalize URL
        url: raw.url?.split('?')[0],

        // ISO date format
        scrapedAt: new Date().toISOString(),

        // Boolean conversion
        inStock: raw.availability?.toLowerCase().includes('in stock') || false,
    };
}

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const rawData = {
            name: $('.product-title').text(),
            price: $('.price').text(),
            url: request.url,
            availability: $('.stock-status').text(),
        };

        const cleaned = cleanProduct(rawData);
        await Dataset.pushData(cleaned);
    },
});

Processing in Python

import json
import re
from datetime import datetime

def clean_product(raw):
    # Extract price as float
    price_str = raw.get('price', '$0')
    price = float(re.sub(r'[^\d.]', '', price_str) or 0)

    return {
        'name': raw.get('name', '').strip(),
        'price': price,
        'url': raw.get('url', '').split('?')[0],
        'scraped_at': datetime.now().isoformat(),
        'in_stock': 'in stock' in raw.get('availability', '').lower(),
    }

# Process dataset
with open('dataset.json') as f:
    raw_data = json.load(f)

cleaned = [clean_product(item) for item in raw_data]

Deduplication

Remove duplicate entries:

// JavaScript - dedupe by URL
function deduplicate(items) {
    const seen = new Set();
    return items.filter(item => {
        if (seen.has(item.url)) return false;
        seen.add(item.url);
        return true;
    });
}

// Python - dedupe by URL
def deduplicate(items):
    seen = set()
    unique = []
    for item in items:
        if item['url'] not in seen:
            seen.add(item['url'])
            unique.append(item)
    return unique

Data Validation

Check required fields and data types:

function validateProduct(item) {
    const errors = [];

    if (!item.name || item.name.length < 2) {
        errors.push('Invalid name');
    }

    if (typeof item.price !== 'number' || item.price < 0) {
        errors.push('Invalid price');
    }

    if (!item.url?.startsWith('http')) {
        errors.push('Invalid URL');
    }

    return {
        isValid: errors.length === 0,
        errors,
        data: item,
    };
}

// Filter to valid items only
const validItems = items
    .map(validateProduct)
    .filter(result => result.isValid)
    .map(result => result.data);

Export Formats

Format Best For Notes
JSON APIs, nested data Preserves types and structure
CSV Excel, flat data Flatten nested objects first
Excel Business users Max 1 million rows
Database Large datasets, queries PostgreSQL, MongoDB common

Apify Dataset Features

Apify handles export automatically:

  • Format conversion: Download as JSON, CSV, Excel, XML
  • Streaming: Handle datasets larger than memory
  • Webhooks: Trigger processing when run completes
  • API access: Fetch results programmatically
// Fetch and process via API
const response = await fetch(
    'https://api.apify.com/v2/datasets/DATASET_ID/items?format=json',
    { headers: { Authorization: 'Bearer YOUR_TOKEN' } }
);
const items = await response.json();
const cleaned = items.map(cleanProduct);

Common Questions

Q: Clean during scraping or after?

A: Both. Do basic cleaning during scraping (trim whitespace). Do heavy processing after (deduplication, validation) when you have all data.

Q: What about very large datasets?

A: Stream processing. Do not load everything into memory. Use database for storage. Process in chunks.

Q: How do I handle encoding issues?

A: Always decode as UTF-8. Replace or remove invalid characters. Watch for HTML entities (&amp; → &).