Essence
API Reference

Crawl

POST /api/v1/crawl — multi-page traversal

Multi-page traversal with URL frontier, deduplication, robots.txt compliance, and rate limiting.

Endpoint

POST /api/v1/crawl

A streaming variant is also available:

POST /api/v1/crawl/stream

Parameters

ParameterTypeDefaultDescription
urlstringRequired. Starting URL
maxDepthinteger2Maximum link depth to follow
limitinteger100Maximum pages to crawl
includePathsstring[]Glob patterns for URLs to include (e.g., ["/blog/*"])
excludePathsstring[]Glob patterns for URLs to exclude (e.g., ["/admin/*"])
allowBackwardLinksbooleanfalseFollow links up the URL hierarchy
allowExternalLinksbooleanfalseFollow links to external domains
detectPaginationbooleantrueAuto-follow pagination links
maxPaginationPagesinteger50Max pagination pages to follow
ignoreSitemapbooleanfalseSkip sitemap.xml discovery
useParallelbooleanfalseUse parallel crawler

All scrape parameters (formats, engine, timeout, etc.) also apply to each crawled page.

Examples

cURL

curl -X POST http://localhost:8080/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxDepth": 2,
    "limit": 50,
    "includePaths": ["/docs/*"],
    "formats": ["markdown"]
  }'

Python

import requests

response = requests.post("http://localhost:8080/api/v1/crawl", json={
    "url": "https://docs.example.com",
    "maxDepth": 2,
    "limit": 50,
    "includePaths": ["/docs/*"],
    "formats": ["markdown"]
})

data = response.json()
for page in data["data"]:
    print(f"{page['metadata']['title']}: {page['metadata']['url']}")

JavaScript

const response = await fetch("http://localhost:8080/api/v1/crawl", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://docs.example.com",
    maxDepth: 2,
    limit: 50,
    includePaths: ["/docs/*"],
    formats: ["markdown"],
  }),
});

const data = await response.json();
data.data.forEach((page) => {
  console.log(`${page.metadata.title}: ${page.metadata.url}`);
});

Streaming Crawl

Use the streaming endpoint to receive results as pages are crawled:

curl -X POST http://localhost:8080/api/v1/crawl/stream \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxDepth": 2,
    "limit": 20
  }'

The response uses Server-Sent Events (SSE) with these event types:

Event TypeDescription
statusProgress updates (pages crawled, queue size, current URL)
documentA completed page document
errorError for an individual URL
completeFinal summary with totals

Response (Non-Streaming)

The standard /api/v1/crawl endpoint returns all results at once:

{
  "success": true,
  "data": [
    {
      "title": "Docs - Getting Started",
      "url": "https://docs.example.com/getting-started",
      "markdown": "# Getting Started\n\nWelcome...",
      "metadata": {
        "title": "Docs - Getting Started",
        "url": "https://docs.example.com/getting-started",
        "statusCode": 200,
        "contentType": "text/html",
        "wordCount": 450,
        "readingTime": 2
      }
    }
  ]
}

Python (Streaming)

import requests
import json

response = requests.post(
    "http://localhost:8080/api/v1/crawl/stream",
    json={"url": "https://docs.example.com", "limit": 20},
    stream=True
)

for line in response.iter_lines():
    if line:
        decoded = line.decode("utf-8")
        if decoded.startswith("data: "):
            event = json.loads(decoded[6:])
            print(f"[{event['type']}] {event.get('current_url', '')}")

On this page