Crawl

Multi-page traversal with URL frontier, deduplication, robots.txt compliance, and rate limiting.

Endpoint

POST /api/v1/crawl

A streaming variant is also available:

POST /api/v1/crawl/stream

Parameters

Parameter	Type	Default	Description
`url`	string	—	Required. Starting URL
`maxDepth`	integer	`2`	Maximum link depth to follow
`limit`	integer	`100`	Maximum pages to crawl
`includePaths`	string[]	—	Glob patterns for URLs to include (e.g., `["/blog/*"]`)
`excludePaths`	string[]	—	Glob patterns for URLs to exclude (e.g., `["/admin/*"]`)
`allowBackwardLinks`	boolean	`false`	Follow links up the URL hierarchy
`allowExternalLinks`	boolean	`false`	Follow links to external domains
`detectPagination`	boolean	`true`	Auto-follow pagination links
`maxPaginationPages`	integer	`50`	Max pagination pages to follow
`ignoreSitemap`	boolean	`false`	Skip sitemap.xml discovery
`useParallel`	boolean	`false`	Use parallel crawler

All scrape parameters (formats, engine, timeout, etc.) also apply to each crawled page.

Examples

cURL

curl -X POST http://localhost:8080/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxDepth": 2,
    "limit": 50,
    "includePaths": ["/docs/*"],
    "formats": ["markdown"]
  }'

Python

import requests

response = requests.post("http://localhost:8080/api/v1/crawl", json={
    "url": "https://docs.example.com",
    "maxDepth": 2,
    "limit": 50,
    "includePaths": ["/docs/*"],
    "formats": ["markdown"]
})

data = response.json()
for page in data["data"]:
    print(f"{page['metadata']['title']}: {page['metadata']['url']}")

JavaScript

const response = await fetch("http://localhost:8080/api/v1/crawl", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://docs.example.com",
    maxDepth: 2,
    limit: 50,
    includePaths: ["/docs/*"],
    formats: ["markdown"],
  }),
});

const data = await response.json();
data.data.forEach((page) => {
  console.log(`${page.metadata.title}: ${page.metadata.url}`);
});

Streaming Crawl

Use the streaming endpoint to receive results as pages are crawled:

curl -X POST http://localhost:8080/api/v1/crawl/stream \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxDepth": 2,
    "limit": 20
  }'

The response uses Server-Sent Events (SSE) with these event types:

Event Type	Description
`status`	Progress updates (pages crawled, queue size, current URL)
`document`	A completed page document
`error`	Error for an individual URL
`complete`	Final summary with totals

Response (Non-Streaming)

The standard /api/v1/crawl endpoint returns all results at once:

{
  "success": true,
  "data": [
    {
      "title": "Docs - Getting Started",
      "url": "https://docs.example.com/getting-started",
      "markdown": "# Getting Started\n\nWelcome...",
      "metadata": {
        "title": "Docs - Getting Started",
        "url": "https://docs.example.com/getting-started",
        "statusCode": 200,
        "contentType": "text/html",
        "wordCount": 450,
        "readingTime": 2
      }
    }
  ]
}

Python (Streaming)

import requests
import json

response = requests.post(
    "http://localhost:8080/api/v1/crawl/stream",
    json={"url": "https://docs.example.com", "limit": 20},
    stream=True
)

for line in response.iter_lines():
    if line:
        decoded = line.decode("utf-8")
        if decoded.startswith("data: "):
            event = json.loads(decoded[6:])
            print(f"[{event['type']}] {event.get('current_url', '')}")

Crawl

On this page