API Reference
Crawl
POST /api/v1/crawl — multi-page traversal
Multi-page traversal with URL frontier, deduplication, robots.txt compliance, and rate limiting.
Endpoint
POST /api/v1/crawlA streaming variant is also available:
POST /api/v1/crawl/streamParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | — | Required. Starting URL |
maxDepth | integer | 2 | Maximum link depth to follow |
limit | integer | 100 | Maximum pages to crawl |
includePaths | string[] | — | Glob patterns for URLs to include (e.g., ["/blog/*"]) |
excludePaths | string[] | — | Glob patterns for URLs to exclude (e.g., ["/admin/*"]) |
allowBackwardLinks | boolean | false | Follow links up the URL hierarchy |
allowExternalLinks | boolean | false | Follow links to external domains |
detectPagination | boolean | true | Auto-follow pagination links |
maxPaginationPages | integer | 50 | Max pagination pages to follow |
ignoreSitemap | boolean | false | Skip sitemap.xml discovery |
useParallel | boolean | false | Use parallel crawler |
All scrape parameters (formats, engine, timeout, etc.) also apply to each crawled page.
Examples
cURL
curl -X POST http://localhost:8080/api/v1/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"maxDepth": 2,
"limit": 50,
"includePaths": ["/docs/*"],
"formats": ["markdown"]
}'Python
import requests
response = requests.post("http://localhost:8080/api/v1/crawl", json={
"url": "https://docs.example.com",
"maxDepth": 2,
"limit": 50,
"includePaths": ["/docs/*"],
"formats": ["markdown"]
})
data = response.json()
for page in data["data"]:
print(f"{page['metadata']['title']}: {page['metadata']['url']}")JavaScript
const response = await fetch("http://localhost:8080/api/v1/crawl", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://docs.example.com",
maxDepth: 2,
limit: 50,
includePaths: ["/docs/*"],
formats: ["markdown"],
}),
});
const data = await response.json();
data.data.forEach((page) => {
console.log(`${page.metadata.title}: ${page.metadata.url}`);
});Streaming Crawl
Use the streaming endpoint to receive results as pages are crawled:
curl -X POST http://localhost:8080/api/v1/crawl/stream \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"maxDepth": 2,
"limit": 20
}'The response uses Server-Sent Events (SSE) with these event types:
| Event Type | Description |
|---|---|
status | Progress updates (pages crawled, queue size, current URL) |
document | A completed page document |
error | Error for an individual URL |
complete | Final summary with totals |
Response (Non-Streaming)
The standard /api/v1/crawl endpoint returns all results at once:
{
"success": true,
"data": [
{
"title": "Docs - Getting Started",
"url": "https://docs.example.com/getting-started",
"markdown": "# Getting Started\n\nWelcome...",
"metadata": {
"title": "Docs - Getting Started",
"url": "https://docs.example.com/getting-started",
"statusCode": 200,
"contentType": "text/html",
"wordCount": 450,
"readingTime": 2
}
}
]
}Python (Streaming)
import requests
import json
response = requests.post(
"http://localhost:8080/api/v1/crawl/stream",
json={"url": "https://docs.example.com", "limit": 20},
stream=True
)
for line in response.iter_lines():
if line:
decoded = line.decode("utf-8")
if decoded.startswith("data: "):
event = json.loads(decoded[6:])
print(f"[{event['type']}] {event.get('current_url', '')}")