Extract
POST /api/v1/extract — structured data extraction
Extract structured data from web pages using CSS selectors, LLM-based extraction, or both. Returns JSON objects conforming to your schema.
Endpoint
POST /api/v1/extractParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | — | Required. URLs to extract from (1-10) |
schema | object | — | JSON Schema defining the desired output structure |
prompt | string | — | Natural language instruction for LLM extraction |
selectors | object | — | CSS selector mappings: {"fieldName": "css.selector"} |
mode | string | "auto" | "auto", "css", or "llm" |
engine | string | "auto" | "auto", "http", or "browser" |
timeout | integer | 30000 | Timeout per URL in milliseconds |
llmBaseUrl | string | — | OpenAI-compatible API base URL (required for "llm" mode) |
llmModel | string | "gpt-4o-mini" | LLM model name |
llmApiKey | string | — | API key for the LLM service |
Extraction Modes
CSS Mode ("css")
Rule-based extraction using CSS selectors. No external dependencies, no API keys.
Each key in selectors becomes a field in the output. The CSS selector is run against the page HTML, and the matched element's text content is extracted. If a schema is provided, values are coerced to the specified types.
LLM Mode ("llm")
AI-powered extraction using any OpenAI-compatible API. Requires llmBaseUrl and optionally llmApiKey. The page content (as Markdown) is sent to the LLM along with your schema and prompt.
Supports both Chat Completions (/v1/chat/completions) and Responses (/v1/responses) API formats — auto-detected from the URL.
Auto Mode ("auto", default)
Tries CSS extraction first (if selectors provided). If the result is less than 50% complete, falls back to LLM extraction (if credentials provided). If neither selectors nor LLM credentials are given, returns an error.
Examples
CSS Extraction
curl -X POST http://localhost:8080/api/v1/extract \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"],
"mode": "css",
"selectors": {
"title": "h1",
"price": "p.price_color",
"availability": "p.availability",
"description": "#product_description ~ p"
},
"schema": {
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"availability": {"type": "string"},
"description": {"type": "string"}
}
}
}'LLM Extraction
curl -X POST http://localhost:8080/api/v1/extract \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/about"],
"mode": "llm",
"prompt": "Extract company information from this page",
"schema": {
"properties": {
"companyName": {"type": "string"},
"founded": {"type": "string"},
"employees": {"type": "number"},
"description": {"type": "string"}
}
},
"llmBaseUrl": "https://api.openai.com",
"llmModel": "gpt-4o-mini",
"llmApiKey": "sk-..."
}'Python
import requests
response = requests.post("http://localhost:8080/api/v1/extract", json={
"urls": ["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"],
"mode": "css",
"selectors": {
"title": "h1",
"price": "p.price_color"
},
"schema": {
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
}
})
data = response.json()
for item in data["data"]:
print(f"{item['title']}: {item['price']}")JavaScript
const response = await fetch("http://localhost:8080/api/v1/extract", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
urls: ["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"],
mode: "css",
selectors: { title: "h1", price: "p.price_color" },
schema: {
properties: {
title: { type: "string" },
price: { type: "number" },
},
},
}),
});
const { data } = await response.json();
console.log(data[0].title, data[0].price);Response
{
"success": true,
"data": [
{
"title": "A Light in the Attic",
"price": 51.77,
"availability": "In stock",
"description": "It's hard to imagine a world without A Light in the Attic..."
}
]
}Each element in data corresponds to one URL from the request. If a URL fails, its entry will contain an error field instead.
Schema Type Coercion
When a schema is provided with CSS extraction, values are coerced:
| Schema Type | Behavior |
|---|---|
"string" | Raw text content (default) |
"number" / "integer" | Strips non-numeric chars, parses as float |
"boolean" | "true", "yes", "1" → true |
"array" | Collects all matching elements |