Extract

Extract structured data from web pages using CSS selectors, LLM-based extraction, or both. Returns JSON objects conforming to your schema.

Endpoint

POST /api/v1/extract

Parameters

Parameter	Type	Default	Description
`urls`	string[]	—	Required. URLs to extract from (1-10)
`schema`	object	—	JSON Schema defining the desired output structure
`prompt`	string	—	Natural language instruction for LLM extraction
`selectors`	object	—	CSS selector mappings: `{"fieldName": "css.selector"}`
`mode`	string	`"auto"`	`"auto"`, `"css"`, or `"llm"`
`engine`	string	`"auto"`	`"auto"`, `"http"`, or `"browser"`
`timeout`	integer	`30000`	Timeout per URL in milliseconds
`llmBaseUrl`	string	—	OpenAI-compatible API base URL (required for `"llm"` mode)
`llmModel`	string	`"gpt-4o-mini"`	LLM model name
`llmApiKey`	string	—	API key for the LLM service

Extraction Modes

CSS Mode (`"css"`)

Rule-based extraction using CSS selectors. No external dependencies, no API keys.

Each key in selectors becomes a field in the output. The CSS selector is run against the page HTML, and the matched element's text content is extracted. If a schema is provided, values are coerced to the specified types.

LLM Mode (`"llm"`)

AI-powered extraction using any OpenAI-compatible API. Requires llmBaseUrl and optionally llmApiKey. The page content (as Markdown) is sent to the LLM along with your schema and prompt.

Supports both Chat Completions (/v1/chat/completions) and Responses (/v1/responses) API formats — auto-detected from the URL.

Auto Mode (`"auto"`, default)

Tries CSS extraction first (if selectors provided). If the result is less than 50% complete, falls back to LLM extraction (if credentials provided). If neither selectors nor LLM credentials are given, returns an error.

Examples

CSS Extraction

curl -X POST http://localhost:8080/api/v1/extract \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"],
    "mode": "css",
    "selectors": {
      "title": "h1",
      "price": "p.price_color",
      "availability": "p.availability",
      "description": "#product_description ~ p"
    },
    "schema": {
      "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "availability": {"type": "string"},
        "description": {"type": "string"}
      }
    }
  }'

LLM Extraction

curl -X POST http://localhost:8080/api/v1/extract \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/about"],
    "mode": "llm",
    "prompt": "Extract company information from this page",
    "schema": {
      "properties": {
        "companyName": {"type": "string"},
        "founded": {"type": "string"},
        "employees": {"type": "number"},
        "description": {"type": "string"}
      }
    },
    "llmBaseUrl": "https://api.openai.com",
    "llmModel": "gpt-4o-mini",
    "llmApiKey": "sk-..."
  }'

Python

import requests

response = requests.post("http://localhost:8080/api/v1/extract", json={
    "urls": ["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"],
    "mode": "css",
    "selectors": {
        "title": "h1",
        "price": "p.price_color"
    },
    "schema": {
        "properties": {
            "title": {"type": "string"},
            "price": {"type": "number"}
        }
    }
})

data = response.json()
for item in data["data"]:
    print(f"{item['title']}: {item['price']}")

JavaScript

const response = await fetch("http://localhost:8080/api/v1/extract", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    urls: ["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"],
    mode: "css",
    selectors: { title: "h1", price: "p.price_color" },
    schema: {
      properties: {
        title: { type: "string" },
        price: { type: "number" },
      },
    },
  }),
});

const { data } = await response.json();
console.log(data[0].title, data[0].price);

Response

{
  "success": true,
  "data": [
    {
      "title": "A Light in the Attic",
      "price": 51.77,
      "availability": "In stock",
      "description": "It's hard to imagine a world without A Light in the Attic..."
    }
  ]
}

Each element in data corresponds to one URL from the request. If a URL fails, its entry will contain an error field instead.

Schema Type Coercion

When a schema is provided with CSS extraction, values are coerced:

Schema Type	Behavior
`"string"`	Raw text content (default)
`"number"` / `"integer"`	Strips non-numeric chars, parses as float
`"boolean"`	`"true"`, `"yes"`, `"1"` → `true`
`"array"`	Collects all matching elements

Extract

On this page