One-command setup
Run docker compose up and you're live. No Node, Python, or host dependencies required.
Self-Hosted Web Crawler
Crawlit turns any website into clean Markdown — no API key required, no per-page costs, no rate limits. Self-hosted in one command.
docker compose up
One command. Chromium bundled. Zero host deps.
Features
Run docker compose up and you're live. No Node, Python, or host dependencies required.
Playwright + puppeteer-stealth handles JS-heavy sites, fingerprint detection, and bot-protected pages.
Distributed breadth-first crawl via BullMQ + Redis. Jobs survive restarts. URL deduplication built in.
Schema-guided JSON via OpenAI or Anthropic. Pass a JSON Schema, get back typed structured data.
Mozilla Readability strips nav and ads. Turndown converts to Markdown with YAML frontmatter. LLM-ready.
Prometheus metrics at /metrics — scrape totals, durations, cache hits, browser pool size.
Quickstart
No package manager needed on your machine.
git clone https://github.com/arufian/Crawlit.git
cd Crawlit
Copy the example env file. All defaults work out of the box — skip this for local use.
cp .env.example .env
# Add OPENAI_API_KEY or ANTHROPIC_API_KEY for LLM extraction
First run downloads Chromium inside the container (~90MB). Subsequent starts are instant.
docker compose up --build
# API ready at http://localhost:3000
Try It Now
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'
AI Agent Integration
Free. Open-source. Self-hosted. Now your AI coding assistant can scrape the web for you.
Crawlit is built for developers who want full control over their web scraping — no per-page costs, no rate limits, no vendor lock-in. With the new AI agent skill, any AI coding assistant can now control Crawlit directly.
Whether you're using Claude Code, OpenAI Codex, OpenCode, or any other AI tool, the skill provides a consistent interface to scrape, crawl, and map websites. Your AI agent becomes your web research assistant.
# Clone the skill repository
git clone https://github.com/arufian/crawlit-skill.git
# Follow installation instructions for your AI tool
# Supports: Claude Code, Codex, OpenCode, and more
API Reference
All endpoints accept JSON and return JSON. Base URL: http://localhost:3000
/v1/scrape
Scrape a single page. Returns clean Markdown, HTML, links, or extracted structured data.
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"formats": ["markdown", "links"],
"mode": "browser",
"extract": {
"schema": {
"type": "object",
"properties": {
"topStory": { "type": "string" },
"points": { "type": "number" }
},
"required": ["topStory"]
},
"prompt": "Extract the top story and its points"
}
}'
{
"success": true,
"data": {
"markdown": "# Hacker News\n\n...",
"links": ["https://example.com/story-1", "..."],
"metadata": {
"title": "Hacker News",
"url": "https://news.ycombinator.com",
"scraped_at": "2026-04-27T10:00:00.000Z"
},
"extract": {
"topStory": "Show HN: Crawlit — self-hosted web crawler",
"points": 312
}
}
}
| Field | Type | Default | Description |
|---|---|---|---|
url | string | required | Page URL to scrape |
formats | array | ["markdown"] | markdown, html, links, rawHtml |
mode | string | "http" | http (fast) or browser (JS rendering) |
waitFor | number | — | ms to wait after page load (browser mode) |
actions | array | — | Click/scroll/type actions before capture |
proxy | string | — | Per-request proxy: http://user:pass@host:port |
save | boolean | false | Save Markdown to ./output/<host>/ |
extract | object | — | LLM extraction config — schema + prompt |
skipCache | boolean | false | Bypass Redis cache for this request |
timeout | number | 30000 | Request timeout in ms |
/v1/crawl
Start an async BFS site crawl. Returns a job ID. Poll for results or cancel anytime.
# Start a crawl
curl -X POST http://localhost:3000/v1/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.example.com", "maxDepth": 2, "limit": 100, "save": true}'
# Poll status
curl http://localhost:3000/v1/crawl/<job-id>
# Cancel
curl -X DELETE http://localhost:3000/v1/crawl/<job-id>
// POST — start response
{ "success": true, "id": "abc123", "url": "/v1/crawl/abc123" }
// GET — poll response
{
"status": "running",
"completed": 14,
"total": 47,
"data": [
{
"url": "https://docs.example.com/guide",
"markdown": "# Guide\n\n...",
"metadata": { "title": "Guide", "scraped_at": "2026-04-27T10:00:00.000Z" }
}
]
}
| Field | Type | Default | Description |
|---|---|---|---|
url | string | required | Seed URL to begin crawl from |
maxDepth | number | 3 | Max link depth from the seed URL |
limit | number | 100 | Max total pages to crawl |
allowedDomains | array | seed domain | Restrict crawl to these domains |
mode | string | "http" | http or browser |
formats | array | ["markdown"] | Output formats per page |
onlyMainContent | boolean | true | Strip nav/ads via Readability |
save | boolean | false | Write each page to ./output/ |
proxy | string | — | Proxy applied to every page in this crawl |
/v1/map
Discover all URLs on a site. Fast — tries sitemap.xml first, falls back to link extraction.
curl -X POST http://localhost:3000/v1/map \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.example.com", "limit": 1000}'
{
"success": true,
"links": [
"https://docs.example.com/",
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"..."
],
"total": 342
}
| Field | Type | Default | Description |
|---|---|---|---|
url | string | required | Site URL to map |
limit | number | 1000 | Max URLs to return |
Under the Hood
Configuration
Copy .env.example to .env. All variables are optional — sensible defaults apply.
| Variable | Default | Description |
|---|---|---|
PORT | 3000 | API server port |
REDIS_URL | redis://redis:6379 | Redis connection string |
WORKER_CONCURRENCY | 5 | Concurrent crawl jobs per worker process |
BROWSER_HEADLESS | true | Run Playwright headless |
CACHE_TTL_SECONDS | 3600 | Redis cache TTL (1 hour default) |
LLM_PROVIDER | openai | openai or anthropic |
OPENAI_API_KEY | — | Required when using OpenAI for LLM extraction |
ANTHROPIC_API_KEY | — | Required when using Anthropic for LLM extraction |
PROXY_URL | — | Global residential proxy: http://user:pass@host:port |
LOG_LEVEL | info | debug · info · warn · error |
Open Source · MIT License
Self-hosted. Free forever. Yours to extend.