Crawlit — Self-Hosted Web Crawler for AI Workflows

Features

Everything you need.
Nothing you don't.

One-command setup

Run docker compose up and you're live. No Node, Python, or host dependencies required.

Stealth browsing

Playwright + puppeteer-stealth handles JS-heavy sites, fingerprint detection, and bot-protected pages.

BFS crawling at scale

Distributed breadth-first crawl via BullMQ + Redis. Jobs survive restarts. URL deduplication built in.

LLM extraction

Schema-guided JSON via OpenAI or Anthropic. Pass a JSON Schema, get back typed structured data.

Clean Markdown

Mozilla Readability strips nav and ads. Turndown converts to Markdown with YAML frontmatter. LLM-ready.

Observable

Prometheus metrics at /metrics — scrape totals, durations, cache hits, browser pool size.

Quickstart

Up and running in 60 seconds.

01

Clone the repo

No package manager needed on your machine.

git clone https://github.com/arufian/Crawlit.git
cd Crawlit

02

Configure (optional)

Copy the example env file. All defaults work out of the box — skip this for local use.

cp .env.example .env
# Add OPENAI_API_KEY or ANTHROPIC_API_KEY for LLM extraction

03

Start the crawler

First run downloads Chromium inside the container (~90MB). Subsequent starts are instant.

docker compose up --build
# API ready at http://localhost:3000

Try It Now

curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

AI Agent Integration

Let AI agents control Crawlit.

Free. Open-source. Self-hosted. Now your AI coding assistant can scrape the web for you.

Crawlit is built for developers who want full control over their web scraping — no per-page costs, no rate limits, no vendor lock-in. With the new AI agent skill, any AI coding assistant can now control Crawlit directly.

Whether you're using Claude Code, OpenAI Codex, OpenCode, or any other AI tool, the skill provides a consistent interface to scrape, crawl, and map websites. Your AI agent becomes your web research assistant.

Install the Skill

# Clone the skill repository
git clone https://github.com/arufian/crawlit-skill.git

# Follow installation instructions for your AI tool
# Supports: Claude Code, Codex, OpenCode, and more

Works With Your Favorite Tools

Claude Code

OpenAI Codex

OpenCode

Any AI Tool

View on GitHub

API Reference

Three endpoints. That's it.

All endpoints accept JSON and return JSON. Base URL: http://localhost:3000

POST /v1/scrape

Scrape a single page. Returns clean Markdown, HTML, links, or extracted structured data.

curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "formats": ["markdown", "links"],
    "mode": "browser",
    "extract": {
      "schema": {
        "type": "object",
        "properties": {
          "topStory":  { "type": "string" },
          "points":    { "type": "number" }
        },
        "required": ["topStory"]
      },
      "prompt": "Extract the top story and its points"
    }
  }'

{
  "success": true,
  "data": {
    "markdown": "# Hacker News\n\n...",
    "links": ["https://example.com/story-1", "..."],
    "metadata": {
      "title": "Hacker News",
      "url": "https://news.ycombinator.com",
      "scraped_at": "2026-04-27T10:00:00.000Z"
    },
    "extract": {
      "topStory": "Show HN: Crawlit — self-hosted web crawler",
      "points": 312
    }
  }
}

Field	Type	Default	Description
`url`	string	required	Page URL to scrape
`formats`	array	`["markdown"]`	`markdown`, `html`, `links`, `rawHtml`
`mode`	string	`"http"`	`http` (fast) or `browser` (JS rendering)
`waitFor`	number	—	ms to wait after page load (browser mode)
`actions`	array	—	Click/scroll/type actions before capture
`proxy`	string	—	Per-request proxy: `http://user:pass@host:port`
`save`	boolean	`false`	Save Markdown to `./output/<host>/`
`extract`	object	—	LLM extraction config — schema + prompt
`skipCache`	boolean	`false`	Bypass Redis cache for this request
`timeout`	number	`30000`	Request timeout in ms

POST /v1/crawl

Start an async BFS site crawl. Returns a job ID. Poll for results or cancel anytime.

# Start a crawl
curl -X POST http://localhost:3000/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com", "maxDepth": 2, "limit": 100, "save": true}'

# Poll status
curl http://localhost:3000/v1/crawl/<job-id>

# Cancel
curl -X DELETE http://localhost:3000/v1/crawl/<job-id>

// POST — start response
{ "success": true, "id": "abc123", "url": "/v1/crawl/abc123" }

// GET — poll response
{
  "status": "running",
  "completed": 14,
  "total": 47,
  "data": [
    {
      "url": "https://docs.example.com/guide",
      "markdown": "# Guide\n\n...",
      "metadata": { "title": "Guide", "scraped_at": "2026-04-27T10:00:00.000Z" }
    }
  ]
}

Field	Type	Default	Description
`url`	string	required	Seed URL to begin crawl from
`maxDepth`	number	`3`	Max link depth from the seed URL
`limit`	number	`100`	Max total pages to crawl
`allowedDomains`	array	seed domain	Restrict crawl to these domains
`mode`	string	`"http"`	`http` or `browser`
`formats`	array	`["markdown"]`	Output formats per page
`onlyMainContent`	boolean	`true`	Strip nav/ads via Readability
`save`	boolean	`false`	Write each page to `./output/`
`proxy`	string	—	Proxy applied to every page in this crawl

POST /v1/map

Discover all URLs on a site. Fast — tries sitemap.xml first, falls back to link extraction.

curl -X POST http://localhost:3000/v1/map \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com", "limit": 1000}'

{
  "success": true,
  "links": [
    "https://docs.example.com/",
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "..."
  ],
  "total": 342
}

Field	Type	Default	Description
`url`	string	required	Site URL to map
`limit`	number	`1000`	Max URLs to return

Under the Hood

Built on boring, reliable parts.

TypeScript Fastify Playwright BullMQ Redis Mozilla Readability Turndown Vercel AI SDK Prometheus Docker

Configuration

Configure via environment variables.

Copy .env.example to .env. All variables are optional — sensible defaults apply.

Variable	Default	Description
`PORT`	`3000`	API server port
`REDIS_URL`	`redis://redis:6379`	Redis connection string
`WORKER_CONCURRENCY`	`5`	Concurrent crawl jobs per worker process
`BROWSER_HEADLESS`	`true`	Run Playwright headless
`CACHE_TTL_SECONDS`	`3600`	Redis cache TTL (1 hour default)
`LLM_PROVIDER`	`openai`	`openai` or `anthropic`
`OPENAI_API_KEY`	—	Required when using OpenAI for LLM extraction
`ANTHROPIC_API_KEY`	—	Required when using Anthropic for LLM extraction
`PROXY_URL`	—	Global residential proxy: `http://user:pass@host:port`
`LOG_LEVEL`	`info`	`debug` · `info` · `warn` · `error`

Open Source · MIT License

Ready to crawl?

Self-hosted. Free forever. Yours to extend.

Star on GitHub Read the docs

Scrape the web.Feed your LLM.

Everything you need.Nothing you don't.