Self-Hosted Web Crawler

Scrape the web.
Feed your LLM.

Crawlit turns any website into clean Markdown — no API key required, no per-page costs, no rate limits. Self-hosted in one command.

docker compose up
One command. Chromium bundled. Zero host deps.

Features

Everything you need.
Nothing you don't.

One-command setup

Run docker compose up and you're live. No Node, Python, or host dependencies required.

Stealth browsing

Playwright + puppeteer-stealth handles JS-heavy sites, fingerprint detection, and bot-protected pages.

BFS crawling at scale

Distributed breadth-first crawl via BullMQ + Redis. Jobs survive restarts. URL deduplication built in.

LLM extraction

Schema-guided JSON via OpenAI or Anthropic. Pass a JSON Schema, get back typed structured data.

Clean Markdown

Mozilla Readability strips nav and ads. Turndown converts to Markdown with YAML frontmatter. LLM-ready.

Observable

Prometheus metrics at /metrics — scrape totals, durations, cache hits, browser pool size.

Quickstart

Up and running in 60 seconds.

01

Clone the repo

No package manager needed on your machine.

git clone https://github.com/arufian/Crawlit.git
cd Crawlit
02

Configure (optional)

Copy the example env file. All defaults work out of the box — skip this for local use.

cp .env.example .env
# Add OPENAI_API_KEY or ANTHROPIC_API_KEY for LLM extraction
03

Start the crawler

First run downloads Chromium inside the container (~90MB). Subsequent starts are instant.

docker compose up --build
# API ready at http://localhost:3000

Try It Now

curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

AI Agent Integration

Let AI agents control Crawlit.

Free. Open-source. Self-hosted. Now your AI coding assistant can scrape the web for you.

Crawlit is built for developers who want full control over their web scraping — no per-page costs, no rate limits, no vendor lock-in. With the new AI agent skill, any AI coding assistant can now control Crawlit directly.

Whether you're using Claude Code, OpenAI Codex, OpenCode, or any other AI tool, the skill provides a consistent interface to scrape, crawl, and map websites. Your AI agent becomes your web research assistant.

Install the Skill

# Clone the skill repository
git clone https://github.com/arufian/crawlit-skill.git

# Follow installation instructions for your AI tool
# Supports: Claude Code, Codex, OpenCode, and more

Works With Your Favorite Tools

Claude Code
OpenAI Codex
OpenCode
Any AI Tool

API Reference

Three endpoints. That's it.

All endpoints accept JSON and return JSON. Base URL: http://localhost:3000

POST /v1/scrape

Scrape a single page. Returns clean Markdown, HTML, links, or extracted structured data.

curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "formats": ["markdown", "links"],
    "mode": "browser",
    "extract": {
      "schema": {
        "type": "object",
        "properties": {
          "topStory":  { "type": "string" },
          "points":    { "type": "number" }
        },
        "required": ["topStory"]
      },
      "prompt": "Extract the top story and its points"
    }
  }'
POST /v1/crawl

Start an async BFS site crawl. Returns a job ID. Poll for results or cancel anytime.

# Start a crawl
curl -X POST http://localhost:3000/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com", "maxDepth": 2, "limit": 100, "save": true}'

# Poll status
curl http://localhost:3000/v1/crawl/<job-id>

# Cancel
curl -X DELETE http://localhost:3000/v1/crawl/<job-id>
POST /v1/map

Discover all URLs on a site. Fast — tries sitemap.xml first, falls back to link extraction.

curl -X POST http://localhost:3000/v1/map \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com", "limit": 1000}'

Under the Hood

Built on boring, reliable parts.

TypeScript Fastify Playwright BullMQ Redis Mozilla Readability Turndown Vercel AI SDK Prometheus Docker

Configuration

Configure via environment variables.

Copy .env.example to .env. All variables are optional — sensible defaults apply.

VariableDefaultDescription
PORT3000API server port
REDIS_URLredis://redis:6379Redis connection string
WORKER_CONCURRENCY5Concurrent crawl jobs per worker process
BROWSER_HEADLESStrueRun Playwright headless
CACHE_TTL_SECONDS3600Redis cache TTL (1 hour default)
LLM_PROVIDERopenaiopenai or anthropic
OPENAI_API_KEYRequired when using OpenAI for LLM extraction
ANTHROPIC_API_KEYRequired when using Anthropic for LLM extraction
PROXY_URLGlobal residential proxy: http://user:pass@host:port
LOG_LEVELinfodebug · info · warn · error

Open Source · MIT License

Ready to crawl?

Self-hosted. Free forever. Yours to extend.