all projects

open extractor
agent

an inspectable ai web extraction agent.

Paste a URL, describe the data you want, and watch a local agent capture the page, plan an extraction, write Python, execute it, and return structured output.

fastapi + playwright·open-source · mit·local web app·case study · jun 2026

web extraction should show its work, not hide inside a scraper box.

Open Extractor Agent is a local web app for turning public websites into structured data. You give it a URL and a plain-English goal, then it builds the extraction pipeline in front of you: capture, scouts, planner, code writer, execution, presenter.

The point is not only scraping. The point is inspectability. You can watch the flow diagram, read the plan, inspect the generated Python, see the raw trace, stop a run, and understand why the agent produced the result.

at a glance

the shape of it

main job
turn a public URL and plain-English request into structured data
runtime
local FastAPI web app, opened in your browser
capture
Playwright for JS-rendered pages, httpx for simpler fetches
agent flow
structure capture, parallel scouts, planner, code writer, execution, presenter
outputs
CSV, HTML table, JSON, Markdown, table views, downloadable results
providers
OpenAI, Claude, DeepSeek, Grok, MiniMax, OpenRouter, custom endpoints
visibility
live SVG flow, token events, stop control, inspector, generated Python
license
open-source · MIT

what it does

describe the data, get the file

It is built for the messy first pass: you know the website and the data you need, but you do not want to hand-write selectors before you have even proven the extraction works.

URLextract every product name, price, rating, and detail link as CSV
URLcrawl all pages and return a clean table of job titles, companies, and apply links
URLfind every pricing tier, feature, and limit, then return JSON
URLcollect article titles, authors, dates, tags, and canonical links
URLextract forms, hidden fields, API hints, and visible table data
URLsummarize the result and give me a Markdown table I can paste into Notion

architecture

the agent pipeline

Open Extractor Agent architecture showing URL input, FastAPI orchestrator, Playwright capture, parser, API inspector, agent harness, Python extractor, execution sandbox and final result.

how it works

a visible extraction loop

url->capture->site context->scouts->planner->python->execute->repair->presentuntil usable output

The app treats the page like evidence. It captures rendered structure, static HTML, visible tables, forms and network hints, then gives agents narrow jobs before generating Python. That keeps the output grounded in what the page actually exposes.

01

compose

Paste a URL, describe the extraction goal, choose output format, max pages, provider, model, and whether browser rendering should run.

02

orchestrate

FastAPI streams NDJSON events so the interface can show progress instead of waiting for one silent response.

03

capture

Playwright reads rendered DOM, visible text, screenshots and interaction traces; httpx and parsers collect static structure, links, forms and tables.

04

package context

The app turns raw page evidence into a compact site context package: DOM summary, network hints, goal memory and discovered tables.

05

scout in parallel

Specialized scouts inspect structure, network/API hints, page patterns and extraction risks before the planner decides what to do.

06

plan

The planner emits strict JSON: what data to extract, which selectors or APIs matter, how many pages to crawl, and which output format fits.

07

write code

A code-writing agent generates a one-off Python extractor using Playwright, httpx, BeautifulSoup and custom selectors for that page.

08

execute and repair

The script runs with a wall-clock timeout. If output is empty or wrong, an evaluation loop can repair the plan or generated code.

09

present

The presenter renders a table, JSON, CSV, Markdown or summary, while the inspector keeps the plan, generated Python and raw trace available.

FastAPIUvicornPlaywrighthttpxBeautifulSoupPythonVanilla JSSVGNDJSON

core product

what makes it useful

plain-English extraction

You describe the data you want instead of writing selectors first: products, prices, links, tables, APIs, forms, article metadata or job listings.

rendered-page support

Playwright can inspect JavaScript-heavy pages, SPAs, WordPress, Next.js, ASP.NET and normal server-rendered sites.

multi-page crawl

Ask for all pages, set a page limit, or tell it to crawl a fixed number of pages. The orchestrator keeps the run bounded.

live agent diagram

The SVG flow builds itself stage by stage, including the number of agents running in parallel.

generated code you can read

The inspector exposes the generated Python before and after execution, so the extractor is inspectable instead of magical.

provider choice

OpenAI, Anthropic, DeepSeek, xAI, MiniMax, OpenRouter and custom OpenAI-compatible endpoints can be selected per run.

fallback mode

Without an API key, it still runs deterministic extraction for headings, links, tables, API hints and a text sample.

stop and token awareness

Long runs can be cancelled, and token events are surfaced so a local tool does not feel like a black box spending silently.

landscape

where it fits

This is not trying to be the biggest scraper platform. It is a local, inspectable agent for learning, prototyping and getting from a public page to a usable structured result quickly.

Octoparse / ParseHub

visual scraper tools

Open Extractor Agent is code-first and local: it creates a one-off Python extractor you can inspect, change and rerun.

Browse AI

hosted automation

This project is not a hosted bot platform. It is a local agent playground for public web extraction with visible planning and generated code.

Apify actors

cloud actor marketplace

Apify is stronger for production scraping fleets. Open Extractor Agent is better for learning, prototyping and seeing how an agent reasons.

BeautifulSoup scripts

manual Python

Manual scripts are precise but slow to start. Here the agent drafts the script from page evidence, then shows the code.

ChatGPT copy-paste

manual workflow

Instead of pasting snippets into a chat, the app captures the page, plans, writes, runs and presents the result in one local UI.

reasoning

the decisions behind it

why generated python is visible

The useful artifact is not only the final table. It is also the extractor. Showing the Python makes the agent auditable, teachable and forkable.

why the live diagram matters

The SVG flow is not decoration. It gives the user a mental model of what the agent is doing: capture, scouts, planner, code, run, repair, present.

why provider choice matters

Extraction planning can be cheap and fast, or stronger and more careful, depending on the site. Per-run provider selection keeps the tool flexible.

why fallback mode exists

A local tool should be testable before you spend tokens. Fallback mode proves installation, crawler behavior and UI rendering without an API key.

honest limitations

  • Generated Python runs in a subprocess with a timeout, not a true sandbox.
  • It only uses what the browser can see. It will not bypass logins, CAPTCHAs or paywalls.
  • Hostile or changing page markup can still break selectors.
  • Small models may fail the strict JSON protocol more often.
  • Some sites never finish browser network idle; plain HTTP mode can be better there.
  • Use max pages politely. The tool is built for bounded extraction, not hammering sites.

what's next

  • +A stronger execution sandbox for generated Python.
  • +Reusable extraction recipes saved from successful runs.
  • +XLSX export and richer table cleaning.
  • +Optional user-supplied session cookies for sites you are allowed to access.
  • +Better selector self-healing when a page changes.
  • +Scheduled runs and diff detection for repeat monitoring.

faq

quick answers for search

What is Open Extractor Agent?

Open Extractor Agent is a local, open-source web extraction agent that turns a public URL and a plain-English request into structured data such as CSV, tables, JSON, Markdown or HTML.

Does it work without an LLM API key?

Yes. Without a key it runs in deterministic fallback mode and extracts headings, links, tables, API hints and a text sample. LLM planning and code generation need a configured provider.

Which LLM providers does it support?

It supports OpenAI, Anthropic Claude, DeepSeek, xAI Grok, MiniMax, OpenRouter, and custom OpenAI-compatible endpoints.

Is it a safe sandbox?

No. The generated Python is run in a subprocess with a wall-clock timeout, but it is not a real sandbox. The inspector exists so you can read generated code and traces.

What stack is it built with?

The backend uses FastAPI, Uvicorn, Playwright, httpx and BeautifulSoup. The frontend is plain HTML, CSS and JavaScript with a live SVG flow diagram and NDJSON streaming.

“paste a url, describe the data, then watch the agent build the extraction in public.”

More inspectable than a black-box scraping tool, faster to start than a blank Python file, and clearer than a chat transcript. The promise is simple: structured data with the reasoning trail still attached.