File Extraction API
Turn PDF, DOCX, XLSX, PPTX and other office documents into clean, LLM-ready Markdown with one POST request. Preserves headings, tables, lists, and metadata — built for RAG pipelines, legal review automation, and AI training-data preparation.
Early-access subscribers get bonus credits at launch. No spam — one notification when the API goes live.
What you'll get at launch
- → Format coverage — PDF, DOCX, DOC, XLSX, XLS, PPTX, ODT, RTF, EPUB and plain text. One endpoint, every common document type.
- → Structure preserved — headings, ordered/unordered lists, tables, code blocks, and inline links survive the conversion.
- → URL or upload — point us at a public PDF URL, or POST the file body directly. Same response shape.
- → Unified billing — uses your existing SearchCans API key and credit balance; no separate quota.
Who is this for?
RAG & LLM apps
Ingest enterprise document corpora (PDF, DOCX, slide decks) into vector stores with a single ETL step — no LangChain document loaders to maintain.
Legal & compliance
Convert contracts, filings, and policy PDFs into structured Markdown for clause extraction, diff review, or LLM-assisted summarisation.
Knowledge bases
Bulk-convert internal wikis, runbooks, and onboarding decks into chat-ready content for Notion, Confluence, or custom search.
AI training data
Build clean, structured datasets from scanned reports, journal PDFs, and form filings at predictable per-call cost.
FAQ
When will the File Extraction API launch? ▾
Target launch is June 2026. Waitlist subscribers will be notified the moment the endpoint is live and will receive bonus credits.
Can I parse a PDF today using Reader API? ▾
Yes — the existing Reader API already supports PDF parsing via the file parameter when the URL points to a PDF document. The dedicated File Extraction API will expand format coverage (DOCX, XLSX, PPTX, etc.) and add direct file-upload support.
Will OCR be supported for scanned PDFs? ▾
OCR for image-only PDFs is on the roadmap. The launch version will cover text-based PDFs and office documents; OCR support will follow as a separate parameter.
Need document parsing today?
For text-based PDFs, the Reader API already handles extraction via the file=1 parameter — same authentication, same credit balance.