SearchCans
Coming June 2026

File Extraction API

Turn PDF, DOCX, XLSX, PPTX and other office documents into clean, LLM-ready Markdown with one POST request. Preserves headings, tables, lists, and metadata — built for RAG pipelines, legal review automation, and AI training-data preparation.

Early-access subscribers get bonus credits at launch. No spam — one notification when the API goes live.

What you'll get at launch

  • Format coverage — PDF, DOCX, DOC, XLSX, XLS, PPTX, ODT, RTF, EPUB and plain text. One endpoint, every common document type.
  • Structure preserved — headings, ordered/unordered lists, tables, code blocks, and inline links survive the conversion.
  • URL or upload — point us at a public PDF URL, or POST the file body directly. Same response shape.
  • Unified billing — uses your existing SearchCans API key and credit balance; no separate quota.

Who is this for?

RAG & LLM apps

Ingest enterprise document corpora (PDF, DOCX, slide decks) into vector stores with a single ETL step — no LangChain document loaders to maintain.

Legal & compliance

Convert contracts, filings, and policy PDFs into structured Markdown for clause extraction, diff review, or LLM-assisted summarisation.

Knowledge bases

Bulk-convert internal wikis, runbooks, and onboarding decks into chat-ready content for Notion, Confluence, or custom search.

AI training data

Build clean, structured datasets from scanned reports, journal PDFs, and form filings at predictable per-call cost.

FAQ

When will the File Extraction API launch?

Target launch is June 2026. Waitlist subscribers will be notified the moment the endpoint is live and will receive bonus credits.

Can I parse a PDF today using Reader API?

Yes — the existing Reader API already supports PDF parsing via the file parameter when the URL points to a PDF document. The dedicated File Extraction API will expand format coverage (DOCX, XLSX, PPTX, etc.) and add direct file-upload support.

Will OCR be supported for scanned PDFs?

OCR for image-only PDFs is on the roadmap. The launch version will cover text-based PDFs and office documents; OCR support will follow as a separate parameter.

Need document parsing today?

For text-based PDFs, the Reader API already handles extraction via the file=1 parameter — same authentication, same credit balance.