File Extraction API

Convert PDF, DOCX, XLSX, PPTX and other documents to clean, structured Markdown with one POST request. Headings, tables, lists, and code blocks are all preserved — built for RAG pipelines, LLM ingestion, and document automation workflows.

Try in Playground →

PDF to Markdown — Live Output Demo

Adobe PDF sample
fileMarkdown Output
structure preserved LLM-ready
# PDF BOOKMARK SAMPLE

**Sample Date:** May 2001
**Prepared by:** Accelio Present Applied Technology

**Features Demonstrated:**
- Primary bookmarks in a PDF file.
- Secondary bookmarks in a PDF file.

### Overview

This sample consists of a simple form containing four distinct
fields. The data file contains eight separate records.

By default, the data file will produce a PDF file containing
eight separate pages. The selective use of the bookmark file
will produce the same PDF with a separate pane containing
bookmarks.

### Sample Files

| Filename            | Description                     |
|---------------------|---------------------------------|
| ap_bookmark.IFD     | The template design.            |
| ap_bookmark.mdf     | Template targeted for PDF output|
| ap_bookmark.dat     | Sample data file in DAT format. |
| ap_bookmark.bmk     | Sample bookmark file.           |
| ap_bookmark.pdf     | Sample PDF output.              |

Source: c4611_sample_explain.pdf — Adobe official sample document, parsed via File Extraction API.

Supported Formats

PDF DOCX DOC XLSX XLS PPTX EPUB

API Endpoint

POST https://www.searchcans.com/api/v1/url

Code Examples

curl -X POST "https://www.searchcans.com/api/v1/url" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "t": "url",
    "s": "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    "file": 1
  }'

Example Response

fileMarkdown fileUrl
{
  "code": 0,
  "data": {
    "id": "q_2061257715480530944_065651eb-1b1d-44ee-8496-c12e5b220cea",
    "markdown": "",
    "title": "",
    "description": "",
    "fileUrl": "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    "fileMarkdown": "# **PDF BOOKMARK SAMPLE**\n\n**Sample Date:** May 2001\n\n**Prepared by:** Accelio Present Applied Technology\n\n**Created and Tested Using:** - Accelio Present Central 5.4\n\n                         - Accelio Present Output Designer 5.4\n\n**Features Demonstrated:** - Primary bookmarks i… (721 chars total)"
  },
  "msg": "success",
  "requestId": null
}

Request Body (JSON)

Parameter Type Description
t string API type — always "url" for File Extraction API "url"
s string URL pointing to the document to extract (PDF, DOCX, DOC, XLSX, XLS, PPTX, EPUB, etc.) "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf"
file integer Set to 1 to trigger file parsing. The URL must point directly to a document file. 1
proxy integer Proxy tier for auth-protected or CDN-gated file URLs: 0 = none (default), 1 = Shared Pool (+2 credits), 2 = Datacenter (+5 credits), 3 = Residential (+10 credits). 0
d integer Request timeout in milliseconds (default: 30000). Use 60000+ for large documents. 60000

API Response Fields

The File Extraction API returns document content as structured Markdown, preserving all headings, tables, lists, and code blocks from the source file.

fileMarkdown

Full document content as clean Markdown. Headings, tables, lists, and inline code are preserved from the original file structure.

fileUrl

The original document URL that was submitted in the request — echoed back for correlation in batch processing pipelines.

id

Unique request ID for tracing and support. Format: q_{uid}_uuid.

title / description

Document metadata when available from the file's internal properties. May be empty for documents without embedded metadata.

Document Parsing for RAG Pipelines

The File Extraction API is built for LLM ingestion workflows. Pass the returned fileMarkdown directly into a vector store or chunking pipeline — no post-processing needed.

For documents hosted behind authentication (AWS S3 presigned URLs, SharePoint, CDN-gated files), use the proxy parameter to route through a residential or datacenter proxy tier. Set "d": 60000 or higher for large documents to avoid timeout errors.

File Extraction vs Reader API

Feature File Extraction API Reader API
Input Document file URL (PDF, DOCX, XLSX…) Any web page URL
Output field fileMarkdown markdown
JS rendering Not required Optional via mode=1
Key parameter file: 1 (default behavior)
Same endpoint? Yes — both use POST /api/v1/url

Start Building Free

Start extracting documents immediately. 100 free credits on sign up.