File Extraction API

Convert PDF, DOCX, XLSX, PPTX and other documents to clean, structured Markdown with one POST request. Headings, tables, lists, and code blocks are all preserved — built for RAG pipelines, LLM ingestion, and document automation workflows.

Try in Playground →

PDF to Markdown — Live Output Demo

Adobe PDF sample

fileMarkdown Output

structure preserved LLM-ready

# PDF BOOKMARK SAMPLE

**Sample Date:** May 2001
**Prepared by:** Accelio Present Applied Technology

**Features Demonstrated:**
- Primary bookmarks in a PDF file.
- Secondary bookmarks in a PDF file.

### Overview

This sample consists of a simple form containing four distinct
fields. The data file contains eight separate records.

By default, the data file will produce a PDF file containing
eight separate pages. The selective use of the bookmark file
will produce the same PDF with a separate pane containing
bookmarks.

### Sample Files

| Filename            | Description                     |
|---------------------|---------------------------------|
| ap_bookmark.IFD     | The template design.            |
| ap_bookmark.mdf     | Template targeted for PDF output|
| ap_bookmark.dat     | Sample data file in DAT format. |
| ap_bookmark.bmk     | Sample bookmark file.           |
| ap_bookmark.pdf     | Sample PDF output.              |

Source: c4611_sample_explain.pdf — Adobe official sample document, parsed via File Extraction API.

Supported Formats

PDF DOCX DOC XLSX XLS PPTX EPUB

API Endpoint

POST https://www.searchcans.com/api/v1/url

Code Examples

curl -X POST "https://www.searchcans.com/api/v1/url" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "t": "url",
    "s": "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    "file": 1
  }'

import requests

response = requests.post(
    "https://www.searchcans.com/api/v1/url",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "t": "url",
        "s": "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
        "file": 1,
    }
)
data = response.json()["data"]
print(data["fileMarkdown"][:500])   # Parsed document as Markdown

const response = await fetch("https://www.searchcans.com/api/v1/url", {
  method: "POST",
  headers: {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    t: "url",
    s: "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    file: 1,
  }),
});
const { data } = await response.json();
console.log(data.fileMarkdown.slice(0, 500));   // Parsed document as Markdown

Example Response

fileMarkdown fileUrl

{
  "code": 0,
  "data": {
    "id": "q_2061257715480530944_065651eb-1b1d-44ee-8496-c12e5b220cea",
    "markdown": "",
    "title": "",
    "description": "",
    "fileUrl": "https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf",
    "fileMarkdown": "# **PDF BOOKMARK SAMPLE**\n\n**Sample Date:** May 2001\n\n**Prepared by:** Accelio Present Applied Technology\n\n**Created and Tested Using:** - Accelio Present Central 5.4\n\n                         - Accelio Present Output Designer 5.4\n\n**Features Demonstrated:** - Primary bookmarks i… (721 chars total)"
  },
  "msg": "success",
  "requestId": null
}

Request Body (JSON)

Parameter	Type	Required	Description
t	string	Required	API type — always "url" for File Extraction API `"url"`
s	string	Required	URL pointing to the document to extract (PDF, DOCX, DOC, XLSX, XLS, PPTX, EPUB, etc.) `"https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf"`
file	integer	Required	Set to 1 to trigger file parsing. The URL must point directly to a document file. `1`
proxy	integer	Optional	Proxy tier for auth-protected or CDN-gated file URLs: 0 = none (default), 1 = Shared Pool (+2 credits), 2 = Datacenter (+5 credits), 3 = Residential (+10 credits). `0`
d	integer	Optional	Request timeout in milliseconds (default: 30000). Use 60000+ for large documents. `60000`

API Response Fields

The File Extraction API returns document content as structured Markdown, preserving all headings, tables, lists, and code blocks from the source file.

fileMarkdown

Full document content as clean Markdown. Headings, tables, lists, and inline code are preserved from the original file structure.

fileUrl

The original document URL that was submitted in the request — echoed back for correlation in batch processing pipelines.

id

Unique request ID for tracing and support. Format: q_{uid}_uuid.

title / description

Document metadata when available from the file's internal properties. May be empty for documents without embedded metadata.

Document Parsing for RAG Pipelines

The File Extraction API is built for LLM ingestion workflows. Pass the returned fileMarkdown directly into a vector store or chunking pipeline — no post-processing needed.

For documents hosted behind authentication (AWS S3 presigned URLs, SharePoint, CDN-gated files), use the proxy parameter to route through a residential or datacenter proxy tier. Set "d": 60000 or higher for large documents to avoid timeout errors.

File Extraction vs Reader API

Feature	File Extraction API	Reader API
Input	Document file URL (PDF, DOCX, XLSX…)	Any web page URL
Output field	`fileMarkdown`	`markdown`
JS rendering	Not required	Optional via `mode=1`
Key parameter	`file: 1`	(default behavior)
Same endpoint?	Yes — both use `POST /api/v1/url`

Start Building Free

Start extracting documents immediately. 100 free credits on sign up.

Get Free API Key Try in Playground